Multimodal Human-Computer Interfaces and Individual Differences. Annotation, perception, representation and generation of situated multimodal behaviors
From 2007 Scientific Report
Current human-computer interfaces are limited compared to the everyday multimodal communication we use when engaging with other people in normal day-to-day interactions or in emotionally-loaded interactions. In our human-human interactions, we combine several communication modalities such as speech and gestures in a spontaneous and subtle way.
Whereas current human-computer interfaces make an extensive use of graphical user interface combining keyboard, mouse and screen, multimodal human-computer interfaces aim at an intuitive interaction using several human-like modalities such as speech and gestures which require no training to use.
In a multimodal input interface, the user should be able to combine speech and gestures, for example for querying about graphical objects displayed on the screen. The goal of such a multimodal input is an intuitive, robust, and efficient human-computer interaction which can be useful for example for kiosk applications in public settings or for map applications on mobile devices.
An Embodied Conversational Agent (ECA) is a multimodal output interface in which an animated character displayed on the screen combines several human-like modalities such as speech, gesture and facial expressions. Using an ECA is expected to lead to an intuitive and friendly interaction, for example via the display of emotional expressions that can be useful for applications in education, for example as a means of motivating students.
A bidirectional interface aims at combining multimodal input and ECA, thus involving intuitive modalities on both sides of the interaction. It raises several challenges for computer scientists. Designers of the multimodal input interface need to know how users will combine their speech and gestures when referring to objects so that they can define appropriate fusion algorithms. Designers of the ECA need to know how the character should combine its communication modalities so that the user gets the message right. This calls for appropriate input and output computational models of multimodal behaviors.
With respect to evaluation, the fact that users and the system can choose among several modalities could be an advantage over classical graphical user interfaces.
The literature in social sciences provides results of numerous studies on nonverbal communication. But, using these results to design multimodal interfaces is not straightforward: in order to be able to analyze or to generate multimodal behaviors we need detailed descriptions of observed behavior. Furthermore, the situation for which we want to design an interface might be quite different from the ones studied in the literature, and the applicability of their results to another situational context requires domain-specific investigations.
Our goal is to contribute to the design of bidirectional multimodal interfaces. Our approach is threefold. First, we build representations of situated (e.g. collected in a specific context) multimodal communicative behaviors. Second, we consider the two sides of the multimodal human-computer interface (e.g. multimodal input from the user, and multimodal output via the ECA). Third, we evaluate such representations and interfaces via experimental studies involving different user profiles.
We define a methodological framework for the collection of videotaped multimodal behaviors, their annotation, their representation, their generation, including the perception of the collected and generated behaviors. This framework features a typology for analyzing combinations between modalities in multimodal behavior (complementarity, redundancy, conflict, equivalence, specialization, transfer).
In this report we summarize several studies achieved with this methodological framework. We focus on multimodal deictic and emotional behaviors. We selected these two behaviors because 1) they involve several modalities, 2) they are required both in input and output, and 3) they are especially useful in pedagogical applications.
With respect to deictic behaviors, we studied how adults and children combine their speech and gestures when they interact with 2D and 3D animated agents in conversational edutainment applications. We used a simulated version of the system, we analyzed the observed combinations of modalities produced by users, we designed an input fusion module, and we evaluated the fusion of users' speech and deictic gestures in the final system. On the output side, we specified and compared the perception that different users have of different multimodal strategies and graphical look of animated agents when they refer to graphical objects during technical presentations.
With respect to multimodal expressions of emotion, we studied how people combine different modalities in spontaneous expressions of emotions. Most emotion studies in social sciences focus on acted mono-modal behaviors instructed by single labeled emotions, although some studies also observed blends of emotions. We used our approach to explore how people express and perceive such blends of emotions in non-acted multimodal behavior. We collected videos of multimodal emotional behaviors during TV interviews. We defined schemes for manually annotating these expressive emotional behaviors. We investigated the use of image processing techniques for validating these manual annotations and we computed representations of multimodal expressive behaviors. We defined a copy-synthesis approach using these representations for replaying the behaviors with an expressive agent. Perceptual studies were conducted for analyzing individual differences with respect to the perception of such blended emotional expressions. The understanding of conflicting facial expressions of emotion and textual dialogues was also assessed with autistic users in an experimental study for which a multimedia platform was designed.
These studies provide three contributions to the design of multimodal human-computer interfaces. First, these studies revealed several individual differences between subjects with respect to age, gender, cognitive profile and personality traits with regards to the perception and production of deictic and emotional behaviors. Second, we iteratively defined a methodological framework for studying multimodal communication including the annotation, representation, generation, and perception of situated multimodal behaviors. Third, this framework was used to define representations such as combinations of modalities in user's multimodal input, and expressive emotional profiles for the annotation and replay of emotional behaviors. Our future directions of research aim at providing further contributions to inform the design of bidirectional multimodal human-computer interfaces. We explain how we will use our framework for the study of situated multimodal behaviors in five dimensions of multimodal communication: 1) user and agent as individuals, 2) blends of social emotions, 3) interaction patterns and interpersonal attitudes, 4) interaction context and interaction memory, 5) from acted to natural behaviors.
- Abrilian, S. (2007). Représentation de Comportements Emotionnels Multimodaux Spontanés : Perception, Annotation et Synthèse. Doctorat en Informatique de l'Université Paris XI. Directeurs: J.-C. Martin, L. Devillers, J. Mariani. 7 Septembre 2007.
- Buisine, S. (2005). Conception et Évaluation d'Agents Conversationnels Multimodaux Bidirectionnels. http://stephanie.buisine.free.fr/, Doctorat de Psychologie Cognitive - Ergonomie de l'Université Paris V. 8 avril 2005. Direction J.-C. Martin & J.-C. Sperandio. .PDF. Mention très honorable avec les félicitations du jury.
- Buisine, S., Martin, J.-C. (2007). The effects of speech-gesture co-operation in animated agents' behaviour in multimedia presentations. International Journal "Interacting with Computers: The interdisciplinary journal of Human-Computer Interaction". Elsevier, 19: 484-493.
- Buisine, S., Abrilian, S., Niewiadomski, R., Martin, J.-C., Devillers, L., Pelachaud, C (2006). Perception of Blended Emotions: from Video Corpus to Expressive Agent. 6th International Conference on Intelligent Virtual Agents (IVA'2006), 21-23 August, Marina del Rey, USA. LNAI 4133, Springer, pp. 93-106. Acceptance rate: 33% of submitted papers accepted as oral. BEST PAPER AWARD.
- Grynszpan, O. (2005). Conception d'interfaces humain-machine pour les personnes autistes. Doctorat d'Informatique de l'Université Paris XI. 9 décembre 2005. Directeurs : J.-C. Martin, J. Nadel, J. Mariani .PDF
- Grynszpan, O., Martin, J.-C., Nadel, J. (2007). Exploring the Influence of Task Assignment and Output Modalities on Computerized training for Autism. International Journal "Interaction Studies: Social Behaviour and Communication in Biological and Artificial Systems", John Benjamins Publishing, 8(2): 241-266
- Martin, J.-C., Caridakis, G., Devillers, L., Karpouzis, K. and Abrilian, S. (2007). Manual Annotation and Automatic Image Processing of Multimodal Emotional Behaviours: Validating the Annotation of TV Interviews. International Journal "Personal and Ubiquitous Computing", Special issue on 'Emerging Multimodal Interfaces' following the special session of the AIAI 2006 Conference, Springer, May.
- Martin, J. C. (2006). Multimodal Human-Computer Interfaces and Individual Differences. Annotation, perception, representation and generation of situated multimodal behaviors. Habilitation à diriger des recherches en Informatique. Université Paris XI. 6th december 2006.
- Martin, J.-C., Buisine, S., Pitel, G., Bernsen, N.O. (2006). Fusion of Children's Speech and 2D Gestures when Conversing with 3D Characters. International Journal "Signal Processing", special issue on "multimodal interfaces". Editors: Thierry Dutoit, Laurence Nigay and Michael Schnaider, Elsevier. Vol. 86, issue 12, pp. 3596-3624.
- Martin, J.-C., Niewiadomski, R., Devillers, L., Buisine, S., Pelachaud, C. (2006). Multimodal Complex Emotions: Gesture Expressivity And Blended Facial Expressions. International Journal of Humanoid Robotics. Eds: C. Pelachaud, L. Canamero. Vol. 3, No. 3,September 2006, 269-291