Emotions in human-machine interactions: perception, detection and generation
Since the antiquity, emotions have been studied by philosophers such as Aristote and Platon, mainly in the discourse. In 1872, Darwin in « the expression of emotions in man and animals» demonstrated that some emotions called primary represent universal emotional processes useful for surviving. In opposition to cognition, object of multiple investigations, emotions were neglected for a long time by modern psychology and neurosciences. At the end of 1980, a revival interest was shown for emotions. Today, cognitive psychology, neuropsychology, neurobiology, and social sciences have restored their central position to affective states that are described as essential in communication; they have many functions in social regulation and play an important role in behaviour and decision making. Recently, this field of research has received a growing interest in the field of artificial intelligence and in the sciences of information processing. Three domains of interest can be distinguished: emotion recognition, emotion synthesis and computers with emotions. First works on computer sciences and emotions have used a restricted number of emotions. More often, the studies focused on primary emotions such as Anger, Fear, Joy and Sadness in acted data. The community of conversational agents is the first to have integrated emotional information in human-machine systems. More recently, the researchers in the field of speech technologies are interested in emotional behaviours for improving synthesis, detection and also dialog system. Indeed, detecting emotions could enable for instance to modify the strategy of a system or in a more simple way pass the caller through a human operator. My research goal on emotions consists on the one hand in describing and modelling expression of emotional behaviours from authentic data, in studying the variability of oral signals between speakers (in much longer term with multimodal data for different languages and cultures), and on the other hand in improving speech technologies and, more widely in building oral and multimodal systems « affectively intelligent ».
I am defending the idea that the emotional behaviour is linked to experiences in situ (socio-cultural and interactional context) and to the personality of participants and not only determined by abstract processes without context or by physiological processes. One fundamental difficulty is the definition itself of the emotion and of the emotional states: emotion, affect, mood, and attitude. The majority of theoretical works do not share the same terminology and the diversity of models of emotion expressions shows both the complexity and the richness of the subject of study. The originality of this work is based on the use of real-life data where emotions are natural. There are very few studies on authentic data without experimental control. Adding an emotional component in Human-Machine Interaction System where the final goal is to be used in everyday life requires analysing real-life data. Regarding the large number of studies carried out on artificial data, we should meet several new challenges when studying non basic emotions in real-life data.
First challenge, they do not have standards of annotation. The first objective has been to propose an experimental methodology for working on natural data and also a representating scheme of emotions. I am involved in the W3C Emotion incubator group as expert on emotion annotation.
The second challenge is to find robust audio cues for characterizing emotions and build systems to detect them. With real-life data and authentic emotions, the issue is to annotate in a reliable way the emotional behaviour found in the natural corpora. To the complexity of the natural data where emotions are often mixed, is added the fact that the verbal language of everyday offers to use a large choice of terms allowing to name and nuance the emotions. How to represent this palette of choices, how to limit them to enough terms for having a sufficient representation and train statistical models? How to integrate the context in our annotations? The subjectivity of the annotation choices is effectively one of the main difficulties of this research domain. How to measure the quality of the trained systems on such data? The multi-level annotation scheme proposed integrates the representation of emotions and their emergence context.
The field of my research on emotions stretches from emotional expression in the voice to multimodal expression up to emotional and mental states in interaction situation. The studies presented have been carried out on natural corpora in real-life context: data collected in financial and medical call centers, clips extracts form TV news and face-to-face games in order to collect multimodal interaction data. They also have been carried out on fiction for model extreme emotions, such as fear which is difficult to observe in real data. A study on emotional behaviour has been carried out on these different corpora. The analysis of real-life emotional data together in the audio only (imperceptible records) and audio-visual (interviews of engaged subjects in TV news) brings the proof of the existence of complex emotions and even of emotion mixtures with contradictory valence: positive in the same time as negative. As for example in the medical task, someone can be relieved because help is coming but at the same time show anxiety, stress and hurt behaviours.
The cues carrying emotions in the natural data are multiple (acoustical or gestual but also linguistic and dialogic), sometimes contradictory and uncontrolled or on the contrary very controlled (acted) and masked by social convention (or private interest). In my works I propose a quantitative modelisation based on the combination of multi-level cues. The annotation scheme has been adapted in function of the applicative goals of detection or generation.
The authentic data allows obtaining emotional representation realistic, rich and diversified.
For the detection purpose, the objective was to build large annotated corpora for extracting robust audio cues, to use these cues in machine learning and to show the feasibility of emotion detection system in real-life data. This work has been carried out on oral telephonic data in French but also in American/English fiction data (realistic movies) and on oral interaction with AIBO robot in German (children voice). The evaluation of the robustness of the cues found has been therefore done on several corpora with different transmission channels: telephone (call centres, financial and medical tasks), DVD (realistic movies), microphone (AIBO data).
For generation purpose from audiovisual cues, the annotation has been carried out on a small amount of data compared for the detection purpose (20mn of TV interviews). The existence of emotion mixtures has been demonstrated with perceptive tests on original data and also on reply with conversational agents (ECAs).
These analyses and the representations obtained in both cases of detection and generation are complementary and enriching. The use of audio and audiovisual corpora in different languages with large variety of voice and context is also on of the instructive aspect of these works. The results are first a multi-level annotation scheme for representing emotions with their emergence context as well as acquisition, definition of task-dependent emotion representation, annotation and annotation validation protocol. Another result concerns the presence of complex mixed emotions in real-life data and the proposition of emotion mixtures representation with a soft vector of emotions. The role of the context has been also emphasized and appraisal dimensions (Scherer model) have been studied with a perceptive point of view. Finally, the possibility of recognizing task-dependent real-life emotion with quite high performances and the robustness of paralinguistic cues for different tasks and languages have been demonstrated. The combination of linguistic and paralinguistic cues has been also explored, showing better performances. First uses of the contextual information have been carried out: for instance, the role of the speaker (agent/caller), the threat time (immediate, latent) have shown some improvements in the detection performances.
My research perspectives are mainly within the framework of oral human-machine communication where the automatic detection of emotional states of the subjects can provide better performances in speech recognition by using adapted models and a better strategy management in the human-machine interaction in function of emotional manifestations. Within this framework, I also foresee to combine emotional oral cues with multimodal ones (corporal language: gesture, facial mimics) in order to detect emotional states in multimodal human-machine interaction. Finally, my long-term project is to develop a hybrid rational and emotional model in human-machine interaction. Some of the studies mentioned above have been carried out with the support of PhD students under my supervision with national public and private grants (THALES) and international grant (NoE FP6-HUMAINE), and also thanks to collaborations with other researchers, in particular at LIMSI and in the NoE HUMAINE.
 Devillers, L., Les émotions dans les interactions homme-machine : perception, détection et génération. Thèse d'Habilitation à diriger des Recherches, Université Paris-Sud/LIMSI (2006).
 Devillers, L., Vidrascu, L., Lamel, L., Challenges in reallife emotion annotation and machine learning based detection, Journal of Neural Networks 2005, 18/4, “Emotion and Brain” (2005).
 Devillers, L., Vidrascu, L., Emotion recognition dans le livre «Speaker characterization », Christian Müller, Susanne Schötz (eds.), Springer-Verlag, (2007).
 Schroeder, M., Devillers, L., Karpouzis, K., Martin, J-C., Pelachaud, C., Peter, Ch., Pirker, H., Schuller, B., Tao, J. and Wilson I.: What should a generic emotion markup language be able to represent?, ACII07, 2nd International Conference on Affective Computing & Intelligent Interaction , (2007).
 Clavel, C., Devillers, L., Richard, G., Vasilescu, I., Ehrette, T., Detection and analysis of abnormal situations through fear-type acoustic manifestations, ICASSP 2007. IEEE 2007 International Conference on Acoustics, Speech and Signal Processing (2007).
 Martin, J.-C., Caridakis, G., Devillers, L., Karpouzis, K., Abrilian, S., Manual annotation and automatic image processing of multimodal emotional behaviors : validating the annotation of TV interviews, Personal and Ubiquitous Computing (2007).