Acoustic and prosodic characteristics of vocalic hesitations across languages
Substantial progress in speech processing research over the past decade has led to a variety of successful demonstrators involving spontaneous, or at least unprepared speech. To increase the usability of speech systems the challenges of multilinguality and spontaneous speech must be addressed efficiently.
In this context there has been growing interest in studies on hesitation phenomena. What is commonly understood by hesitation phenomena includes a variety of vocalic realizations, varying from simple events such as silent or filled pauses, and word lengthening, to more complex phenomena involving repetitions and speech repair. Among the hesitation phenomena, the filled pauses which correspond to autonomous vocalic hesitations and are ever-present in the spontaneous spoken language, have benefited from a special focus. Over the past half-century, filled pauses have been analyzed with different standpoints by (psycho)linguists, clinicians, cognitive scientists and sociolinguists (O'Connel and Kowal, 2004). Special attention has been payed to their communicative role, generally associated with the planning of the verbal message within an interpersonal interaction.
Our work focuses on autonomous vocalic hesitations in American English, French and Spanish. Such vocalic hesitations occur without lexical support and are thus to be distinguished from vocal lengthening of segments belonging to lexical items (generally function words). Vocalic hesitations are widely encountered in the world's languages and consist of the insertion of a relatively long, stable vocalic segment. The orthographic transcription of vocalic hesitations varies across languages suggesting differences in their perception by native speakers, for example uh/um in American English, er in British English, euh in French, eh in Spanish, äh/ähm in German, and so on (Clark and Fox Tree, 2002).
If the function in speech of vocalic hesitations as systematic verbalizations with (almost) lexical meaning has been largely discussed, studies on the their language-specific acoustic and prosodic characteristics are still lacking. The latter stand however for a relevant topic in the framework of computational linguistics approaches, conducted on large speech corpora: an increasing need in transcribing spontaneous speech leads indeed to a more in-depth analysis of the non-lexical events - such as hesitations - occuring in unprepared speech productions. Studies on acoustic/prosodic characteristics of vocalic hesitations focused then on the description of a number of objective parameters with the purpose of finding potential cues ables to indicate when this particular type of disfluency occurs within the speech flow.
Aim of the study
Our research aims to contribute both to speech processing research as well as to linguistics. From a linguistic perspective, this work addresses the question of whether some language universals can be found for hesitation vowels across languages. Seen from a complementary perspective, this study also investigates whether hesitation vowels bear language-specific information. An additional point of interest is the link between hesitation vowels and intra-lexical vowels for each language. Does the timbre of the vocalic hesitations correspond to a vowel of the phonemic inventory of the languages or does the vocalic hesitation tend to a given, potentially universal and neutral "rest" position? The gained insights may also be of interest to research in speech technology fields such as acoustic modeling for automatic speech, speaker, language and accent recognition.
Vocalic hesitations in French, American English and European Spanish were automatically extracted from large journalistic broadcast and parliamentary debate corpora. The LIMSI speech transcription system (Gauvain, 2002) has been used for speech alignment. These data are available at LIMSI in the framework of different projects. For French, 20 hours of speech from several national radio and TV broadcast channels (France Inter, France Info, France2}...) have been selected. The American English data is comprised of about 10h of recordings from broadcast channels (CNN, VOA, ABC, etc.) distributed by LDC. For Spanish, 10h of European Parliamentary debates are used. For all three languages, several dozens of different speakers contribute the majority of the audio data. Vocalic hesitations resulted in approximatively 1300 items in French (respectively 2000 in English and 1700 in Spanish), whereas intra-lexical vowels measured in all types of contexts led to 24k vocalic items in American English (85k in French, 403k in Spanish). Duration, fundamental frequency and formant values were measured and compared both from an inter- and intra-language perspective. Results here are presented for male speakers since they are better represented in the data, however the observations remain valid for female speakers.
Vocalic hesitations across languages
Whereas fundamental frequency and duration confirm universal patterns (low, flat fundamental frequency and long durations), timbre comparisons across languages lead to interesting observations. As mentioned above, the transcription of vocalic hesitations varies across languages, for instance they are transcribed as uh/um in American English, euh in French or eh in Spanish. In order to evaluate if the various transcriptions mirror acoustic differences between vocalic hesitations, timbres have been compared across languages through first F1 and second F2 formant values. Average F1/F2 measures on hesitation vowels in American English, French and Spanish have been ploted in the vocalic space obtained thanks to similar measures on intra-lexical vocalic segments. The vocalic hesitations in the three languages analyzed here indicate yet language-dependent characteristics: they differ in their timbres across languages and do not inevitably result in central vowels as stated by Clark and Fox Tree (2002). Figure 1 below show average F1 and F2 values for each language, whereas Figure 2 displays objective distances between hesitation vowels and intra-lexical vocalic segments which demarcate the vocalic space in American English, French and Spanish. In the framework of language identification approaches, these results suggest that vocalic hesitations require language-specific acoustic models.
Vocalic hesitations vs vocalic systems
Vocalic hesitations exhibit similarities with some intra-lexical vowels but the degree of similarity varies across languages. In European Spanish the vocalic hesitation almost coincides with the mid-closed front-vowel /e/ whereas in American English and French, it is central slightly more open than /oe/ (in French) and mid-open central between /^/ and /ae/ (in American English). The relation between the vocalic hesitation and language inventory varies thus across languages. The average timbres of vocalic hesitations are close to a central, mid-open realization in French and American English, but close to a front mid-closed vowel in Spanish. Vocalic hesitations appear thus to be realized as interpolations between timbres attested in the languages' phonemic inventories and some neutral timbres occupying the central area of vocalic triangle. Finally, in the manner of phonemic vowels, vocalic hesitation display centralization towards a neutral schwa-type position of their vocalic qualities when the duration of the segment decreases (Vasilescu and Adda-Decker, 2006).
Figure 2: F1/F2 mean values for intra-lexical vowels and hesitation vowel ("&" label).
Many more detailed studies can be carried out in the future, including analyses depending on different hesitation functions. Extensions to other languages, regional accents and to more spontaneous speech collected in various interaction contexts, are certainly promising future research directions which can contribute to an extensive description and more in-depth understanding of hesitation phenomena in speech.
D.C. O'Connell, S. Kowal, The History of research on filed pause as evidence of the written language bias in linguistics (Linell, 1982), Journal of psycholinguistic research, 33 (6), pp. 459-474, 2004.
H.H. Clark, J.E. Fox Tree, Using "uh" and "um" in spontaneous speaking, Cognition, vol. 84, pp. 73-111, 2002.
J.L. Gauvain, L.F. Lamel, G. Adda, The LIMSI Broadcast News Transcription System, Speech Communication, vol.37(1-2), pp. 89-108, 2002.
I. Vasilescu, M. Adda-Decker, A cross-language study of acoustic and prosodic characteristics of vocalic hesitation, In: NATO Security Through Science Program The Fundamentals of Verbal and Non-verbal Communication and the Biometrical Issue, IOS Press BV, 2006.
I.Vasilescu, R. Nemoto, M. Adda-Decker, Vocalic Hesitations vs Vocalic Systems: a Cross-Language Comparisons, In 16th International Congress of Phonetic Sciences, Saarbrücken, July 2007.