PFC corpus: automatic processing within TCAN VarCom & PFC-Cor projects

From 2007 Scientific Report
Jump to: navigation, search

M. Adda-Decker P. Boula de Mareüil L. Lamel C. Woehrling G. Adda E. Bilinski


Contents

Object

The PFC (Phonologie du Français Contemporain, http://www.projet-pfc.net/) corpus is the result of an ambitious long-term project, initiated and pushed by French phonologists and phoneticians [1]. Several tens of sampling points across the Francophone area in the world, totalling hundreds of speakers and hundreds of hours of speech, have been gathered in order to enable large scale studies on linguistic and sociolinguistic variation in spoken French. Automatic processing, in particular, alignment on lexical and phonemic levels can be produced by automatic speech recognition (ASR) systems, provided that accurate and normalized speech transcripts are available. Regional accents, spontaneous speech, pronunciation variants and lexical changes are challenges to automatic speech recognition systems [2,3,4,5], and motivate our interdisciplinary collaborations within the framework of CNRS TCAN VarCom (headed by N. Nguyen, LPL, Univ. Provence) and ANR PFC-Cor (headed by B. Laks, Modyco, Univ. Paris 10) projects involving also ERSS and Laboratoire Jacques-Lordat, Univ. Toulouse-Le Mirail, Klassisk og Romansk Institutt, Univ. Oslo, Laboratoire de Psycholinguistique, Univ. Genève. We here report on ongoing work on the challenging PFC audio data. We believe that this kind of innovative interdisciplinary research contributes to foster both spoken corpus linguistics and automatic speech processing [6,7].

Description

In the following we address the question of producing automatically aligned speech for further studies by the project partners. This includes manual transcript normalizations and pronunciation dictionary updates. Specific pronunciation variants may be allowed during alignment in order to study variation as a function of region, gender or age.

PFC corpus

Fig.1 shows 18 locations in France, Belgium and Switzerland for which the recorded data have been automatically processed, including transcript normalization and automatic alignment.

Figure 1: France with neighboring francophone countries (Belgium and Switzerland). Red circles: localization of automatically processed PFC speech samples.

Automatic processing

Manually transcribed data have first been normalized in order to fit automatic processing requirements: spurious structure tags and punctuation marks (mainly speaker codes, quotes, double-quotes...), orthographic and typographic errors have been semi-automatically discarded from the original Praat transcription tiers. Filtering with a standard French dictionary, allows the location of either errors or unknown lexical items, which may correspond to regional specificities. Pronunciation dictionaries have then been augmented by introducing topic, class/age or region-specific items (e.g. réaleviner concerning fish culture, Xonrupt-Longemer for Lorraine, röstis, huitante for Switzerland, cokoteurs for Belgian students).


Fig.2 shows a 12 second excerpt of a spontaneous speech conversation corresponding to the following transcription:

ça peut être des romans , ça peut être des nouvelles , ça peut être des aventures , 
  ça peut être ... je crois que c' est plus le plaisir de lire,

where manual transcripts have been automatically aligned with the audio signal: word and phone boundaries are provided. In Fig.2, it is interesting to underline two points: first speech alternates with relatively long pauses (marked as [pause-respir]), which distinguishes conversational (informal, spontaneous) speech from read or prepared speech, where a higher density of speech and a more regular speech flow is commonly observed. Next content words, such as "romans, nouvelles, aventures, plaisir de lire" tend to last significantly longer than the surrounding (repeated) function word sequences "ça peut être des" or "je crois que c'est plus". These temporal structure changes are blatant in the orthographic transcription tier of Fig.2 where almost only content words remain readable, the remaining items are squeezed due to very short durations.


Figure 2: Speech sample of 12 seconds from Nyon (Switzerland), guided conversations. Pour écouter cliquez Media:roman-plaisir-de-lireNyon.ogg au format WAV Roman ... Plaisir de Lire (Nyon).


Each word has been aligned on a phonemic level using standard pronunciation dictionaries. Word and phonemic class (plosives, fricatives, nasals, glides, vowels) segmentations have been made available to partners of the TCAN-Varcom and PFC-Cor projects for specific research actions or more generally to enrich the PFC Web site http://www.projet-pfc.net/ in the future.


Fig. 3 shows two repetitions of the word sequence "plaisir de lire", the first one comes from a female speaker from Dijon, the second one is uttered by a male speaker from Nyon (the same as in Fig.2). Beyond variation in duration (by a factor of two), differences can be perceived on both prosodic and phonetic levels. Future challenges include expliciting and quantifying these differences on all of the PFC data, thus contributing to a better understanding of variation in speech and to improvements in acoustic modeling and automatic processing.


Figure 4: .
Figure 3: .

pour écouter cliquez Media:plaisir-de-lireDijon.ogg

pour écouter cliquez Media:plaisir-de-lireNyon.ogg

au format WAV Plaisir de Lire (Dijon)

au format WAV Plaisir de Lire (Nyon)

Figure 3: Variation in conversational speech: two utterances of the word sequence "plaisir de lire" by a female speaker from Dijon and a male speaker from Nyon (Switzerland).

Alignment results, in particular phone boundaries and labels, largely depend on the pronunciation dictionary in use. Beyond canonical phonemic alignments, specific pronunciation variants have been introduced to address some issues of phonetic and prosodic variation in French: schwa, mid-open vowels, nasalized vowels [8]. Fig. 4 gives an example of acoustic modeling of the word "épais" in case of canonical pronunciation dictionaries (left) and when mid-open vowel specific variants are allowed in the dictionary (right).

canonical pronunciation

ModelAcoust1eng.png

pronunciation variants

ModelAcoust2eng.png

Figure 4: Alignment with a canonical pronunciation (left) consists in producing phone boundaries for a given pronunciation (/epE/). When pronunciation variants are used (right), alignment consists in fixing boundaries for the most likely pronunciation (among [epe], [epE], [Epe] and [EpE]).

Results and prospects

The recordings of 200 speakers corresponding to about 150 hours of speech have been aligned and distributed to the partners to accelerate or enable a large variety of different studies.

As highlighted in the spectrogram of Fig. 2 representing an excerpt of conversational speech, important durational changes are observed with respect to expected phonemic units as predicted by canonical pronunciations. Durations may be good indicators of pronunciation variation and specific analyses on function words and content words as a function of speaking style and other factors such as gender, age and region are underway.

Fig. 5 reports average duration distributions corresponding to different speaking styles: read text, guided conversations and free conversations (Fig. 5 left). Whereas comparisons between different PFC localizations only feature small changes (not shown here), important differences can be observed between scripted and conversational speaking styles. Differences between guided and free conversations are almost negligible. As far as conversational speech is concerned, the percentage of minimal duration segments (30ms) increases drastically to more than 20 percent, and possible explanations here are incomplete pronunciations of frequent word sequences (listen to speech sample of Fig. 2). As a comparison, duration distributions of standard ASR speech corpora such as journalistic speech from broadcast news (BN) or conversational telephone speech (CTS) are added to the PFC distributions (Fig. 5 right), with the expectation that BN be closer to read speech, and CTS similar to PFC conversations. Whereas conversational telephone speech distributions almost merge with PFC face-to-face conversation distributions, prepared speech appears to have a sharper bell-shape like aspect than read speech has. This difference may be related to time-pressure and professional performance for journalistic speech tending towards a steady-stream production, or different cognitive loads for prepared speech as compared to reading tasks.

PhonDur-PFCalltgl.png

PhonDur.png

Figure 5: Global phone duration distributions. X-axis: segment duration in cs (minimal segment duration is 3 cs) - Y-axis: percentage in corpus. Left: durations are compared using PFC data: read speech vs guided and free conversations. Right: PFC durations are compared to prepared BN speech and CTS conversational data in French.

Whatever the potential explanations, the duration curves highlight globally different pronunciations and require more in-depth phonetic and prosodic studies. The insights gained hopefully contribute both to improve ASR performance and to increase our knowledge and understanding of variation.

Achieved results concerning French accent identification and characterization [9], making use of perceptual studies and automatic alignment with specific pronunciation dictionaries are presented in http://rs2007.limsi.fr/index.php?title=TLP:Page_9.


A variety of studies have been launched on the PFC corpus so far. It appears that automatically processed speech corpora enable a large panel of stimulating research directions, which were still out of reach ten years ago. It is our hope that the interdisciplinary collaborations of Varcom and PFC-Cor contribute to explore the linguistically rich information buried in the PFC corpus.

References

[1] Durand, Jacques, Bernard, Laks & Chantal, Lyche (2002). La phonologie du français contemporain: usages, variétés et structure. In: C. Pusch & W. Raible (eds.) Romanistische Korpuslinguistik - Korpora und gesprochene Sprache / Romance Corpus Linguistics - Corpora and Spoken Language. Tübingen: Gunter Narr Verlag, pp. 93-106.

[2] Gauvain, Jean-Luc, Adda, Gilles , Adda-Decker, Martine, Allauzen, Alexandre, Gendner, Véronique, Lamel, Lori, Schwenk, Holger (2005). Where Are We in Transcribing French Broadcast News?. In InterSpeech, Lisbon, September 2005.

[3] Adda-Decker, Martine (2007). Problèmes posés par le schwa en reconnaissance et en alignement automatiques de la parole. In 5es Journées d'Études Linguistiques de Nantes, Nantes, June 2007.

[4] Adda-Decker, Martine, Lamel, Lori (2005). Do Speech Recognizers Prefer Female Speakers?. In InterSpeech, Lisbon, September 2005.

[5] Adda-Decker, Martine, Boula de Mareüil, Philippe, Adda, Gilles, Lamel, Lori (2005). Investigating syllabic structures and their variation in spontaneous French. Speech Communication, 46:119-139, 2005.

[6] Adda-Decker, Martine (2006). De la reconnaissance automatique de la parole à l'analyse linguistique de corpus oraux. In Journées d'Étude sur la Parole, pp. 389-400, Dinard, June 2006.

[7] Adda-Decker, Martine (2007). Corpus pour la transcription automatique de l'oral. Revue Française de Linguistique Appliquée, XII(1):71-84, June 2007.

[8] Boula de Mareüil, Philippe, Adda-Decker, Martine & Woehrling, Cécile (2007), Analysis of oral and nasal vowel realisation in northern and southern French varieties, 16th International Congress of Phonetic Sciences, Saarbrücken, July 2007.

[9] Woehrling, Cécile, Boula de Mareüil, Philippe (2006). Identification of regional accents in French: perception and categorization. In International Conference on Speech and Language Processing, pages 1511-1514, Pittsburgh, September 2006.

Navigation

Accueil > TLP