Multi-level analysis for spoken and written documents

From 2007 Scientific Report
Jump to: navigation, search

O. Galibert S. Rosset



In our dialogue systems and question-answering systems, a same analysis is applied to user queries, documents collection or written questions. The general objective of this analysis is to find the bits of information that can be of use for search and extraction, which we call pertinent information chunks. These can be of different categories: named entities, linguistic entities (e.g. verbs, prepositions), or specific entities (e.g. scores). All words that do not fall into such chunks are also annotated by following a longest-match strategy to find chunks with coherent meanings. Some examples of pertinent information chunks are given on the following figure:

<org> Airbus </org>  <aux> a </aux> <_action> vendu </action> <obj_val> <val> 10 </val> <prod> A380 </prod> </obj_val> à <org> Fedex </org>


In the following sections, we describe the types of entities that are handled by the system and how they are recognized.

Entities definition

Different entities at various linguistic levels have been defined. Starting from the largely adopted definition for named entities, we extended it to cover expressions describing one specific element of a given type, where type can take values such as location, person, organization, etc. For instance, president of the United States in 2006 is a person named entity by our definition. Additionally, we decided to allow hierarchical named entities, so the previous example would also have United States annotated as an organization and 2006 as a date. These entities prove very useful in a question-answering system as search keys or answer chunks, and for identifying query types. For example, ONU is annotated as org, 2006 Cannes festival as event, and veni vidi vici as citation.

However, we have found that information present in named entities is not sufficient to analyze the wide range of user utterances that a robust dialog system should be able to deal with. After manual identification in a corpus of user utterances and text documents from various sources [1], we have identified three additional categories of specific entities. Table 1 summarizes the different entity types that are used.

Type of Entity Examples
classical named entities pers: Romano Prodi ; Winston Churchill

prod: les Misérables ; Titanic
time: third century ; 1998 ; June 30th
org: European Commission ; NATO
loc: Cambridge ; England

extended named entities monument: Chateau de Versailles ; Great Wall of China

event: the 2007 Six Nations ; the XXth Olympic winter games
amount: 500 ; two hundred and fifty thousand
measure: year ; mile ; Hertz
animal: cow ; whale
awards: Nobel Price ; Academy awards
type_award: physics ; literature ; best picture

question markers Qpers: who won the Oscar ; I'm looking for the name of the first minister in Italy

Qloc: where is Croke Park ; in which country is Dublin
Qmeasure: what's the distance in kilometers between Paris and Moscou

linguistic chunk compound: language processing ; information technology ; producer of coca

verb: Roberto Martinez now knows the full size of the task
adj_comp: the microphones would be similar to ...
adj_sup: the biggest producer of cocoa of the world

Table 1. Examples for the main entity types.

Most entity types have subtypes that further refine the description of entities. For instance, the type for physical locations (loc) has different subtypes: continent, country, city, etc. Likewise, the type for measurements (measure) has subtypes including acreage, volume, length, speed, etc. Furthermore, some types allow inclusion of other types that act as specifiers. For instance, the type for events (event) can include entities of subtype period (e.g. the Summer olympic games), date (e.g. the 2007 Rugby World Cup), city (e.g. the Cannes film festival) or others.

All these entities correspond to a generic interrogative expressions that can be found in user queries (e.g. in which <Country> country </Country> is Dover; <Qsize> how tall </Qsize> is the Eiffel Tower). We also consider types of entities that cover expressions indicative of communication management (e.g. <Dopening> hello </Dopening>; <Qdial> I'm looking for </Qdial>), and specific linguistic phenomena such as negation (e.g. <Qnegdial> I'm not looking for </Qnegdial> an <Qneg_info> animal </Qneg_info> but for a type of car ; <Qneg_sys> I did'nt ask for that </Qneg_sys>).


With the described types as target, we developed an analysis module to detect and annotate instances in text and transcribed speech. While most of our written documents are of a largely acceptable linguistic quality, speech transcriptions contain disfluencies, repetitions, syntactic and grammatical errors, and lexical errors due to errors made by the ASR system that must be dealt with by the analysis strategy.

The types we need to detect correspond to two levels of analysis: named-entity recognition and chunk-based shallow parsing. Various strategies for named-entity recognition using machine learning techniques have been proposed ([2], [3], [4]), but they rely on the availability of large annoted corpora which were difficult to build, in part because of the large number of occurrences needed to cover all defined types and subtypes. Rule-based approaches to named-entity recognition (e.g. [5]) rely on morphosyntactic and/or syntactic analysis of the documents. Performing this kind of analysis was however not applicable for the present work: first, speech transcriptions are too noisy to allow for both accurate and robust linguistic analysis based on typical rules; second, the processing time of most existing linguistic analyzers was not compatible with the high speed requirement of the system.

We decided to tackle the problem with rules based on regular expressions on words as in other works [6], which allows the use of lists for initial detection, and the definition of local contexts and simple categorizations. We have built a word-based regular expression engine named wmatch which works with backtracking and supports the following features:

  • positive and negative lookaheads
  • shy and greedy groupings
  • named-classes and macros
  • rule application ordering
  • search and substitution on subtrees
  • external annotation of words (e.g. for morphosyntactic categories)
  • internal categorization of words (e.g. numbers, acronyms, proper nouns, etc.)

Rules are preanalyzed and optimized in several ways, and stored in a compact format in order to speed up processing.

Analysis is multi-pass, and subsequent rule applications operate on the results of previous rule applications which can be enriched or modified. The following example shows a wmatch rule used by the analysis module for annotating certain entities of type loc (location):

loc: ((<= &prep_loc) %caps | %pays | %ville);

The first element is the name of the rule, which will be used for annotation of corresponding text spans. The (<= &prep_loc) subexpression indicates a positive lookbehind assertion which searches for elements covered by the macro &prep_loc (typically, a set of prepositions that introduce locations of the following kinds). The | operator denotes disjunctions, and the elements prefixed with a % sign denote classes of words. For example, %caps is a predefined class that represents capitalized words, and %pays is a user-defined class that represents country names and their variants. Rules can be used to normalize variants of given values. Such classes use resources that are specifically built: for instance, we use lists for country names (approx. 500), city names (approx. 185,000), languages (approx. 300), etc.

The previous example annotation rule only applies if such an element in the lookbehind assertion is found before any of the following elements defined by classes, but the element in the subexpression will not be annotated if the rule is successfully applied. This illustrates how the surrounding context of words can be used to annotate them in a robust manner, something which would be more difficult to capture with full linguistic analysis and impractical to express with character-based regular expressions. As for any other approches, handcoded rules can produce erroneous results. But the impact on search recall can be minimized by using appropriate backoff strategies during document retrieval, and search precision can be improved by subsequently using more precise analysis techniques if time permits.


The full analysis comprises some 50 steps and takes roughly 4 ms on a typical user utterance, which proves fast enough for the analysis of short text but yet too slow for allowing smooth reaction time if analyzing dynamically large document sets from the Web. Figure 1 below shows a result of the annotation of a document snippet, and figure 2 shows a result of the annotation of a user utterance.

le président Bush a rencontré le premier ministre israélien après les dernières élections palestiniennes
<det> le </> <fonc> <fonction_publique> président </> </>

<pers> Bush </>
<aux> a </> <verb> rencontré </>
<det> le </>
<fonc> <fonction_publique> premier ministre </> </>
<org> <pays> <Israëel> israélien </> </> </> </>
<prep> après </>
<range_objet> <det> les </> dernières </>
<Eve> élections </>
<org> <pays> <Palestine> palestiniennes </> </> </> </>

Figure 1. Annotation of a snippet

je ne veux pas d' informations sur Benedetti je voudrais une information sur le dernier prix Nobel de la paix
<Qneg_dial> je ne veux pas d' informations </> <neg_info> <prep> sur </> <pers> Benedetti </> </>

<Qdial> je voudrais <1> une </> information sur </>
<det> le </> <range_objet> dernier </>
<prix> <Prix> prix Nobel </> <type_prix> de la paix </> </>

Figure 2. Annotation of a user query

This system has been use in a dialog plateform during a pilot user study [7]. We annotate manually the 20 collected dialogs (comprising 280 user utterances for 1,852 words (364 distinct)) with the tags defined in our ontology. Evaluation was carried out on the results of the analysis produced by the system during the dialogs. The results obtained for the 118 different types of tags found in this corpus were 88.5% Precision, 88.4% Recall and 0.88 F-measure. Specific error rates on tags were also measured:

  • wrong type and boundary errors: 124 tags (10.0%). 51 of these were on tags related to communication management and negation, which

play an important role in the confirmation process and history management. Errors whereby a title question marker was confused for a citation question marker occurred 3 times, which is quite problematic for subsequent QA.

  • wrong type errors: 33 tags (2.6%). Of these, 6 lead to topic detection errors (e.g. confusion between an organisation and a substantive), and 6 other were simply confusion between titles and proper nouns which are less problematic as the backoffs implemented in the QA system can correct them. Other confusions (e.g. verb or substantive for adjective) had less impact on QA.
  • wrong boundary errors: 23 tags (1.8%). Of these, 18 involve question tags or question types and can therefore have a negative impact on

confirmation and possibly QA.

Conclusion & Perspectives

The analysis is based on regular expressions on words for both list-based detection and context detection. The analysis is used in an oral and interactive question-answering system and is able to analysing queries and text in both oral and written forms. The system currently works on French and English data only. One of the major problems is the disambiguation between these entities and also classifying unknown proper names. For instance, Diana can be a person or a city,France could be an organisation, a location or the name of a boat. A rule system, in practice, tends to lack subtlety encoutering such a disambiguation task. Statictical methods seem better for this kind of automatic decision making. We are currently integrating in the rules engine support for these kinds of disambiguations using a Memory Based Learning approach.

Another aspect we will investigate is adding an initial chunking pass and use these chunks as additional information for the rest of the analysis. This chunking would probably be statistical too, either using language models or unsupervised parsing methodologies.


  • [1] S. Rosset and S. Petel. The Ritel Corpus - An annotated Human-Machine open-domain question answering spoken dialog corpus. In LREC'06, pages 1640-1643, Genoa, May 2006.
  • [2] D.M. Bikel, S. Miller, R. Schwartz and R. Weischedel. Nymble: a high-performance learning name-finder. In ANLP'97. Washington, EU. 1997.
  • [3] H. Isozaki and H. Kazawa. Efficient Support Vector Classifiers for Named Entity Recognition. In COLING'02. Taipei, Taiwan. 2002.
  • [4] M. Surdeanu, J. Turmo and E. Comelles. Named Entity Recognition from spontaneous Open-Domain Speech. in InterSpeech'05. Lisbon, Portugal. 2005.
  • [5] F. Wolinski, F. Vichot, and B. Dillet. Automatic Processing of Proper Names in Texts. In EACL'95. Dublin, Ireland. 1995.
  • [6] S. Sekine. Definition, dictionaries and tagger of Extended Named Entity hierarchy. in LREC'04. Lisbon, Portugal. 2004.
  • [7] B. van Schooten, S. Rosset, O. Galibert, A. Max, R. op den Akker, G. Illouz. Handling speech input in the Ritel QA dialogue system. 2007. In Interspeech'07. Antwerp. Belgium. August 2007.


Accueil > TLP