Constrained MLLR for Speaker Recognition

From 2007 Scientific Report

Jump to: navigation, search

M. Ferràs C.C. Leung C. Barras J.L. Gauvain

1 Object
2 Description
3 Results
4 Perspectives
5 References
6 Navigation

Object

Maximum-Likelihood Linear Regression (MLLR) and Constrained MLLR (CMLLR) are two widely-used techniques for speaker adaptation in large-vocabulary speech recognition systems. Recently, using MLLR transforms as features for speaker recognition tasks has been proposed, achieving performance comparable to that obtained with cepstral features. We describe a new feature extraction technique for speaker recognition based on CMLLR speaker adaptation which avoids the use of transcripts. Modeling is carried out through Support Vector Machines (SVM). Results on the NIST Speaker Recognition Evaluation 2005 dataset are provided as well as in combination with two cepstral approaches such as MFCC-GMM and MFCC-SVM, for which system performance is improved by a 10% in Equal Error Rate relative terms.

Description

Figure 1. Block diagram for GMM/UBM re-estimation.

MLLR and CMLLR can be used in speaker recognition systems to extract features that are more specifically focused on speaker-related characteristics than standard spectral envelope features. Existing work relies on a large-vocabulary speech recognition system to derive several class-dependent MLLR transforms the coefficients of which are later stacked vector-wise and their concatenation used as a feature vector. We propose a slightly different approach which consists of two stages. As a first step, a GMM/UBM model is built upon background speaker cepstral features. Next, CMLLR transforms are estimated for each speaker of interest by using this UBM and are eventually rearranged as vectors to be modeled later. To build the UBM model an iterative approach is followed shown Fig. 1. Only one CMLLR transform is estimated per speaker, resulting in one high-dimensional feature per speaker which is specially well-suited for SVM modeling.

Experiments are conducted on conversational telephone speech from NIST 2005 speaker recognition evaluation, and performance is evaluated in term of NIST-defined minimum detection cost function (MDC) and equal error rate (EER). All systems use cepstral features with differential coefficients, channel compensation and short-term gaussianization. The MFCC-GMM system is based on a MAP-adapted gender-dependent UBM with 1536 Gaussians. The MFCC-SVM expands the cepstral features through a third order monomial expansion, resulting in a 20824-D normalized mean vector, further reduced to 3197-D using Kernel Principal Component Analysis; a linear kernel SVM is trained on the resulting features. The CMLLR-SVM system has a similar SVM setup applied to the CMLLR transforms. Individual systems are fused at score level by arithmetic average.

Results

Figure 2. DET curve for the individual systems MFCC-GMM, MFCC-SVM and CMLLR-SVM and for the baseline and all-combination systems

Two re-estimation iterations were found optimal for the CMLLR-SVM system. Table 1 shows results for each individual system, MFCC-GMM (a), MFCC-SVM (b) and CMLLR-SVM (c), and the baseline system (a+b) and all-combination system (a+b+c). CMLLR-SVM is competitive with the other individual systems (a and b) in terms of EER. Both MFCC-GMM and MFCC-SVM significantly outperform CMLLR-SVM in MDC, though. This trend is confirmed after fusion of all individual systems. Including CMLLR-SVM in the fusion brings about a 10% relative improvement over the baseline in EER, but leaves MDC at the same level.

Table 1. MDC and EER for the individual, baseline and all-combination systems.
System	MDC (x100)	EER (%)
MFCC-GMM (a)	0.330	8.61
MFCC-SVM (b)	0.277	7.41
CMLLR-SVM (c)	0.370	8.15
Baseline (a+b)	0.266	7.11
All-combination (a+b+c)	0.260	6.40

Fig. 2 shows DET curves for the individual systems. MFCC-SVM outperforms the two other systems, and CMLLR-SVM and MFCC-GMM complement each other, depending on the DET curve region. The all-combination system consistently outperforms the baseline system. This improvement is small at the MDC operating point and gets larger for lower miss probability values, for instance, a 10% relative improvement in EER.

Perspectives

The main advantage of using CMLLR instead of MLLR transformations is that the training procedure is not transcript-dependent or language-dependent while still capturing differences between speaker-independent and speaker-dependent acoustic features. On the other side, since a GMM is used to estimate the transform the resulting transform is less precise and probably more dependent on the message.

References

M. Ferras, C-C. Leung, C. Barras, and J-L. Gauvain. "Constrained MLLR for Speaker Recognition". In Proceedings of ICASSP, pages 53-56, Honolulu, Hawaii, April 2007.

Navigation

Accueil > TLP