iVector approach to Phonotactic LRE

iVector approach to Phonotactic LRE Mehdi Soufifar 2nd May 2011

Phonotactic LRE Extract n-gram statistics Train : Phoneme sequence N-gram counts Train Classifier LR, SVM, LM GLC,.. L L Classifier Recognizer (Hvite,BUTPR,...) Recognizer (Hvite,BUTPR,...) Test Utterance Language-dependant Utterance AM AM Extract n-gram statistics Test : Phoneme sequence N-gram counts Language dependant Score

N-gram Counts 1 • Problem : Huge vector of n-gram counts • Solutions: • Choose the most frequent n-grams • Choosing top N n-grams discriminatively(LL) • Compress the n-gram counts • Singular Value Decomposition (SVD) • Decompose the document matrix D • Using the the transformation matrix U to reduce the n-gram vector dimensionality • PCA-based dimensionality reduction • iVector feature selection N^3= 226981 for RU phoneme set

Sub-space multinomial modeling • Every vector of n-gram counts consist of E events (#n-grams) • Log probability of nth utterance in MN distribution is: • can be defined as : • Model parameter to be estimated in ML estimation are t and w • No analytical solution! • We use Newton Raphson update as a Numerical solution N^3= 226981 for RU phoneme set

Sub-space multinomial modeling • 1st solution : • consider all 3-grams to be components of a Bernoulli trial • Model the entire vector of 3-gram counts with one multinomial distribution • N-gram events are not independent (not consistent with Bernoulli trial presumption!) • 2nd solution • Cluster 3-grams based on their histories • Model each history with a separate MN distribution • Data sparsity problem! • Clustering 3-grams based on binary-decision tree

Training of iVector extractor • Number of iterations : 5-7 (depends on sub-space dimension) • Sub-space dimension : 600 3 seconds 10 seconds 30 seconds

Classifiers • Configuration : L one-to-all linear classifier • L: number of targeted languages • Classifiers: • SVM • LR • Linear Generative Classifier • MLR (to be done!)

Results on different classifiers • Task : NIST LRE 2009

Results of different systems LRE09

N-gram clustering • Remove all the 3-gram with repetition < 10 over all training utterances • Model each history with a separate MN distribution • 1084 histories, up to 33 3-grams each

Merging histories using BDT • In case of 3-gram PiPjPk • Merging histories which do not increase the entropy more than a certain value Models 1 Models 2 PiP22Pk PiP33+22Pk PiP33Pk E1=Entropy(Model1) E2=Entropy(Model2) D= E1 – E2

Results on DT Hist. merging • 1089-60 More iterations on training T => T matrix is moving toward zero matrix!

Deeper insight to the iVector Extrac.

Strange results • 3-grams with no repetition through out the whole training set should not affect system performance! • Remove all the 3-grams with no repetition through the whole training set • 35973->35406 (567 reduction) • Even worse result if we prune more!!!!

DT clustering of n-gram histories • The overall likelihood is an order of magnitude higher than the 1st solution • Change of the model-likelihood is quite notable in each iteration! • The T Matrix is mainly zero after some iterations!

1st iteration

2nd iteration

3rd iteration

4th iteration

5th iteration

6th iteration

Closer look at TRAIN set

Ivector inspection Engl Cant

iVect inspection • Multiple data source causes bimodality • We also see this effect in some single source languages Amha

iVector approach to Phonotactic LRE