ICASSP 05

ICASSP 05

Reference • Rapid Language Model Development Using External Resources for New Spoken Dialog Domains • Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1 • 1IBM, 2Columbia University • Maximum Entropy Based Generic Filter for Language Model Adaptation • Dong Yu, Milind Mahajan, PeterMau, Alex Acero • Microsoft • Language Model Estimation for Optimizing End-to End Performance of A Natural Language Call Routing System • Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine Deligne, Cheng Wu • IBM

Introduction • LM adaptation consists of four steps • 1. Collection of task specific adaptation data • 2. Normalization step • Abbreviations, data and time, punctuations • 3. Analyze adaptation data and build a task specific LM • 4. Interpolate task specific LM with task independent LM

Introduction • Language Modeling research concentrated in tow directions • 1. Improving the language model probability estimation • 2. Obtaining additional training material • The largest data set is the World Wide Web (WWW) • More than 4 billion pages

Introduction • Using web data for language modeling • Query generation • Filtering the relevant text from the retrieved pages • The web counts are certainly less sparse than the counts in a corpus of a fixed size • The web counts are also likely to be significantly more noisy than counts obtained from a carefully cleaned and normalized corpus • Retrieve unit • Whole document v.s. sentence (utterance)

Build LM for new domain • In practice when we start to build an SDS (spoken dialog system) for a new domain, the amount of in-domain data for the target domain is usually small • Definition • Static resource: corpora collected for other tasks • Dynamic resource: web data

Flow diagram for the collecting relevant data

Generating search Queries • Using Google as search engine • The more specific a query is the more relevant the retrieved pages are.

Similarity based sentence selection • Machine translation’s BLEU (BiLingual Evaluation Understudy) • N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, BP is the brevity penalty • Where r and c are the lengths of the reference and candidate sentences, respectively Threshold is 0.08

Experimental result • SCLM: using static corpora for language model • WWW-20 / WWW-100: predefined limit to 20 / 100 pares per sentence

E-mail corpus • Dictated and non-dictated

Filtering the corpus • Filtering out these non-dictated texts is not an easy job in general • Hand-crafted rules (e.g. regular expressions) • Limitations • It does not generalize well to situations which we have not encountered • Rules are usually language dependent • Developing and testing rules are very costly

Maximum Entropy based filter • Consider the filtering task as a labeling problem to segment the adaptation data into two categories: • Category D (Dictated text): • Text which should be used for LM adaptation • Category N (Non-dictated text): • Text which should not be used for LM adaptation • Text is divided into a sequence of text units (such as lines) ti is the text unit, and li is the label associated with ti

Label dependency • Assume that the labels of text units are independent with each other given the complete sequence of text units • We further assume that the label for a given unit depends only upon units in a surrounding window of units • k = 1:

Classification Model • A MaxEnt model has the form: where is the vector of model parameters

Classification Model • Pthresh = 0.5

Features

Space Splitting

Evaluation • Uses only features RawCompact, EOS, and OOV • No space splitting

Filtering is especially important and effective for the adaptation data with high percentage of non-dictated text (U2)

Efficient linear combination for distant n-gram models David Langlois, Kamel Smaili, Jean-Paul Haton EUROSPEECH 2003 p409~412

Introduction • Classical n-gram model • Distant language models

Modelization of distance in SLM • Cache model (self-relationship) • The former deals with the self-relationship between a word present in the history and itself: if a word is frequent in the history, it has more chance to appear once again

Modelization of distance in SLM (cont.) • Trigger model • the relationship between two words • It deals with couple of words v → w such that if v (the triggering word) is in the history, w (the triggered word) has more chance to appear • But, in fact, the majority of triggers are self triggers (v → v): a word triggers itself

d-n-gram model • Nd(.) is the discounted count • 0-n-gram model is the classical n-gram model

Evaluation • Voc: 20k words • Training set: 38M words • Development set: 2M words • Test set: 2M words • Baseline classical n-gram models:

Integration of distant n-gram models • Distant n-gram model cannot be used alone. It takes into account only a part of the history • Perplexity is 717 for n=2 and d=4 • Several models with distance up to d are combined with the baseline model

Improvement 7.1% Improvement 3.1% The utility of distant n-gram models decreases with the distance: a distance greater than 2 does not provide more information

Distant trigram lead to an improvement, but it is less important than in distant bigram. overlap between the history of d-trigram and (d+1)-trigram

Backoff smoothing b_u_z db_u_z (b_u_z)˙(db_u_z)

7.9% 11.6%

Combination weight • Unique weight • The model’s weights depend on each history (the class of each sub-history)

Combination of distant n-gram • In order to combine K models, M1,…,MK, a set of weight 1,…,K is defined and the combination is expressed by: • Development corpus is not sufficient to estimate a huge number of parameters • Classify histories and set a weight to each class

Classification • Break the history into the several parts (sub-histories). Each sub-history is analyzed in order to estimate its importance in terms of prediction and then put into a class • Such a class is directly linked to the value of the sub-history frequency • This class gathers all sub-histories which have approximately the same frequency

8000 classes/115.4 12.8% improvement to baseline (132.4) 5.3% improvement to the single weight combination (121.9) 4000 classes/85.2 12.8% improvement to baseline (97.8) 1.5% improvement to the single weight combination (86.5)

ICASSP 05

ICASSP 05

Presentation Transcript

Review of ICASSP 2004

Lezione del 12/05/2005 e 13/05/05

ICASSP 2009: Acoustic Model Survey

Today’s Bellringer 05/ 05/ 14

A Survey of ICASSP 2013 Language Model

Today’s Schedule – 05/05/10

05

05-05-13 513 concert

Survey of Robust Speech Techniques in ICASSP 2009

05

ICASSP 2008 Survey

ICASSP 2014

Survey ICASSP 2007 Discriminative Training

L-21-05-05-2014

ICASSP Paper Survey

ICASSP 2006 Robustness Techniques Survey