350 likes | 534 Views
ICASSP 05. Reference. Rapid Language Model Development Using External Resources for New Spoken Dialog Domains Ruhi Sarikaya 1 , Agustin Gravano 2 , Yuqing Gao 1 1 IBM, 2 Columbia University Maximum Entropy Based Generic Filter for Language Model Adaptation
E N D
Reference • Rapid Language Model Development Using External Resources for New Spoken Dialog Domains • Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1 • 1IBM, 2Columbia University • Maximum Entropy Based Generic Filter for Language Model Adaptation • Dong Yu, Milind Mahajan, PeterMau, Alex Acero • Microsoft • Language Model Estimation for Optimizing End-to End Performance of A Natural Language Call Routing System • Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine Deligne, Cheng Wu • IBM
Introduction • LM adaptation consists of four steps • 1. Collection of task specific adaptation data • 2. Normalization step • Abbreviations, data and time, punctuations • 3. Analyze adaptation data and build a task specific LM • 4. Interpolate task specific LM with task independent LM
Introduction • Language Modeling research concentrated in tow directions • 1. Improving the language model probability estimation • 2. Obtaining additional training material • The largest data set is the World Wide Web (WWW) • More than 4 billion pages
Introduction • Using web data for language modeling • Query generation • Filtering the relevant text from the retrieved pages • The web counts are certainly less sparse than the counts in a corpus of a fixed size • The web counts are also likely to be significantly more noisy than counts obtained from a carefully cleaned and normalized corpus • Retrieve unit • Whole document v.s. sentence (utterance)
Build LM for new domain • In practice when we start to build an SDS (spoken dialog system) for a new domain, the amount of in-domain data for the target domain is usually small • Definition • Static resource: corpora collected for other tasks • Dynamic resource: web data
Generating search Queries • Using Google as search engine • The more specific a query is the more relevant the retrieved pages are.
Similarity based sentence selection • Machine translation’s BLEU (BiLingual Evaluation Understudy) • N is the maximum n-gram length, wn and pn are the corresponding weight and precision, respectively, BP is the brevity penalty • Where r and c are the lengths of the reference and candidate sentences, respectively Threshold is 0.08
Experimental result • SCLM: using static corpora for language model • WWW-20 / WWW-100: predefined limit to 20 / 100 pares per sentence
E-mail corpus • Dictated and non-dictated
Filtering the corpus • Filtering out these non-dictated texts is not an easy job in general • Hand-crafted rules (e.g. regular expressions) • Limitations • It does not generalize well to situations which we have not encountered • Rules are usually language dependent • Developing and testing rules are very costly
Maximum Entropy based filter • Consider the filtering task as a labeling problem to segment the adaptation data into two categories: • Category D (Dictated text): • Text which should be used for LM adaptation • Category N (Non-dictated text): • Text which should not be used for LM adaptation • Text is divided into a sequence of text units (such as lines) ti is the text unit, and li is the label associated with ti
Label dependency • Assume that the labels of text units are independent with each other given the complete sequence of text units • We further assume that the label for a given unit depends only upon units in a surrounding window of units • k = 1:
Classification Model • A MaxEnt model has the form: where is the vector of model parameters
Classification Model • Pthresh = 0.5
Evaluation • Uses only features RawCompact, EOS, and OOV • No space splitting
Filtering is especially important and effective for the adaptation data with high percentage of non-dictated text (U2)
Efficient linear combination for distant n-gram models David Langlois, Kamel Smaili, Jean-Paul Haton EUROSPEECH 2003 p409~412
Introduction • Classical n-gram model • Distant language models
Modelization of distance in SLM • Cache model (self-relationship) • The former deals with the self-relationship between a word present in the history and itself: if a word is frequent in the history, it has more chance to appear once again
Modelization of distance in SLM (cont.) • Trigger model • the relationship between two words • It deals with couple of words v → w such that if v (the triggering word) is in the history, w (the triggered word) has more chance to appear • But, in fact, the majority of triggers are self triggers (v → v): a word triggers itself
d-n-gram model • Nd(.) is the discounted count • 0-n-gram model is the classical n-gram model
Evaluation • Voc: 20k words • Training set: 38M words • Development set: 2M words • Test set: 2M words • Baseline classical n-gram models:
Integration of distant n-gram models • Distant n-gram model cannot be used alone. It takes into account only a part of the history • Perplexity is 717 for n=2 and d=4 • Several models with distance up to d are combined with the baseline model
Improvement 7.1% Improvement 3.1% The utility of distant n-gram models decreases with the distance: a distance greater than 2 does not provide more information
Distant trigram lead to an improvement, but it is less important than in distant bigram. overlap between the history of d-trigram and (d+1)-trigram
Backoff smoothing b_u_z db_u_z (b_u_z)˙(db_u_z)
7.9% 11.6%
Combination weight • Unique weight • The model’s weights depend on each history (the class of each sub-history)
Combination of distant n-gram • In order to combine K models, M1,…,MK, a set of weight 1,…,K is defined and the combination is expressed by: • Development corpus is not sufficient to estimate a huge number of parameters • Classify histories and set a weight to each class
Classification • Break the history into the several parts (sub-histories). Each sub-history is analyzed in order to estimate its importance in terms of prediction and then put into a class • Such a class is directly linked to the value of the sub-history frequency • This class gathers all sub-histories which have approximately the same frequency
8000 classes/115.4 12.8% improvement to baseline (132.4) 5.3% improvement to the single weight combination (121.9) 4000 classes/85.2 12.8% improvement to baseline (97.8) 1.5% improvement to the single weight combination (86.5)