Chapter 5. Probabilistic Models of Pronunciation and Spelling

Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007년 05월 04일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189

Outline • Introduction • 5.1 Dealing with Spelling Errors • 5.2 Spelling Error Patterns • 5.3 Detecting Non-Word Errors • 5.4 Probabilistic Models • 5.5 Applying the Bayesian Method to Spelling • 5.6 Minimum Edit Distance • 5.7 English Pronunciation Variation • 5.8 The Bayesian Method for Pronunciation • 5.9 Weighted Automata • 5.10 Pronunciation in Humans

Introduction • Introduce the problems of detecting and correcting spelling errors • Summarize typical human spelling error patterns • The essential probabilistic architecture: • Bayes Rule • Noisy channel model • The essential algorithm • Dynamic programming • Viterbi algorithm • Minimum edit distance algorithm • Forword algorithm • Weighted automaton 3 / 40

5.1 Dealing with Spelling Errors (1/2) • The detection and correction of spelling error • integral part of modern word-processors • Applications in which even the individual letter aren’t guaranteed to be accurately identified • Optical character recognition (OCR) • On-line handwriting recognition • Detection and correction of spelling errors, mainly in typed text • OCR systems often • misread “D” as “O” or “ri” as “n” • producing ‘mis-spelled’ words like dension for derision 4 / 40

5.1 Dealing with Spelling Errors (2/2) • Kukich (1992) breaks the field down into three increasingly broader problems: • non-word error detection (graffe for giraffe) • isolated-word error correction (correcting graffe to giraffe) • context-dependent error detection and correction • there for three, dessert for desert, piece for peace 5 / 40

5.2 Spelling Error Patterns (1/2) • Single-error misspellings - Damerau (1964) • insertion: mistyping the as ther • deletion: mistyping the as th • substitution: mistyping the as thw • transposition: mistyping the as the • Kukich (1992) breaks down human typing error • Typographic errors (spell as speel) • Cognitive errors (separate as seperate) 6 / 40

5.2 Spelling Error Patterns (2/2) • OCR errors are usually grouped into five classes • substitutions (e →c) • multi-substitutions (m →rn, he →b) • space deletions or insertions • failures (u →~) • framing errors 7 / 40

5.3 Detecting Non-word Errors • Detecting non-word errors in text is done by the use of dictionary • dictionaries would need to be kept small • large dictionaries contain very rare words that resemble misspellings of other words 8 / 40

5.4 Probabilistic Models (1/3) • The intuition of the noisy channel model is to treat the surface form as an instance of the lexical form • to build a model of the channel so that we can figure out how it modified this “true” word and recover it • source of noise • variation in pronunciation, variation in the realization of phones, acoustic variation due to the channel 9 / 40

5.4 Probabilistic Models (2/3) • string of phones (say [ni]) • word corresponds to this string of phones • consider all possible words • P (word | observation) is highest • (5.1) • : our estimate of the correct w • O : the observation sequence [ni] • function argmaxx f(x) : the x such that f(x) is maximized 10 / 40

5.4 Probabilistic Models (3/3) • (5.2) • (5.3) • substituting (5.2) into (5.1) to get (5.3) • we can ignore P(O). Why? • (5.4) • P(w) is called the Prior probability • P(O|w) is called the likelihood 11 / 40

5.5 Applying the Bayesian Method to Spelling (1/5) 12 / 40

5.5 Applying the Bayesian Method to Spelling (3/5) • p(acress|across) → number of times that e was substituted for 0 in some large corpus of error • confusion matrix • a square 26 * 26 table • number of times one letter was incorrectly used instead of another • [o,e] in a substitution confusion matrix • count of times e was substitution for o 14 / 40

5.5 Applying the Bayesian Method to Spelling (4/5) • del[x,y] contains the number of times in the training set that the characters xy in the correct word were typed as x • ins[x,y] contains the number of times in the training set that the character x in the correct word was typed as xy • sub[x,y] the number of times that x was typed as y • trans[x,y] the number of times that xy was typed asyx 15 / 40

5.6 Minimum Edit Distance (1/6) • string distance - some metric of how alike two strings are to each other • minimum edit distance - the minimum number of editing operations needed to transform one string into another • operation - insertion, deletion, substitution • For example • the gap between intention and execution is five operation • trace, alignment, operation list (Figure 5.4.) 17 / 40

5.6 Minimum Edit Distance (2/6) 18 / 40

5.6 Minimum Edit Distance (3/6) • Levenshtein distance • assign a particular cost or weight to each of operations • simplest weighting factor • three operation has a cost of 1 • Levenshtein distance between intention and execution is 5 • alternate version - substitutions has a cost of 2 (why?) • The minimum edit distance is computed by dynamic programming 19 / 40

5.6 Minimum Edit Distance (4/6) • Dynamic programming • large problem can be solved by properly combining the solution to various subproblems • minimum edit distance for spelling error correction • Viterbi and the forward for speech recognition • CYK and Earley for parsing 20 / 40

5.6 Minimum Edit Distance (5/6) 21/ 40

5.6 Minimum Edit Distance (6/6) 22 / 40

5.8 The Bayesian Method for Pronunciation (1/6) • Bayesian algorithm can be used to solve what is often called the pronunciation subproblem in speech recognition • when [ni] occurs after the word I at the beginning of a sentence • investigation of the Switchboard corpus produces a total of 7 words • the, neat, need, new, knee, to, you (Chapter 4 참고) • two components • candidate generation • candidate scoring 23 / 40

5.8 The Bayesian Method for Pronunciation (2/6) • Speech recognizers often use an alternative architecture, trading off speech for storage • each pronunciation is expanded in advance with all possible variants, which are then pre-stored with their scores • Thus there is no need for candidate generation • the word [ni] is simply stored with the list of words that can generate it 24 / 40

5.8 The Bayesian Method for Pronunciation (3/6) • y represents the sequence of phones • w represents the candidate word • it turns out that confusion matrices don't do as well for pronunciation • the changes in pronunciation between a lexical and surface form are much greater • probabilistic models of pronunciation variation include a lot more factors than a simple confusion matrix can include • One simple way to generate pronunciation likelihoods is via probabilistic rules 25 / 40

5.8 The Bayesian Method for Pronunciation (4/6) • a word-initial [δ] becomes [n] if the preceding word ended in [n] or sometimes [m] • ncout : number of times lexical [δ] is realized word initially by surface [n] when the previous word ends in a nasal • envcount : total number of times lexical [δ] occurs when the previous word ends in a nasal 26 / 40

5.8 The Bayesian Method for Pronunciation (5/6) 27/ 40

5.8 The Bayesian Method for Pronunciation (6/6) • Decision Tree Models of Pronunciation Variation 28 / 40

5.9 Weighted Automata (1/12) • Weighted Automata • simple augmentation of the finite automaton • each arc is associated with a probability • the probability on all the arcs leaving a node must sum to 1 29/ 40

5.9 Weighted Automata (2/12) 30 / 40

5.9 Weighted Automata (4/12) 3 2/ 40

Chapter 5. Probabilistic Models of Pronunciation and Spelling