1 / 104

CS60057 Speech &Natural Language Processing

CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture 10 16 August 2007. Hidden Markov Model (HMM) Tagging. Using an HMM to do POS tagging HMM is a special case of Bayesian inference It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition).

betrys
Download Presentation

CS60057 Speech &Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS60057Speech &Natural Language Processing Autumn 2007 Lecture 10 16 August 2007 Natural Language Processing

  2. Hidden Markov Model (HMM) Tagging • Using an HMM to do POS tagging • HMM is a special case of Bayesian inference • It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition) Natural Language Processing

  3. Hidden Markov Model (HMM) Taggers • Goal: maximize P(word|tag) x P(tag|previous n tags) • P(word|tag) • word/lexical likelihood • probability that given this tag, we have this word • NOT probability that this word has this tag • modeled through language model (word-tag matrix) • P(tag|previous n tags) • tag sequence likelihood • probability that this tag follows these previous tags • modeled through language model (tag-tag matrix) Lexical information Syntagmatic information Natural Language Processing

  4. POS tagging as a sequence classification task • We are given a sentence (an “observation” or “sequence of observations”) • Secretariat is expected to race tomorrow • sequence of n words w1…wn. • What is the best sequence of tags which corresponds to this sequence of observations? • Probabilistic/Bayesian view: • Consider all possible sequences of tags • Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn. Natural Language Processing

  5. Getting to HMM • Let T = t1,t2,…,tn • Let W = w1,w2,…,wn • Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn • Hat ^ means “our estimate of the best = the most probable tag sequence” • Argmaxxf(x) means “the x such that f(x) is maximized” it maximazes our estimate of the best tag sequence Natural Language Processing

  6. Getting to HMM • This equation is guaranteed to give us the best tag sequence • But how do we make it operational? How do we compute this value? • Intuition of Bayesian classification: • Use Bayes rule to transform it into a set of other probabilities that are easier to compute • Thomas Bayes: British mathematician (1702-1761) Natural Language Processing

  7. Bayes Rule Breaks down any conditional probability P(x|y) into three other probabilities P(x|y): The conditional probability of an event x assuming that y has occurred Natural Language Processing

  8. Bayes Rule We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words Natural Language Processing

  9. Bayes Rule Natural Language Processing

  10. Likelihood and prior n Natural Language Processing

  11. Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag 3. The most probable tag sequence estimated by the bigram tagger Natural Language Processing

  12. WORDS TAGS the koala put the keys on the table N V P DET Likelihood and prior Further Simplifications 1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it n Natural Language Processing

  13. Likelihood and prior Further Simplifications 2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram. Bigrams are used as the basis for simple statistical analysis of text The bigram assumption is related to the first-order Markov assumption Natural Language Processing

  14. Likelihood and prior Further Simplifications 3. The most probable tag sequence estimated by the bigram tagger --------------------------------------------------------------------------------------------------------------- n biagram assumption Natural Language Processing

  15. Two kinds of probabilities (1) • Tag transition probabilities p(ti|ti-1) • Determiners likely to precede adjs and nouns • That/DT flight/NN • The/DT yellow/JJ hat/NN • So we expect P(NN|DT) and P(JJ|DT) to be high • But P(DT|JJ) to be:? Natural Language Processing

  16. Two kinds of probabilities (1) • Tag transition probabilities p(ti|ti-1) • Compute P(NN|DT) by counting in a labeled corpus: # of times DT is followed by NN Natural Language Processing

  17. Two kinds of probabilities (2) • Word likelihood probabilities p(wi|ti) • P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is” • Compute P(is|VBZ) by counting in a labeled corpus: If we were expecting a third person singular verb, how likely is it that this verb would be is? Natural Language Processing

  18. An Example: the verb “race” • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR • People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DTrace/NN for/IN outer/JJ space/NN • How do we pick the right tag? Natural Language Processing

  19. Disambiguating “race” Natural Language Processing

  20. Disambiguating “race” • P(NN|TO) = .00047 • P(VB|TO) = .83 The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’ • P(race|NN) = .00057 • P(race|VB) = .00012 Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB. • P(NR|VB) = .0027 • P(NR|NN) = .0012 tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun • P(VB|TO)P(NR|VB)P(race|VB) = .00000027 • P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins Natural Language Processing

  21. Hidden Markov Models • What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM) • Let’s just spend a bit of time tying this into the model • In order to define HMM, we will first introduce the Markov Chain, or observable Markov Model. Natural Language Processing

  22. Definitions • A weighted finite-state automaton adds probabilities to the arcs • The sum of the probabilities leaving any arc must sum to one • A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through • Markov chains can’t represent inherently ambiguous problems • Useful for assigning probabilities to unambiguous sequences Natural Language Processing

  23. Markov chain = “First-order observed Markov Model” • a set of states • Q = q1, q2…qN; the state at time t is qt • a set of transition probabilities: • a set of probabilities A = a01a02…an1…ann. • Each aij represents the probability of transitioning from state i to state j • The set of these is the transition probability matrix A • Distinguished start and end states Special initial probability vector  ithe probability that the MM will start in state i, each iexpresses the probability p(qi|START) Natural Language Processing

  24. Markov chain = “First-order observed Markov Model” Markov Chain for weather: Example 1 • three types of weather: sunny, rainy, foggy • we want to find the following conditional probabilities: P(qn|qn-1, qn-2, …, q1) - I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days - We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences Problem: the larger n is, the more observations we must collect. Suppose that n=6, then we have to collect statistics for 3(6-1) = 243 past histories Natural Language Processing

  25. Markov chain = “First-order observed Markov Model” • Therefore, we make a simplifying assumption, called the (first-order) Markov assumption for a sequence of observations q1, … qn, current state only depends on previous state • the joint probability of certain past and current observations Natural Language Processing

  26. Markov chain = “First-order observable Markov Model” Natural Language Processing

  27. Markov chain = “First-order observed Markov Model” • Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy? • Using the Markov assumption and the probabilities in table 1, this translates into: Natural Language Processing

  28. Markov chain for weather • What is the probability of 4 consecutive rainy days? • Sequence is rainy-rainy-rainy-rainy • I.e., state sequence is 3-3-3-3 • P(3,3,3,3) = • 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432 Natural Language Processing

  29. Hidden Markov Model • For Markov chains, the output symbols are the same as the states. • See sunny weather: we’re in state sunny • But in part-of-speech tagging (and other things) • The output symbols are words • But the hidden states are part-of-speech tags • So we need an extension! • A Hidden Markov Model is an extension of a Markov chain in which the output symbols are not the same as the states. • This means we don’t know which state we are in. Natural Language Processing

  30. Markov chain for words Observed events: words Hidden events: tags Natural Language Processing

  31. Hidden Markov Models • States Q = q1, q2…qN; • Observations O = o1, o2…oN; • Each observation is a symbol from a vocabulary V = {v1,v2,…vV} • Transition probabilities (prior) • Transition probability matrix A = {aij} • Observation likelihoods (likelihood) • Output probability matrix B={bi(ot)} a set of observation likelihoods, each expressing the probability of an observation ot being generated from a state i, emission probabilities • Special initial probability vector  ithe probability that the HMM will start in state i, each iexpresses the probability p(qi|START) Natural Language Processing

  32. Assumptions • Markov assumption: the probability of a particular state depends only on the previous state • Output-independence assumption: the probability of an output observation depends only on the state that produced that observation Natural Language Processing

  33. HMM for Ice Cream • You are a climatologist in the year 2799 • Studying global warming • You can’t find any records of the weather in Boston, MA for summer of 2007 • But you find Jason Eisner’s diary • Which lists how many ice-creams Jason ate every date that summer • Our job: figure out how hot it was Natural Language Processing

  34. Noam task • Given • Ice Cream Observation Sequence: 1,2,3,2,2,2,3… (cp. with output symbols) • Produce: • Weather Sequence: C,C,H,C,C,C,H … (cp. with hidden states, causing states) Natural Language Processing

  35. HMM for ice cream Natural Language Processing

  36. Different types of HMM structure Ergodic = fully-connected Bakis = left-to-right Natural Language Processing

  37. HMM Taggers • Two kinds of probabilities • A transition probabilities (PRIOR) • B observation likelihoods (LIKELIHOOD) • HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability Natural Language Processing

  38. Weighted FSM corresponding to hidden states of HMM, showing A probs Natural Language Processing

  39. B observation likelihoods for POS HMM Natural Language Processing

  40. The A matrix for the POS HMM Natural Language Processing

  41. The B matrix for the POS HMM Natural Language Processing

  42. HMM Taggers • The probabilities are trained on hand-labeled training corpora (training set) • Combine different N-gram levels • Evaluated by comparing their output from a test set to human labels for that test set (Gold Standard) Natural Language Processing

  43. The Viterbi Algorithm • best tag sequence for "John likes to fish in the sea"? • efficiently computes the most likely state sequence given a particular output sequence • based on dynamic programming Natural Language Processing

  44. A smaller example a b b a 0.2 0.8 0.4 • What is the best sequence of states for the input string “bbba”? • Computing all possible paths and finding the one with the max probability is exponential 0.6 0.7 end start r q 1 1 0.5 0.3 0.5 Natural Language Processing

  45. A smaller example (con’t) • For each state, store the most likely sequence that could lead to it (and its probability) • Path probability matrix: • An array of states versus time (tags versus words) • That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time. Natural Language Processing

  46. Viterbi intuition: we are looking for the best ‘path’ S1 S2 S3 S4 S5 Natural Language Processing Slide from Dekang Lin

  47. The Viterbi Algorithm Natural Language Processing

  48. Intuition • The value in each cell is computed by taking the MAX over all paths that lead to this cell. • An extension of a path from state i at time t-1 is computed by multiplying: • Previous path probability from previous cell viterbi[t-1,i] • Transition probability aij from previous state I to current state j • Observation likelihood bj(ot) that current state j matches observation symbol t Natural Language Processing

  49. Viterbi example Natural Language Processing

  50. Smoothing of probabilities • Data sparseness is a problem when estimating probabilities based on corpus data. • The “add one” smoothing technique – C- absolute frequency N: no of training instances B: no of different types • Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams: • The lambda values are automatically determined using a variant of the Expectation Maximization algorithm. Natural Language Processing

More Related