1 / 51

COMP790: Statistical NLP

COMP790: Statistical NLP. POS Tagging Chap. 10. POS tagging . Goal: assign the right part of speech (noun, verb, …) to words in a text “ The /AT representative /NN put /VBD chairs /NNS on /IN the /AT table /NN .” Terminology POS, part-of-speech tag word class morphological class

alda
Download Presentation

COMP790: Statistical NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP790: Statistical NLP POS Tagging Chap. 10

  2. POS tagging • Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN put/VBD chairs/NNS on/IN the/AT table/NN.” • Terminology • POS, part-of-speech tag • word class • morphological class • lexical tag • grammatical tag

  3. Why do POS Tagging? • Purpose: • 1st step to NLU • easier then full NLU (results > 95% accuracy) • Useful for: • speech recognition / synthesis (better accuracy) • how to recognize/pronounce a word • CONtent/noun VS conTENT/adj • stemming in IR • which morphological affixes the word can take • adverb - ly = noun (friendly - ly = friend) • for IR and QA • pick out nouns which may be more important than other words in indexing documents or query analysis • partial parsing/chunking (for IE) • to find noun phrases/verb phrases

  4. Tag sets • Different tag sets, depends on the purpose of the application • 45 tags in Penn Treebank • 62 tags in CLAWS with BNC corpus • 79 tags in Church (1991) • 87 tags in Brown corpus • 147 tags in C7 tagset • 258 tags in Tzoukermann and Radev (1995)

  5. Tag set: Penn TreeBank • 45 tags

  6. Most word types are not ambiguous but... • but most word types are rare… • Brown corpus (Francis&Kucera, 1982): • 11.5% word types are ambiguous (>1 tag) • 40% word tokens are ambiguous (>1 tag)

  7. Techniques to POS tagging • rule-based tagging • uses hand-written rules • stochastic tagging • uses probabilities computed from training corpus • transformation-based tagging • uses rules learned automatically

  8. Information sources for tagging All techniques are based on the same observations… • Syntagmatic information: • some tag sequences are more probable than others • ART+ADJ+N is more probable than ART+ADJ+VB • Lexical information: • knowing the word to be tagged gives a lot of information about the correct tag • “table”: {noun, verb} but not a {adj, prep,…} • “rose”: {noun, adj, verb} but not {prep, ...}

  9. Naïve POS tagging • using only syntagmatic patterns: • Green & Rubin (1971) • accuracy of 77% • using the most-likely tag for each word: • Charniak et al. (1993) • accuracy of 90% • much better, but not very good... • 1 mistake every 10 words • used as baseline for evaluation

  10. Techniques to POS tagging • --> rule-based tagging • uses hand-written rules • stochastic tagging • uses probabilities computed from training corpus • transformation-based tagging • uses rules learned automatically

  11. Rule-based POS tagging • Step 1: Assign each word with all possible tags • use dictionary • Step 2: Use if-then rules to identify the correct tag in context (disambiguation rules)

  12. Sample rules N-IP rule: A tag N (noun) cannot be followed by a tag IP (interrogative pronoun) ... man who … • man: {N} • who: {RP, IP} --> {RP} relative pronoun ART-V rule: A tag ART (article) cannot be followed by a tag V (verb) ...the book… • the: {ART} • book: {N, V} --> {N}

  13. Techniques to POS tagging • rule-based tagging • uses hand-written rules • --> stochastic tagging • uses probabilities computed from training corpus • transformation-based tagging • uses rules learned automatically

  14. Stochastic POS tagging • Assume that a word’s tag only depends on the previous tags (not following ones) • Use a training set (manually tagged corpus) to: • learn the regularities of tag sequences • learn the possible tags for a word • model this info through a language model (n-gram)

  15. Hidden Markov Model (HMM) Taggers • Goal: maximize P(word|tag) x P(tag|previous n tags) • P(word|tag) • word/lexical likelihood • probability that given this tag, we have this word • NOT probability that this word has this tag • modeled through language model (word-tag matrix) • P(tag|previous n tags) • tag sequence likelihood • probability that this tag follows these previous tags • modeled through language model (tag-tag matrix) Lexical information Syntagmatic information

  16. Tag sequence probability • P(tag|previous n tags) • if we look (n-1) tags before to find current tag --> n-gram model • trigram model • chooses the most probable tag ti for word wi given: • the previous 2 tags ti-2 & ti-1 and • the current word wi • bigram model • chooses the most probable tag ti for word wi given: • the previous tag ti-1 and • the current word wi • unigram model (just most-likely tag) • chooses the most probable tag ti for word wi given: • the current word wi

  17. Example • “race” can be VB or NN • “Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/ADV” • “People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NNfor/IN outer/JJ space/NN” • let’s tag the word “race” in 1st sentence with a bigram model.

  18. Example (con’t) • assuming previous words have been tagged, we have: “Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow” • P(race|VB) x P(VB|TO) ? • given that we have a VB, how likely is the current word to be race • given that the previous tag is TO, how likely is the current tag to be VB • P(race|NN) x P(NN|TO) ? • given that we have a NN, how likely is the current word to be race • given that the previous tag is TO, how likely is the current tag to be NN

  19. Example (con’t) • From the training corpus, we found that: • P(NN|TO) = .021// given that the previous tag is TO // 2.1% chances that the current tag is NN • P(VB|TO) = .34// given that the previous tag is TO // 34% chances that the current tag is VB • P(race|NN) = .00041// given that we have an NN // 0.041% chances that this word is "race" • P(race|VB) = .00003// given that we have a VB // 003% chances that this word is "race" so: P(VB|TO) x P(race|VB) = .34 x .00003 = .000 01 P(NN|TO) x P(race|NN) = .021 x .00041 = .000 009 so: VB is more probable!

  20. Example (con’t) • and by the way: race is 98% of the time a NN !!! P(VB|race) = 0.02 P(NN|race) = 0.98 !!! • How are the probabilities found ? • using a training corpus of hand-tagged text • long & meticulous work done by linguists

  21. HMM Tagging • But HMM tagging tries to find: • the best sequence of tags for a sentence • not just best tag for a single word • goal: maximize the probability of a tag sequence, given a word sequence • i.e. choose the sequence of tags that maximizes P(tag sequence|word sequence)

  22. HMM Tagging (con’t) • By Bayes law: • wordSeq is given… • so P(wordSeq) will be the same for all tagSeq • so we can drop it from the equation

  23. Assumptions in HMM Tagging • words are independent • Markov assumption (approximation to short history) • ex. with bigram approximation: • probability of a word is only dependent on its tag emission probability state transition probability

  24. The derivation bestTagSeq = argmax P(tagSeq) x P(wordSeq|tagSeq) (t1…tn)* = argmax P( t1, …, tn ) x P(w1, …, wn | t1, …, tn ) Assumption 1: Independence assumption + Chain rule P(t1, …, tn) x P(w1, …, wn | t1, …, tn) = P(tn| t1, …, tn-1) x P(tn-1| t1, …, tn-2) x P(tn-2| t1, …, tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn) Assumption 2: Markov assumption: only look at short history (ex. bigram) = P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn) Assumption 3: A word’s identity only depends on its tag = P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1) x P(w2 | t2) x P(w3 | t3) x … x P(wn | tn)

  25. Emissions & Transitions probabilities • let • N: number of possible tags (size of tag set) • V: number of word types (vocabulary) • from a tagged training corpus, we compute the frequency of: • Emission probabilities P(wi| ti) • stored in an N x V matrix • emission[i,j] = probability that tag i is the correct tag for word j • Transitions probabilities P(ti|ti-1) • stored in an N x N matrix • transmission[i,j] = probability that tag i follows tag j • In practice, these matrices are very sparse • So these models are smoothed to avoid zero probabilities

  26. Emission probabilities P(wi| ti) • stored in an N x V matrix • emission[i,j] = probability/frequency that tag i is the correct tag for word j

  27. Transitions probabilities P(ti|ti-1) • stored in an N x N matrix • transmission[i,j] = probability/frequency that tag i follows tag j

  28. Efficiency issues • to find the best probability of a sequence is exponential in time • for efficiency, we usually use the Viterbi algorithm • For global maximisation • i.e. best tag sequence

  29. an Example • Emission probabilities: • Transition probabilities:

  30. State Transition Diagram (VMM) • Transition probabilities 0.2 1 start TO 0.05 AT 0.7 0.1 0.95 0.5 0.25 0.1 0.5 0.1 NN 0.2 VB 0.3 0.05 0.25 0.2 PN 0.7 0.2 0.1 0.9 IN 0.1 end

  31. State Transition Diagram (HMM) • but the states are "invisible" (we only see the words) … in: 0.2 the: 0.1 to: 0.1 0.2 1 start TO … 0.05 AT 0.7 0.1 0.95 likes: 0.1 0.5 0.25 sea: 0.2 0.1 0.5 0.1 NN 0.2 … VB 0.3 fish: 0.3 0.05 0.25 0.2 likes: 0.1 John: 0.3 PN … 0.7 fish: 0.1 0.2 0.1 0.9 … IN 0.1 in: 0.1 end …

  32. The Viterbi Algorithm • best tag sequence for "John likes to fish in the sea"? • efficiently computes the most likely state sequence given a particular output sequence • based on dynamic programming

  33. A smaller example a b b a 0.2 0.8 0.4 • What is the best sequence of states for the input string “bbba”? • Computing all possible paths and finding the one with the max probability is exponential 0.6 0.7 end start r q 1 1 0.5 0.3 0.5

  34. A smaller example (con’t) • For each state, store the most likely sequence that could lead to it (and its probability) • Path probability matrix: • An array of states versus time (tags versus words) • That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time.

  35. Viterbi for POS tagging Let: • n = nb of words in sentence to tag (nb of input tokens) • T = nb of tags in the tag set (nb of states) • vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i • state = matrix to recover the nodes of the best path (best tag sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1 // Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD) vit[1,t]:=0.0 for t ≠ PERIOD

  36. Viterbi for POS tagging (con’t) emission probability // Induction (build the path probability matrix) for i:=1 to n step 1 do // for all words in the sentence for all tags tj do // for all possible tags // store the max prob of the path vit[i+1,tj] := max1≤k≤T(vit[i,tk] x P(wi+1|tj) x P(tj|tk)) // store the actual state path[i+1,tj] := argmax1≤k≤T ( vit[i,tk] x P(wi+1|tj) x P(tj|tk)) end end //Termination and path-readout bestStaten+1 := argmax1≤j≤T vit[n+1,j] for j:=n to 1 step -1 do // for all the words in the sentence bestStatej := path[i+1, bestStatej+1] end P(bestState1,…, bestStaten ) := max1≤j≤T vit[n+1,j] state transition probability probability of best path leading to state tk at word i

  37. Possible improvements • in bigram POS tagging, we condition a tag only on the preceding tag • why not... • use more context (ex. use trigram model) • more precise: • “is clearly marked”--> verb, past participle • “he clearly marked” -->verb, past tense • combine trigram, bigram, unigram models • condition on words too • but with an n-gram approach, this is too costly (too many parameters to model) • transformation-based tagging...

  38. Techniques to POS tagging • rule-based tagging • uses hand-written rules • stochastic tagging • uses probabilities computed from training corpus • --> transformation-based tagging • uses rules learned automatically

  39. Transformation-based tagging • Due to Eric Brill (1995) • basic idea: • take a non-optimal sequence of tags and • improve it successively by applying a series of well-ordered re-write rules • rule-based • but, rules are learned automatically by training on a pre-tagged corpus

  40. An example 1. Assign to words their most likely tag • P(NN|race) = .98 • P(VB|race) = .02 2. Change some tags by applying transformation rules

  41. Types of context • lots of latitude… • can be: • tag-triggered transformation • The preceding/following word is tagged this way • The word two before/after is tagged this way • ... • word- triggered transformation • The preceding/following word this word • … • morphology- triggered transformation • The preceding/following word finishes with an s • … • a combination of the above • The preceding word is tagged this ways AND the following word is this word

  42. Learning the transformation rules • Input: A corpus with each word: • correctly tagged (for reference) • tagged with its most frequent tag (C0) • Output: A bag of transformation rules • Algorithm: • Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0 • Change tag a to tag b when… • The preceding/following word is tagged z • The word two before/after is tagged z • One of the 2 preceding/following words is tagged z • One of the 2 preceding words is z • …

  43. Learning the transformation rules (con't) • Run the initial tagger and compile types of errors • <incorrect tag, desired tag, # of occurrences> • For each error type, instantiate all templates to generate candidate transformations • Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces • Save the transformation that yields the greatest improvement • Stop when no transformation can reduce the error rate by a predetermined threshold

  44. Example • if the initial tagger mistags 159 words as verbs instead of nouns • create the error triple: <verb, noun, 159> • Suppose template #3 is instantiated as the rule: • Change the tag from <verb> to <noun> if one of the two preceding words is tagged as a determiner. • When this template is applied to the corpus: • it corrects 98 of the 159 errors • but it also creates 18 new errors • Error reduction is 98-18=80

  45. Learning the best transformations • input: • a corpus with each word: • correctly tagged (for reference) • tagged with its most frequent tag (C0) • a bag of unordered transformation rules • output: • an ordering of the best transformation rules

  46. Learning the best transformations (con’t) let: • E(Ck) = nb of words incorrectly tagged in the corpus at iteration k • v(C) = the corpus obtained after applying rule v on the corpus C ε = minimum number of errors desired for k:= 0 step 1 do bt := argmint (E(t(Ck))// find the transformation t thatminimizes // the error rate if ((E(Ck) - E(bt(Ck))) < ε)// if bt does not improve the taggingsignificantly then goto finished Ck+1 := bt(Ck)// apply rule bt to the current corpus Tk+1 := bt// bt will be kept as the currenttransformation // rule end finished: the sequence T1 T2 … Tk is the ordered transformation rules

  47. Strengths of transformation-based tagging • exploits a wider range of lexical and syntactic regularities • can look at a wider context • condition the tags on preceding/next words not just preceding tags. • can use more context than bigram or trigram. • transformation rules are easier to understand than matrices of probabilities

  48. Evaluation of POS taggers • compared with gold-standard ofhuman performance • metric: • accuracy = % of tags that are identical to gold standard • most taggers ~96-97% accuracy • must compare accuracy to: • ceiling (best possible results) • how do human annotators score compared to each other? (96-97%) • so systems are not bad at all! • baseline (worst possible results) • what if we take the most-likely tag (unigram model) regardless of previous tags ? (90-91%) • so anything less is really bad

  49. More on tagger accuracy • is 95% good? • that’s 5 mistakes every 100 words • if on average, a sentence is 20 words, that’s 1 mistake per sentence • when comparing tagger accuracy, beware of: • size of training corpus • the bigger, the better the results • difference between training & testing corpora (genre, domain…) • the closer, the better the results • size of tag set • Prediction versus classification • unknown words • the more unknown words (not in dictionary), the worst the results

  50. Error analysis of POS taggers • Where did the tagger go wrong ? • Use a confusion matrix / contingency table • Most confused: • NN (noun) vs. NNP (proper noun) vs. JJ (adjective) • VBD (verb, past tense) vs. VBN (past participle) vs. JJ (adjective) • he chopped carrots, the carrots were chopped, the chopped carrots

More Related