1 / 26

Tagging – more details

Tagging – more details. Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing , Ch 8 R Dale et al (2000) Handbook of Natural Language Processing , Ch 17 C D Manning & H Sch ü tze (1999) Foundations of Statistical Natural Language Processing , Ch 10. POS tagging - overview.

keahi
Download Presentation

Tagging – more details

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing, Ch 17 C D Manning & H Schütze (1999) Foundations of Statistical Natural Language Processing, Ch 10

  2. POS tagging - overview • What is a “tagger”? • Tagsets • How to build a tagger and how a tagger works • Supervised vs unsupervised learning • Rule-based vs stochastic • And some details

  3. What is a tagger? • Lack of distinction between … • Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” • The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) • Taggers (even rule-based ones) are almost invariably trained on a given corpus • “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)

  4. Tagging vs. parsing • Once tagger is “trained”, process consists straightforward look-up, plus local context (and sometimes morphology) • Will attempt to assign a tag to unknown words, and to disambiguate homographs • “Tagset” (list of categories) usually larger with more distinctions

  5. Tagset • Parsing usually has basic word-categories, whereas tagging makes more subtle distinctions • E.g. noun sg vs pl vs genitive, common vs proper, +is, +has, … and all combinations • Parser uses maybe 12-20 categories, tagger may use 60-100

  6. Simple taggers • Default tagger has one tag per word, and assigns it on the basis of dictionary lookup • Tags may indicate ambiguity but not resolve it, e.g. nvb for noun-or-verb • Words may be assigned different tags with associated probabilities • Tagger will assign most probable tag unless • there is some way to identify when a less probable tag is in fact correct • Tag sequences may be defined by regular expressions, and assigned probabilities (including 0 for illegal sequences)

  7. What probabilities do we have to learn? • Individual word probabilities: Probability that a given tag t is appropriate for a givenword w • Easy (in principle): learn from training corpus: • Problem of “sparse data”: • Add a small amount to each calculation, so we get no zeros

  8. (b) Tag sequence probability: Probability that a given tag sequence t1,t2,…,tn is appropriate for a givenword sequence w1,w2,…,wn • P(t1,t2,…,tn | w1,w2,…,wn ) = ??? • Too hard to calculate entire sequence: P(t1,t2 ,t3 ,t4 , …) = P(t2|t1 ) P(t3|t1,t2 ) P(t4|t1,t2 ,t3 )  … • Subsequence is more tractable • Sequence of 2 or 3 should be enough: Bigram model: P(t1,t2) = P(t2|t1 ) Trigram model: P(t1,t2 ,t3) = P(t2|t1 ) P(t3|t2 ) N-gram model:

  9. More complex taggers • Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to wordn on the basis of wordn-1) • An nth-order tagger assigns tags on the basis of sequences of n words • As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations

  10. 1960 1970 1980 1990 2000 History Combined Methods 98%+ Trigram Tagger (Kempe) 96%+ DeRose/Church Efficient HMM Sparse Data 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Greene and Rubin Rule Based - 70% HMM Tagging (CLAWS) 93%-95% Neural Network 96%+ LOB Corpus Tagged Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP LOB Corpus Created (EN-UK) 1 Million Words Penn Treebank Corpus (WSJ, 4.5M)

  11. How do they work? • Tagger must be “trained” • Many different techniques, but typically … • Small “training corpus” hand-tagged • Tagging rules learned automatically • Rules define most likely sequence of tags • Rules based on • Internal evidence (morphology) • External evidence (context)

  12. Rule-based taggers • Earliest type of tagging: two stages • Stage 1: look up word in lexicon to give list of potential POSs • Stage 2: Apply rules which certify or disallow tag sequences • Rules originally handwritten; more recently Machine Learning methods can be used • cf transformation-based learning, below

  13. Stochastic taggers • Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier ... Some primitive algorithms were already published in 60s and 70s) • Most common is based on Hidden markov Models (also found in speech processing, etc.)

  14. (Hidden) Markov Models • Probability calculations imply Markov models: we assume that P(t|w) is dependent only on the (or, a sequence of) previous word(s) • (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking too much account of the past • Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states • Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest

  15. Three stages of HMM training • Estimating likelihoods on the basis of a corpus: Forward-backward algorithm • “Decoding”: applying the process to a given input: Viterbi algorithm • Learning (training): Baum-Welch algorithm or Iterative Viterbi

  16. Denote Claim: Therefore we can calculate all At(s) in time O(L*Tn). Similar, by going backwards, we can get: Multiplying we can get: Note that summing this for all states at a time t gives the likelihood of w1…wL. Forward-backward algorithm

  17. Viterbi algorithm (aka Dynamic programming)(see J&M p177ff) • Denote • Claim: • Otherwise, appending s to the prefix would get a path better than Qt+1(s). • Therefore, checking all possible states q at time t, multiplying by the transition probability between q and s and the expression probability of wt+1 given s, and finding the maximum, gives Qt+1(s). • We need to store for each state the previous state in Qt(s). • Find the maximal finish state, and reconstruct the path. • O(L*Tn) instead of TL.

  18. Baum-Welch algorithm • Start with initial HMM • Calculate, using F-B, the likelihood to get our observations given that a certain hidden state was used at time i. • Re-estimate the HMM parameters • Continue until convergence • Can be shown to constantly improve likelihood

  19. Unsupervised learning • We have an untagged corpus • We may also have partial information such as a set of tags, a dictionary, knowledge of tag transitions, etc. • Use Baum-Welch to estimate both the context probabilities and the lexical probabilities

  20. Supervised learning • Use a tagged corpus • Count the frequencies of tag-pairs t,w: C(t,w) • Estimate (Maximum Likelihood Estimate): • Count the frequencies of tag n-grams C(t1…tn) • Estimate (Maximum Likelihood Estimate): • What about small counts? Zero counts?

  21. Sparse Training Data - Smoothing • Adding a bias: • Compensates for estimation (Bayesean approach) • Has larger effect on low-count words • Solves zero-count word problem • Generalized Smoothing: • Reduces to bias using:

  22. Decision-tree tagging • Not all n-grams are created equal: • Some n-grams contain redundant information that may be expressed well enough with less tags • Some n-grams are too sparse • Decision Tree (Schmid, 1994)

  23. Decision Trees • Each node is a binary test of tag ti-k. • The leaves store probabilities for ti. • All HMM algorithms can still be used • Learning: • Build tree from root to leaves • Choose tests for nodes that maximize information gain • Stop when branch too sparse • Finally, prune tree

  24. Transformation-based learning • Eric Brill (1993) • Start from an initial tagging, and apply a series of transformations • Transformations are learned as well, from the training data • Captures the tagging data in much fewer parameters than stochastic models • The transformations learned have linguistic meaning

  25. Transformation-based learning • Examples: Change tag a to b when: • The preceding (following) word is tagged z • The word two before (after) is tagged z • One of the 2 preceding (following) words is tagged z • The preceding word is tagged z and the following word is tagged w • The preceding (following) word is W

  26. Transformation-based Tagger: Learning • Start with initial tagging • Score the possible transformations by comparing their result to the “truth”. • Choose the transformation that maximizes the score • Repeat last 2 steps

More Related