1.54k likes | 1.56k Views
Computational Tools for Linguists. Inderjeet Mani Georgetown University im5@georgetown.edu. Topics. Computational tools for manual and automatic annotation of linguistic data exploration of linguistic hypotheses Case studies Demonstrations and training Inter-annotator reliability
E N D
Computational Tools for Linguists Inderjeet Mani Georgetown University im5@georgetown.edu
Topics • Computational tools for • manual and automatic annotation of linguistic data • exploration of linguistic hypotheses • Case studies • Demonstrations and training • Inter-annotator reliability • Effectiveness of annotation scheme • Costs and tradeoffs in corpus preparation
Topics Concordances Data sparseness Chomsky’s Critique Ngrams Mutual Information Part-of-speech tagging Annotation Issues Inter-Annotator Reliability Named Entity Tagging Relationship Tagging Case Studies metonymy adjective ordering Discourse markers: then TimeML Outline
Corpus Linguistics • Use of linguistic data from corpora to test linguistic hypotheses => emphasizes language use • Uses computers to do the searching and counting from on-line material • Faster than doing it by hand! Check? • Most typical tool is a concordancer, but there are many others! • Tools can analyze a certain amount, rest is left to human! • Corpus Linguistics is also a particular approach to linguistics, namely an empiricist approach • Sometimes (extreme view) opposed to the rationalist approach, at other times (more moderate view) viewed as complementary to it • Cf. Theoretical vs. Applied Linguistics
Empirical Approaches in Computational Linguistics • Empiricism – the doctrine that knowledge is derived from experience • Rationalism: the doctrine that knowledge is derived from reason • Computational Linguistics is, by necessity, focused on ‘performance’, in that naturally occurring linguistic data has to be processed • Naturally occurring data is messy! This means we have to process data characterized by false starts, hesitations, elliptical sentences, long and complex sentences, input that is in a complex format, etc. • The methodology used is corpus-based • linguistic analysis (phonological, morphological, syntactic, semantic, etc.) carried out on a fairly large scale • rules are derived by humans or machines from looking at phenomena in situ (with statistics playing an important role)
Example: metonymy • Metonymy: substituting the name of one referent for another • George W. Bush invaded Iraq • A Mercedes rear-ended me • Is metonymy involving institutions as agents more common in print news than in fiction? • “The X Vreporting” • Let’s start with: “The X said” • This pattern will provide a “handle” to identify the data
Exploring Corpora • Datasets http://complingtwo.georgetown.edu/cgi-bin/gwilson/bin/DataSets.cgi • Metonymy Test using Corpora http://complingtwo.georgetown.edu/~gwilson/Tools/Metonymy/TheXSaid_MST.html
‘The X said’ from Concordance data The preference for metonymy in print news arises because of the need to communicate Information from companies and governments.
Chomsky’s Critique of Corpus-Based Methods 1. Corpora model performance, while linguistics is aimed at the explanation of competence If you define linguistics that way, linguistic theories will never be able to deal with actual, messy data Many linguists don’t find the competence-performance distinction to be clear-cut. Sociolinguists have argued that the variability of linguistic performance is systematic, predictable, and meaningful to speakers of a language. Grammatical theories vary in where they draw the line between competence and performance, with some grammars (such as Halliday’s Systemic Grammar) organized as systems of functionally-oriented choices.
Chomsky’s Critique (concluded) 2. Natural language is in principle infinite, whereas corpora are finite, so many examples will be missed Excellent point, which needs to be understood by anyone working with a corpus. But does that mean corpora are useless? • Introspection is unreliable (prone to performance factors, cf. only short sentences), and pretty useless with child data. • Also, insights from a corpus might lead to generalization/induction beyond the corpus– if the corpus is a good sample of the “text population” 3. Ungrammatical examples won’t be available in a corpus Depends on the corpus, e.g., spontaneous speech, language learners, etc. The notion of grammaticality is not that clear • Who did you see [pictures/?a picture/??his picture/*John’s picture] of? • ARG/ADJUNCT example
Which Words are the Most Frequent? Common Words in Tom Sawyer (71,730 words), from Manning & Schutze p.21 Will these counts hold in a different corpus (and genre, cf. Tom)? What happens if you have 8-9M words? (check usage demo!)
Many low-frequency words Fewer high-frequency words. Only a few words will have lots of examples. About 50% of word types occur only once Over 90% occur 10 times or less. So, there is merit to Chomsky’s 2nd objection Data Sparseness Frequency of word types in Tom Sawyer, from M&S 22.
Zipf’s Law: Frequency is inversely proportional to rank Empirical evaluation of Zipf’s Law on Tom Sawyer, from M&S 23.
Illustration of Zipf’s Law (Brown Corpus, from M&S p. 30) logarithmic scale • See also http://www.georgetown.edu/faculty/wilsong/IR/WordDist.html
Tokenizing words for corpus analysis • 1. Break on • Spaces? 犬に当る男の子は私の兄弟である。 inuo butta otokonokowa otooto da • Periods? (U.K. Products) • Hyphens? data-base = database = data base • Apostrophes? won’t, couldn’t, O’Riley, car’s • 2. should different word forms be counted as distinct? • Lemma: a set of lexical forms having the same stem, the same pos, and the same word-sense. So, cat and cats are the same lemma. • Sometimes, words are lemmatized by stemming, other times by morphological analysis, using a dictionary and/or morphological rules • 3. fold case or not (usually folded)? • The the THEMark versus mark • One may need, however, to regenerate the original case when presenting it to the user
Counting: Word Tokens vs Word Types • Word tokens in Tom Sawyer: 71,370 • Word types: (i.e., how many different words) 8,018 • In newswire text of that number of tokens, you would have 11,000 word types. Perhaps because Tom Sawyer is written in a simple style.
Inspecting word frequencies in a corpus • http://complingtwo.georgetown.edu/cgi-bin/gwilson/bin/DataSets.cgi • Usage demo: • http://complingtwo.georgetown.edu/cgi-bin/gwilson/bin/Usage.cgi
Ngrams • Sequences of linguistic items of length n • See count.pl
A test for association strength: Mutual Information Data from (Church et al. 1991) 1988 AP corpus; N=44.3M
Interpreting Mutual Information • High scores, e.g., strong supporter (8.85) indicates strongly associated in the corpus MI is a logarithmic score. To convert it, recall that X=2 log2X so, 28.85 461.44. So this is 461 X chance. • Low scores – powerful support (1.74): this is 3X chance, since 21.74 3 I fxy fx fy x y 1.74 2 1984 13,428 powerful support I = log2 (2N/1984*13428) = 1.74 • So, doesn’t necessarily mean weakly associated – could be due to data sparseness
Mutual Information over Grammatical Relations • Parse a corpus • Determine subject-verb-object triples • Identify head nouns of subject and object NPs • Score subj-verb and verb-obj associations using MI
Demo of Verb-Subj, Verb-Obj Parses • Who devoursor what gets devoured? • Demo: http://www.cs.ualberta.ca/~lindek/demos/depindex.htm
MI over verb-obj relations • Data from (Church et al. 1991)
A Subj-Verb MI Example: Who does what in news? executive police politician reprimand 16.36 shoot 17.37 clamor 16.94 conceal 17.46 raid 17.65 jockey 17.53 bank 18.27 arrest 17.96 wrangle 17.59 foresee 18.85 detain 18.04 woo 18.92 conspire 18.91 disperse 18.14 exploit 19.57 convene 19.69 interrogate 18.36 brand 19.65 plead 19.83 swoop 18.44 behave 19.72 sue 19.85 evict 18.46 dare 19.73 answer 20.02 bundle 18.50 sway 19.77 commit 20.04 manhandle 18.59 criticize 19.78 worry 20.04 search 18.60 flank 19.87 accompany 20.11 confiscate 18.63 proclaim 19.91 own 20.22 apprehend 18.71 annul 19.91 witness 20.28 round 18.78 favor 19.92 Data from (Schiffman et al. 2001)
‘Famous’ Corpora • Must see: http://www.ldc.upenn.edu/Catalog/ • Brown Corpus • British National Corpus • International Corpus of English • Penn Treebank • Lancaster-Oslo-Bergen Corpus • Canadian Hansard Corpus • U.N. Parallel Corpus • TREC Corpora • MUC Corpora • English, Arabic, Chinese Gigawords • Chinese, ArabicTreebanks • North American News Text Corpus • Multext East Corpus – ‘1984’ in multiple Eastern/Central European langauges
Links to Corpora • Corpora: • Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ • Oxford Text Archive http://sable.ox.ac.uk/ota/ • Project Gutenberg http://www.promo.net/pg/ • CORPORA list http://www.hd.uib.no/corpora/archive.html • Other: • Chris Manning’s Corpora Page • http://www-nlp.stanford.edu/links/statnlp.html#Corpora • Michael Barlow’s Corpus Linguistics page http://www.ruf.rice.edu/~barlow/corpus.html • Cathy Ball’s Corpora tutorial http://www.georgetown.edu/faculty/ballc/corpora/tutorial.html
Summary: Introduction • Concordances and corpora are widely used and available, to help one to develop empirically-based linguistic theories and computer implementations • The linguistic items that can be counted are many, but “words” (defined appropriately) are basic items • The frequency distribution of words in any natural language is Zipfian • Data sparseness is a basic problem when using observations in a corpus sample of language • Sequences of linguistic items (e.g., word sequences – n-grams) can also be counted, but the counts will be very rare for longer items • Associations between items can be easily computed • e.g., associations between verbs and parser-discovered subjs or objs
Topics Concordances Data sparseness Chomsky’s Critique Ngrams Mutual Information Part-of-speech tagging Annotation Issues Inter-Annotator Reliability Named Entity Tagging Relationship Tagging Case Studies metonymy adjective ordering Discourse markers: then TimeML Outline
Using POS in Concordances deal is more often a verb In Fiction 2000 deal is more often a noun in English Gigaword deal is more prevalent in Fiction 2000 than Gigaword
POS Tagging – What is it? • Given a sentence and a tagset of lexical categories, find the most likely tag for each word in the sentence • Tagset – e.g., Penn Treebank (45 tags, derived from the 87-tag Brown corpus tagset) • Note that many of the words may have unambiguous tags • Example Secretariat/NNP is/VBZ expected/VBNto/TO race/VB tomorrow/NN People/NNS continue/VBP to/TOinquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
More details of POS problem • How ambiguous? • Most words in English have only one Brown Corpus tag • Unambiguous (1 tag) 35,340 word types • Ambiguous (2- 7 tags) 4,100 word types = 11.5% • 7 tags: 1 word type “still” • But many of the most common words are ambiguous • Over 40% of Brown corpus tokens are ambiguous • Obvious strategies may be suggested based on intuition • to/TO race/VB • the/DT race/NN • will/MD race/NN • Sentences can also contain unknown words for which tags have to be guessed: Secretariat/NNP is/VBZ
Different English Part-of-Speech Tagsets • Brown corpus - 87 tags • Allows compound tags • “I'm” tagged as PPSS+BEM • PPSS for "non-3rd person nominative personal pronoun" and BEM for "am, 'm“ • Others have derived their work from Brown Corpus • LOB Corpus: 135 tags • Lancaster UCREL Group: 165 tags • London-Lund Corpus: 197 tags. • BNC – 61 tags (C5) • PTB – 45 tags • To see comparisons ad mappings of tagsets, go to www.comp.leeds.ac.uk/amalgam/tagsets/tagmenu.html
PTB Tagset Development • Several changes were made to Brown Corpus tagset: • Recoverability • Lexical: Same treatment of Be, do, have, whereas BC gave each its own symbol • Do/VB does/VBZ did/VBD doing/VBG done/VBN • Syntactic: Since parse trees were used as part of Treebank, conflated certain categories under the assumption that they would be recoverable from syntax • subject vs. object pronouns (both PP) • subordinating conjunctions vs. prepositions on being informed vs. on the table (both IN) • Preposition “to” vs. infinitive marker (both TO) • Syntactic Function • BC: the/DT one/CD vs. PTB: the/DT one/NN • BC: both/ABX vs. • PTB: both/PDT the boys, the boys both/RB, both/NNS of the boys, both/CC boys and girls
PTB Tagging Process • Tagset developed • Automatic tagging by rule-based and statistical pos taggers • Human correction using an editor embedded in Gnu Emacs • Takes under a month for humans to learn this (at 15 hours a week), and annotation speeds after a month exceed 3,000 words/hour • Inter-annotator disagreement (4 annotators, eight 2000-word docs) was 7.2% for the tagging task and 4.1% for the correcting task • Manual tagging took about 2X as long as correcting, with about 2X the inter-annotator disagreement rate and an error rate that was about 50% higher. • So, for certain problems, having a linguist correct automatically tagged output is far more efficient and leads to better reliability among linguists compared to having them annotate the text from scratch!
Automatic POS tagging • http://complingone.georgetown.edu/~linguist/
Choose the most likely tag for each ambiguous word, independent of previous words i.e., assign each token to the pos-category it occurred in most often in the training set E.g., race – which pos is more likely in a corpus? This strategy gives you 90% accuracy in controlled tests So, this “unigram baseline” must always be compared against A Baseline Strategy
Beyond the Baseline • Hand-coded rules • Sub-symbolic machine learning • Symbolic machine learning
Machine Learning • Machines can learn from examples • Learning can be supervised or unsupervised • Given training data, machines analyze the data, and learn rules which generalize to new examples • Can be sub-symbolic (rule may be a mathematical function) –e.g. neural nets • Or it can be symbolic (rules are in a representation that is similar to representation used for hand-coded rules) • In general, machine learning approaches allow for more tuning to the needs of a corpus, and can be reused across corpora
What you want to do is find the “best sequence” of pos-tags C=C1..Cn for a sentence W=W1..Wn. (Here C1 is pos_tag(W1)). In other words, find a sequence of pos tags Cthat maximizes P(C| W) Using Bayes’ Rule, we can say P(C| W) = P(W | C) * P(C) / P(W ) Since we are interested in finding the value of C which maximizes the RHS, the denominator can be discarded, since it will be the same for every C So, the problem is: Find C which maximizes P(W | C) * P(C) Example: He will race Possible sequences: He/PP will/MD race/NN He/PP will/NN race/NN He/PP will/MD race/VB He/PP will/NN race/VB W = W1 W2 W3 = He will race C = C1 C2 C3 Choices: C= PP MD NN C= PP NN NN C = PP MD VB C = PP NN VB A Probabilistic Approach to POS tagging
Independence Assumptions • P(C1….Cn) i=1, n P(Ci| Ci-1) • assumes that the event of a pos-tag occurring is independent of the event of any other pos-tag occurring, except for the immediately previous pos tag • From a linguistic standpoint, this seems an unreasonable assumption, due to long-distance dependencies • P(W1….Wn | C1….Cn) i=1, n P(Wi| Ci) • assumes that the event of a word appearing in a category is independent of the event of any other word appearing in a category • Ditto • However, the proof of the pudding is in the eating! • N-gram models work well for part-of-speech tagging
will|MD .8 .4 race|NN .4 .8 he|PP 1 1 <s>| lex(B) .3 .6 .2 will|NN .2 race|VB .6 .7 A Statistical Method for POS Tagging MD NN VB PRP he 0 0 0 .3 will .8 .2 0 0 race 0 .4 .6 0 Find the value of C1..Cn which maximizes: i=1, n P(Wi| Ci) * P(Ci| Ci-1) Pos bigram probs lexical generation probabilities lexical generation probs C|R MD NN VB PRP MD .4 .6 NN .3 .7 PP .8 .2 1 pos bigram probs
Score(I) = Max J pred I [Score(J)* transition(I|J)]* lex(I) Score(B) = P(PP|)* P(he|PP) =1*.3=.3 Score(C)=Score(B) *P(MD|PP) * P(will|MD) = .3*.8*.8= .19 Score(D)=Score(B) *P(NN|PP) * P(will|NN) = .3*.2*.2= .012 Score(E) = Max [Score(C)*P(NN|MD), Score(D)*P(NN|NN)] *P(race|NN) = Score(F) = Max [Score(C)*P(VB|MD), Score(D)*P(VB|NN)]*P(race|VB)= Finding the best path through an HMM C E will|MD .8 .4 race|NN .4 Viterbi algorithm A .8 he|PP 1 1 <s>| lex(B) .3 F B .6 .2 will|NN .2 race|VB .6 .7 D
But Data Sparseness Bites Again! • Lexical generation probabilities will lack observations for low-frequency and unknown words • Most systems do one of the following • Smooth the counts • E.g., add a small number to unseen data (to zero counts). For example, assume a bigram not seen in the data has a very small probability, e.g., .0001. • Backoff bigrams with unigrams, etc. • Use lots more data (you’ll still lose, thanks to Zipf!) • Group items into classes, thus increasing class frequency • e.g., group words into ambiguity classes, based on their set of tags. For counting, alll words in an ambiguity class are treated as variants of the same ‘word’
A Symbolic Learning Method • HMMs are subsymbolic – they don’t give you rules that you can inspect • A method called Transformational Rule Sequence learning (Brill algorithm) can be used for symbolic learning (among other approaches) • The rules (actually, a sequence of rules) are learnt from an annotated corpus • Performs at least as accurately as other statistical approaches • Has better treatment of context compared to HMMs • rules which use the next (or previous) pos • HMMs just use P(Ci| Ci-1) or P(Ci| Ci-2Ci-1) • rules which use the previous (next) word • HMMs just use P(Wi|Ci)
Assume you are given a training corpus G (for gold standard) First, create a tag-free version V of it Notes: As the algorithm proceeds, each successive rule becomes narrower (covering fewer examples, i.e., changing fewer tags), but also potentially more accurate Some later rules may change tags changed by earlier rules 1. First label every word token in V with most likely tag for that word type from G. If this ‘initial state annotator’ is perfect, you’re done! 2. Then consider every possible transformational rule, selecting the one that leads to the most improvement in V using G to measure the error 3. Retag V based on this rule 4. Go back to 2, until there is no significant improvement in accuracy over previous iteration Brill Algorithm (Overview)
1. Label every word token with its most likely tag (based on lexical generation probabilities). 2. List the positions of tagging errors and their counts, by comparing with ground-truth (GT) 3. For each error position, consider each instantiation I of X, Y, and Z in Rule template. If Y=GT, increment improvements[I], else increment errors[I]. 4. Pick the I which results in the greatest error reduction, and add to output e.g., VB NN PREV1OR2TAG DT improves 98 errors, but produces 18 new errors, so net decrease of 80 errors 5. Apply that I to corpus 6. Go to 2, unless stopping criterion is reached Most likely tag: P(NN|race) = .98 P(VB|race) = .02 Is/VBZ expected/VBN to/TO race/NNtomorrow/NN Rule template: Change a word from tag X to tag Y when previous tag is Z Rule Instantiation to above example: NN VB PREV1OR2TAG TO Applying this rule yields: Is/VBZ expected/VBN to/TO race/VB tomorrow/NN Brill Algorithm (Detailed)
Example of Error Reduction From Eric Brill (1995): Computational Linguistics, 21, 4, p. 7
Example of Learnt Rule Sequence • 1. NN VB PREVTAG TO • to/TO race/NN->VB • 2. VBP VB PREV1OR20R3TAG MD • might/MD vanish/VBP-> VB • 3. NN VB PREV1OR2TAG MD • might/MD not/MD reply/NN -> VB • 4. VB NN PREV1OR2TAG DT • the/DT great/JJ feast/VB->NN • 5. VBD VBN PREV1OR20R3TAG VBZ • He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP
Handling Unknown Words • Can also use the Brill method • Guess NNP if capitalized, NN otherwise. • Or use the tag most common for words ending in the last 3 letters. • etc. Example Learnt Rule Sequence for Unknown Words