Natural Language Processing

Natural Language Processing Part-of-Speech Tagging

Parts of Speech • 8–10 traditional parts of speech • Noun, verb, adjective, adverb, preposition, article, interjection, pronoun, conjunction, … • Variously called: • Parts of speech, lexical categories, word classes, morphological classes, lexical tags, ... • Lots of debate within linguistics about the number, nature, and universality of these • We’ll completely ignore this debate

POS Examples • N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adjective purple, tall, ridiculous • ADV adverb unfortunately, slowly • P preposition of, by, to • DET determiner the, a, that, those • INT interjection ouch, hey • PRO pronoun I, me, mine • CONJ conjunction and, but, for, because

POS Tagging • The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag the DET koala N put V the DET keys N on P the DET table N

Why is POS Tagging Useful? • First step of a vast number of practical tasks • Speech synthesis • How to pronounce “lead”? • INsult inSULT • OBject obJECT • OVERflow overFLOW • DIScount disCOUNT • CONtent conTENT • Parsing • Need to know if a word is an N or V before you can parse • Information extraction • Finding names, relations, etc. • Machine Translation

Open and Closed Classes • Closed class: a small fixed membership • Prepositions: of, in, by, … • Auxiliaries: may, can, will had, been, … • Pronouns: I, you, she, mine, his, them, … • Usually function words (short common words which play a role in grammar) • Open class: new ones can be created all the time • English has 4: Nouns, Verbs, Adjectives, Adverbs • Many languages have these 4, but not all!

POS TaggingChoosing a Tagset • There are so many parts of speech, potential distinctions we can draw • To do POS tagging, we need to choose a standard set of tags to work with • Could pick very coarse tagsets • N, V, Adj, Adv, … • More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags • PRP$, WRB, WP$, VBG • Even more fine-grained tagsets exist

Penn TreeBank POS Tagset

POS Tagging • Words often have more than one POS: back • The backdoor = JJ • On my back= NN • Win the voters back= RB • Promised to backthe bill = VB • The POS tagging problem is to determine the tag for a particular instance of a word These examples from Dekang Lin

How Hard is POS Tagging? Measuring Ambiguity

Evaluation • So once you have you POS tagger running how do you evaluate it? • Overall error rate with respect to a manually annotated gold-standard test set • Error rates on particular tags • Tag confusions ... • Accuracy typically reaches 96–97% for English newswire text • What about Turkish? • What about twitter?

Error Analysis • Look at a confusion matrix • See what errors are causing problems • Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) • Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

Natural Language Processing