360 likes | 925 Views
Introduction to Natural Language Processing (NLP). Dekang Lin Department of Computing Science University of Alberta lindek@cs.ualberta.ca. Outline. What is NLP Applications Challenges Linguistics Issues Part of Speech Tagging. What is Natural Language Processing?.
E N D
Introduction to Natural Language Processing (NLP) Dekang Lin Department of Computing Science University of Alberta lindek@cs.ualberta.ca
Outline • What is NLP • Applications • Challenges • Linguistics Issues • Part of Speech Tagging
What is Natural Language Processing? • Natural Language Processing • Process information contained in natural language text. • Also known as Computational Linguistics • Can machines understand human language? • Define ‘understand’ • Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful.
Why Study NLP? • A hallmark of human intelligence. • Text is the largest repository of human knowledge and is growing quickly. • emails, news articles, web pages, IRC, scientific articles, insurance claims, customer complaint letters, transcripts of phone calls, technical documents, government documents, patent portfolios, court decisions, contracts, …… • Are we reading any faster than before?
NLP Applications • Question answering • Who is the first Taiwanese president? • Text Categorization/Routing • e.g., customer e-mails. • Text Mining • Find everything that interacts with BRCA1. • Machine (Assisted) Translation • Language Teaching/Learning • Usage checking • Spelling correction • Is that just dictionary lookup?
Challenges in NLP: Ambiguity • Words or phrases can often be understood in multiple ways. • Teacher Strikes Idle Kids • Killer Sentenced to Die for Second Time in 10 Years • They denied the petition for his release that was signed by over 10,000 people. • child abuse expert/child computer expert • Who does Mary love?
Probabilistic/Statistical Resolution of Ambiguities • When there are ambiguities, choose the interpretation with the highest probability. • Example: how many times peoples say • “Mary loves …” • “the Mary love” • Which interpretation has the highest probability?
Challenges in NLP: Variations • The same meaning can be expressed in different ways • Who wrote “The Language Instinct”? • Steven Pinker, a MIT professor and author of “The Language Instinct”, ……
Linguistic Issues • Morphology • Internal structure of words • Syntax • Internal structure of sentences • Semantics • How to interpret the meanings of words, phrases and sentences.
Morphology • Morphology is concerned with the internal make-up of words • The fearsome cats attacked the foolish dog • The fear-some cat-s attack-ed the fool-ish dog • Inflectional morphology • Does not change the grammatical category of words: cats/cat-s, attacked/attack-ed • Derivational morphology • May involve changes to grammatical categories: fearsome/fear-some, foolish/fool-ish
Morphology Is not as Easy as It May Seem to be • Examples from Woods et. al. 2000 • delegate (de + leg + ate) take the legs from • caress (car + ess) female car • cashier (cashy + er) more wealthy • lacerate (lace + rate) speed of tatting • ratify (rat + ify) infest with rodents • infantry (infant + ry) childish behavior
A Turkish Example [Oflazer & Guzey 1994] • uygarlastiramayabileceklerimizdenmissinizcesine • urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL mis/NARR siniz/2PL cesine/AS-IF • an adverb meaning roughly “(behaving) as if you were one of those whom we might not be able to civilize.”
Sentence Structures • Sentences have structures and are made up of constituents. • The constituents are phrases. • A phrase consists of a head and modifiers. • The categorial type of the head determines the categorial type of the phrase (e.g., a phrase headed by a noun is a noun phrase).
Parsing • Analyze the structure of a sentence S VP NP PP NP NP D N V D N P D N The student put the book on the table
S S VP VP NP NP NP NP N N V N N V A N Teacher strikes idle kids Teacher strikes idle kids
Syntax • Syntax is the study of the regularities and constraints of word order and phrase structure • How words are organized into phrases • How phrases are combined into larger phrases (including sentences).
Phrase Structures • Noun phrases • A noun phrase consists of a head noun and a set of modifiers. • The meaning of the noun phrase is largely determined by the noun. • Verb phrases • A verb phrase consists of a head verb and a set of modifiers • the head verb denotes the action/activity/state
Part of Speech • Syntactic categories that words belong to • N, V, Adj/Adv, Prep, Aux, • Open/Closed class, lexical/functional categories • Also known as: grammatical categories, syntactic tags, POS tags, word classes, …
POS Examples Open Class N noun baby, toy V verb see, kiss ADJ adjective tall, grateful, alleged ADV adverb quickly, frankly, ... P preposition in, on, near DET determiner the, a, that WhPron wh-pronoun who, what, which, … COORD coordinator and, or
Substitution Test • Two words belong to the same category if replacing one with another does not change the grammaticality of a sentence. • The _____ is angry. • The ____ dog is angry. • Fifi ____ . • Fifi ____ the book.
POS Tags • There is no standard set of POS tags • Some use coarse classes: e.g., N • Others prefer finer distinctions (e.g., Penn Treebank): • PRP: personal pronouns (you, me, she, he, them, him, her, …) • PRP$: possessive pronouns (my, our, her, his, …) • NN: singular common nouns (sky, door, theorem, …) • NNS: plural common nouns (doors, theorems, women, …) • NNP: singular proper names (Fifi, IBM, Canada, …) • NNPS: plural proper names (Americas, Carolinas, …)
PRP PRP$
Part of Speech Tagging • Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB • The POS tagging problem is to determine the POS tag for a particular instance of a word.
POS Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 (Derose, 1988)