1 / 109

CS 8520: Artificial Intelligence

CS 8520: Artificial Intelligence. Natural Language Processing Introduction Paula Matuszek Spring, 2013. CSC 8520 Spring 2013. Paula Matuszek. Natural Language Processing. speech recognition natural language understanding computational linguistics psycholinguistics information extraction

Download Presentation

CS 8520: Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2013 CSC 8520 Spring 2013. Paula Matuszek

  2. Natural Language Processing • speech recognition • natural language understanding • computational linguistics • psycholinguistics • information extraction • information retrieval • inference • natural language generation • speech synthesis • language evolution CSC 8520 Spring 2013. Paula Matuszek

  3. Applied NLP • Machine translation • spelling/grammar correction • Information Retrieval • Data mining • Document classification • Question answering, conversational agents CSC 8520 Spring 2013. Paula Matuszek

  4. sound waves accoustic /phonetic morphological/syntactic semantic / pragmatic Natural Language Understanding internal representation CSC 8520 Spring 2013. Paula Matuszek

  5. sound waves accoustic /phonetic morphological/syntactic semantic / pragmatic Natural Language Understanding Sounds Symbols Sense internal representation CSC 8520 Spring 2013. Paula Matuszek

  6. sound waves Where are the words? semantic / pragmatic accoustic /phonetic morphological/syntactic • “How to recognize speech, not to wreck a nice beach” • “The cat scares all the birds away” • “The cat’s cares are few” internal representation • pauses in speech bear little relation to word breaks • + intonation offers additional clues to meaning CSC 8520 Spring 2013. Paula Matuszek

  7. sound waves Dissecting words/sentences accoustic /phonetic morphological/syntactic semantic / pragmatic internal representation • “The dealer sold the merchant a dog” • “I saw the Golden bridge flying into San Francisco” CSC 8520 Spring 2013. Paula Matuszek

  8. sound waves What does it mean? semantic / pragmatic accoustic /phonetic morphological/syntactic internal representation • “I saw Pathfinder on Mars with a telescope” • “Pathfinder photographed Mars” • “The Pathfinder photograph from Ford has arrived” • “When a Pathfinder fords a river it sometimes mars its paint job.” CSC 8520 Spring 2013. Paula Matuszek

  9. sound waves What does it mean? accoustic /phonetic morphological/syntactic semantic / pragmatic • “Jack went to the store. Hefound the milk in aisle 3. He paid for it and left.” • “ Q: Did you read the report? • A: I read Bob’s email.” internal representation CSC 8520 Spring 2013. Paula Matuszek

  10. The steps in NLP • Morphology: Concerns the way words are built up from smaller meaning bearing units. • Syntax: concerns how words are put together to form correct sentences and what structural role each word has • Semantics: concerns what words mean and how these meanings combine in sentences to form sentence meanings CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

  11. The steps in NLP (Cont.) • Pragmatics: concerns how sentences are used in different situations and how use affects the interpretation of the sentence • Discourse: concerns how the immediately preceding sentences affect the interpretation of the next sentence CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

  12. Some of the Tools • Regular Expressions and Finite State Automata • Part of Speech taggers • N-Grams • Grammars • Parsers • Semantic Analysis CSC 8520 Spring 2013. Paula Matuszek

  13. Parsing (Syntactic Analysis) • Assigning a syntactic and logical form to an input sentence • uses knowledge about word and word meanings (lexicon) • uses a set of rules defining legal structures (grammar) • Paula ate the apple. • (S (NP (NAME Paula)) • (VP (V ate) • (NP (ART the) • (N apple)))) CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

  14. Word Sense Resolution • Many words have many meanings or senses • We need to resolve which of the senses of an ambiguous word is invoked in a particular use of the word • I made her duck. (made her a bird for lunch or made her move her head quickly downwards?) CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

  15. Reference Resolution • Domain Knowledge (Registration transaction) • Discourse Knowledge • World Knowledge • U: I would like to register in an CSC Course. • S: Which number? • U: Make it 8520. • S: Which section? • U: Which section is in the evening? • S: section 1. • U: Then make it that section. CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt

  16. Syntactic Analysis Discourse Analysis Pragmatic Analysis Semantic Analysis user Morphological Analysis Internal representation lexicon stems parse tree Surface form Perform action Resolve references Stages of NLP Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt CSC 8520 Spring 2013. Paula Matuszek

  17. Human Languages • You know ~50,000 words of primary language, each with several meanings • six year old knows ~13000 words • First 16 years we learn 1 word every 90 min of waking time • Mental grammar generates sentences -virtually every sentence is novel • 3 year olds already have 90% of grammar • ~6000 human languages – none of them simple! Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London CSC 8520 Spring 2013. Paula Matuszek

  18. Human Spoken language • Most complicated mechanical motion of the human body • Movements must be accurate to within mm • synchronized within hundredths of a second • We can understand up to 50 phonemes/sec (normal speech 10-15ph/sec) • but if sound is repeated 20 times /sec we hear continuous buzz! • All aspects of language processing are involved and manage to keep apace Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London CSC 8520 Spring 2013. Paula Matuszek

  19. Why Language is Hard • NLP is AI-complete • Abstract concepts are difficult to represent • LOTS of possible relationships among concepts • Many ways to represent similar concepts • Tens of hundreds or thousands of features/dimensions CSC 8520 Spring 2013. Paula Matuszek

  20. Why Language is Easy • Highly redundant • Many relatively crude methods provide fairly good results CSC 8520 Spring 2013. Paula Matuszek

  21. History of NLP • Prehistory (1940s, 1950s) • automata theory, formal language theory, markov processes (Turing, McCullock&Pitts, Chomsky) • information theory and probabilistic algorithms (Shannon) • Turing test – can machines think? CSC 8520 Spring 2013. Paula Matuszek

  22. History of NLP • Early work: • symbolic approach • generative syntax - eg Transformations and Discourse Analysis Project (TDAP- Harris) • AI – pattern matching, logic-based, special-purpose systems • Eliza -- Rogerian therapist http://www.manifestation.com/neurotoys/eliza.php3 • stochastic • bayesian methods • early successes -- $$$$ grants! • by 1966 US government had spent 20 million on machine translation • Critics: • Bar Hillel – “no way to disambiguation without deep understanding” • Pierce NSF 1966 report: “no way to justify work in terms of practical output” CSC 8520 Spring 2013. Paula Matuszek

  23. History of NLP • The middle ages (1970-1990) • stochastic • speech recognition and synthesis (Bell Labs) • logic-based • compositional semantics (Montague) • definite clause grammars (Pereira&Warren) • ad hoc AI-based NLU systems • SHRDLU robot in blocks world (Winograd) • knowledge representation systems at Yale (Shank) • discourse modeling • anaphora • focus/topic (Groz et al) • conversational implicature (Grice) CSC 8520 Spring 2013. Paula Matuszek

  24. History of NLP • NLP Renaissance (1990-2000) • Lessons from phonology & morphology successes: • finite-state models are very powerful • probabilistic models pervasive • Web creates new opportunities and challenges • practical applications driving the field again • 21st Century NLP • The web changes everything: • much greater use for NLP • much more data available CSC 8520 Spring 2013. Paula Matuszek

  25. Document Features • Most NLP is applied to some quantity of unstructured text. • For simplicity, we will refer to any such quantity as a document • What features of a document are of interest? • Most common is the actual terms in the document. CSC 8520 Spring 2013. Paula Matuszek

  26. Tokenization • Tokenization is the process of breaking up a string of letters into words and other meaningful components (numbers, punctuation, etc. • Typically broken up at white space. • Very standard NLP tool • Language-dependent, and sometimes also domain-dependent. • 3,7-dihydro-1,3,7-trimethyl-1H-purine-2,6-dione • Tokens can also be larger divisions: sentences, paragraphs, etc. CSC 8520 Spring 2013. Paula Matuszek

  27. A-Z 2 0 1 blank A-Z blank, EOF Lexical Analyser • Basic idea is a finite state machine • Triples of input state, transition token, output state • Must be very efficient; gets used a LOT CSC 8520 Spring 2013. Paula Matuszek

  28. Design Issues for Tokenizer • Punctuation • treat as whitespace? • treat as characters? • treat specially? • Case • fold? • Digits • assemble into numbers? • treat as characters? • treat as punctuation? CSC 8520 Spring 2013. Paula Matuszek

  29. NLTK Tokenizer • Natural Language ToolKit • http://text-processing.com/demo/tokenize/ • Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. CSC 8520 Spring 2013. Paula Matuszek

  30. Document Features • Once we have a corpus of standardized and possibly tokenized documents, what features of text documents might be interesting? • Word frequencies: Bag of Words (BOW) • Language • Document Length -- characters, words, sentences • Named Entities • Parts of speech • Average word length • Average sentence length • Domain-specific stuff CSC 8520 Spring 2013. Paula Matuszek

  31. Bag of Words • Specifically not interested in order • Frequency of each possible word in this document. • Very sparse vector! • In order to assign count to correct position, need to know all the words used in the corpus • two-pass • reverse-index CSC 8520 Spring 2013. Paula Matuszek

  32. Simplifying BOW • Do we really want every word? • Stop words • omit common function words. • e.g. http://www.ranks.nl/resources/stopwords.html • Stemming or lemmatization • Convert words to standard form • lemma is the standard word; stem may not be a word at all • Synonym matching • TF*IDF • Use the “most meaningful” words CSC 8520 Spring 2013. Paula Matuszek

  33. Stemming • Inflectional stemming: eliminate morphological variants • singular/plural, present/past • In English, rules plus a dictionary • books -> book, children -> child • few errors, but many omissions • Root stemming: eliminate inflections and derivations • invention, inventor -> invent • much more aggressive • Which (if either) depends on problem • http://text-processing.com/demo/stem/ CSC 8520 Spring 2013. Paula Matuszek

  34. Synonym-matching • Map some words to their synonyms • http://thesaurus.com/browse/computer • Problematic in English • requires a large dictionary • many words have multiple meanings • In specific domains may be important • biological and chemical domains: http://en.wikipedia.org/wiki/Caffeine • any specific domain: Nova, Villanova, V’Nova CSC 8520 Spring 2013. Paula Matuszek

  35. TF-IDF • Word frequency is simple, but: • affected by length of document • not best indicator of what doc is about • very common words don’t tell us much about differences between documents • very uncommon words are typos or idiosyncratic • Term Frequency Inverse Document Frequency • tf-idf(j) = tf(j)*idf(j). idf(j)=log(N/df(j)) CSC 8520 Spring 2013. Paula Matuszek

  36. Language • Relevant if documents are in multiple languages • May know from source • Determining language itself can be considered a form of NLP. • http://translate.google.com/?hl=en&tab=wT • http://fr.wikipedia.org/ • http://de.wikipedia.org/ CSC 8520 Spring 2013. Paula Matuszek

  37. Document Counts • Number of characters, words, sentences • Average length of words, sentences, paragraphs • EG: clustering documents to determine how many authors have written them or how many genres are represented • NLTK + Python makes this easy CSC 8520 Spring 2013. Paula Matuszek

  38. Named Entities • What persons, places, companies are mentioned in documents? • “Proper nouns” • One of most common information extraction tasks • Combination of rules and dictionary • Example rules: • Capitalized word not at beginning of sentence • Two capitalized words in a row • One or more capitalized words followed by Inc • Dictionaries of common names, places, major corporations. Sometimes called “gazetteer” CSC 8520 Spring 2013. Paula Matuszek

  39. Parts of Speech • Part-of-Speech (POS) taggers identify nouns, verbs, adjectives, noun phrases, etc • Brill is the best-known rule-based tagger • More recent work uses machine learning to create taggers from labeled examples • http://text-processing.com/demo/tag/ • http://cst.dk/online/pos_tagger/uk/index.html CSC 8520 Spring 2013. Paula Matuszek

  40. Domain-Specific • Structure in document • email: To, From, Subject, Body • Villanova Campus Currents: News, Academic Notes, Events, Sports, etc • Tags in document • Medline corpus is tagged with MeSH terms • Twitter feeds may be tagged with #tags. • Intranet documents might have date, source, department, author, etc. CSC 8520 Spring 2013. Paula Matuszek

  41. N-Grams • These language features will take you a long way • But it’s not hard to come up with examples where it fails: • Dog bites man. vs. Man bites dog. • Without adding explicit semantics we can still get substantial additional information by considering sequences of words. CSC 8520 Spring 2013. Paula Matuszek

  42. Free Association Exercise  • I am going to say some phrases. Write down the next word or two that occur to you. • Microsoft announced a new security ____ • NHL commissioner cancels rest ____ • One Fish, ______ • Colorless green ideas ______ • Conjunction Junction, what’s __________ • Oh, say, can you see, by the dawn’s ______ • After I finished my homework I went _____. CSC 8520 Spring 2013. Paula Matuszek

  43. Human Word Prediction • Clearly, at least some of us have the ability to predict future words in an utterance. • How? • Domain knowledge • Syntactic knowledge • Lexical knowledge CSC 8520 Spring 2013. Paula Matuszek

  44. Claim • A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques • In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence) CSC 8520 Spring 2013. Paula Matuszek

  45. Applications • Why do we want to predict a word, given some preceding words? • Rank the likelihood of sequences containing various alternative hypotheses, e.g. for automated speech recognition, OCRing. • Theatre owners say popcorn/unicorn sales have doubled... • Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation • Como mucho pescado. --> At the most fished. CSC 8520 Spring 2013. Paula Matuszek

  46. Real Word Spelling Errors • They are leaving in about fifteen minuets to go to her house. • The study was conducted mainly be John Black. • The design an construction of the system will take more than a year. • Hopefully, all with continue smoothly in my absence. • Can they lave him my messages? • I need to notified the bank of…. • He is trying to fine out. Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt CSC 8520 Spring 2013. Paula Matuszek

  47. Language Modeling • Fundamental tool in NLP • Main idea: • Some words are more likely than others to follow each other • You can predict fairly accurately that likelihood. • In other words, you can build a language model Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt CSC 8520 Spring 2013. Paula Matuszek

  48. N-Grams • N-Grams are sequences of tokens. • The N stands for how many terms are used • Unigram: 1 term • Bigram: 2 terms • Trigrams: 3 terms • You can use different kinds of tokens • Character based n-grams • Word-based n-grams • POS-based n-grams • N-Grams give us some idea of the context around the token we are looking at. Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt CSC 8520 Spring 2013. Paula Matuszek

  49. N-Gram Models of Language • A language model is a model that lets us compute the probability, or likelihood, of a sentence S, P(S). • N-Gram models use the previous N-1 words in a sequence to predict the next word • unigrams, bigrams, trigrams,… • How do we construct or train these language models? • Count frequencies in very large corpora • Determine probabilities using Markov models, similar to POS tagging. CSC 8520 Spring 2013. Paula Matuszek

  50. Counting Words in Corpora • What is a word? • e.g., are cat and cats the same word? • September and Sept? • zero and oh? • Is _ a word? * ? ‘(‘ ? • How many words are there in don’t ? Gonna ? • In Japanese and Chinese text -- how do we identify a word? • Back to stemming! CSC 8520 Spring 2013. Paula Matuszek

More Related