1 / 55

Natural Language Processing

Natural Language Processing. Lecture Jim Martin. Today. Briefly review sequence labeling HMMs vs. MEMMs Information extraction (Ch 22). Statistical Sequence Labeling. W ith POS tagging we trained systems to tag using annotated training data and HMMs. Training data

daria-riley
Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Lecture Jim Martin

  2. Today • Briefly review sequence labeling • HMMs vs. MEMMs • Information extraction (Ch 22) Speech and Language Processing - Jurafsky and Martin

  3. Statistical Sequence Labeling • With POS taggingwe trained systems to tag using annotated training data and HMMs. • Training data • Hand tag a bunch of data with POS tags • We also saw that you can do the same thing for partial parsing (chunking) using a clever scheme for encoding the chunks as tags Speech and Language Processing - Jurafsky and Martin

  4. Encoding • With IOB encoding you can turn a labeled bracketing task into a tagging task. And then proceed exactly as we did with POS Tagging. • We’ll use what’s called IOB labeling to do this • I -> Inside • O -> Outside • B -> Begin Speech and Language Processing - Jurafsky and Martin

  5. IOB encoding • This example shows the encoding for just base-NPs. There are 3 tags in this scheme. • This example shows full coverage. In this scheme there are 2*N+1 tags. Where N is the number of constituents in your set. Speech and Language Processing - Jurafsky and Martin

  6. Methods • Argmax P(Tags|Words) • HMMs • Discriminative Sequence Classification • Using any kind of standard ML-based classifier. Speech and Language Processing - Jurafsky and Martin

  7. HMM Tagging • Same as with POS tagging • Argmax P(T|W) = P(W|T)P(T) • The tags are the hidden states • Works ok, but has one major shortcoming • The typical kinds of things that we might think would be useful in this task aren’t easily squeezed into the HMM model • We’d like to be able to make arbitrary features available for the statistical inference being made. Speech and Language Processing - Jurafsky and Martin

  8. Supervised Classification • Training a system to take an object represented as a set of features and apply a label to that object. • Methods typically include • Naïve Bayes • Decision Trees • Maximum Entropy (logistic regression) • Support Vector Machines • … Speech and Language Processing - Jurafsky and Martin

  9. Sequence Classification • Applying this to tagging… • The object to be tagged is a word in the sequence • The features are • features of the word, • features of its immediate neighbors, • and features derived from the entire sentence. • Sequential tagging means sweeping the classifier across the input assigning tags to words as you proceed. Speech and Language Processing - Jurafsky and Martin

  10. Typical Features • Typical setup involves • A small sliding window around the object being tagged • Features extracted from the window • Current word token • Previous/next N word tokens • Current word POS • Previous/next POS • Previous N chunk labels • Capitalization information • ... Speech and Language Processing - Jurafsky and Martin

  11. Statistical Sequence Labeling Speech and Language Processing - Jurafsky and Martin

  12. Notes... • Viewing complex string tasks as tagging tasks can open up a lot of possibilities • Finite-state methods • Statistical learning methods • The combination of Markov-style sequence processing with standard ML techniques is very productive • We’ll see this combo again and again and again Speech and Language Processing - Jurafsky and Martin

  13. Evaluation • Suppose you employ this scheme. What’s the best way to measure performance. • Probably not the per-tag accuracy we used for POS tagging. • Why? • It’s not measuring what we care about • We need a metric that looks at the chunks not the tags Speech and Language Processing - Jurafsky and Martin

  14. Example • Suppose we were looking for PP chunks for some reason. • If the system simply said O all the time it would do pretty well on a per-label basis since most words reside outside any PP. Speech and Language Processing - Jurafsky and Martin

  15. Precision/Recall/F • Precision: • The fraction of chunks the system returned that were right • “Right” means the boundaries and the label are correct given some labeled test set. • Recall: • The fraction of the chunks that system got from those that it should have gotten. • F1: Harmonic mean of those two numbers. Speech and Language Processing - Jurafsky and Martin

  16. F1 F1 balances the performance of precision and recall. Speech and Language Processing - Jurafsky and Martin

  17. Information Extraction • Turns out that these partial/parsing and chunking methods for syntax, give us the means to address shallow semantic problems as well • Figure out the entities (the players, props, instruments, locations, etc.) in a text • Figure out how they’re related • Figure out what kind of events/activities they’re all up to • And do each of those tasks in a loosely-coupled data-driven manner Speech and Language Processing - Jurafsky and Martin

  18. Information Extraction • Ordinary newswire text is often used in typical examples. • And there are lots of useful applications out there • But the real interest/money is in specialized domains • Bioinformatics • Electronic medical records • Stock market analysis • Intelligence analysis • Social media Speech and Language Processing - Jurafsky and Martin

  19. Example Application Speech and Language Processing - Jurafsky and Martin

  20. Services Calais (Thomson Reuters) Alchemy API (Denver based) Speech and Language Processing - Jurafsky and Martin

  21. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Speech and Language Processing - Jurafsky and Martin

  22. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York. Speech and Language Processing - Jurafsky and Martin

  23. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Speech and Language Processing - Jurafsky and Martin

  24. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Speech and Language Processing - Jurafsky and Martin

  25. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Speech and Language Processing - Jurafsky and Martin

  26. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Speech and Language Processing - Jurafsky and Martin

  27. Named Entity Recognition • Find and classify all the named entities in a text. • What’s a named entity? • A reference to an entity via the mention of its name. • Colorado Rockies • This is a subset of the possible mentions... • Rockies, the team, it, they... • Find means identify the exact span of the mention. • Classify means determine the category of the entity being referred to. Speech and Language Processing - Jurafsky and Martin

  28. NER Approaches • As with chunking, there are two basic approaches (and hybrids) • Rule-based (regular expressions) • Lists of names • Patterns to match things that look like names • Patterns to match the environments that classes of names tend to occur in. • ML-based approaches • Get annotated training data • Extract features • Train systems to replicate the annotation Speech and Language Processing - Jurafsky and Martin

  29. ML Approach Speech and Language Processing - Jurafsky and Martin

  30. Encoding for Sequence Labeling • We can use the same IOB encoding here that we used for chunking: • For N classes we have 2*N+1 tags • An I and B for each class and a O for outside any class. • Each token in a text gets a tag. Speech and Language Processing - Jurafsky and Martin

  31. NER Features Speech and Language Processing - Jurafsky and Martin

  32. NER Features • The usefulness of different features varies by domain and by language • But features should be superficial and easily extracted from the text to be analyzed • Can’t solve a problem by using a feature that’s harder to extract than the actual problem! • The “shape” feature turns out to be amazingly useful in many domains Speech and Language Processing - Jurafsky and Martin

  33. NER as Sequence Labeling Speech and Language Processing - Jurafsky and Martin

  34. NER Evaluation • As with chunking, it is a bad idea to evaluation sequence labelers at the tag level. • Most labels are O; so just guessing O gives a learning algorithm a lot of credit. • So we need to evaluate precision, recall and F at the entity level. • But we may not care equally about all kinds of entities • So we might weight them differently in our evaluation. Speech and Language Processing - Jurafsky and Martin

  35. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York. Speech and Language Processing - Jurafsky and Martin

  36. NER and Entities • Traditionally, NER only refers to entities that are referred to with an explicit mention of a name. • “Jane Smith” vs. “she” • “Twitter” vs. “it” or “they” • “Tesla Model S” vs. “the car” • General entity reference and tracking is a bigger problem. Speech and Language Processing - Jurafsky and Martin

  37. Relations • Once you have captured the entities in a text, you might want to ascertain how they relate to one another. • Here we’re just talking about explicitly stated relations Speech and Language Processing - Jurafsky and Martin

  38. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Speech and Language Processing - Jurafsky and Martin

  39. Relation Types • As with named entities, the list of relations is application specific. For generic news texts... Speech and Language Processing - Jurafsky and Martin

  40. Relations • By relation we really mean sets of tuples. • Think about populating a database. Speech and Language Processing - Jurafsky and Martin

  41. Relation Analysis • We can divide relation analysis into two parts • Determining if 2 entities are related • And if they are, classifying the relation • There are 2 reasons to do this • Cutting down on training time for classification by eliminating most pairs • Producing separate feature-sets that are appropriate for each task. Speech and Language Processing - Jurafsky and Martin

  42. Relation Analysis • Let’s just worry about named entities within the same sentence Speech and Language Processing - Jurafsky and Martin

  43. Features • We can group the features (for both tasks) into three categories • Features of the named entities involved • Features derived from the words between and around the named entities • Features derived from the syntactic environment that governs the two entities Speech and Language Processing - Jurafsky and Martin

  44. Features • Features of the entities • Their types • Concatenation of the types • Headwords of the entities • George Washington Bridge • Words in the entities • Features between and around • Particular positions to the left and right of the entities • +/- 1, 2, 3 • Bag of words between Speech and Language Processing - Jurafsky and Martin

  45. Features • Syntactic environment • Constituent path through the tree from one to the other • Base syntactic chunk sequence from one to the other • Dependency path Speech and Language Processing - Jurafsky and Martin

  46. Example • For the following example, we’re interested in the possible relation between American Airlines and Tim Wagner. • American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. Speech and Language Processing - Jurafsky and Martin

  47. Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York Speech and Language Processing - Jurafsky and Martin

  48. Case Study:Bioinformatic NLP • An example domain • Very important: basic science, clinical practice, billing, etc. • Practitioners care about the technology • They have problems they’re trying to solve • Lots and lots of text available • Lots of interesting problems Speech and Language Processing - Jurafsky and Martin

  49. Lots and Lots of Text Speech and Language Processing - Jurafsky and Martin

  50. Lots and Lots of Text Electronic health records are the next frontier for this kind of research Mandated for health care providers Speech and Language Processing - Jurafsky and Martin

More Related