1 / 18

Information Extraction

Information Extraction. MAS.S60 Catherine Havasi Rob Speer. Wikipedia as a corpus. 3.9 million English articles, 284 languages 2 billion words Brown has 1 million DBpedia and Freebase. Text reveals relations.

meara
Download Presentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction MAS.S60 Catherine Havasi Rob Speer

  2. Wikipedia as a corpus • 3.9 million English articles, 284 languages • 2 billion words • Brown has 1 million • DBpedia and Freebase

  3. Text reveals relations • “Various explanations of the overabundance of carbon, oxygen, nitrogen, and other elements have been proposed.” • “These were performed in town halls and other large buildings...” • “The splendid artistic legacy of Angkor Watand other Khmer monuments...”

  4. NACLOpuzzle Would it be plausible to describe something as “danty but sloshful”?

  5. Possible patterns • both X and Y • X but not Y • use NP to VP • [Un]fortunately, VP

  6. Constraints using named entities

  7. Constraints using named entities and parts of speech

  8. TextRunner • Starts out with some seed patterns • Label: Uses those to label possible extractions in a sentence • Learn: Using a graphical model • Extract: Using the learned pattern, extract the sentence • Problem: 200,000 – 300,000 labeled training points needed

  9. ReVerb • Syntactic Constraint • Requires extraction to match syntactic patterns • Lexical Constraint • Phrases must have many different arguments in the corpus

  10. Accuracy of IE • Incoherent extractions make up 15-30% of extracted knowledge bits • Uninformative extractions 3-7%

  11. Tom Mitchell (NELL) • Unsupervised learning machine

  12. Categories on Wikipedia (Dan Weld)

  13. How Kylin Works

  14. Word senses on Wikipedia

  15. Named entities on Wikipedia? [[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...

  16. DownloadingWikipedia and other Wikimedia projects • A 2200-article sample is available on the class web site

  17. Lab • Find an information pattern besides the ones we’ve listed • Run it over the Wikipedia front page corpus • Does it need a tagger? A named entity extractor?

  18. Assignment • Choose and refine an information extractor • Hand-tag some examples • Add a classifier for good vs. bad matches • You are allowed to work in groups • Sharing code is fine, but one writeup per person

More Related