Information Extraction

Information Extraction MAS.S60 Catherine Havasi Rob Speer

Wikipedia as a corpus • 3.9 million English articles, 284 languages • 2 billion words • Brown has 1 million • DBpedia and Freebase

Text reveals relations • “Various explanations of the overabundance of carbon, oxygen, nitrogen, and other elements have been proposed.” • “These were performed in town halls and other large buildings...” • “The splendid artistic legacy of Angkor Watand other Khmer monuments...”

NACLOpuzzle Would it be plausible to describe something as “danty but sloshful”?

Possible patterns • both X and Y • X but not Y • use NP to VP • [Un]fortunately, VP

Constraints using named entities

Constraints using named entities and parts of speech

TextRunner • Starts out with some seed patterns • Label: Uses those to label possible extractions in a sentence • Learn: Using a graphical model • Extract: Using the learned pattern, extract the sentence • Problem: 200,000 – 300,000 labeled training points needed

ReVerb • Syntactic Constraint • Requires extraction to match syntactic patterns • Lexical Constraint • Phrases must have many different arguments in the corpus

Accuracy of IE • Incoherent extractions make up 15-30% of extracted knowledge bits • Uninformative extractions 3-7%

Tom Mitchell (NELL) • Unsupervised learning machine

Categories on Wikipedia (Dan Weld)

How Kylin Works

Word senses on Wikipedia

Named entities on Wikipedia? [[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...

DownloadingWikipedia and other Wikimedia projects • A 2200-article sample is available on the class web site

Lab • Find an information pattern besides the ones we’ve listed • Run it over the Wikipedia front page corpus • Does it need a tagger? A named entity extractor?

Assignment • Choose and refine an information extractor • Hand-tag some examples • Add a classifier for good vs. bad matches • You are allowed to work in groups • Sharing code is fine, but one writeup per person

Information Extraction