110 likes | 119 Views
Explore the step-by-step process of extracting events from Wikipedia, indexing them by date, and displaying them in a user-friendly web application. Learn about challenges faced, lessons learned, and technologies used, such as MySQL, Lingpipe API, Lucene, and more. Follow the journey of Brandon, Alex, Robert, and Sean in creating this innovative tool.
E N D
Timedex.org Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy
Abstract • Extract events from Wikipedia • Index the events based on date of occurrence • Display the events in a useful webapp
Step 1: Importing Wikipedia into MySQL • Large dataset • 8GB page links table • ~5GB page table • Difficulties: • Altering tables (adding columns or indexes) • Lessons learned: • Be careful with alter operations on large tables • Use Postgres
Step 2: Extracting Sentences and Page Hierarchy • Used Lingpipe API to find sentences • Parsed Wikipedia tags to create heading/sentence tree of each page • Difficulties: • Many Wikipedia sentences are terminated by newlines • Periods in abbreviations can be confusing • Lessons learned: • 3rd party packages never do exactly what you need
Step 3: Detecting Dates • Used a set of regular expressions to check for dates • Difficulties: • Deciding what date formats to accept such that date-like constructs that are not dates are minimized • Lessons learned: • Regular expressions are easy to control and tune, so use them if possible
Step 4: Event Summaries • Given a sentence containing a date, find a short phrase describing it • We thought this would be done best by training an HMM or CRF to extract the events. • Switched to much simpler system of using headings • Difficulties: • Most sentences do not contain good, complete information! • E.g. “He had two children by her in 1948 and 1951.” • Lessons learned: • Try basic methods first, then experiment with more elaborate schemes • English is hard; pronouns suck
Step 5: Ranking Events • We have more events than anyone would ever want • Run PageRank at page level using link table • Assume highly linked-to pages contain better events • Count heading levels • Assume events at deep subheading levels are less important • Empirically, first sentence of page often has most useful event • Difficulties • PageRank can take until the end of time (177 million links) • Lessons learned • In rare cases, efficiency does matter! • Buy more memory
Step 6: Writing the Webapp • Makes an asynchronous JS request • Results returned in JSON • Uses Lucene to query for keywords • Difficulties: • Creating a distribution of ranks for “hide/show” experience to be interesting • Creating a good looking site • Lessons learned: • Use JS libraries whenever possible
Technologies Used • Mallet • Machine learning API • Lingpipe • Linguistic analysis API • Hibernate • Java persistence layer • Spring • Java MVC framework • MySQL + Tomcat + Apache • Scriptaculous • JS library • Lucene • Java Indexing API
Who Did What • Brandon • Sentence and event extraction, date extraction, Lucene semantics • Alex • MySQL import, PageRank, webapp, data access layer • Robert • Sentence and event extraction, webapp, data access layer • Sean • Sentence and event extraction, hierarchy parsing