1 / 11

Timedex

Explore the step-by-step process of extracting events from Wikipedia, indexing them by date, and displaying them in a user-friendly web application. Learn about challenges faced, lessons learned, and technologies used, such as MySQL, Lingpipe API, Lucene, and more. Follow the journey of Brandon, Alex, Robert, and Sean in creating this innovative tool.

rketchum
Download Presentation

Timedex

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Timedex.org Brandon Bell Alex Loddengaard Robert Gay Sean McCarthy

  2. Abstract • Extract events from Wikipedia • Index the events based on date of occurrence • Display the events in a useful webapp

  3. Step 1: Importing Wikipedia into MySQL • Large dataset • 8GB page links table • ~5GB page table • Difficulties: • Altering tables (adding columns or indexes) • Lessons learned: • Be careful with alter operations on large tables • Use Postgres

  4. Step 2: Extracting Sentences and Page Hierarchy • Used Lingpipe API to find sentences • Parsed Wikipedia tags to create heading/sentence tree of each page • Difficulties: • Many Wikipedia sentences are terminated by newlines • Periods in abbreviations can be confusing • Lessons learned: • 3rd party packages never do exactly what you need

  5. Step 3: Detecting Dates • Used a set of regular expressions to check for dates • Difficulties: • Deciding what date formats to accept such that date-like constructs that are not dates are minimized • Lessons learned: • Regular expressions are easy to control and tune, so use them if possible

  6. Step 4: Event Summaries • Given a sentence containing a date, find a short phrase describing it • We thought this would be done best by training an HMM or CRF to extract the events. • Switched to much simpler system of using headings • Difficulties: • Most sentences do not contain good, complete information! • E.g. “He had two children by her in 1948 and 1951.” • Lessons learned: • Try basic methods first, then experiment with more elaborate schemes • English is hard; pronouns suck

  7. Step 5: Ranking Events • We have more events than anyone would ever want • Run PageRank at page level using link table • Assume highly linked-to pages contain better events • Count heading levels • Assume events at deep subheading levels are less important • Empirically, first sentence of page often has most useful event • Difficulties • PageRank can take until the end of time (177 million links) • Lessons learned • In rare cases, efficiency does matter! • Buy more memory

  8. Step 6: Writing the Webapp • Makes an asynchronous JS request • Results returned in JSON • Uses Lucene to query for keywords • Difficulties: • Creating a distribution of ranks for “hide/show” experience to be interesting • Creating a good looking site • Lessons learned: • Use JS libraries whenever possible

  9. Technologies Used • Mallet • Machine learning API • Lingpipe • Linguistic analysis API • Hibernate • Java persistence layer • Spring • Java MVC framework • MySQL + Tomcat + Apache • Scriptaculous • JS library • Lucene • Java Indexing API

  10. Who Did What • Brandon • Sentence and event extraction, date extraction, Lucene semantics • Alex • MySQL import, PageRank, webapp, data access layer • Robert • Sentence and event extraction, webapp, data access layer • Sean • Sentence and event extraction, hierarchy parsing

  11. Demo

More Related