1 / 31

Pattern Markup-Language

Pattern Markup-Language. A tool for simplifying data extraction from semi-structured sources Jonathan Baker, Hilton Campbell, Jordan Crabtree, David W. Embley. Many Sites with Genealogical Data. Structural Patterns. Programmer Defined Regular Expressions.

nelly
Download Presentation

Pattern Markup-Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Markup-Language A tool for simplifying data extraction from semi-structured sources Jonathan Baker, Hilton Campbell, Jordan Crabtree, David W. Embley

  2. Many Sites with Genealogical Data Pattern Markup Language

  3. Pattern Markup Language

  4. Pattern Markup Language

  5. Structural Patterns Pattern Markup Language

  6. Pattern Markup Language

  7. Pattern Markup Language

  8. Pattern Markup Language

  9. Pattern Markup Language

  10. Programmer DefinedRegular Expressions Pattern Markup Language

  11. Programmer DefinedRegular Expressions Pattern Markup Language

  12. Programmer DefinedRegular Expressions Pattern Markup Language

  13. Which Relationships Found? Pattern Markup Language

  14. Simple Schema Represents Relationships Pattern Markup Language

  15. Combine Schema andRegular Expressions Tree Represented by XML = PatML Pattern Markup Language

  16. Pattern Markup Language

  17. Pattern Markup Language

  18. Pattern Markup Language

  19. Pattern Markup Language

  20. PatML Generation Tools Schema Generator Establishes relationships Pattern Markup Language

  21. PatML Generation Tools PatML Editor Helps write the regular expressions and establish which facts they match Pattern Markup Language

  22. Pattern Markup Language

  23. Using PatML Editor • Get your schema file • Browse for sample page • Add nodes • Add expressions • See the highlights in source • Adjust Pattern Markup Language

  24. PatML EditorInterface Tree representing PatML structure Text area with sample page source Browser with rendered sample page Pattern Markup Language

  25. Pattern Markup Language

  26. Fast and Versatile • Regular sites can be integrated in hours • Adaptable to any type of information Pattern Markup Language

  27. Implementation to Date • Genesis uses PatML files to search a variety of sites • Searches TNG, Retrospect-GDS, Family Search, GedCom and Kansas Gunslingers • Standardizes information for a common datamodel • Simultaneously searches other sites (in different formats) for people with similar information Pattern Markup Language

  28. Results Pattern Markup Language

  29. Results • Produced PatML that correctly extracts data from TNG, RGDS, GedCom Sites, and Kansas Gunslingers • User Interface allows for improved debugging environment • ~1/10 coding time with PatML generation tools compared to similarly functioning hand coded parsers Pattern Markup Language

  30. Limitations • Sites must be recognizable with regular expressions • Even regular sites have page to page HTML variations • Programmer error with regular expressions • Regular expression operations can be slow Pattern Markup Language

  31. Future work • Automatic regular expression generation • Parsing links to extract data on connected pages • Use in other applications and fields • XPath approaches Pattern Markup Language

More Related