1 / 58

A Survey of WEB Information Extraction Systems

This survey explores web information extraction systems, including manual, supervised, semi-supervised, and unsupervised systems, with a focus on automation degree, techniques, and output targets. It covers technologies, tools, and related work in the field as of 2005.

edwiny
Download Presentation

A Survey of WEB Information Extraction Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey of WEB Information Extraction Systems Chia-Hui Chang National Central University Sep. 22, 2005

  2. Introduction • Abundant information on the Web • Static Web pages • Searchable databases: Deep Web • Information Integration • Information for life • e.g. shopping agents, travel agents • Data for research purpose • e.g. bioinformatics, auction economy

  3. Introduction (Cont.) • Information Extraction (IE) • is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form • An IE task is defined by its input and output

  4. An IE Task

  5. Web Data Extraction Data Record Data Record

  6. IE Systems • Wrappers • Programs that perform the task of IE are referred to as extractors or wrappers. • Wrapper Induction • IE systems are software tools that are designed to generate wrappers.

  7. Various IE Survey • Muslea • Hsu and Dung • Chang • Kushmerick • Laender • Sarawagi • Kuhlins and Tredwell

  8. Related Work: Time • MUC Approaches • AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995] • Post-MUC Approaches • WHISK [Soderland, 1999], RAPIER [califf, 1998], SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

  9. Related Work: Automation Degree • Hsu and Dung [1998] • hand-crafted wrappers using general programming languages • specially designed programming languages or tools • heuristic-based wrappers, and • WI approaches

  10. Related Work: Automation Degree • Chang and Kuo [2003] • systems that need programmers, • systems that need annotation examples, • annotation-free systems and • semi-supervised systems

  11. Related Work: Input and Extraction Rules • Muslea [1999] • IE from free text using extraction patterns that are mainly based on syntactic/semantic constraints. • The second class is Wrapper induction systems which rely on the use of delimiter-based rules. • The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

  12. Related Work: Extraction Rules • Kushmerick [2003] • Finite-state tools (regular expressions) • Relational learning tools (logic rules)

  13. Related Work: Techniques • Laender [2002] • languages for wrapper development • HTML-aware tools • NLP-based tools • Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER), • Modeling-based tools • Ontology-based tools • New Criteria: • degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

  14. Related Work: Output Targets • Sarawagi [VLDB 2002] • Record-level • Page-level • Site-level

  15. Related Work: Usability • Kuhlins and Tredwell [2002] • Commercial • Noncommercial

  16. Three Dimensions • Task Domain • Input (Unstructured, semi-structured) • Output Targets (record-level, page-level, site-level) • Automation Degree • Programmer-involved, learning-based or annotation-free approaches • Techniques • Regular expression rules vs Prolog-like logic rules • Deterministic finite-state transducer vs probabilistic hidden Markov models

  17. Task Domain: Input

  18. Task Domain: Output • Missing Attributes • Multi-valued Attributes • Multiple Permutations • Nested Data Objects • Various Templates for an attribute • Common Templates for various attributes • Untokenized Attributes

  19. Classification by Automation Degree • Manually • TSIMMIS, Minerva, WebOQL, W4F, XWrap • Supervised • WIEN, Stalker, Softmealy • Semi-supervised • IEPAD, OLERA • Unsupervised • DeLa, RoadRunner, EXALG

  20. Automation Degree • Page-fetching Support • Annotation Requirement • Output Support • API Support

  21. Technologies • Scan passes • Extraction rule types • Learning algorithms • Tokenization schemes • Feature used

  22. A Survey of Contemporary IE Systems • Manually-constructed IE tools • Programmer-aided • Supervised IE systems • Labeled based • Semi-supervised IE systems • Unsupervised IE systems • Annotation-free

  23. Manually-constructed IE Systems • TSIMMIS [Hammer, et al, 1997] • Minerva [Crescenzi, 1998] • WebOQL [Arocena and Mendelzon, 1998] • W4F [Saiiuguet and Azavant, 2001] • XWrap [Liu, et al. 2000]

  24. A Running Example

  25. TSIMMIS • Each command is of the form: [variables, source, pattern] where • source specifies the input text to be considered • pattern specifies how to find the text of interest within the source, and • variables are a list of variables that hold the extracted results. • Note: • # means “save in the variable” • * means “discard”

  26. Minerva • The grammar used by Minerva is defined in an EBNF style

  27. Tag: Body, Source: <Body>…</Body> Text: Book Name … Tag: OL, Source: <ol>…</ol> Text: Reviewer Name … Tag: <b> Source:<b>Book Name</b> Text: Book Name Tag: NOTAG Source: Databases Text: Database Tag: <b> Source:<b>Reviews</b> Text: Reviews Tag: LI, Source: <li>…</li> Text: Reviewer Name … Tag: <b> Source:<b>Reviewer Name</b> Text: Reviewer Name Tag: NOTAG Source: … Text: … Tag: NOTAG Source: John Text: John Tag: <b> Source:<b>Rating</b> Text: Rating Tag: <b> Source:<b>Text</b> Text: Text Tag: NOTAG Source: 7 Text: 7 WebOQL Select [ Z!’.Text] From x in browse (“pe2.html”)’, y in x’, Z in y’ Where x.Tag = “ol” and Z.Text=”Reviewer Name”

  28. W4F • Wysiwyg support • Java toolkit • Extraction rule • HTML parse tree (DOM object) • e.g. html.body.ol[0].li[*].pcdata[0].txt • Regular expression to address finer pieces of information

  29. Supervised IE systems • SRV [Freitag, 1998] • Rapier [Califf and Mooney, 1998] • WIEN [Kushmerick, 1997] • WHISK [Soderland, 1999] • NoDoSE [Adelberg, 1998] • Softmealy [Hsu and Dung, 1998] • Stalker [Muslea, 1999] • DEByE [Laender, 2002b ]

  30. SRV • Single-slot information extraction • Top-down (general to specific) relational learning algorithm • Positive examples • Negative examples • Learning algorithm work like FOIL • Token-oriented features • Logic rule Rating extraction rule:- Length(=1), Every(numeric true), Every(in_list true).

  31. Rapier • Field-level (Single-slot) data extraction • Bottom-up (specific to general) • The extraction rules consist of 3 parts: • Pre-filler • Slot-filler • Post-filler Book Title extraction rule:- Pre-filler slot-filler post-filler word: Book Length=2 word=<b> word: Name Tag: [nn, nns] word: </b>

  32. WIEN • LR Wrapper • (‘Reviewer name </b>’, ‘<b>’, ‘Rating </b>’, ‘<b>’, ‘Text </b>’, ‘</li>’) • HLRT Wrapper (Head LR Tail) • OCLR Wrapper (Open-Close LR) • HOCLRT Wrapper • N-LR Wrapper (Nested LR) • N-HLRT Wrapper (Nested HLRT)

  33. WHISK • Top-down (general to specific) learning • Example • To generate 3-slot book reviews, it start with empty rule “*(*)*(*)*(*)*” • Each parenthesis indicates a phrase to be extracted • The phrase in the first set of parenthesis is bound to variable $1, and 2nd to $2, etc. • The extraction logic is similar to the LR wrapper for WIEN. Pattern:: * ‘Reviewer Name </b>’ (Person) ‘<b>’ * (Digit) ‘<b>Text</b>’(*) ‘</li>’ Output:: BookReview {Name $1} {Rating $2} {Comment $3}

  34. NoDoSE • Assume the order of attributes within a record to be fixed • The user interacts with the system to decompose the input. • For the running example • a book title (an attribute of type string) and • a list of Reviewer • RName (string), Rate (integer), and Text (string).

  35. ?/next_token ?/next_token ?/next_token ?/ε ?/ε ?/ε s<,T>/ “T=”+ next_tokn s<b,N>/ “N=”+ next_tokn s<,R>/ “R=”+ next_tokn s<N, > / ε s<T,e> / ε e N R R T b N s<R, e>/ ε Softmealy • Finite transducer • Contextual rules s<,R>L ::= HTML(<b>) C1Alph(Rating) HTML(</b>) s<,R>R ::= Spc(-) Num(-) s<R,>L ::= Num(-) s<R,>R ::= NL(-) HTML(<b>)

  36. Stalker • Embedded Category Tree • Multipass Softmealy

  37. DEByE • Bottom-up extraction strategy • Comparison • DEByE: the user marks only atomic (attribute) values to assemble nested tables • NoDoSE: the user decomposes the whole document in a top-down fashion

  38. Semi-supervised Approaches • IEPAD [Chang and Lui, 2001] • OLERA [Chang and Kuo, 2003] • Thresher [Hogue, 2005]

  39. IEPAD • Encoding of the input page • Multiple-record pages • Pattern Mining by PAT Tree • Multiple string alignment • For the running example • <li><b>T</b>T<b>T</b>T<b>T</b>T</li>

  40. OLERA • Online extraction rule analysis • Enclosing • Drill-down / Roll-up • Attribute Assignment

  41. Thresher • Work similar to OLERA • Apply tree alignment instead of string alignment

  42. Unsupervised Approaches • Roadrunner [Crescenzi, 2001] • DeLa [Wang, 2002; 2003] • EXALG [Arasu and Garcia-Molina, 2003] • DEPTA [Zhai, et al., 2005]

  43. Terminal search match Wrapper after solving mismatch <html><body><b> Book Name </b> #PCDATA<b> Reviews </b> <OL> ( <LI><b> Reviewer Name </b> #PCDATA <b> Rating </b> #PCDATA <b> Text </b> #PCDATA </LI> )+ </OL></body></html> Roadrunner • Input: multiple pages with the same template • Match two input pages at one time Wrapper (initially) 01: <html><body> 02: <b> 03: Book Name 04: </b> 05: Databases 06: <b> 07: Reviews 08: </b> 09: <OL> 10: <LI> 11: <b> Reviewer Name </b> 12: John 13: <b> Rating </b> 14: 7 15: <b>Text </b> 16: … 17: </LI> 10: </OL> 11:</body></html> Sample page 01: <html><body> 02: <b> 03: Book Name 04: </b> 05: Data mining 06: <b> 07: Reviews 08: </b> 09: <OL> 10: <LI> 11: <b> Reviewer Name </b> 12: Jeff 13: <b> Rating </b> 14: 2 15: <b>Text </b> 16: … 17: </LI> 18: <LI> 19: <b> Reviewer Name </b> 20: Jane 21: <b> Rating </b> 22: 6 23: <b>Text </b> 24: … 25: </LI> 26: </OL> 27:</body></html> parsing String mismatch String mismatch String mismatch String mismatch tag mismatch

  44. DeLa • Similar to IEPAD • Works for one input page • Handle nested data structure • Example • <P><A>T</A><A>T</A> T</P><P><A>T</A>T</P> • <P><A>T</A>T</P><P><A>T</A>T</P> • (<P>(<A>T</A>)*T<P>)*

  45. EXALG • Input: multiple pages with the same template • Techniques: • Differentiating token roles • Equivalence class (EC) form a template • Tokens with the same occurrence vector

  46. DEPTA • Identify data region • Allow mismatch between data records • Identify data record • Data records may not be continuous • Identify data items • By partial tree alignment

  47. Comparison • How do we differentiate template token from data token? • DeLa and DEPTA assume HTML tags are template while others are data tokens • IEPAD and OLERA leaves the problems to users • How to apply the information from multiple pages? • DeLa and DEPTA conduct the mining from single page • Roadrunner and EXALG do the analysis from multiple pages

  48. Comparison (Cont.) • Techniques improvement • From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher) • From full alignment (IEPAD) to partial alignment (DEPTA)

  49. Task domain comparison • Page type • structured, semi-structured or free-text Web pages • Non-HTML support • Extraction level • Field level, record-level, page-level

More Related