340 likes | 501 Views
Information Extraction on the Web. Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw. Outline. What is information extraction? Document types Applications Wrapper induction Automatic Wrapper generator Conclusions.
E N D
Information Extraction on the Web Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw
Outline • What is information extraction? • Document types • Applications • Wrapper induction • Automatic Wrapper generator • Conclusions
What’s information extraction? • An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically. • Example-- Parser • input a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete
Modules • Text Zonerturn a text into a set of text segments • Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes • Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones
Document types • Plain text: (一句一句,平鋪直述) • 利用lexical、semantic analysis。 • AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTAL(Soderland 95), HASTEN(Krupka 95)。 • Web page:(半結構性文件) • 利用html語法特性-tag。 • 觀察所得之heuristics: Layout。
Applications • Meta Search Engines • Information Agents • 以特定目的為導向,例: • 新聞代理人(News spider) • 網羅新聞 • 購物比價 • 找工作 • ShopBot (Doorenbos 97), Software LEGO(Hsu 99)。
Human & Computer Users • User Services: • Query • Monitor • Update Information Integration Service Mediator Mediator Mediator Wrapper Wrapper SQL ORB Text, Images/Video, Spreadsheets Hierarchical & Network Databases Object & Knowledge Bases Relational Databases Heterogeneous Data Sources Information Integration Systems Abstracted Information Agent/Module Coordination Mediation Semantic Integration Translation and Wrapping Unprocessed, Unintegrated Details
What is a wrapper? • Wrapper • An extracting program to extract desired information from Web pages. Semi-Structure Doc.– wrapper→ Structure Info.
Web Wrappers • Web wrappers wrap... • “Query-able’’ or “Search-able’’ Web sites • Web pages with large itemized lists • The primary issues are: • How to build the extractor quickly?
Free Text Extraction v.s. Semi-structured Text Extraction • Example: to extract attributes --- job title, employer and phone number --- from a job item list • Free text extraction can depend on NL knowledge “The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details.” • Semistructured text extraction? --- depend on appearance and regularity “Faculty position, department of computer science, Cranberry Lemon University. Call (555)333-5555”
skip extract skip extract 1 2 3 4 <B> </B> <I> </I> Wrapper Representations • Delimiter-based finite state automata <HTML><TITLE>Some Country Codes</TITLE><BODY> <B>Congo</B><I>242</I><BR> <B>Egypt</B><I>20</I><BR> <B>Belize</B><I>501<I><BR> <B>Spain</B><I>34</I><BR> </BODY></HTML>
Related Work • Shopbot • Doorenbos, Etzioni, Weld, AA-97 • Ariadne • Ashish, Knoblock, Coopis-97 • WIEN • Kushmerick, Weld, IJCAI-97
Related Work (Cont.) • SoftMealy wrapper representation • Hsu, IJCAI-99 • STALKER • Muslea, Minton, Knoblock, AA-99 • A hierarchical FST • IEPAD • Chang, WWW01
WIEN • HLRT (Head-Left-Right-Tail) • Labeling: by PageOracle, LableOracle. • PAC analysis • Extract 48% web pages successfully. • Weakness: • Missing attributes, attributes not in order, tabular data..etc.
Softmealy Chun-Nan Hsu, 1998 Arizona State University
Softmealy • Finite-State Transducers for Semi-Structured Text Mining • Labeling: use a interface to label example by manually. • FST (Finite-State Transducer) • Sigle-pass • Multi-pass
SoftMealy wrapper representation • Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path • Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes
Output 4種情形
b Finite State Transducer 多解決了(N, M)、(N, A, M)2個情形 skip extract skip extract U -U N skip -N extract skip extract skip -A e M A
Stalker Muslea, Minton, Knoblock, AA-99 A Hierarchical FST
STALKER • STALKER • “STALKER: Learning Extraction Rules for Semi-structured, Web-based Information Sources”. AAAI-98, Muslea. • Embeded Catalog Description is a tree-like structure.
Multi-Pass or Hierarchical Wrapper Pass1: extract U 先extract Body Pass2:extract N Pass3:extract A 再extract Tuples Pass4:extract M
Rule Generating Extract Credit info. 1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D4 2nd: uncover{D1, D2} Candicate:{; _Symbol_}
Features • Process is performed in a hierarchical manner. • 沒有Attributes not in order的問題。 • Use disjunctive rule 可以解決Missing attributes的問題。
Comparison • Both : • can handle irregular missing attributes. • 對於未見過的attribute,需要training • Single-pass : • 允許的attribute permutations 有限 • Single-pass is good for tabular pages • 比較快 • Multi-pass: • Attribute permutations沒有影響 • Multi-pass is good for tagged-list pages • 比較慢
Comparison • Quote Server • Stalker: 10 example tuples, 79%, 500 test • WIEN: the collection beyond learn’s capablity • SoftMealy: multi-pass 85%, single-pass97% • Internet Address Finder • Stalker: 80% ~ 100%, 500 test • WIEN: the collection beyond learn’s capablity • SoftMealy: multi-pass 68%, single-pass 41%,
Comparison • Okra(tabular pages) • Stalker: 97%, 1 example tuple • WIEN: 100% , 13 example tuples, 30 test • SoftMealy: single-pass 100%, 1 example tuple, 30 test • Big-book(tagged-list pages) • Stalker: 97%, 8 example tuples • WIEN: perfect, 18 example tuples, 30 test • SoftMealy: single-pass 97%, 4 examples, 30 test multi-pass 100%, 6 examples, 30 test