150 likes | 267 Views
WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context. Tim Weninger. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL. Past, Present, Future. Past – Entity search and retrieval is one of the dreams of the Web – TBL
E N D
WinaCS ProjectWeb Entity Extraction and Mapping Discovering and Propagating Context Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL
Past, Present, Future Past – Entity search and retrieval is one of the dreams of the Web – TBL Present – Ranking and Retrieval bi-directional approach 1) Information Networks 2) Web mining and Information Extraction a) List Finding b) Entity-page Discovery c) Entity-page Mapping Future – InfoBase Project Information extraction via Schema Discovery
Finding lists on the Web is Hard! (KDD Explorations Dec. 2010) 1. Google Sets 2. WebTables 3. Mining Data Records (MDR) 4. World Wide Tables (WWT) 5. Tag Path Clustering 6. RoadRunner 6. SEAL 7. Visual List Extraction 8. VIsual-based Page Segmentation (VIPS) 9. Visualized Element Nodes Table extraction (VENTex)
Why is finding lists important? • CharuAggarwal • DeepayanChakrabarti • Ed Chang • Kevin Chang • Olivier Chapelle • Chris Clifton • Jiawei Han • … • Jiawei Han • ChengXiangZhai • Kevin Chang • Dan Roth • Marianne Winslett • Jiawei Han • ChengXiangZhai • Kevin Chang • Dan Roth • Marianne Winslett • SaritaAdve • TarekAdelzaher • VikramAdve • GulAgha • … Correction Inference Disambiguation Recommendation etc
Mapping Pages to Records (CIKM’10) Example Ap1={People, Faculty, Dan Roth, Personal Site} Ap2={Research, Data Mining, Dan Roth, Personal Site} Bag of Anchors: {Research:1, People:1, Faculty:1, Data Mining:1, Dan Roth:2, Personal Site:2} Sorted Bag of Anchors: Au;v1={Dan Roth:2/2=1, Research:1/2=0.5, Data Mining:1/2 =0.5, Personal Site:2/5=0.4, People:1/3=0.33, Faculty:1/3=0.33}
CSMap Locations of top 25 computer science departments. Automatically generated by extracting and ranking 5 digit numbers from Entity Web pages.
Next Steps: The hard part! Infer categories/schemas from a set of WebPages Example: Name Address ZipCode Publications Collaborators Organizations How can we infer this schema? Wikipedia? How can we populate it? What does these entities have in common?
Next Steps: The hardest part! Inferred Given This can be modeled as a heterogeneous information network. Thus, Ranking and Clustering is possible So is semantic search, keyword search and typal search Cube operations are possible