WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context

WinaCS ProjectWeb Entity Extraction and Mapping Discovering and Propagating Context Tim Weninger Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL

Past, Present, Future Past – Entity search and retrieval is one of the dreams of the Web – TBL Present – Ranking and Retrieval bi-directional approach 1) Information Networks 2) Web mining and Information Extraction a) List Finding b) Entity-page Discovery c) Entity-page Mapping Future – InfoBase Project Information extraction via Schema Discovery

Finding lists on the Web is Hard! (KDD Explorations Dec. 2010) 1. Google Sets 2. WebTables 3. Mining Data Records (MDR) 4. World Wide Tables (WWT) 5. Tag Path Clustering 6. RoadRunner 6. SEAL 7. Visual List Extraction 8. VIsual-based Page Segmentation (VIPS) 9. Visualized Element Nodes Table extraction (VENTex)

Why is finding lists important? • CharuAggarwal • DeepayanChakrabarti • Ed Chang • Kevin Chang • Olivier Chapelle • Chris Clifton • Jiawei Han • … • Jiawei Han • ChengXiangZhai • Kevin Chang • Dan Roth • Marianne Winslett • Jiawei Han • ChengXiangZhai • Kevin Chang • Dan Roth • Marianne Winslett • SaritaAdve • TarekAdelzaher • VikramAdve • GulAgha • … Correction Inference Disambiguation Recommendation etc

Our list finding algorithm (Accepted: WWW 2011)

List Finding for Entity Page Discovery

Growing Parallel Paths (Accepted: WWW 2011) Result:

Mapping Pages to Records (CIKM’10)

Mapping Pages to Records (CIKM’10) Example Ap1={People, Faculty, Dan Roth, Personal Site} Ap2={Research, Data Mining, Dan Roth, Personal Site} Bag of Anchors: {Research:1, People:1, Faculty:1, Data Mining:1, Dan Roth:2, Personal Site:2} Sorted Bag of Anchors: Au;v1={Dan Roth:2/2=1, Research:1/2=0.5, Data Mining:1/2 =0.5, Personal Site:2/5=0.4, People:1/3=0.33, Faculty:1/3=0.33}

CSMap Locations of top 25 computer science departments. Automatically generated by extracting and ranking 5 digit numbers from Entity Web pages.

Next Steps: The hard part! Infer categories/schemas from a set of WebPages Example: Name Address ZipCode Publications Collaborators Organizations How can we infer this schema? Wikipedia? How can we populate it? What does these entities have in common?

Idea! Propagating schemas

Next Steps: The hardest part! Inferred Given This can be modeled as a heterogeneous information network. Thus, Ranking and Clustering is possible So is semantic search, keyword search and typal search Cube operations are possible

WinaCS – An information network based Web search engine

Questions? Challenges?

WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context

WinaCS Project Web Entity Extraction and Mapping Discovering and Propagating Context

Presentation Transcript

SA0951a Enhanced Entity-Relationship Modelling (EERM) and Mapping

Multidimensional Data Modeling for Feature Extraction and Mapping

Joint Entity and Relation Extraction using Card-Pyramid Parsing

Online real-time tweets extraction, mapping and dissemination

Unsupervised Named-Entity Extraction from the Web: An Experimental Study

Entity extraction: rule-based methods

Project context and aim

Discovering Web Pages

Logical Database Design and Entity-Relationship Mapping

Project Context and Objectives

Consolidation and Entity Mapping: New XBRL Solutions

Information Extraction and Named Entity Recognition

Web Mapping

Web Mapping and Mash-ups

Google Maps and Web Mapping

Named Entity Extraction

Chapter 6 Registering and Discovering Web services

Assigning and Propagating Uncertainties

Propagating and Selling Fish!

Unbxd Advancements In Entity Extraction