160 likes | 254 Views
Querying The Web Database. Michael J. Cafarella University of Michigan CS4HS August 18, 2010. Two kinds of databases. Structured databases (your bank) Expensive, hard to use Few sources of data Powerful queries “Who lives in Ypsilanti and has a balance between $800 and $1400?”
E N D
Querying The Web Database Michael J. Cafarella University of Michigan CS4HS August 18, 2010
Two kinds of databases • Structured databases (your bank) • Expensive, hard to use • Few sources of data • Powerful queries • “Who lives in Ypsilanti and has a balance between $800 and $1400?” • Unstructured databases (the Web) • Cheap, easy to use • Many sources of data • Very boring “topic” queries • britney spears, etc.
The Structured Web? • What if we had a structured-data version of everything on the Web? • “A Database of Everything” • “List all scientists from Belgium who were left-handed” • “Which heart surgeon in Michigan has the highest success rate?” • “List Miami hotels with hot tubs near a beach”
This page contains 16 distinct HTML tables, but only one structured database
WebTables Schema Statistics Applications • WebTables system automatically extracts dbs from web crawl • An extracted database is one table plus labeled columns • Estimate that our crawl of 14.1B raw HTML tables contains ~154M good structured dbs Raw crawled pages Raw HTML Tables Recovered Databases
Easy Data Analysis • Knowledge worker queries for“city population”[VLDB08, “WebTables: Exploring…”, Cafarella et al]
Conclusions • The Structured Web exists in raw form today, but tools largely ignore it • Information Extraction helps gather structural information from existing Web info • These techniques bring the promise of the Structured Web much closer