170 likes | 321 Views
Modeling Query-Based Access to Text Databases. Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University. Extracting Structured Information “Buried” in Text Documents.
E N D
Modeling Query-Based Access to Text Databases Eugene AgichteinPanagiotis IpeirotisLuis Gravano Computer Science Department Columbia University
Extracting Structured Information “Buried” in Text Documents May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Information Extraction System (e.g., NYU’s Proteus)
InformationExtraction System Extracted Tuples Extracting All “Tuples” of a Relation from a Text Database • Naïve approach: feed every document to information extraction system. At 7 secs./document,Proteus takes over 8 days for 100K documents • Only a tiny fraction of documents contains tuples Processing every document is inefficient • Many databases are not crawlable (scannable), but available only via a search engine. Search engines can help:efficiency and accessibility
A Query-Based Strategy for Information Extraction[Agichtein and Gravano, ICDE 2003] 1While seed has unprocessed tuple t 2Retrieve up to MaxResults documents using query derived from t 3Extract new tuples te from these documents 4Augment seed with te 0Start with some seed tuples (e.g., <“May 1995”, “Ebola”, “Zaire”>) seed t0 t1 t2 Potential problem: May run out of tuples (and queries) incomplete relation!
Iterative Methods Sometimes (but not Always) “Succeed” seed seed SUCCESS! FAIL Can we predict if a query-based strategy will succeed?
Model: Querying Graph Tokens Documents • Tokens: Tuple attributes <“May 1995”, “Ebola”, “Zaire”> • Each Token (as query) retrieves documents • Documents contain tokens t1 d1 d2 t2 t3 d3 t4 d4 t5 d5
Model: ReachabilityGraph Tokens Documents t1 t1 d1 t2 t3 d2 t2 t3 d3 t4 t5 t4 d4 t1retrieves document d1that contains t2 t2, t3, and t4 “reachable” from t1 t5 d5
Core Out In (strongly connected) Model: Connected Components t1 t3 t2 t4 Tokens not in Core, but are reachable from Core Tokens not in Core but from which Core is reachable
Components of Reachability Graph Core In Out t0 (strongly connected) Out In Core How many tokens are in the largest Core + Out? Out In Core
Model: Power-law Graphs • Conjecture: Degree distribution in the reachability graph follows power-law: #(nodes with degree k) ≈ O(k-β) (i.e., many nodes with small degree, a few nodes with large degree) • Power-law random graphs are expected to have at most onegiant connected component (~Core+In+Out). Other connected components are small.
Model: Reachability Reachability : Fraction of tokens in the largest Core + Out (Power law allows to ignore small components) Core t0 In Out (strongly connected)
Estimating Reachability • In a power-law random graph G a giant component CG emerges if the average outdegree d > 1 • Graph theory results predict relative size of CG [Chung and Lu, Annals of Combinatorics, 2002 ] Estimate reachability asrelative size of CG, which reduces to estimating average outdegree of reachability graph
Estimating Reachability Using Sampling(estimate average outdegree) • Choose S random seedtokens • Query the database for seed • Extract tokens to compute the reachability graph edges for seed tokens. • Estimate d as average outdegree of seed tokens. • Estimate reachability Tokens Documents t1 t1 d1 d2 t2 t3 t3 d3 t4 d4 t2 t2 d =1.5 t5 d5 t4
Experimental Results: Verifying the “Power-law” Conjecture Task 1: NYT DiseaseOutbreaks(Date, Disease, Location) New York Times, 1995 |T|= 8,859 |D|=137,000 Follows the power-law distribution
Experimental Results:Estimating Reachability by Sampling • Approximate reachability isestimated with S = 50 tokens • The reachability correctly predicts performance of query-based information extractionstrategy • If the estimated reachability is too low,can switch to a different strategy early
Future Work Tokens Documents • What if we have only limited access to the database? • Limit on number of queries • Limit on number of documents retrieved • Not modelled by reachability graph, but can be modelled using properties of querying graph t1 d1 d2 t2 t3 d3 t4 d4 t5 d5
Summary • Presented graph model for query-based algorithms: • for Information Extraction • for Constructing Database Content Summaries • Showed that querying and reachability graphs can be used to analyze such algorithms • Presented single reachability metric to predict success of iterative query-based algorithms • Presented and verified conjecture that reachability graphs for these algorithms follow the power law • Presented efficient techniques for estimating reachability by exploiting properties of power-law random graphs