280 likes | 373 Views
Merging Ranks from Heterogeneous Internet Sources. Hector Garcia-Molina Luis Gravano Stanford University. Users Have Many Available Information Sources. Source 1. h 11 , h 12 , h 13 ,. “ Houses near Palo Alto for around $300K .”. Source 2. Nothing!. User Query. Query Results.
E N D
Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University
Users Have Many Available Information Sources Source 1 h11, h12, h13, ... “Houses near Palo Altofor around $300K.” Source 2 Nothing! ... User Query Query Results Luis Gravano
Challenges • Sources are too numerous • Sources are heterogeneous(query language, model, results) • Users want a single query result Luis Gravano
Metasearcher • Selects the good sources for a query • Extracts and combines the query results from the sources Luis Gravano
Text Sources Rank Query Results “Distributed Databases” Text Source Doc 1: 0.8 Doc 2: 0.6 ... Luis Gravano
Structured Sources on the Internet also Rank Results A real-estate agent receives queries onLocationand Price: Q: “Houses with preferred location in Palo Alto and preferred price around $300K.” Luis Gravano
The Agent Ranks its Houses Based on its Own Scoring Function Q: “Houses with preferred location in Palo Alto and preferred price around $300K.” Luis Gravano
A Metasearcher then Faces Two Problems • Extracting the top objects from the underlying sources • Merging the results from the various sources Luis Gravano
Merging Query Results is Easy with Enough Information Given a record like: the metasearcher ignores theSource score and computes itsTarget score from the Location and Price Luis Gravano
Extracting the Top Objects from a Source is Hard The metasearcher’s scoring function might be different from the source’s! Luis Gravano
We Want to Avoid Extracting All the Source’s Contents Assume a house h with: • Source(Q, h) = 0(worst for source) • Target(Q, h) = 1 (best for metasearcher) Problem! Luis Gravano
The Example Query is Not Manageable at the Agent A query Q is manageable at a source if $e < 1 such that: Source(Q, h) ³Target(Q, h)-e (1,1) Source 1-e e (0,0) Target Luis Gravano
Single-Attribute Queries Are More Likely to be Manageable Single-attribute queries for Q: • Q1:Location = Palo Alto • Q2:Price = $300K Luis Gravano
The Example Becomes Tractable! … if the top Target objects for Q are among the top Source objects for Q1and Q2 Luis Gravano
A Cover Bounds the Target Scores for Q Q1, …, Qm single-attribute queries form a cover for Q if $ g1, …, gm, G such that: Target(Qi, h) £ giTarget(Q, h) £ G Luis Gravano
Having a Manageable Cover for a Query is Sufficient... “Efficient” Executions Possible at S Manageable Cover for query Q at source S Luis Gravano
Having a Manageable Cover for a Query is Sufficient... (1) Pick a manageable cover C = {Q1, ..., Qm} for Q at S (2) For i = 1 to m: Find ei for Qi (3) Pick 0 £g1, ..., gm, G < 1 for cover C (4) For i = 1 to m (5) Retrieve all objects t with Source(Qi, t) ³ Gi = gi - ei (6) Compute Target(Q, t) for all objects t retrieved (7) If $ i such that Gi £ 0 Then Go to Step (11) (8) If for all t retrieved, Target(Q, t) £ G Then (9) Find new, lower 0 £ g1, ..., gm, G < 1 for C (10) Go to Step (4) (11) Output those objects retrieved with the highest Target score Luis Gravano
Algorithm to Extract Top Target Objects Q1 Q2 1 g2 g1 Target(Q, h) £ G 0 Luis Gravano
Algorithm to Extract Top Target Objects Q1 Q2 1 h’ Target(Q, h’) > G’! g2’ g1’ Target(Q, h) £ G’ 0 Luis Gravano
Preliminary Performance Results for our Algorithm • Target=Min: 14% objects retrieved • Target=Max: 4% objects retrieved 10,000 objects 4 query attributes e=0 Luis Gravano
Preliminary Performance Results for our Algorithm • Target=Min: 25% objects retrieved • Target=Max: 44% objects retrieved 10,000 objects 4 query attributes e=0.10 Luis Gravano
Having a Manageable Cover for a Query is Also Necessary... No Manageable Cover for query Q at source S Efficient Executions Impossible at S Luis Gravano
A Manageable Cover is Necessary: Proof Consider Q1, Q2, Q3 minimal cover for Q with: Q1, Q2 manageable, Q3 not manageable • For any “efficient “execution, build h such that: • h is not retrieved • Target(Q, h) > G = max{Target(Q, o) | o retrieved} Luis Gravano
A Manageable Cover is Necessary: Proof Q1 Q2 Q3 1 g1 g3 g2 0 Luis Gravano
h’ h’ h’ Target(Q, h’) > G!
h’ h’h h’h h Target(Q, h’) > G Target(Q3, h) ³Target(Q, h’) Target(Q, h) > G!
We Studied Two Metasearching Problems • Extracting the top objects from the underlying sources • Merging the results from the various sources Luis Gravano
Related Work:Collection Fusion • Voorhees et al. • Callan/Lu/Croft • Gauch/Wang Luis Gravano