260 likes | 360 Views
Top-k Set Similarity Joins. Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee. Based on Chuan Xiao’s presentation slides in ICDE ’09. Outline. Introduction Problem Definition Existing Approaches
E N D
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee • Based on Chuan Xiao’s presentation slides in ICDE ’09
Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments
Motivation • Data Cleaning
More Application • Near duplicate Web page detection Obama Has Busy Final Day Before Taking Office as Bush Says Farewells iht.com Jan 20, 2009 New York Times Jan 19th, 2009
Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments
(Traditional) Set Similarity Join • Each record is tokenized into a set • Given a collection of records, the set similarity join problem is to find all pairs of records, <x,y>, such that sim(x,y) t • Common similarity functions: • jaccard: • cosine: • dice: What if t is unknown beforehand?
What If t is Unknown Beforehand? • Example – using jaccard similarity function • w = {A, B, C, D, E} • x = {A, B, C, E, F} • y = {B, C, D, E, F} • z = {B, C, F, G, H} • If t = 0.7 no results • If t = 0.4 <w,x>, <w,y>, <x,y>, <x,z>, <y,z> (too many results and long running time) • Return the top-k results ranked by their similarity values • if k = 1 <w,x>
Top-k Set Similarity Join • Return top-k pairs of records, ranked by similarity scores • Advantages over traditional similarity join • Without specifying a threshold • Output results progressively benefit interactive applications • Produce most meaningful results under limited resources/time constraints • Can be stopped at any time, but still guarantee sim(output results) sim(unseen pairs)
Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments
Straightforward Solution • Start from a certain t, repeat the following steps: • answer traditional sim-join with t as threshold • if # of results k, stop and output k results with highest sim • else, decrease t • Example (jaccard, k = 2) • w = {A, B, C, E} • x = {A, B, C, E, F} • y = {B, C, D, E, F} • z = {B, C, F, G, H} • t = 0.9 no result • t = 0.8 <w,x> • t = 0.7 <w,x> • t = 0.6 <w,x>, <x,y> Which thresholds shall we enumerate? 0.8, 0.6 results don’t change!
Naïve and Index-Based Algorithms • Naïve Algorithm: • Compare every pair of objects -> O(n2) time complexity • Index-based Algorithm[Sarawagi et al. SIGMOD04]: inverted lists Record Set Index Construction Candidate Generation Verification Result Pairs
Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07] • Sort the tokens by a global ordering • increasing order of document frequency • Only need to index the first few tokens (prefix) for each record • Example: jaccardt = 0.8 |x y| 4 if |x|=|y|=5 sorted x upper boundO(x,y) = 3 < 4! y sorted prefix • Must share at least one token in prefix to be a candidate pair • For jaccard, prefix length = |x| * (1 –t) + 1 each t is associated with a prefix length
Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments
Necessary Thresholds • Each prefix is associated with a threshold • the maximum possible similarity a record can achieve with other records t x = x y z
Event-driven Model • Problem: repeated invocation of sim-join algorithm • t is decreasing run sim-join algorithm in an incremental way • Prefix Event <x, A, t> • Initialize prefix length for each record as 1 <x, A, 1.0> • For each prefix event • Probe the inverted list of the token for candidate pairs, verify the candidate pairs, and insert them into temp results • Insert x into A’s inverted list • Extend prefix by one token maintain prefix events with a max-heap on t • Stop until tk-th temp result’s similarity
Topk-join - Example jaccard, k=2 prefix event t=0.6 2nd temp result’s sim w x y z inverted list temporary result verified twice!
Optimizations - Verification • In the above example, (w,x) and (y,z) have been verified twice • How to avoid repeated verification? • Memorize all verified pairs with a hash table too much memory consumption • Check if this pair will be identified again when it is verified for the first time • Keep only those will be identified again before algorithm stops • Guarantee no pair will be verified twice x if k-th temp result’s sim = 0.7 won’t be identified again! y
Optimizations - Indexing • How to reduce inverted list size to save memory? • tis decreasing calculate the upper bound of similarity for future probings into inverted lists • Don’t insert into inverted list if upper bound k-th temp result’s similarity • 0.8 x max. similarity = 4/6 = 0.67 y
Outline • Introduction • Problem Definition • Existing Approaches • Top-k Join Similarity Join Algorithms • Experiments
Experiment Settings • Algorithms • topk-join • pptopk: modified ppjoin[Xiao, et al. WWW08], a prefix-filter based approach, with t = 0.95, 0.90, 0.85... • Measure • Compare topk-join and pptopk (candidate size, running time) • Output results progressively • Dataset
Thank You! Any questions or comments?
Related Work • Index-based approaches • S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004 • C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE, 2008 • Prefix-based approaches • S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006 • R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007 • C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008 • PartEnum • A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006