230 likes | 338 Views
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1]. Pirooz Chubak May 22, 2008. Motivation. Selectivity estimation of approximate string matching queries Applications Misspelling correction/suggestion Data integration and data cleaning
E N D
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008
Motivation • Selectivity estimation of approximate string matching queries • Applications • Misspelling correction/suggestion • Data integration and data cleaning • Query optimization (generating query plans)
Approximate String Matching • String similarity measures • Edit distance • Hamming distance • Jaccard similarity co-efficient • Edit distance • Minimum number of edit (insertion, deletion, replacement) operations to convert a string to the other
Short Identifying Substring • SIS by Chaudhuri, et al. [2] • String susually has a substring s’ that if an attribute value contains s, it almost always contains s’ • Thus, approximate selectivity of long string queries with their shorter substrings
Related Work • SEPIA [3] • Clusters similar strings • Selects a pivot for each cluster • Captures the edit distance distribution with histograms • For each query, visit all the clusters and estimate the number of strings within the distance threshold
Problem Statement • Given a query string sq and a bag of strings DB estimate the size of the answer set • Interested in low edit thresholds (1-3)
Basic Definitions • Q-gram • Any string of length q • N-gram table • Frequencies of all q-grams for q=1…N • Ans(sq,iDjImR) = set of strings s’ such that sq can be converted to s’ with i deletions, j insertions and m replacements • Ans(sq,k) = set of string s’ obtained from sq with exactly k edit operations
Examples • Ans(“abcd”,1R) = {“?bcd”,”a?cd”,”ab?d”,”abc?”} • Alphabet for extended Q-grams = • 3-gram table for “beau” contains frequencies for • 1-grams (b, e, a, u) • 2-grams (#b, be, ea, au, u$) • 3-grams (#be, bea, eau, au$ • Extended 3-gram table also contains frequencies for • For 2-grams (?b, ?e, ?u, b?, e?, a?, u?, ??, #?, ?$ • For 3-grams (?ea, #?e, ??$, etc.)
Replacement semi-lattice • Assume only replacements are allowed • E.g. Ans(“abcd”,2R) • Possible answers = ab??, a?c?, ?bc?, a??d, ?b?d, ??cd • Find value of | Ans(“abcd”,2R)| using • S1 = ab??, … , S6=??cd
Replacement semi-lattice (Cont.) Semi-lattice for Ans(“abcd”,2R) Get the values of intersections from this table and plug them into the formula for |Ans(“abcd”,2R)|
General Formulas • Generalize the above idea to find |Ans(sq,kR)| • The general formulas for deletion is very trivial and can be shown to always be the sum of the frequencies of the level-0 nodes • The general case for insertion can be very complex, only interested in at most 3 insertions
Estimate selectivity • General idea • group Ans(sq,k) by the length of the strings (l-k...l+k) • Estimate the size of each subset separately • Ans(“abcde”,2) • 5 subsets, having strings of size 3 to 7 • Length 3 is Ans(“abcde”,2D) • Length 5 is Ans(“abcde”,1I1D) U Ans(“abcde”,2R) Lots of overlap
Estimate selectivity (Cont.) • Combined Approach • Obtain base strings for both sets • Remove redundant base strings • Ans(“abcde”,2R) generates “abc??” • Ans(“abcde”,1I1D) generates “abcd?” • “abc??” has all the strings in “abcd?” Remove “abcd?” from base strings
Estimate selectivity (cont.) • BasicEQ, for a given string length • Find the base strings (remove redundancies) • Iteratively intersect base strings to obtain r-intersections (r = 2..|base strings|) • This will generate new nodes in the hierarchy • Partition the nodes and estimate their frequencies • Add these estimated frequencies
Estimate selectivity (cont.) • Node Partitioning • Partition the nodes, so that every node q in a partition has the same coefficient Cq • Cq is the number of times q appears in all the intersections of base strings • For each partition find Cq and sum of frequencies of its nodes
Frequency Estimation • Estimate the frequency of an extended q-gram in the extended N-gram table • Maximal Overlap (MO) [4] • Finds the substring in the table that has the maximum overlap with sq • MAX approach • If MO(“abc?”) < MO(“abcd”), then set MO(“abcd”) for “abc?” • MO+ • Find the substring with the minimum frequency • MM • Combination of MAX and MO+
Estimate selectivity (cont.) • BasicEQ is efficient if the general formulas are applicable • Propose OptEQ that adds two enhancements to BasicEQ • Approximates the co-efficient Cq but achieves a better performance • Groups the set of strings obtained in each iteration of BasicEQ to obtain faster intersection tests (for being empty)
Experimetal Evaluation (method, NB, NE, PT)
Experimetal Evaluation Space vs. Accuracy
Conclusions • Proposed OptEQ • Approximates coefficients of partitions • Groups semi-lattices to obtain scalability • More accurate than SEPIA • Exploits disk space to give higher precisions • MM and Max estimates give good results
References [1] H. Lee, R. T. Ng, and K. Shim, “Extending Q-grams to estimate selectivity of string matching with low edit distance”, VLDB 2007 [2] S. Chaudhuri, V. Ganti, and L. Gravano “Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem”, ICDE 2004 [3] L. Jin and C. Li, “Selectivity Estimation for Fuzzy String Predicates in Large Data Sets”, VLDB 2005 [4] H. V. Jagadish, R. T. Ng and D. Srivastava. “Substring Selectivity Estimation”, PODS 1999