1 / 23

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1]

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1]. Pirooz Chubak May 22, 2008. Motivation. Selectivity estimation of approximate string matching queries Applications Misspelling correction/suggestion Data integration and data cleaning

otto-wilder
Download Presentation

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1]

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008

  2. Motivation • Selectivity estimation of approximate string matching queries • Applications • Misspelling correction/suggestion • Data integration and data cleaning • Query optimization (generating query plans)

  3. Approximate String Matching • String similarity measures • Edit distance • Hamming distance • Jaccard similarity co-efficient • Edit distance • Minimum number of edit (insertion, deletion, replacement) operations to convert a string to the other

  4. Short Identifying Substring • SIS by Chaudhuri, et al. [2] • String susually has a substring s’ that if an attribute value contains s, it almost always contains s’ • Thus, approximate selectivity of long string queries with their shorter substrings

  5. Related Work • SEPIA [3] • Clusters similar strings • Selects a pivot for each cluster • Captures the edit distance distribution with histograms • For each query, visit all the clusters and estimate the number of strings within the distance threshold

  6. Problem Statement • Given a query string sq and a bag of strings DB estimate the size of the answer set • Interested in low edit thresholds (1-3)

  7. Basic Definitions • Q-gram • Any string of length q • N-gram table • Frequencies of all q-grams for q=1…N • Ans(sq,iDjImR) = set of strings s’ such that sq can be converted to s’ with i deletions, j insertions and m replacements • Ans(sq,k) = set of string s’ obtained from sq with exactly k edit operations

  8. Examples • Ans(“abcd”,1R) = {“?bcd”,”a?cd”,”ab?d”,”abc?”} • Alphabet for extended Q-grams = • 3-gram table for “beau” contains frequencies for • 1-grams (b, e, a, u) • 2-grams (#b, be, ea, au, u$) • 3-grams (#be, bea, eau, au$ • Extended 3-gram table also contains frequencies for • For 2-grams (?b, ?e, ?u, b?, e?, a?, u?, ??, #?, ?$ • For 3-grams (?ea, #?e, ??$, etc.)

  9. Replacement semi-lattice • Assume only replacements are allowed • E.g. Ans(“abcd”,2R) • Possible answers = ab??, a?c?, ?bc?, a??d, ?b?d, ??cd • Find value of | Ans(“abcd”,2R)| using • S1 = ab??, … , S6=??cd

  10. Replacement semi-lattice (Cont.) Semi-lattice for Ans(“abcd”,2R) Get the values of intersections from this table and plug them into the formula for |Ans(“abcd”,2R)|

  11. General Formulas • Generalize the above idea to find |Ans(sq,kR)| • The general formulas for deletion is very trivial and can be shown to always be the sum of the frequencies of the level-0 nodes • The general case for insertion can be very complex, only interested in at most 3 insertions

  12. Estimate selectivity • General idea • group Ans(sq,k) by the length of the strings (l-k...l+k) • Estimate the size of each subset separately • Ans(“abcde”,2) • 5 subsets, having strings of size 3 to 7 • Length 3 is Ans(“abcde”,2D) • Length 5 is Ans(“abcde”,1I1D) U Ans(“abcde”,2R) Lots of overlap

  13. Estimate selectivity (Cont.) • Combined Approach • Obtain base strings for both sets • Remove redundant base strings • Ans(“abcde”,2R) generates “abc??” • Ans(“abcde”,1I1D) generates “abcd?” • “abc??” has all the strings in “abcd?” Remove “abcd?” from base strings

  14. Estimate selectivity (cont.) • BasicEQ, for a given string length • Find the base strings (remove redundancies) • Iteratively intersect base strings to obtain r-intersections (r = 2..|base strings|) • This will generate new nodes in the hierarchy • Partition the nodes and estimate their frequencies • Add these estimated frequencies

  15. Estimate selectivity (cont.) • Node Partitioning • Partition the nodes, so that every node q in a partition has the same coefficient Cq • Cq is the number of times q appears in all the intersections of base strings • For each partition find Cq and sum of frequencies of its nodes

  16. Frequency Estimation • Estimate the frequency of an extended q-gram in the extended N-gram table • Maximal Overlap (MO) [4] • Finds the substring in the table that has the maximum overlap with sq • MAX approach • If MO(“abc?”) < MO(“abcd”), then set MO(“abcd”) for “abc?” • MO+ • Find the substring with the minimum frequency • MM • Combination of MAX and MO+

  17. Estimate selectivity (cont.) • BasicEQ is efficient if the general formulas are applicable • Propose OptEQ that adds two enhancements to BasicEQ • Approximates the co-efficient Cq but achieves a better performance • Groups the set of strings obtained in each iteration of BasicEQ to obtain faster intersection tests (for being empty)

  18. Experimetal Evaluation (method, NB, NE, PT)

  19. Experimetal Evaluation

  20. Experimetal Evaluation

  21. Experimetal Evaluation Space vs. Accuracy

  22. Conclusions • Proposed OptEQ • Approximates coefficients of partitions • Groups semi-lattices to obtain scalability • More accurate than SEPIA • Exploits disk space to give higher precisions • MM and Max estimates give good results

  23. References [1] H. Lee, R. T. Ng, and K. Shim, “Extending Q-grams to estimate selectivity of string matching with low edit distance”, VLDB 2007 [2] S. Chaudhuri, V. Ganti, and L. Gravano “Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem”, ICDE 2004 [3] L. Jin and C. Li, “Selectivity Estimation for Fuzzy String Predicates in Large Data Sets”, VLDB 2005 [4] H. V. Jagadish, R. T. Ng and D. Srivastava. “Substring Selectivity Estimation”, PODS 1999

More Related