Reverse Spatial and Textual k Nearest Neighbor Search

Reverse Spatial and Textual k Nearest Neighbor Search Presentation in HP Labs China Jiaheng Lu Renmin University of China Sep 6 2011

Research experience Associate Professor: Renmin University of China • XML data management, Spatial data management, Cloud data management Post-doc: University of California, Irvine • Data integration, Approximate string match PhD National University of Singapore • XML data management

Outline XML data management • XML twig query processing • XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search (SIGMOD 2011)

XML twig query processing • XPath: Section[Title]/Paragraph//Figure • Twig pattern Section Paragraph Title Figure

XML twig query processing (Cont.) • Problem Statement • Given a query twig pattern Q, and an XML database D, weneed to compute ALL the answers to Q in D. • E.g. Consider Query and Document: Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1) Query: Section Document: s1 t1 s2 title figure t2 p1 f1

An example for TJFast algorithm Root Document: Query: { } 0 A set for the branching node A a1 … A 0.0 0.3 0.5 b2 a3 a2 D B 0.3.2 0.5.0 d1 d2 b1 d3 C 0.0.1 0.3.1 c1 c2 TD: 0.0.1 , 0.3.1, 0.5.0 0.3.2.1 0.5.0.0 TC: 0.3.2.1, 0.5.0.0

XML twig query processing (Cont.) • Several efficient pattern matching algorithms • TJFast (VLDB 05) • iTwigJoin (SIGMOD 05) • TwigStackList (CIKM 04) • TreeMatch (TKDE 10) • Current works: distributed XML twig pattern processing

XML twig query processing • Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. CIKM 2004:533-542 • Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204 • Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189 • Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119 • Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309 • Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178 • Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263 • Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298 • Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466 • ……

课题背景： XQuery vs. 关键字查询 XML keyword search Query papers by “Mike” XQuery:for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings  Keyword search: Mike，inproceedings Complicated

XML keyword search • The proposed keyword search returns the set of smallest trees containing all keywords. Keywords: bib Mike hobby Paper author author article 2009 name publications hobby name publications hobby Mike ward Paper folding John Hopking Read book inproceedings articles inproceedings article title year title year title year title year 2002 Information Retrival Base line of XML key 2002 Data Mining 2007 Keyword Search in XML 2009

Effectiveness Capture user’s search intention • Identify the target that users intend to search for • Infer the predicate constraint that user intends to search via Result ranking • Rank the query results according to their objective relevance to user search intention

XML keyword search • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934 • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109 • Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010) • Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754 • Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528 • Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537 • Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716 • ……

Outline XML data management • XML twig query processing • XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search

Motivation: Data Cleaning Should clearly be “Niels Bohr” • Real-world data is dirty • Typos • Inconsistent representations • (PO Box vs. P.O. Box) • Approximately check against clean dictionary Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

Motivation: Record Linkage We want to link records belonging to the same entity No exact match! The same entity may have similar representations Arnold Schwarzeneger versus Arnold Schwarzenegger Forrest Whittaker versus Forest Whittacker

Motivation: Query Relaxation • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html

What is Approximate String Search? Queries against collection: Find all entries similarto“Forrest Whitaker” Find all entries similarto“Arnold Schwarzenegger” Find all entries similarto“Brittany Spears” String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … … … What do we mean by similar to? • Edit Distance • Jaccard Similarity • Cosine Similaity • Dice • Etc. The similar to predicate can help our described applications! How can we support these types of queries efficiently?

Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine Sliding Window 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query

Approximate Query Example Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne} Lookup Grams … tf vi ir ef rv ne un … in 2-grams 1 2 4 5 6 5 9 1 3 4 5 7 9 1 5 1 2 3 9 3 9 7 9 5 6 9 Inverted Lists (stringIDs) Candidates = {1, 5, 9} May have false positives Need to compute real similarity Each edit operations can “destroy” at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.

Approximate string matching • Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for approximate member extraction using signature-based inverted lists. CIKM 2009:315-324 • Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615 • Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266 • Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739 • ……

Outline XML data management • XML twig query processing • XML keyword search Approximate string matching Reverse Spatial and Textual k Nearest Neighbor Search (SIGMOD 2011)

Motivation clothes food clothes clothes sports food clothes If add a new shop at Q, which shops will be influenced? Influence facts • Spatial Distance • Results: D, F • Textual Similarity • Services/Products... • Results: F, C 2

Problems of finding Influential Sets Traditional query Reverse k nearest neighbor query (RkNN) Our new query Reverse spatial and textual k nearest neighbor query (RSTkNN) 3

Problem Statement Spatial-Textual Similarity • describe the similarity between such objects based on both spatial proximity and textual similarity. Spatial-Textual Similarity Function 4

Problem Statement (con’t) RSTkNN query • finding objects which have the query object as one of their k spatial-textual similar objects. 5

Related Work • Pre-computing the kNN for each object (Korn ect, SIGMOD2000, Yang ect, ICDE2001) • (Hyper) Voronio cell/planes pruning strategy (Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009) • 60-degree-pruning method (Stanoi ect, SIGMOD2000) • Branch and Bound (based on Lp-norm metric space) (Achtert ect, SIGMOD2006, Achtert ect, EDBT2009) Challenging Features: • Lose Euclidean geometric properties. • High dimension in text space. • k and α are different from query to query. 7

Intersection and Union R-tree (IUR-tree) 10

Overview of Search Algorithm RSTkNN Algorithm: • Travel from the IUR-tree root • Progressively update lower and upper bounds • Apply search strategy: • prune unrelated entries in Pruned; • report entries to be results Ans; • add candidate objects to Cnd. • FinalVerification • For objects in Cnd, check whether results or not by updating the bounds for candidates using expanding entries in Pruned. 14

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 N4 N1 N2 N3 p5 p3 p1 p2 p4 Initialize N4.CLs; EnQueue(U, N4); U N4, (0, 0) 15

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 Mutual-effect N2 N1 N3 N1 N3 N2 N4 N1 N2 N3 p5 p3 p1 p2 p4 DeQueue(U, N4) EnQueue(U, N2) EnQueue(U, N3) Pruned.add(N1) Pruned N1(0.37, 0.432) U N4(0, 0) N3(0.323, 0.619 ) N2(0.21, 0.619 ) 16

Mutual-effect p4 N2 p4,N2 p5 Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 N4 N2 N3 N1 p5 p3 p1 p2 p4 DeQueue(U, N3) Answer.add(p4) Candidate.add(p5) Pruned Answer N1(0.37, 0.432) p4(0.21, 0.619 ) U Candidate N3(0.323, 0.619 ) N2(0.21, 0.619 ) p5(0.374, 0.374) 17

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6 Mutual-effect p4,p5 p2 p2,p4,p5 p3 N4 N2 N3 N1 p5 p3 p1 p2 p4 DeQueue(U, N2) Answer.add(p2, p3) So far since U=Cand=empty, algorithm ends. Results: p2, p3, p4. Pruned.add(p5) Pruned Answer N1(0.37, 0.432) p4 p2 p3 U Candidate N2(0.21, 0.619 ) p5(0.374, 0.374) 18

Cluster IUR-tree: CIUR-tree IUR-tree: Texts in an index node could be very different. CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. 19

Optimizations • Motivation • To give a tighter bound during CIUR-tree traversal • To purify the textual description in the index node • Outlier Detection and Extraction (ODE-CIUR) • Extract subtrees with outlier clusters • Take the outliers into special account and calculate their bounds separately. • Text-entropy based optimization (TE-CIUR) • Define TextEntropy to depict the distribution of text clusters in an entry of CIUR-tree • Travel first for the entries with higher TextEntropy,i.e. more diverse in texts. 20

Experimental Study Experimental Setup • OS: Windows XP; CPU: 2.0GHz; Memory: 4GB • Page size: 4KB; Language: C/C++. Compared Methods • baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE. Datasets • ShopBranches(Shop), extended from a small real data • GeographicNames(GN), real data • CaliforniaDBpedia(CD), generated combining location in California and documents from DBpedia. Metric • Total query time • Page access number 21

Scalability (1) Log-scale version (2) Linear-scale version 22

Effect of k Query time 23

Conclusion Propose a new query problem RSTkNN. Present a hybrid index IUR-Tree. Show the enhanced variant CIUR-Tree and two optimizations ODE-CIUR and TE-CIUR to further improve search processing. 24

Current and future works • Distributed XML query processing • Cloud-based SQL Processing • Spatial and Temporal Keyword search

Thank you Q&A

Reverse Spatial and Textual k Nearest Neighbor Search