1 / 28

Lecture 5: Indexing: R-Trees

Lecture 5: Indexing: R-Trees. Sept. 8 2006 ChengXiang Zhai. Most slides are adapted from Kevin Chang’s lecture slides. Two Essential Techniques for Efficient Processing of Declarative Queries ?. From “declarative” to “procedural” (query compilation) Physical-query-plan operators

Thomas
Download Presentation

Lecture 5: Indexing: R-Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 5: Indexing: R-Trees Sept. 8 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang’s lecture slides

  2. Two Essential Techniques for Efficient Processing of Declarative Queries ? • From “declarative” to “procedural” (query compilation) • Physical-query-plan operators • Query optimization (access path selection) improves efficiency • Implementation of physical-query-plan operators • Access methods (procedural support for relational operators) • Indexing improves efficiency

  3. Access Methods: Indexing • What is an index? • partition data into buckets • label each bucket • use label to determine if buckets are relevant to query • One dimensional indexing: • hashing • B-trees/B+-trees (we’ll simply call them B-trees) • What are the differences (pros & cons)? • Indexes are designed to serve queries! • check each index methods w.r.t. typical queries

  4. Multidimensional Data • GIS (geographic info. systems) database • e.g.: map databases • CAD database • circuit layout (of rectangles) • Data cube systems • sale data (date, time, store, item, price) • a tuple as a point in multi-dimensional space • queries typically retrieve a “cube” • e.g.: date between 1990-1999, store=supermarket, price > 100

  5. Example of Multidimensional Data • Values may be multi-dimensional

  6. Multidimensional Queries • Point queries: • constrain values of some dimensions • Range queries: • constraint ranges of some dimensions • Nearest neighbor: • e.g.: the closest city to Champaign • Where-am-I query: regions related to a query region • given a point, find out where it is located • given a region, find overlapping, containing, contained regions • Others?

  7. B-Tree for Nearest Neighbor Queries • Object represented by (X,Y) • X indexed by a B-tree • Y indexed by a B-tree • Given an object (x0, y0), find its NN object: • let d = 1 • find objs in (X-d, Y-d) to (X+d, Y+d) • if there are some objs, find NN among them--> done! • else d = d + delta; repeat • Problems?

  8. ?? B-Tree for Nearest Neighbor Queries • Intersecting multiple indexes not efficient • Need to guess the distance d and delta • too small: no objs in the bounding box • too large: may be too many in bounding box • May actually miss the NN point! • closer point may be outside the d-range wrong answer! > d

  9. Indexing Techniques How to reach the right buckets quickly? • Hash-like schemes: lookup by k  h(k) • grid files • partitioned indexing functions • Tree-like schemes: lookup by tree traversal • multiple key indexes • kd-trees • quad trees • R-trees • See demo: http://www.cs.umd.edu/~brabec/quadtree

  10. Grid Files 250K * * 200K * 150K * * * * * Salary * * 80K * * * * * * 20K * 0 30 40 70 100 Age

  11. Grid Files • Each region corresponds to a bucket • ? why is this hash-based indexing • Good for partial-match, range, NN • Buckets may be empty or overfull over time • points do not distribute uniformly • Number of buckets grows exponentially with D • Reorganization requires repartitioning space • need to move the grid lines

  12. Partitioned Hash Functions • Hash function produces K bits a bucket • Bits partitioned among attributes • buckets identified by b1b2b3b4b5b6 • age produces b1b2b3 • salary produces b4b5b6 • ? what are different from grid files? • can simulate a grid file (generalization of grid files!) • range/NN queries? • bucket utilization?

  13. Multiple-Key Indexes • Each level is an index for one attribute • Partial match will work if the prefix of a path is specified • But, what if only salary is specified? • NN queries? age salary

  14. kd-Tree: k-dimensional Search Tree • binary tree • interior nodes: attr, div value • attributes alternate at different levels • binary search to find records Salary, 150 Age, 60 Age, 47 50, 275 70, 110 Salary, 80 Salary, 300 60, 260 85, 140 50, 100 Age, 38 50, 120 30, 260 25, 400 25, 60 45, 60 45, 350 50, 75

  15. Quad Trees • Interior nodes represent square regions • with at most M entries • Split into four quadrants if more than M entries 200K * * * * * * Salary * * * * * * * 0 100 Age

  16. Quad Trees (cont.) In terms of space partitioning: • ? How is it different from a grid file? • ? How is it different from a kd-tree?

  17. R-Trees: Region Trees • B-tree related: • balanced tree (by dynamic restructuring) • guaranteed utilization of nodes • between half and completely full • Support regions • Coverage of siblings may overlap

  18. R-Trees • Structure like B-tree, but keys are minimum bounding rectangle (MBR) regions • does B-tree use (1-D) MBR? e.g., keys: | 57 | 81 | 95 | • Interior node contains sets of regions • Regions can overlap • Regions do not necessarily fully cover the entire parent region • Region can be a point or a shape • Utilization of nodes: (unless root) • max capacity: M (How to determine M?) • min capacity: m <= M/2 (Why?)

  19. R-Trees: Region Trees • How is it different from others?

  20. R-Tree: Where-Am-I Queries • Given point P, return data regions containing P • Start with root • Find subregions S containing P: • if S is a data region: return S • else: recursively search S

  21. R-Tree: Split Node • Objectives: • minimize covered area of the containing region • Why? • Is this the only objective? What else?

  22. R-Tree: Split Node • Minimal-covered area split is exponential • need to consider all binary partitioning • M+1 entries --> 2M-1 approximately (?) • Use heuristics (not exhaustive) • quadratic algorithm • linear algorithm

  23. R-Tree: Quadratic Split • PickSeeds: select two seeds for two groups • heuristics: pair that wastes the largest area • waste = area(pair) - [area(I) + area(J)] • If can be further partitioned (enough entries assignable): PickNext: • heuristics: greedily select the next with max preference • d1 & d2: area-increase if entry E added to group 1 & 2 • select E with max(|d1 - d2|) Add to the group with least enlargement • Quadratic in M: • PickSeeds consider every pair • What about PickNext?

  24. R-Tree: Linear Split • PickSeeds: select two seeds for two groups • heuristics: pair that is the most separated in any dimension • separation = distance of the near sides • normalized by the length of the dimension • If can be further partitioned (more entries assignable): PickNext: • heuristics: any of the remaining entry Add to the group with least enlargement • Why is this linear?

  25. R-Tree: Other Operations • Insertion: • ChooseLeaf with least enlargement • SplitNode • AdjustTree: • propagate node split/adjust MBR upward • Deletion: • seek and destroy • CondenseTree: • propagate node deletion/adjust MBR upward • reinsert orphaned entries

  26. R-Tree Variants • R*-trees: • new and improved split algorithm: nlog(n) • forced reinserts of extreme entries • decrease overlapping of sibling regions • regions packed better • splits less often • R+-tree: • instead of overlapping regions, use multiple inserts • speed up search, slow down insert/delete/update

  27. What You Should Know • What is an R-tree? • How does an R-tree generalize a B-tree? • Different kinds of queries prefer different indices; what kind of queries is an R-tree index especially suitable for?

  28. Carry Away Messages • Once again, we see a breakthrough was motivated by new applications, so always keep an eye on new applications • New applications -> Limitations of existing framework/methods -> Improved framework/methods • Generalize whenever it’s possible • Spatial data querying -> multi-dimensional data querying • 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) • Can we further generalize R-trees? • New query capabilities often require new indices; what kind of new query capabilities haven’t we thought about?

More Related