Lecture 5: Indexing: R-Trees

Lecture 5: Indexing: R-Trees Sept. 8 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang’s lecture slides

Two Essential Techniques for Efficient Processing of Declarative Queries ? • From “declarative” to “procedural” (query compilation) • Physical-query-plan operators • Query optimization (access path selection) improves efficiency • Implementation of physical-query-plan operators • Access methods (procedural support for relational operators) • Indexing improves efficiency

Access Methods: Indexing • What is an index? • partition data into buckets • label each bucket • use label to determine if buckets are relevant to query • One dimensional indexing: • hashing • B-trees/B+-trees (we’ll simply call them B-trees) • What are the differences (pros & cons)? • Indexes are designed to serve queries! • check each index methods w.r.t. typical queries

Multidimensional Data • GIS (geographic info. systems) database • e.g.: map databases • CAD database • circuit layout (of rectangles) • Data cube systems • sale data (date, time, store, item, price) • a tuple as a point in multi-dimensional space • queries typically retrieve a “cube” • e.g.: date between 1990-1999, store=supermarket, price > 100

Example of Multidimensional Data • Values may be multi-dimensional

Multidimensional Queries • Point queries: • constrain values of some dimensions • Range queries: • constraint ranges of some dimensions • Nearest neighbor: • e.g.: the closest city to Champaign • Where-am-I query: regions related to a query region • given a point, find out where it is located • given a region, find overlapping, containing, contained regions • Others?

B-Tree for Nearest Neighbor Queries • Object represented by (X,Y) • X indexed by a B-tree • Y indexed by a B-tree • Given an object (x0, y0), find its NN object: • let d = 1 • find objs in (X-d, Y-d) to (X+d, Y+d) • if there are some objs, find NN among them--> done! • else d = d + delta; repeat • Problems?

?? B-Tree for Nearest Neighbor Queries • Intersecting multiple indexes not efficient • Need to guess the distance d and delta • too small: no objs in the bounding box • too large: may be too many in bounding box • May actually miss the NN point! • closer point may be outside the d-range wrong answer! > d

Indexing Techniques How to reach the right buckets quickly? • Hash-like schemes: lookup by k  h(k) • grid files • partitioned indexing functions • Tree-like schemes: lookup by tree traversal • multiple key indexes • kd-trees • quad trees • R-trees • See demo: http://www.cs.umd.edu/~brabec/quadtree

Grid Files 250K * * 200K * 150K * * * * * Salary * * 80K * * * * * * 20K * 0 30 40 70 100 Age

Grid Files • Each region corresponds to a bucket • ? why is this hash-based indexing • Good for partial-match, range, NN • Buckets may be empty or overfull over time • points do not distribute uniformly • Number of buckets grows exponentially with D • Reorganization requires repartitioning space • need to move the grid lines

Partitioned Hash Functions • Hash function produces K bits a bucket • Bits partitioned among attributes • buckets identified by b1b2b3b4b5b6 • age produces b1b2b3 • salary produces b4b5b6 • ? what are different from grid files? • can simulate a grid file (generalization of grid files!) • range/NN queries? • bucket utilization?

Multiple-Key Indexes • Each level is an index for one attribute • Partial match will work if the prefix of a path is specified • But, what if only salary is specified? • NN queries? age salary

kd-Tree: k-dimensional Search Tree • binary tree • interior nodes: attr, div value • attributes alternate at different levels • binary search to find records Salary, 150 Age, 60 Age, 47 50, 275 70, 110 Salary, 80 Salary, 300 60, 260 85, 140 50, 100 Age, 38 50, 120 30, 260 25, 400 25, 60 45, 60 45, 350 50, 75

Quad Trees • Interior nodes represent square regions • with at most M entries • Split into four quadrants if more than M entries 200K * * * * * * Salary * * * * * * * 0 100 Age

Quad Trees (cont.) In terms of space partitioning: • ? How is it different from a grid file? • ? How is it different from a kd-tree?

R-Trees: Region Trees • B-tree related: • balanced tree (by dynamic restructuring) • guaranteed utilization of nodes • between half and completely full • Support regions • Coverage of siblings may overlap

R-Trees • Structure like B-tree, but keys are minimum bounding rectangle (MBR) regions • does B-tree use (1-D) MBR? e.g., keys: | 57 | 81 | 95 | • Interior node contains sets of regions • Regions can overlap • Regions do not necessarily fully cover the entire parent region • Region can be a point or a shape • Utilization of nodes: (unless root) • max capacity: M (How to determine M?) • min capacity: m <= M/2 (Why?)

R-Trees: Region Trees • How is it different from others?

R-Tree: Where-Am-I Queries • Given point P, return data regions containing P • Start with root • Find subregions S containing P: • if S is a data region: return S • else: recursively search S

R-Tree: Split Node • Objectives: • minimize covered area of the containing region • Why? • Is this the only objective? What else?

R-Tree: Split Node • Minimal-covered area split is exponential • need to consider all binary partitioning • M+1 entries --> 2M-1 approximately (?) • Use heuristics (not exhaustive) • quadratic algorithm • linear algorithm

R-Tree: Quadratic Split • PickSeeds: select two seeds for two groups • heuristics: pair that wastes the largest area • waste = area(pair) - [area(I) + area(J)] • If can be further partitioned (enough entries assignable): PickNext: • heuristics: greedily select the next with max preference • d1 & d2: area-increase if entry E added to group 1 & 2 • select E with max(|d1 - d2|) Add to the group with least enlargement • Quadratic in M: • PickSeeds consider every pair • What about PickNext?

R-Tree: Linear Split • PickSeeds: select two seeds for two groups • heuristics: pair that is the most separated in any dimension • separation = distance of the near sides • normalized by the length of the dimension • If can be further partitioned (more entries assignable): PickNext: • heuristics: any of the remaining entry Add to the group with least enlargement • Why is this linear?

R-Tree: Other Operations • Insertion: • ChooseLeaf with least enlargement • SplitNode • AdjustTree: • propagate node split/adjust MBR upward • Deletion: • seek and destroy • CondenseTree: • propagate node deletion/adjust MBR upward • reinsert orphaned entries

R-Tree Variants • R*-trees: • new and improved split algorithm: nlog(n) • forced reinserts of extreme entries • decrease overlapping of sibling regions • regions packed better • splits less often • R+-tree: • instead of overlapping regions, use multiple inserts • speed up search, slow down insert/delete/update

What You Should Know • What is an R-tree? • How does an R-tree generalize a B-tree? • Different kinds of queries prefer different indices; what kind of queries is an R-tree index especially suitable for?

Carry Away Messages • Once again, we see a breakthrough was motivated by new applications, so always keep an eye on new applications • New applications -> Limitations of existing framework/methods -> Improved framework/methods • Generalize whenever it’s possible • Spatial data querying -> multi-dimensional data querying • 1-dimension indexing (B+-tree, interval-based)-> Multi-dimension indexing (R-tree, region-based) • Can we further generalize R-trees? • New query capabilities often require new indices; what kind of new query capabilities haven’t we thought about?

Lecture 5: Indexing: R-Trees

Lecture 5: Indexing: R-Trees

Presentation Transcript

File Structures CIS 256

Hashing and Indexing

Lecture # 7

IR - Indexing

Trees

IR - Indexing

Lecture 12: Storage and Indexing

What about the trees of the Mississippi?

Phylogenetic Trees (2) Lecture 12

Lecture 4: Index Construction Related to Chapter 4:

Lecture 14 Phylogenetics

B-trees - Hashing

Web indexing

B + -Trees

Today

Lecture 6: Indexing: GiST

Indexing

CS 245: Database System Principles Notes 4: Indexing

Lecture 15: Latent Semantic Indexing