100 likes | 321 Views
Ch. 16: Sweep-Zones. Basic Question: Is it possible to compute nearest neighbors in expected time O(n*log(n)) ??? Basic Idea: Generalize sweep-lines to sweep-zones !!! Def.: The sweep-zone SZ of an area is the set of regions touching the upper boundary of an area from below.
E N D
Ch. 16: Sweep-Zones Basic Question: Is it possible to compute nearest neighbors in expected time O(n*log(n)) ??? Basic Idea: Generalize sweep-lines to sweep-zones !!! Def.: Thesweep-zone SZ of an area is the set of regions touching the upper boundary of an area from below. R. Bayer, Ch. 16, DWH-SS2000
UB-Tree Insertion 18/19 1 3 6 7 8 4 2 9 10 6 5 15 11 16 12 17 18 13 14 R. Bayer, Ch. 16, DWH-SS2000
Sweep-Zone Algorithm 1: i { Z-regions have been read in increasing Z-order up to region Ri-1, i.e. area(R i-1) with upper boundary B(R i-1) } { set of cached regions C(R i) is the set of regions in SZi-1 = SZ(area(R i-1)) plus region Ri} 1. for every point p Ri let l(p) and h(p) be the lower and higher neighbor of p on Z-curve, compute l(p) and h(p). 2. let q = l(p) if dist(p,l(p)) < dist (p, h(p)) = h(p) otherwise 3. Let Q(p) be the query box with center p and side length 2*dist(p,q) q p R. Bayer, Ch. 16, DWH-SS2000
4. Retrieve Q(p) from cache or disk and compute the nearest neighbor (p) { Note: retrieval of Q(p) should take time O(log n), finding (p) should be nearly constant } 5. Cache regions intersecting Q(p) to enforce linear I/O time 6. If Ri was the last region in Z-order then exit 7. Release all regions from C(Ri) which are not in SZi 8. i:= i+1;read next region R i in Z-order; 9. Goto step 1 { all nearest neighbors are known, now cluster } R. Bayer, Ch. 16, DWH-SS2000
Sweep-Zone Algorithm 2: Basic Idea: run algorithm forward to compute lower (w.r. to Z-order) nearest neighbor (p) of p and backward to compute upper (w.r. to Z-order) nearest neighbor (p) of p, then (p) = closest of {(p), (p)} i.e. modify step 4 in Sweep-Zone algorithm 1 to compute Q(p) area(Ri) Advantages: all pages are read in increasing or decreasing Z-order only (sequential reads) and cache requirements are smaller Disadvantage: data must be read twice, tradeoff??? R. Bayer, Ch. 16, DWH-SS2000
Cache Contents for Algorithm 2: 1109 8 7 6 5 2 1 11 10 6 5 3 2 112 11 10 6 4 3 2 13 5 4 3 214 6 5 4 3 15 7 6 5 4 316 15 14 12 10 8 7 6 5 4 17 16 15 14 12 9 8 7 6 5 41817 16 R. Bayer, Ch. 16, DWH-SS2000
Cache Modification 1. Determine extension of next region to be read using upper part of UB-index 2. Determine regions that can be released, i.e. SZi - SZi-1 3. Release regions from cache 4. Read next region, i.e. transfer it from disk to cache R. Bayer, Ch. 16, DWH-SS2000
Observations: expected cache size ~ 1.5 * sqrt (18) = 6.4 maximaloccurring cache size = 6 average cache size = 4.28 Cache Organization: keep cache organized as a set of regions sorted in Z-order, e.g. AVL-tree with elementary operations append single element and delete set of elements R. Bayer, Ch. 16, DWH-SS2000
Open Questions: • which algorithm is faster • which algorithm requires less resources • what are the tradeoffs between I/O, cache size, CPU-time, total time, etc. • analytic comparison of both algorithms? R. Bayer, Ch. 16, DWH-SS2000
Algorithm 3 this is a local optimization of Algorithm 2: if Q(p) area (Ri) then (p) = (p) and we can ignore the computation of (p) in the backward phase Algorithm 4 if (p) = (p) then discard p entirely from the backward phase, i.e. reduce the amount of data and computations for the second phase, but then we have to write out the non-discarded points Open Question: under what conditions is Algorithm 4 better than Algorithm 3? R. Bayer, Ch. 16, DWH-SS2000