B-Trees

B-Trees Size and Lookup

Records sorted on search key, with 10 records per blocks. The B-tree is a dense index. • 1,000,000 /10 = 100,000 blocks to hold the records • For the B-tree we reason as follows: • We need 1,000,000 pointers at the leaves to point to each record. • They can be packed into 1,000,000 / 70 = 14,286 leave blocks. • We need 14,286 pointers in the above level. • Packed into 14,286/70=204 blks • We need 204 pointers in the above • Packed into 204/70 = 3 blocks • Total: 100,000 + 14,286 + 204 + 3 1(root) = 114,494 blocks • Since the tree has 4 levels, we need 5 I/O’s for a lookup. B-Trees (I) Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

Same as (A), but the records aren’t sorted. Still, records packed 10 to a block. • Same as (A) = 114,494 blocks • Same as (A) = 5 I/O’s for a lookup. B-Trees (II) • Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

Same as (A), but B-tree is a sparse index. • 1,000,000 /10 = 100,000 blocks to hold the records • For the sparse B-tree: • We need 100,000 pointers at the leaves to point to each data block. • They can be packed into 100,000 / 70 = 1,429 leave blocks. • We need 1,429 pointers in the above level. • Packed into 1429/70=21 blocks • We need 21 pointers in the above • Packed into 21/70 = 1 block (root) • Total: 100,000 + 1,429 + 21 + 1 = 101,451blocks • Since the tree has 3 levels, we need 4 I/O’s for a lookup. B-Trees (III) • Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

Instead of the B-Tree leaves having pointers to data records, the B-Tree leaves hold the records themselves. A block can hold 10 records, but a leaf block is in fact 70% full, i.e. there are 7 records per leaf. • 1,000,000 /7 = 142,857 blocks to hold the records. These blocks will be the leaves of the B-tree. • We need 142,857 pointers at the next level. • They can be packed into 142,857/ 70 = 2,040 blocks. • We need 2,040 pointers in the above next level. • Packed into 2040/70=30 blocks • We need 30 pointers in the above… • Packed into 30/70 = 1 block (root) • Total: 142857 + 2040 +30 +1 = 144,928 blocks • Since the tree has 4 levels, we need 4 I/O’s for a lookup. B-Trees (IV) • Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

B-Trees Range Queries: Repeat the exercise in the case that the query is a range query that is matched by 1000 records.

Records sorted on search key, with 10 records per blocks. The B-tree is a dense index. • Since the tree has 4 levels, we need 4 I/O’s to go to the leaf where the pointer to start of the range is located. • Then, by following the sibling pointers we retrieve all the leaves holding the pointers to the matching records. • 1000 pointers are packed in 1000/70 = 14 blocks. • Now, by following each of the 1000 pointers we read the 1000 records. • Since the records are sorted the 1000 range records will occupy (almost) as few blocks as possible, i.e. 1000 / 10 = 100 blocks. • In total, we need 4 + 14 + 100 = 118 I/O’s. B-Trees (I) • Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

Same as (A), but the records aren’t sorted. Still, records packed 10 to a block. • As in (A) we need 4+14 I/O’s to locate the 1000 pointers. • However, since the records aren’t sorted, it might be that the 1000 records are located in 1000 different blocks. So, we might end up reading 1000 blocks. • In total, we need 4 + 14 + 1000 = 1018 I/O’s. B-Trees (II) • Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

Same as (A), but B-tree is a sparse index. • Since the tree has 3 levels, we need 3 I/O’s to go to the leaf where the pointer to start of the range is located. • How many pointers to data we need to follow? • 1000 records are packed into 1000/10 = 100 blocks. So, we need to follow 100 pointers. • How many leaves are needed to pack 100 pointers? • 2 leaves. • Total: 3 + 2 + 100 = 105 I/O’s B-Trees (III) • Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

Instead of the B-Tree leaves having pointers to data records, the B-Tree leaves hold the records themselves. A block can hold 10 records, but a leaf block is in fact 70% full, i.e. there are 7 records per leaf. • Since the tree has 4 levels, we need 4 I/O’s to go to the leaf where the start record of the range is located. • By following the sibling pointers we need to retrieve as many leaves as are needed to hold 1000 records. • 1000 records are packed in 1000 /7 = 143 leaves. • Total: 4 + 143 = 147 I/O’s B-Trees (IV) • Suppose: • Blocks can hold either: • 10 records or • 99 keys and 100 pointers • B-Tree nodes are 70% full • 69 keys and 70 pointers • 1,000,000 records • For each structure described below, determine • The total number of blocks • The average # of I/O for lookup given the search key.

Hash Tables Extensible Hash Tables

Dynamic Hashing Framework • Hash function h produces a sequence of k bits. • Only some of the bits are used at any time to determine placement of keys in buckets. Extensible Hashing (Buckets may share blocks!) • Keep parameter i = number of bits from the beginning of h(K) that determine the bucket. • Bucket array now = pointers to buckets. • A block can serve as several buckets. • For each block, a parameter ji tells how many bits of h(K) determine membership in the block. • I.e., a block represents 2i-j buckets that share the first j bits of their number.

Example • An extensible hash table when i=1:

Extensible Hashtable Insert • If record with key K fits in the block pointed to by h(K), put it there. • If not, let this block B represent j bits. • j<i: • Split block B into two and distribute the records (of B) according to (j+1)st bit; • set j:=j+1; • fix pointers in bucket array, so that entries that formerly pointed to B now point either to B or the new block How? depending on…(j+1)st bit • j=i: • Set i:=i+1; • Double the bucket array, so it has now 2i+1 entries; • proceed as in (1). Letwbe an old array entry. Both the new entriesw0andw1point to the same block that w used to point to.

Now, after the insertion Before Example • Insert record with h(K) = 1010.

Currently After the insertions Example: Next • Next: records with h(K)=0000; h(K)=0111. • Bucket for 0... gets split, • but i stays at 2. • Then: record with h(K) = 1000. • Overflows bucket for 10... • Raise i to 3.

Exercise • Suppose we want to insert keys with hash values: 0000…1111 in an extensible hash table. • Assume that a block can hold three records.

i=1 i=1 0000 1 0000 1 0001 1 1 i=1 0000 1 0001 0010 1 0 1 0 1 0 1 Insertion of 0011. No room

i=2 i=3 0000 2 000 0000 3 0001 001 0001 0010 010 011 2 0010 3 100 0011 This is the new block. This is the new block. 101 110 1 111 2 1 10 01 00 11

i=3 i=3 000 0000 3 000 0000 3 001 0001 001 0001 010 010 011 011 0010 3 0010 3 100 100 0011 0011 101 101 110 110 111 2 111 0100 2 1 1

i=3 i=3 000 0000 3 000 0000 3 001 0001 001 0001 010 010 011 011 0010 3 0010 3 100 100 0011 0011 101 101 110 110 111 0100 2 111 0100 2 0101 0101 0110 1 1

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 1

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 1 1001

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 1 1001 1010

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 2 2 1001 Still no room for 1011 1010

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 3 1010 3 2 1001 1011

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 3 1010 3 1100 2 1001 1011

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 3 1010 3 1100 2 1001 1011 1101

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 3 1010 3 1100 2 1001 1011 1101 1110

i=3 000 0000 3 001 0001 010 011 0010 3 100 0011 101 110 111 0100 3 0110 3 0101 0111 1000 3 1010 3 1100 3 1110 3 1001 1011 1101 1111

Hash Tables Linear Hash Tables

Linear Hashing • Use i bits from right (loworder) end of h(K). • Buckets numbered [0…n-1], where 2i-1<n2i. • Let last i bits of h(K) be m = (a1,a2,…,ai) • If m < n, then record belongs in bucket m. • If nm<2i, then record belongs in bucket m-2i-1, that is the bucket we would get if we changed a1 (which must be 1) to 0. #of buckets #of records This is also part of the structure

Linear HashTable Insert • Pick an upper limit on capacity, • e.g., 85% (1.7 records/bucket in our example). • If an insertion exceeds capacity limit, set n := n + 1. • If new n is 2i + 1, set i := i + 1. No change in bucket numbers needed --- just imagine a leading 0. • Need to split bucket n - 2i-1 because there is now a bucket numbered (old) n.

i=1 #of buckets n=2 r=3 #of records Example • Insert record with h(K) = 0101. • Capacity limit exceeded; increment n. i=2 #of buckets n=3 r=4 #of records

Example • Insert record with h(K) = 0001. • Capacity limit not exceeded. • But bucket is full; add overflow bucket. i=2 n=3 r=5

i=2 n=4 r=7 Example • Insert record with h(K) = 1100. • Capacity exceeded; set n = 4, add bucket 11. • Split bucket 01.

Lookup in Linear Hash Table • For record(s) with search key K, compute h(K); search the corresponding bucket according to the procedure described for insertion. • If the record we wish to look up isn’t there, it can’t be anywhere else. • E.g. lookup for a key which hashes to 1010, and then for a key which hashes to 1011. i=2 n=3 r=4

Exercise • Suppose we want to insert keys with hash values: 0000…1111 in a linear hash table with 100% capacity threshold. • Assume that a block can hold three records.

i=1 0000 i=1 0000 0 0 0001 n=1 n=1 r=1 r=2 i=1 0000 i=1 0000 0 0 0001 0010 n=1 n=2 0010 r=3 r=4 0001 1 0011

i=1 0000 i=1 0000 0 0 0010 0010 n=2 n=2 0100 0100 r=5 r=6 0001 0001 1 1 0011 0011 0101

i=2 0000 00 n=3 0100 r=7 0001 01 0011 0101 0010 10 0110 Continue at home…

Multidimensional Indexes

Grid files (hash-like structure) • Divide data into stripes in each dimension • Rectangle in grid points to bucket • Example: database records (age,salary) for people who buy gold jewelry. Data: (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60,260)

Grid file

Operations Lookup Find coordinates of point in each dimension --- gives you a bucket to search. Nearest Neighbor Lookup point P . Consider points in that bucket. • Problem: there could be points in adjacent buckets that are closer. • Example: NN of (45; 200). • Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. • Buckets on border may contain points not in range. • Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute • Problem: must search a whole row or column of buckets.

Insertion • Use overflow buckets, or split stripes in one or more dimensions • Insert (52,200). Split central bucket, for instance by splitting central salary stripe • The blocks of 3 buckets are to be processed. • In general the blocks of n buckets are to be processed during a split. • n is the number of buckets in the chosen direction • Very expensive.

Partitioned hashing • Example: Gold jewelry with • first bit = age mod 2 • bits 2 and 3: salary mod 4 • Works well for: • partial match (i.e. just an attribute specified) • Bad for: • range • Nearest Neighbors queries

KD-Trees • Generalizes binary search trees, but search attributes rotate among dimensions • Levels rotate among the dimensions, partitioning the points by comparison with a value for that dimension. • Leaves are blocks

Geometrically… • Remember we didn’t want the stripes in grid files to continue all along the vertical or horizontal direction? • Here they don’t.

B-Trees

B-Trees

Presentation Transcript

B + -Trees

B-Trees

B+-trees

B-Trees

B+ Trees

B Trees

B-Trees

B-trees

B-Trees

B + -Trees

B+ Trees

B-Trees And B+-Trees

B-Trees

B-Trees

B-Trees

B-Trees

B-Trees

B-Trees

B-Trees

B-Trees

B+ Trees

B-Trees