340 likes | 563 Views
Hashing and Indexing. John Ortiz. Overview. How to retrieve data records with a key value? Sequential search (O(N)) Binary Search (O(log 2 N)) Hashing Conventional (simple) indexing B+ tree indexing More advanced hashing & indexing. key h(key). <key>. Buckets (typically 1
E N D
Hashing and Indexing John Ortiz
Overview • How to retrieve data records with a key value? • Sequential search (O(N)) • Binary Search (O(log2 N)) • Hashing • Conventional (simple) indexing • B+ tree indexing • More advanced hashing & indexing Hashing and Indexing
key h(key) <key> Buckets (typically 1 disk block) . . . Hashing • A hash function maps a key value to a bucket address where the record can be found • Good for queries with condition A=v • Typical complexity O(1) Hashing and Indexing
record key h(key) key 1 Hash table Data file Hash Table • What are stored in buckets? • Option 1: Actual records (hash table is the file) • Option 2: Key values and pointers to records • Should we sort keys in buckets? Hashing and Indexing
Hash Functions • Perfect hash functions should evenly distribute key values to buckets (very difficult to come by). Good hash functions do a random distribution. • Ex: Key = ‘x1 x2 … xn’ n byte character string and table has b buckets, h(key) = (x1 + x2 + … xn) mod b • Many other choices (see Knuth Vol. 3) • May have to handle collision (bucket overflow). Typically with chaining of overflow blocks • With K mod M, 0 <= result <= M-1 Hashing and Indexing
Loading Factor • Loading Factor = # keys loaded / total # keys that fit • Try to keep loading factor between 50% & 80% • If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys per bucket Hashing and Indexing
Data File Index File 90 70 50 30 10 110 40 120 20 60 80 100 10 50 90 30 70 100 20 40 80 60 Indexing Find rec. w/ key 60 Hashing and Indexing
Terms • Data file: contains blocks of data records • Index file: contains blocks of index entries • Index entry: <index key, address> • Index key: not necessarily a key of a relation • Address: for record with a key value (may be block or record address) • Search key: Index value used for a search Hashing and Indexing
Types of Indexes • What type of index key are used? • Primary indexes • Clustering indexes • Secondary indexes • Multilevel indexes • Dynamic Indexes, B-Trees, B+-Trees • Does every record has an index entry? • Dense index, sparse index • Consideration: How does it handle Updates? • Insertions? Deletions? Hashing and Indexing
210 170 130 90 50 10 Index Data file 30 70 230 150 190 110 90 70 50 30 10 80 60 40 20 100 Primary Index • Index key is the primary key • Sparse: one entry per data block • Data file sorted on index key • Can be B+ tree as well Hashing and Indexing
Primary Index • Example 1, p.159 • Ordered file, r = 30,000 records • Block Size B = 1024 bytes/block • Unspanned • Record Length R = 100 bytes/record • Bfr = floor( B/R ) records/block • # blocks = ??? (use units to determine formula!) • Binary search yields ~log2B accesses • Compare to search with primary index, see p.159 Hashing and Indexing
Data File Index 40 30 20 10 10 10 20 30 45 30 10 20 30 40 45 56 59 61 Clustering Index • Index key is not a key of relation (may have duplicate values) • Sparse: one entry per distinct value • Data file sorted on index key • Can be B+ tree Hashing and Indexing
Secondary Index • Index key: can be key or non-key • Pointers point to record (not block) • Dense: one entry per record (sparse at higher levels) • Data file not sorted on index key! • Can be B+ tree Hashing and Indexing
Index Data file 30 20 80 100 90 50 70 40 10 60 10 50 10 60 50 20 30 70 90 40 ... ... sparse high level Secondary Index Hashing and Indexing
Index file Data file 10 10 20 30 20 40 10 40 40 40 10 20 40 40 30 10 ... 40 10 20 40 Secondary Index for Non-Key • Option 1: Dense index • Problem: • excess overhead! • disk space Hashing and Indexing
Index file Data file 30 10 20 10 20 40 40 40 10 40 10 40 30 20 Secondary Index for Non-Key • Option 2: Reserve multiple pointers Problem: variable size records in index! Hashing and Indexing
10 30 10 20 20 40 40 40 10 40 50 10 20 60 ... 30 40 Secondary Index for Non-Key • Option 3: Use pointer buckets Data file Index file Buckets Hashing and Indexing
Advantage of Pointer Buckets • Assume following indexes on EMP(name,dept,floor,...) • Name: primary • Dept: secondary • Floor: secondary • Find employees in Toy dept on 2nd floor • Find pointers for Toy dept • Find pointers for 2nd floor • Intersect these sets of pointers • Retrieve records Hashing and Indexing
Index level 2 Index level 1 Data file 330 10 170 490 10 90 130 170 210 50 190 150 110 70 570 410 230 250 90 30 70 50 90 10 30 20 100 60 80 40 Simple Multiple level Index Hashing and Indexing
B+ Tree Index • The index file is organized as a B+ tree • Height-balanced • Nodes are blocks of index keys and pointers • Order P: Max # of pointers fits in a node • Nodes are at least 50% full • Support efficient updates Hashing and Indexing
100 Index file 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35 Point to data records/blocks B+ Tree Index Example • P = 4 Root Hashing and Indexing
p0 a1 … ai pi ai+1 … ak pk … key < a1 … ai <= key < ai+1 key >= ak Internal Nodes • The root must have k 2 pointers • Others must have k P/2 pointers, where P is the order of the B+ tree • Must have k keys and k+1 pointers • Keys are sorted Hashing and Indexing
a1 p1 … ai pi … ak pk p to next Leaf node to data records Leaf Nodes • All external nodes are at the same level. • Must have k P/2 keys, unless it is the only node in the tree. • Keys are sorted • Has a (block pointer) to next leaf node (other pointers can be block or record pointers) Hashing and Indexing
An Example • File: Employees(SSN, Name, Dept, Age, Phone) • Attributes sizes in bytes: SSN (9), Name (25), Dept (4), Age (4), Phone (10). • Block (page) size = 1024 bytes. • # of records: 30,000, packed unspanned. • What is the file size in pages? • Tuple Size = 9+25+4+4+10 = 52 bytes • bfEmployee = 1024 / 52 = 19 record/page (block) • bEmployee = 30,000 / 19 = 1,579 pages (blocks) Hashing and Indexing
Example: B+ Tree Primary Index • Pointer size = 4 bytes • Nodes are 70% full • How big is a B+ tree primary index on SSN? • Order: P = (1024 + 9) / (9+4) = 79 • Average order = 79*.7 =56 pointers • # pointers of internal nodes = 56 • # index entries in leaf node = 55 • # index entries = 1579 (one per page) • # leaf nodes = 1579 / 55 = 29 • # nodes next level = 29 / 56 = 1 Hashing and Indexing
Example: B+ Tree Primary Index • Total # of levels = 2 • Total # of index nodes (pages) = 30 • To answer the query: select * from Employees where SSN=123456789; • # of page I/Os = 3 (2 index pages + 1 data page). Hashing and Indexing
Example: B+ Tree Secondary Index • B+ tree secondary index on Dept • # of distinct values = 1000 • Assume a dense index. • # of index entries = 30000 • Size of index entry = 8 bytes (4-byte Dept + 4-byte pointer) • Order: P = (1024+4) / (4+4) = 128 • Assume nodes are 70% full • internal node has 90 pointers • leaf node has 89 keys. Hashing and Indexing
Example: B+ Tree Secondary Index • # of leaf nodes = 30,000 / 89 = 338 • # of nodes at 2nd level = 338 / 90 = 4 • # of nodes at 3rd level = 4 / 90 = 1 • Total # of levels = 3 • Total # of pages = 343 • # records per distinct value = 30 (each on a different page) • To find all employees of “Dept = x”, # of page I/O = 33 (3 pages of index + 30 pages of data) Hashing and Indexing
Indexes in SQL • Create a secondary index create index Salary_Index on Employees(Salary); • Create an index on a key create unique index SSN_Index on Employees(SSN); • Drop an index drop index Salary_Index; Hashing and Indexing
Summary • Hashing is every efficient, but is effective only when search condition is equality • Indexing is effective for range selection as well as equality selection • Simple indexing is good for small files • ISAM is good if update is infrequent • B+ tree is a dynamic structure. • Inserts/deletes leave B+ tree height-balanced; O(logP N) cost. • Typically has 3 or 4 levels for large files Hashing and Indexing
Summary (Contd.) • Almost always better than maintaining a sorted file (no sorting, no global moving) • Typically, 67%-70% full on average • Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. • Oracle automatically creates index for primary key attribute(s) and unique attribute(s) • It is not possible to specify different types of index or hashing using SQL • Many other types of indexes … Hashing and Indexing
Look Ahead • Next topic: Query Processing and Optimization • Read textbook: • Chapter 18 Hashing and Indexing