Dictionaries and Hash Tables

Chapter 2.5:Dictionaries and Hash Tables 0  1 025-612-0001 2 981-101-0002 3  4 451-229-0004

The dictionary ADT models a searchable collection of key-element items The main operations of a dictionary are searching, inserting, and deleting items Multiple items with the same key are allowed Applications: address book credit card authorization mapping host names (e.g., cs16.net) to internet addresses (e.g., 128.148.34.101) Dictionary ADT methods: findElement(k): if the dictionary has an item with key k, returns its element, else, returns the special element NO_SUCH_KEY insertItem(k, o): inserts item (k, o) into the dictionary removeElement(k): if the dictionary has an item with key k, removes it from the dictionary and returns its element, else returns the special element NO_SUCH_KEY size(), isEmpty() keys(), Elements() Dictionary ADT (§2.5.1)

A log file is a dictionary implemented by means of an unsorted sequence We store the items of the dictionary in a sequence (based on a doubly-linked lists or a circular array), in arbitrary order Performance: insertItem takes O(1) time since we can insert the new item at the beginning or at the end of the sequence findElement and removeElement take O(n) time since in the worst case (the item is not found) we traverse the entire sequence to look for an item with the given key The log file is effective only for dictionaries of small size or for dictionaries on which insertions are the most common operations, while searches and removals are rarely performed (e.g., historical record of logins to a workstation) Log File (§2.5.1)

A lookup table is a dictionary implemented with a sorted sequence We store the items of the dictionary in an array-based sequence, sorted by key We use an external comparator for the keys Performance: findElement takes O(log n) time, using binary search insertItem takes O(n) time since in the worst case we have to shift O(n) items to make room for the new item removeElement take O(n) time since in the worst case we have to shift O(n) items to compact the items after the removal Effective for small dictionaries or for dictionaries where searches are common but inserts and deletes are rare (e.g. credit card authorizations) Lookup Table

Hashing (2.5.2) • Application: word occurrence statistics • Operations: insert, find • Dictionary: insert, delete, find • Are O(log n) comparisons necessary? (no) • Hashing basic plan: • create a big array for the items to be stored • use a function to figure out storage location from key (hash function) • a collision resolution scheme is necessary

Hash Table Example • Simple Hash function: • Treat the key as a large integer K • h(K) = K mod M, where M is the table size • let M be a prime number to better scramble result • Example: • Suppose we have 101 buckets in the hash table. • ‘abcd’ in hex is 0x61626364 • Converted to decimal it’s 1633831724 • 1633831724 % 101 = 11 • Thus h(‘abcd’) = 11. Store the key at location 11. • “dcba” hashes to 57. • “abbc” also hashes to 57 – collision. What to do? • If you have billions of possible keys and hundreds of buckets, the possibility of collisions is unavoidable!

A hash function is usually specified as the composition of two functions: Hash code map:h1:keysintegers Compression map:h2: integers [0, N- 1] The hash code map is applied first, and the compression map is applied next on the result, i.e., h(x) = h2(h1(x)) The goal of the hash function is to “disperse” the keys in an apparently random way Hash Functions (§ 2.5.3)

Memory address: interpret the memory address of the key as an integer Integer cast: interpret the bits of the key as an integer (for short keys) Component sum: partition the bits of the key into chunks (e.g., 16 or 32 bits) and sum, ignoring overflows (for long keys) Polynomial accumulation: like component sum, but multiply each term by 1, z, z2, z3, ... p(z)= a0+a1 z+a2 z2+ … … +an-1zn-1 at a fixed value z, ignoring overflows Can be evaluated in O(n) time using Horner’s rule: Each term is computed from the previous in O(1) time p0(z)= an-1 pi(z)= an-i-1 +zpi-1(z) (i =1, 2, …, n -1) Hash Code Maps (§2.5.3)

Division: h2 (y) = y mod N The size N of the hash table is usually chosen to be a prime to better scatter hash values Multiply, Add and Divide (MAD): h2 (y) =(ay + b)mod N a and b are nonnegative integers such thata mod N 0 Otherwise, every integer would map to the same value b Compression Maps (§2.5.4)

Hashing Strings • h(‘aVeryLongVariableName’)? • Horner’s method example: • 256 * 97 + 86 = 24918 % 101 = 72 • 256 * 72 + 101 = 18533 % 101 = 50 • 256 * 50 + 114 = 12914 % 101 = 87 • Further scramble by replacing 256 with 117 int hash(char *v, int M) { int h, a=117; for (h=0; *v; v++) h = (a*h + *v) % M; return h; }

0  1 025-612-0001 2  3  4 451-229-0004 981-101-0004 Collision Resolution: Chaining • Build a linked list for each bucket • Linear search within each list • Simple, practical, widely used • Cuts search time by a factor of M over sequential search • But, requires extra memory beyond table (linked lists)

Chaining 2 • Insertion time? • O(1) • Average search cost, successful search? • O(N/2M) • Average search cost, unsuccessful? • O(N/M) • M large: CONSTANT average search time • Worst case: N (“probabilistically unlikely”) • Keep lists sorted? • insert time O(N/2M) • unsuccessful search time O(N/2M)

Linear Probing (§2.5.5) • Or, we could keep everything in the same table • Insert: upon collision, search for a free spot • Search: same (ifyou find one, fail) • Runtime? • Still O(1) if tableis sparse • But: as table fills,clustering occurs • Skipping c spotsdoesn’t help…

Clustering • Long clusters tend to get longer • Precise analysis difficult • Theorem (Knuth): • Insert cost: approx. (1 + 1/(1-N/M)2)/2 • (50% full  2.5 probes; 80% full  13 probes) • Search (hit) cost: approx. (1 + 1/(1-N/M))/2 • (50% full  1.5 probes; 80% full  3 probes) • Search (miss): same as insert • Too slow when table gets 70-80% full • How to reduce/avoid clustering?

Double Hashing • Use a second hash function to compute increment seq. • Analysis extremely difficult • About like ideal (random probe) • Thm (Guibas-Szemeredi): Insert: approx 1+1/(1-N/M) Search hit: ln(1+N/M)/(N/M) Search miss: same as insert Not too slow until the table isabout 90% full

0 1 2 3 4 5 6 7 8 9 10 11 12 31 41 18 32 59 73 22 44 0 1 2 3 4 5 6 7 8 9 10 11 12 Example of Double Hashing • Consider a hash table storing integer keys that handles collision with double hashing • N= 13 • h(k) = k mod13 • d(k) =7 - k mod7 • Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Dynamic Hash Tables • Suppose you are making a symbol table for a compiler. How big should you make the hash table? • If you don’t know in advance how big a table to make, what to do? • Could grow the table when it “fills” (e.g. 50% full) • Make a new table of twice the size. • Make a new hash function • Re-hash all of the items in the new table • Dispose of the old table

Table Growing Analysis • Worst case insertion: Q(n), to re-hash all items • Can we make any better statements? • Average case? • O(1), since insertions n through 2n cost O(n) (on average) for insertions and O(2n) (on average) for rehashing  O(n) total (with 3x the constant) • Amortized analysis? • The result above is actually an amortized result for the rehashing. • Any sequence of j insertions into an empty table has O(j) average cost for insertions and O(2j) for rehashing. • Or, think of it as billing 3 time units for each insertion, storing 2 in the bank. Withdraw them later for rehashing.

Chaining vs. Double Hashing • Assume the same amount of space for keys, links (use pointers for long or variable-length keys) • Chaining: • 1M buckets, 4M keys • 4M links in nodes • 9M words total; avg search time 2 • Double hashing in same space: • 4M items, 9M buckets in table • average search time: 1/(1-4/9) = 1.8: 10% faster • Double hashing in same time • 4M items, average search time 2 • space needed: 8M words (1/(1-4/8) = 2) (11% less space)

Deletion • How to implement delete() with separate chaining? • Simply unlink unwanted item • Runtime? • Same as search() • How to implement delete() with linear probing? • Can’t just erase it. (Why not?) • Re-hash entire cluster • Or mark as deleted? • How to delete() with double hashing? • Re-hashing cluster doesn’t work – which “cluster”? • Mark as deleted • Every so often re-hash entire table to prune “dead-wood”

Comparisons • Separate chaining advantages: • Runtime degrades gracefully if you overfill it • No large chunks of memory needed • Why use hashing? • Fastest dictionary implementation • Constant time search and insert, on average • Easy to implement; built into many environments • Why not use hashing? • No performance guarantees • Uses more space than a filled array • Doesn’t support pred, succ, sort, etc. – no notion of order • Where did perl “hashes” get their name?

Hashing Summary • Separate chaining: easiest to deploy • Linear probing: fastest (but takes more memory) • Double hashing: least memory (but takes more time to compute the second hash function) • Dynamic (grow): handles any number of inserts at < 3x time • Curious use of hashing: early unix spell checker (back in the days of 3M machines…) Construction Search Miss Chain Probe Dbl Grow Chain Probe Dbl Grow 5k 1 4 4 3 1 0 1 0 50k 18 11 12 22 15 8 8 8 100k 35 21 23 47 45 23 21 15 190k 79 106 59 155 144 2194 261 30 200k 84 159 156 33

File tamper test • Problem: you want to guarantee that a file hasn’t been tampered with. How?

Password verification • Problem: you run a website. You need to verify people’s logins. How? • Possible techniques? • Ethics?

Cache filenames • Problem: caching FlexScores $outfileName = $outDirectory . "/" . $cfg->hymn . "-" . $cfg->instrument; if ($custom) $outfileName .= "-" . substr(md5($cfg->asXML()),0,6);

Hash Tables in C# • System.Collections: basic data structures • Dictionary • ArrayList • (See handout)

Babble • You want to generate natural-sounding random text. How?

Google Translate

Babble • Use statistics: what word tends to come after each other word or phrase? • He a loose tongues, and saw the day's dinner does not mean quite bloated Mr. Barsad. • "I hope so, sir," pleaded the abashed Mr. Cruncher, who was murdered," said one. • "There's all manner of gestures while he spoke, as if in some sense of better things, Mr. Carton!"

GUI Programming in C# • Microsoft has provided a number of different GUI programming environments over the years: • MFC: Microsoft Foundation Classes • C++, legacy • Windows Forms • .net wrapper for access to native Windows interface • WPF: Windows Presentation Foundation • .net, Managed code, primarily C# and VB • XML file defines user interface • Better support for media, animation, etc.

Dictionaries and Hash Tables