CSE 30331 Lecture 16 – Hashing & Tables

CSE 30331Lecture 16 – Hashing & Tables • Binary Search Tree vs. Hash Table • Hash Function (quick intro) • Collision • Coping with Collisions • Open addressing & linear probing • Chaining with separate lists • Hash Functions • What works • C++ Function Objects • Hash Iterators • Efficiency of Hash Methods

BST vs Hash Table • Both used to implement Sets & Maps • Binary Search Tree – ordered associative container • Order (log N) access (average & worst) • Hash Table – unordered associative container • Order(1) access (average case)

Hash Function • A hash function converts a key into a numeric (unsigned int) table index • Ideal hash functions uniformly distribute keys to all available indices • When two keys hash to the same index a collision occurs • Keys are not in any particular order (numeric, alphabetical, ...) within the table

Example Hash Function hf(n) = n, the identity function index = hf(n)%m, where m is table size

Collision Given keys p and q, and table size m hf(p)%m and hf(q)%m produce the same index hf(36)=36 ---- 36%7 = 1

Coping with Collisions • Three primary methods exist for coping with collisions • Rehashing: use same key but different hash function • Linear Probing: examine successive locations (index, index+1, index+2, ...) • Chaining: implement table with separate list at each table[index] location • Note: Except for the last case, the table is a fixed size.

Hash Table Using Linear Probing – Open Addressing

Linear Probing PseudoCode // insert item into table of size n using hashFunc() to // calculate index. this assumes no duplicate keys, and some // method of indicating that a hash table location is empty int index = hashFunc(item) % n; int origIndex = index; do { if table[index] is empty insert item as table[index] and return else if table[index] matches item return index = (index+1) % n; // this is next location to probe } while (index != origIndex); throw overflowError;// if we get here, table is full & does // not contain item

Problems with Linear Probing • Clustering of items occurs as number of items approaches size of table • Colliding items fill in gaps between other entries • This forms runs or clusters within the table • Items in the cluster are a mix of items that hash to different indices • Degraded performance results • Long sequences of repeated probes are required to find what is sought

Chaining – Uses Lists or Buckets • Implement the hash table as a vector of lists • Each list (bucket, chain, ...) contains all items that hash to the associated table location • Buckets are not mixed like clusters in linear probing • Table size can grow easily by expanding individual buckets as necessary • The number of buckets stays constant • Within a bucket, items are unordered and must be searched linearly

Chaining with Separate Lists Example

C++ Function Objects • Function object is an instance of a class that contains only a single function – operator() • Function objects are easily passed as parameters to other functions • Commonly used to implement hash functions and comparison operations template <typename T> class greaterThan { public: bool operator() (const T& x, const T& y) const { return x > y; } };

Using a function object • Here is a template function that swaps two parameters only IF the comparison is true template<typename T, typename Compare> void swap(T& a, T& b, Compare comp) { if (comp(a,b)) { T temp = a; a = b; b = temp; } } • Here is a sample call swap(x, y, greaterThan);

Reasonable Hash Functions • Integer key: Identity function • Good distribution if key or a portion of it is random class hfIntKey { public: bool operator() (int key) const { return key; } };

Reasonable Hash Functions • Integer key: Midsquare technique • Extracts middle two bytes of 4 byte square of key • Works well with random and non-random keys class hfMidSq { public: bool operator() (int key) const { unsigned int n = key; return ((n*n)/256) % 65536; // 0 .. 2^16-1 } };

Reasonable Hash Functions • String key: string-to-number • Simple function uses ASCII codes for the string characters to build n-digit unsigned integers out of n-digit strings class hfString { public: bool operator() (string key) const { unsigned int prime = 2049982463; int n(0); for (int i=0; i < key.length(); i++) n = n*8 + key[i]; return (n > 0 ? (n % prime) : (-n % prime) ); } };

Reasonable Hash Functions • String key: folding • Uses substrings as numbers and combines them by addition or multiplication or … • Example: Sum of the 3 character substrings of a SSN • Assuming no dashes in SSN … • “987654321”  987+654+321 = 1962 class hfSSN { public: bool operator() (string ssn) const { return ( atoi(ssn.substr(0,3).c_str()) + atoi(ssn.substr(3,3).c_str()) + atoi(ssn.substr(6,3).c_str()) ); } };

Hash Class – not in STL • See headers in Ford & Topp include folder • d_hash.h – for the hash table using buckets • d_hashf.h – for hash function object • d_uset.h – for unordered set based on hash class • d_hiter.h – for hash class iterator and const_iterator

Hash Class template <typename T, typename HashFunc> class hash { public : hash (int nbuckets, const HashFunc& hfunc = HashFunc()); hash (T *first, T *last, int nbuckets, const HashFunc& hfunc = HashFunc()); bool empty() const; int size() const; iterator find(const T& item); pair<iterator,bool> insert(const T& item); int erase(const T& item); void erase(iterator pos); void erase(iterator first, iterator last); iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const; private: int numBuckets; // number of buckets vector<list<T> > bucket; // table is vector of lists HashFunc hf; // hash function int hashtableSize; // number of elements };

Hash::find(item) template <typename T, typename HashFunc> hash<T,HashFunc>::iterator hash<T,HashFunc>::find(const T& item) { int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; // traverse list and look for a match with item bucketIter = myBucket.begin(); while(bucketIter != myBucket.end()) { if (*bucketIter == item) // return iterator to found item return iterator(this, hashIndex, bucketIter); bucketIter++; } // did not find item, so return iterator to table end return end(); }

Hash::insert(item) template <typename T, typename HashFunc> pair<hash<T, HashFunc>::iterator,bool> hash<T, HashFunc>::insert(const T& item) { int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; bool success; bucketIter = myBucket.begin(); while (bucketIter != myBucket.end()) if (*bucketIter == item) break; // found the item already in bucket else bucketIter++; if (bucketIter == myBucket.end()) { bucketIter = myBucket.insert(bucketIter, item); success = true; hashtableSize++; } else success = false; // item already in table return pair<iterator,bool> (iterator(this,hashIndex,bucketIter), success); }

Hash Iterator hIter Referencing Element 22 in Table ht

Determining Performance • The Load Factor (λ) measures the table density • Where (m = size of table, n = items in table) • Linear addressing (m = size of vector, maxitems) • Chaining (m = number of buckets) • Worst case • (all items hash to same table location or bucket) • Linear search is O(n) • Making table size prime helps prevent nonuniform distribution causing this worst case

Average Case - Chaining • Finding bucket is O(1) – using hash function • Uniform hashing implies each bucket has n/m items • Assuming uniform hash distribution • The ith item was inserted at the end of its bucket when the previous (i-1) items were spread evenly over the m buckets • To find this item takes 1+(i-1)/m comparisons since there are (on average) (i-1)/m items ahead of it in its bucket • Average performance of search for an arbitrary item is the average of the number of comparisons required to find each item in the list

Hash table size = m, Number of elements in hash table = n, Load factor = n/m Average Probes for Successful Search Average Probes for Unsuccessful Search Open Probe Chaining  Efficiency of Hash Methods

Final Variations • Universal Hashing • Choose hf(n) randomly before execution from set of hash functions • Prevents same clustering of collisions each time given set of data is used in hash table • Efficiency is more likely to be to be Θ(1), even worst case • Perfect Hashing • Two tier approach (requires static set of keys) • Uses two hash functions from universal hf(n) set • Like chaining, but with secondary hash tables instead of chains • Size of secondary hash tables is square of number of items hashing to that table using first hash function • Second hash function is chosen so no collisions occur in each secondary table • Efficiency is guaranteed to be Θ(1)

Summary • Hash Table • Simulates the fastest searching technique, knowing the index of the required value in a vector and array and apply the index to access the value, by applying a hash function that converts the data to an integer • After obtaining an index by dividing the value from the hash function by the table size and taking the remainder, access the table. Normally, the number of elements in the table is much smaller than the number of distinct data values, so collisions occur. • To handle collisions, we must place a value that collides with an existing table element into the table in such a way that we can efficiently access it later. • Average running time for a search of a hash table is Θ(1) • Worst case is Θ(n)

Summary • Collision Resolution • Linear open probe addressing • the table is a vector or array of static size • After using the hash function to compute a table index, look up the entry in the table. • If the values match, perform an update if necessary. • If the table entry is empty, insert the value in the table.

Summary • Collision Resolution (Cont…) • Chaining with separate lists. • The hash table is a vector of list objects • Each list is a sequence of colliding items. • After applying the hash function to compute the table index, search the list for the data value. • If it is found, update its value; otherwise, insert the value at the back of the list. • You search only items that collided at the same table location • There is no limitation on the number of values in the table, and deleting an item from the table involves only erasing it from its corresponding list

CSE 30331 Lecture 16 – Hashing & Tables