Tuesday, May 17, 2011

Everything about hashtables

Hashtables are arrays. Hashtables are very useful data structures when you need to do a quick lookup operation. Hashtables make use of the fact that arrays provide look up by index in O(1) time. Hashtables store data as an array of data. Given a data to search(or store) for, hashtable internally converts the data into its unique index in the array, and then makes a look up(or store) by index in O(1) time. In java, the conversion of data into a unique index is handled by a method called hashCode() from the Object class and this process is called hashing.

The catch is that, every data cannot be converted into a unique index. For example, two strings may slot to the same index, this phenomenon is called as collision. There are several ways to handle collision: (1) Hashtable as Array of LinkedLists, (2) Open Addressing.

(1) Hashtable as array of linked lists:
This method infact breaks the main advantage of hashtables, the O(1) look up time. Searching Linked Lists takes O(n). Hashtables are created with a initial capacity depending on number of expected data values and the number of affordable values in one linked list(load factor).

Capacity = Expected number of data values/ load factor

If the number of data values inserted increases beyond the expected number, then the hashtable is resized and all values are rehashed and reinserted. Note that the hash code of each value may change when the capacity changes. This is because all hashcodes end with mod m, where m is the capacity.This is done to restrict the hashcode value to be within the capacity of the hashtable(range of 0 to capacity). In java rehashing is performed by rehash() method.

(2) Open addressing.
Open addressing is the idea of storing all the keys in the hashtable itself instead of storing in the linked list. It mainly saves some memory space as the space required for storing pointers in linked lists will not be required. Instead space is required in the hashtable itself to store the keys.
  • How open addressing works: There are several ways of doing open addressing. Methods like linear probing and quadratic probing use a linear or quadratic functions to find a sequence of slots, for example a linear function, h(k) = (h’(k) + i )mod m finds sequences starting from h’(k) and linearly incrementing by i. However this has problems when 2 keys have the same initial h’(k), they will have the same probing sequence.
  • Another alternative is to use double hashing, where the probing sequence is found by a function of two hash functions, for example: h(k) = (h1(k) + ih2(k)) mod m, since the probing sequence depends on two hash functions, the sequence would be less likely the same for two keys resulting in better search and insertion time.
How to avoid O(n) look up time when using linked lists.
The O(n) lookup time when using hashtable as array of linked lists can be avoided by Perfect Hashing. Before we learn about perfect hashing, we need to learn about universal hashing..

Universal Hashing. For a fixed hash function, an adversary can easily choose a set of keys that can hash to the same bucket. To overcome this problem, universal hashing is performed. In universal hashing, the hash function is chosen randomly from a set of hash functions before the data is inserted. For example: If m is the number of keys, then the hash function could be something like, h(k) = (a.k + b) mod m where a and b are chosen randomly before insertion. The idea behind universal hashing is that, by choosing the hash function randomly, the probability that the chosen hash function would be bad resulting in undesirable collisions would be low.


Perfect Hashing. In case of perfect hashing, the linear time search of linked lists during collision is reduced/avoided by using hashtable of hashtables. Both the primary and secondary hashtables use universal hashing to decide the hashfunction. Therefore the primary hashtable will be having one randomly chosen hashfunction and each of the secondary hashtables will be having one randomly chosen hashfunction.


Idea behind Perfect hashing. It has been proved that if n is the number of keys and m is the number of slots in a hashtable, then if m = n^2 and if the hash function is chosen randomly, then the probability of collisions of two keys in this hashtable would be lesser than ½ which is low.


Why we need two levels of hashtables in Perfect hashing. Since it is not affordable to have an hashtable of capacity n^2, we use a secondary hashtable. The number of elements colliding onto the same secondary hashtable is low since we use universal hashing, and therefore we can afford to have nj ^ 2 capacity secondary hashtables where nj is the number of elements hashing to the same secondary hashtable Sj.


What will be the memory requirement for perfect hashing. It has been proven that, when both primary and secondary hashtables use universal hashing and when secondary hashtables have nj^2 capacity, then the total memory slots required for perfect hashing would be upper bounded by 2n.


Note. Collisions on secondary hashtables although have low probability, still needs to be resolved by linked lists.