Hash table

From Citizendium
Revision as of 11:56, 16 January 2008 by imported>Pat Palmer (collision definied in paragraph preceding the section on it)
Jump to navigation Jump to search
This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

In computer science, a hash table is an unordered, dictionary-like data structure that provides very efficient insertion, deletion or lookup of elements. A hash table operates on pairs of keys and values; when a programmer presents a key, the hash table returns the corresponding value.

Hash tables are often used to implement dictionaries or sets. Internet routers usually use a hash table to correlate ranges of IP addresses to route paths.

How a hash table works

A hash table is typically implemented as an array in which value elements are stored, with a value's address within that array computed by feeding the key to the hash function. Any type of data can be used as key or value elements, and the key and value elements may be the same. The only requirement is the existence of a hash function which maps keys to hash values (almost always word-sized integers). For efficiency, the hash function should look in some sense like it maps each key to a random hash value; see universal hash function for an example formalization of this intuition.

For instance, if we wanted to use a hash table to store telephone numbers, indexed against the owner's name, we would have strings of names as key elements, and strings of phone numbers as value elements. Our hash function would, given a string, produce an integer. We would then store or retrieve a reference to that person's phone number at that index in the array of phone numbers. For the cost of executing the hash function once, we go directly to the phone number, avoiding costly search algorithms that must check multiple elements to find the one in question. As a result, insertion, deletion, or lookup of an element in a hash table takes only O(1), which means that the time necessary to perform these actions does not grow with the number of elements in the table.

A good hash function should result in distributing the computed addresses fairly evenly over the range of addresses in the array. However, there can be more than one key that yield the same address, a situation known as a collision.

Hash collisions

Hash tables execute on real computers, so the array of value elements must be finite. Generally, hash algorithms will compute the hash of each key element modulo the size of the array. As a result, if N elements are added to a hash table that has an array of M elements, and if N > M, then by the pigeonhole principle, some elements must be assigned to the same array indices. This situation is called a hash collision, since two keys collide on a particular array element. A hash function that produces no collisions is called a perfect hash function.

A programmer can minimize the occurrence of hash collisions by carefully selecting a hash function. Such a selection must be made against the type of the key elements, the size of the hash table, and typical use of the application. In all but a few special situations, it is impossible to totally avoid hash collisions.

Several variants of hash tables exist which handle hash collisions in different ways.

  1. separate chaining (also known as bucket hashing)---by far, the simplest---stores an array of linked lists of value elements. If two keys have the same hash value, then they are both inserted into the list. In the expected case, when each list contains no more than one element, then insertions, deletions and lookups are still O(1). If, on the other hand, the hash function has poor distribution, some lists may grow very long. In that case, insertions, deletions and lookups slow to O(n), where n is the length of the longest list.
  2. linear probing attempts to store the element in the next available array index, tested in sequential order (for instance, the nth, the n+1th, the n+2th, and so on). In that sense, linear probing is an extension to the hash function. Linear probing has two major problems: first, it tends to worsen the distribution of the hash function by creating clusters of occupied array indices, and second it gives the hash table a finite size since the number of array indices does not increase. In the worst case insertions, deletions and lookups slow to O(n), where n is the length of the longest cluster.
  3. quadratic probing is very similar to linear probing, except a different sequence is used to determine next available array index. Instead of testing them sequentially, quadratic probing tests the nth, the nth+1, the nth+2, then nth+4, the nth+9, and so on. Unlike linear probing, quadratic probing does not tend to produce clusters.
  4. double hashing stores a hash table of hash tables. This method is most effective when two orthogonal hash functions can be found, and is thus limited by problem domain.