Hashing
Hashing
by
Farhan Sufyan
Contents
1 Hashing 1
1.1 Direct Addressing or Direct Access Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Hash Function and Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Complexity of Opertaions in Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Limitations of Hash Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Types of Hash Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Collision Resolution Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Separate Chaining (Open Hashing or Closed Addressing) . . . . . . . . . . . . . . . . . . 9
1.3.2 Open Addressing (Closed Hashing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2.1 Linear Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2.2 Quadratic Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2.3 Double Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.3 Comparison of Probing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Separate Chaining vs Open Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 1
Hashing
• For example: Suppose we have to store the records of 15 students and each record consist of roll number and
name of the student. The roll number are in range 0 to 14 and they are taken as keys. The record with key
(roll number) 0 is stored at array index 0 and so on. Hence, we can access any record in constant time and no
comparison is involved.
• The above method is called Direct Addressing or Key-indexed search, but it is only useful only when the
set of possible key value is small.
• Direct Address Table is a data structure that has the capability of mapping records to their corresponding
keys using arrays. In direct address tables, records are placed using their key values directly as indexes. They
facilitate fast searching, insertion and deletion operations.
• Consider the case where we have to store 500 employees ofcompany and their mobile number is taken as key
to identify the individual. The mobile number can be from 0000000000 to 9999999999, so here the set of
possible key value is much more than the number of key.
• Time complexity wise this solution is the best among all, we can do all operations in O(1) time. However,
problem with this solution is extra space required is huge.
1
• Due to above limitations Direct Access Table cannot always be used. Hashing is the solution that can be
used in almost all such situations and performs extremely well compared to above data structures like Array,
Linked List, Balanced BST in practice.
• With hashing we get O(1) search time on average (under reasonable assumptions) and O(n) in worst case.
1.1.1 Advantages
• Searching in O(1) Time: Direct address tables use arrays which are random access data structure, so, the
key values (which are also the index of the array) can be easily used to search the records in O(1) time.
• Insertion in O(1) Time: We can easily insert an element in an array in O(1) time. The same thing follows
in a direct address table also.
• Deletion in O(1) Time: Deletion of an element takes O(1) time in an array. Similarly, to delete an element
in a direct address table we need O(1) time.
1.1.2 Limitations
• Prior knowledge of maximum key value
• Practically useful only if the maximum value is very less.
• It causes wastage of memory space if there is a significant difference between total records and maximum
value.
1.2 Hashing
• Hashing is an improvement over Direct Access Table. The idea is to use hash function that converts a given
phone number or any other key to a smaller number and uses the small number as index in a table called
hash table.
• Hashing is one of the searching techniques that uses a constant time. The time complexity in hashing is O(1).
Till now, we read the two techniques for searching, i.e., linear search and binary search. The worst time
complexity in linear search is O(n), and O(log n) in binary search.
• In both the searching techniques, the searching depends upon the number of elements but we want the
technique that takes a constant time. So, hashing technique came that provides a constant time.
• In Hashing technique, the hash table and hash function are used.
• Using the hash function, we can calculate the address at which the value can be stored.
• The main idea behind the hashing is to create the (key/value) pairs. If the key is given, then the algorithm
computes the index at which the value would be stored. It can be written as:
hash(key) = index
where hash is a hash function, key is the key and index is the hash value of key key. Now the key key is stored
at array index index, also known as it home address.
2
1.2.1 Hash Function and Hash Table
• A hash table is a type of data structure that stores key-value pairs. The key is sent to a hash function that
performs arithmetic operations on it to produce the hash value or hash, which is the index of the key-value
pair in the hash table. Hence, the process of generating the address from keys is called hashing and the array
in which insertion, deletion and searching perform through hashing is called hash table.
• A hash function is a function that can be used to map data of arbitrary size to fixed-size values . The
fixed-size values are usually used to index a fixed-size table called a hash table.
• The values returned by a hash function are called hash values, hash codes, message digests, or simply
hashes.
• Keys may be of different types like string, integer, etc but hash value will always be integer.
3
• To achieve a good hashing mechanism, It is important to have a good hash function with the following basic
requirements:
– Easy to compute: It should be easy to compute and must not become an algorithm in itself.
– Uniform distribution: It should provide a uniform distribution across the hash table and should not
result in clustering.
– Less collisions: Collisions occur when pairs of elements are mapped to the same hash value. These
should be avoided.
Here, searching will take O(n) time (where n is the number of strings) to access a specific string. This shows
that the hash function is not a good hash function.
By using a good hash function, hashing can work well. Under reasonable assumptions, the average time
required to search for an element in a hash table is O(1).
4
1.2.2 Complexity of Opertaions in Hash Table
5
1.2.4 Types of Hash Function
There are three ways of calculating the hash function:
• Folding method
6
• Division method (Modulo-Division)
7
1.3 Collision Resolution Techniques
While hashing, two or more key points to the same hash table index is called as collision. Irrespective of how
good a hash function is, collisions are bound to occur. Therefore, to maintain the performance of a hash table, it is
important to manage collisions through various collision resolution techniques.
• Separate Chaining (Open Hashing or Closed Addressing)
• Open Addressing (Closed Hashing)
– Linear Probing
– Quadratic Probing
– Double Hashing
8
1.3.1 Separate Chaining (Open Hashing or Closed Addressing)
• Open hashing is a collision avoidance method which uses array of linked list to resolve the collision.
• In separate chaining, each element of the hash table is a linked list.
• It is also known as the separate chaining method (each linked list is considered as a chain).
• For separate chaining, the worst-case scenario is when all the entries are inserted into the same linked list.
The lookup procedure may have to scan all its entries, so the worst-case cost is proportional to the number
(N) of entries in the table.
9
• Let us consider a simple hash function as key mod 7 and sequence of keys as 50, 700, 76, 85, 92, 73, 101.
• Draw an empty hash table. For the given hash function, the possible range of hash values is [0, 6]. So, draw
an empty hash table consisting of 7 buckets
• The first key to be inserted in the hash table is 50. Bucket of the hash table to which key 50 maps as
50 mod 7 = 1. So, key 50 will be inserted in bucket-1 of the hash table.
• Key 700 maps as 700 mod 7 = 0. So, key 700 will be inserted in bucket-0 of the hash table. Key 76 maps as
76 mod 7 = 6. So, key 76 will be inserted in bucket-6 of the hash table.
• Key 85 maps as 85 mod 7 = 1. Since bucket-1 is already occupied, so collision occurs. Separate chaining
handles the collision by creating a linked list to bucket-1. So, key 85 will be inserted in bucket-1 of the hash
table and so on.
• Advantages
– Simple to implement.
– Hash table never fills up, we can always add more elements to the chain.
– Less sensitive to the hash function or load factors.
– It is mostly used when it is unknown how many and how frequently keys may be inserted or deleted.
• Disadvantages
– Cache performance of chaining is not good as keys are stored using a linked list.
– Wastage of Space (Some parts of hash table are never used).
– If the chain becomes long, then search time can become O(n) in the worst case.
– Uses extra space for links.
10
1.3.2 Open Addressing (Closed Hashing)
• Unlike chaining, it does not insert elements to some other data-structures. It inserts the data into the hash
table itself.
• The size of the hash table should be larger or atleast equal to the number of keys i.e. it has at most one
element per bucket.
• In open addressing, all elements are stored in the hash table itself.
• When a new entry has to be inserted, the hash index of the hashed value is computed and then the array is
examined (starting with the hashed index). If the slot at the hashed index is unoccupied, then the entry record
is inserted in slot at the hashed index else it proceeds in some probe sequence until it finds an unoccupied
slot.
• The probe sequence is the sequence that is followed while traversing through entries. In different probe
sequences, you can have different intervals between successive entry slots or probes.
• When searching for an entry, the array is scanned in the same sequence until either the target element is
found or an unused slot is found.
• The name open addressing refers to the fact that the location or address of the item is not determined by its
hash value.
11
• Example
Initial Hash Table
Insert 13
12
Insert 1
Insert 6
1%5 = 1
6%5 = 1
Both 1 and 6 points the same index under modulo 5. So that we have placed 6 in arr[2] which is next available
index.
13
Insert 11
1%5 = 1
6%5 = 1
11%5 = 1
1, 6 and 11 points the same index under modulo 5. So that we have placed 11 in arr[4] which is next available
index.
Insert 10
14
Insert 15
15
Let us consider a simple hash function as key mod 7 and a sequence of keys as 50, 700, 76, 85, 92, 73, 101.
• Advantages
– Easy to compute.
– Achieves good cache performance since the probing sequence is linear in memory
• Disadvantages
– A problem tends to create long sequences of occupied buckets. The existing chain will act as a "net"
and catch many of the new keys, which will be appended to the chain and exacerbate the problem.
– Key is storing far away from its home address, which is bad for performance since the average number
of buckets scanned during insert and lookup or search increases. The phenomenon is called primary
clustering or just clustering.
– Worst time to search an element in linear probing is O(N), where N is the size of the table.
16
1.3.2.2 Quadratic Probing
• In Quadratic Probing, the problem of primary clustering is solved by solved by storing the colliding keys
away from the initial collision point.
• Quadratic probing is an open-addressing scheme where we look for i2 -th slot in it h iteration if the given hash
value x collides in the hash table.
17
• Example
18
1.3.2.3 Double Hashing
• Double hashing uses the idea of applying a second hash function to key when a collision occurs.
• First hash function is
hash(key) or index = key%hashTableSize
19
• Advantage
– It does not suffer from any clustering.
– Although the computational cost may be high, double hashing can find the next free slot faster than the
linear probing approach.
• Disadvantages
– It requires more computation time as two hash functions need to be computed.
• Double caching has poor cache performance but no clustering. Double hashing requires more computation
time as two hash functions need to be computed.
20
1.4 Separate Chaining vs Open Addressing
21