0% found this document useful (0 votes)
13 views23 pages

Hashing

Uploaded by

Ak221197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

Hashing

Uploaded by

Ak221197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Structures Notes

by

Farhan Sufyan
Contents

1 Hashing 1
1.1 Direct Addressing or Direct Access Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Hash Function and Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Complexity of Opertaions in Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Limitations of Hash Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Types of Hash Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Collision Resolution Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Separate Chaining (Open Hashing or Closed Addressing) . . . . . . . . . . . . . . . . . . 9
1.3.2 Open Addressing (Closed Hashing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2.1 Linear Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2.2 Quadratic Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2.3 Double Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.3 Comparison of Probing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Separate Chaining vs Open Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 1

Hashing

1.1 Direct Addressing or Direct Access Table


• Searching algorithms like linear and binary search trees are totally dependent on number of elements and
many comparisons.
• Suppose we have 0 − (n − 1) and all of them are unique. We can take array of size n and store the records in
that array based on the condition that key and array index are same.

• For example: Suppose we have to store the records of 15 students and each record consist of roll number and
name of the student. The roll number are in range 0 to 14 and they are taken as keys. The record with key
(roll number) 0 is stored at array index 0 and so on. Hence, we can access any record in constant time and no
comparison is involved.
• The above method is called Direct Addressing or Key-indexed search, but it is only useful only when the
set of possible key value is small.
• Direct Address Table is a data structure that has the capability of mapping records to their corresponding
keys using arrays. In direct address tables, records are placed using their key values directly as indexes. They
facilitate fast searching, insertion and deletion operations.

• Consider the case where we have to store 500 employees ofcompany and their mobile number is taken as key
to identify the individual. The mobile number can be from 0000000000 to 9999999999, so here the set of
possible key value is much more than the number of key.
• Time complexity wise this solution is the best among all, we can do all operations in O(1) time. However,
problem with this solution is extra space required is huge.

1
• Due to above limitations Direct Access Table cannot always be used. Hashing is the solution that can be
used in almost all such situations and performs extremely well compared to above data structures like Array,
Linked List, Balanced BST in practice.
• With hashing we get O(1) search time on average (under reasonable assumptions) and O(n) in worst case.

• Hashing can overcome these limitations of direct address tables.

1.1.1 Advantages
• Searching in O(1) Time: Direct address tables use arrays which are random access data structure, so, the
key values (which are also the index of the array) can be easily used to search the records in O(1) time.
• Insertion in O(1) Time: We can easily insert an element in an array in O(1) time. The same thing follows
in a direct address table also.

• Deletion in O(1) Time: Deletion of an element takes O(1) time in an array. Similarly, to delete an element
in a direct address table we need O(1) time.

1.1.2 Limitations
• Prior knowledge of maximum key value
• Practically useful only if the maximum value is very less.
• It causes wastage of memory space if there is a significant difference between total records and maximum
value.

1.2 Hashing
• Hashing is an improvement over Direct Access Table. The idea is to use hash function that converts a given
phone number or any other key to a smaller number and uses the small number as index in a table called
hash table.
• Hashing is one of the searching techniques that uses a constant time. The time complexity in hashing is O(1).
Till now, we read the two techniques for searching, i.e., linear search and binary search. The worst time
complexity in linear search is O(n), and O(log n) in binary search.
• In both the searching techniques, the searching depends upon the number of elements but we want the
technique that takes a constant time. So, hashing technique came that provides a constant time.
• In Hashing technique, the hash table and hash function are used.

• Using the hash function, we can calculate the address at which the value can be stored.
• The main idea behind the hashing is to create the (key/value) pairs. If the key is given, then the algorithm
computes the index at which the value would be stored. It can be written as:
hash(key) = index
where hash is a hash function, key is the key and index is the hash value of key key. Now the key key is stored
at array index index, also known as it home address.

2
1.2.1 Hash Function and Hash Table
• A hash table is a type of data structure that stores key-value pairs. The key is sent to a hash function that
performs arithmetic operations on it to produce the hash value or hash, which is the index of the key-value
pair in the hash table. Hence, the process of generating the address from keys is called hashing and the array
in which insertion, deletion and searching perform through hashing is called hash table.
• A hash function is a function that can be used to map data of arbitrary size to fixed-size values . The
fixed-size values are usually used to index a fixed-size table called a hash table.
• The values returned by a hash function are called hash values, hash codes, message digests, or simply
hashes.
• Keys may be of different types like string, integer, etc but hash value will always be integer.

3
• To achieve a good hashing mechanism, It is important to have a good hash function with the following basic
requirements:
– Easy to compute: It should be easy to compute and must not become an algorithm in itself.
– Uniform distribution: It should provide a uniform distribution across the hash table and should not
result in clustering.
– Less collisions: Collisions occur when pairs of elements are mapped to the same hash value. These
should be avoided.

Here, searching will take O(n) time (where n is the number of strings) to access a specific string. This shows
that the hash function is not a good hash function.

By using a good hash function, hashing can work well. Under reasonable assumptions, the average time
required to search for an element in a hash table is O(1).

4
1.2.2 Complexity of Opertaions in Hash Table

1.2.3 Limitations of Hash Function


• Since a hash function gets us a small number for a big key, there is possibility that two keys result in same
value. The situation where a newly inserted key maps to an already occupied slot in hash table is called
collision.
• When the hash function generates the same index for multiple keys, there will be a conflict (what value to be
stored in that index). This is called a hash collision.

5
1.2.4 Types of Hash Function
There are three ways of calculating the hash function:
• Folding method

• Mid square method

6
• Division method (Modulo-Division)

7
1.3 Collision Resolution Techniques
While hashing, two or more key points to the same hash table index is called as collision. Irrespective of how
good a hash function is, collisions are bound to occur. Therefore, to maintain the performance of a hash table, it is
important to manage collisions through various collision resolution techniques.
• Separate Chaining (Open Hashing or Closed Addressing)
• Open Addressing (Closed Hashing)

– Linear Probing
– Quadratic Probing
– Double Hashing

8
1.3.1 Separate Chaining (Open Hashing or Closed Addressing)
• Open hashing is a collision avoidance method which uses array of linked list to resolve the collision.
• In separate chaining, each element of the hash table is a linked list.
• It is also known as the separate chaining method (each linked list is considered as a chain).

• Chaining is simple, but requires additional memory outside the table.

• For separate chaining, the worst-case scenario is when all the entries are inserted into the same linked list.
The lookup procedure may have to scan all its entries, so the worst-case cost is proportional to the number
(N) of entries in the table.

9
• Let us consider a simple hash function as key mod 7 and sequence of keys as 50, 700, 76, 85, 92, 73, 101.
• Draw an empty hash table. For the given hash function, the possible range of hash values is [0, 6]. So, draw
an empty hash table consisting of 7 buckets

• The first key to be inserted in the hash table is 50. Bucket of the hash table to which key 50 maps as
50 mod 7 = 1. So, key 50 will be inserted in bucket-1 of the hash table.
• Key 700 maps as 700 mod 7 = 0. So, key 700 will be inserted in bucket-0 of the hash table. Key 76 maps as
76 mod 7 = 6. So, key 76 will be inserted in bucket-6 of the hash table.
• Key 85 maps as 85 mod 7 = 1. Since bucket-1 is already occupied, so collision occurs. Separate chaining
handles the collision by creating a linked list to bucket-1. So, key 85 will be inserted in bucket-1 of the hash
table and so on.

• Advantages
– Simple to implement.
– Hash table never fills up, we can always add more elements to the chain.
– Less sensitive to the hash function or load factors.
– It is mostly used when it is unknown how many and how frequently keys may be inserted or deleted.
• Disadvantages
– Cache performance of chaining is not good as keys are stored using a linked list.
– Wastage of Space (Some parts of hash table are never used).
– If the chain becomes long, then search time can become O(n) in the worst case.
– Uses extra space for links.

10
1.3.2 Open Addressing (Closed Hashing)
• Unlike chaining, it does not insert elements to some other data-structures. It inserts the data into the hash
table itself.
• The size of the hash table should be larger or atleast equal to the number of keys i.e. it has at most one
element per bucket.

• In open addressing, all elements are stored in the hash table itself.
• When a new entry has to be inserted, the hash index of the hashed value is computed and then the array is
examined (starting with the hashed index). If the slot at the hashed index is unoccupied, then the entry record
is inserted in slot at the hashed index else it proceeds in some probe sequence until it finds an unoccupied
slot.

• The probe sequence is the sequence that is followed while traversing through entries. In different probe
sequences, you can have different intervals between successive entry slots or probes.
• When searching for an entry, the array is scanned in the same sequence until either the target element is
found or an unused slot is found.

• The name open addressing refers to the fact that the location or address of the item is not determined by its
hash value.

1.3.2.1 Linear Probing


• When collision occurs, in linear probing, we keep probing linearly for the next bucket until an empty bucket
is found.

• The probing sequence for linear probing will be:

Step 1 - Calculate the hash:


index = key%hashTableSize
Step 2 - If hashTable[index] is empty, store the value directly hashTable[index] = key.
Step 3 - If the hash index already has some value, check for next index.
index = (index + 1)%hashTableSize
If the next index is available hashTable[index], store the value. Otherwise try for next index using
index = (index + 2)%hashTableSize
index = (index + 3)%hashTableSize
and so on until an empty bucket is found.

• The formula for linear probing can be written as:


H(key, i) = (hash(key) + i) mod hashTableSize
where i = 0, 1, ....., (hashTableSize − 1)
• The search for empty location will be in sequence
H(key, i), H(key, i) + 1, H(key, i) + 2, H(key, i) + 3.......

11
• Example
Initial Hash Table

Insert 13

12
Insert 1

Insert 6

1%5 = 1
6%5 = 1
Both 1 and 6 points the same index under modulo 5. So that we have placed 6 in arr[2] which is next available
index.

13
Insert 11

1%5 = 1
6%5 = 1
11%5 = 1
1, 6 and 11 points the same index under modulo 5. So that we have placed 11 in arr[4] which is next available
index.
Insert 10

14
Insert 15

15
Let us consider a simple hash function as key mod 7 and a sequence of keys as 50, 700, 76, 85, 92, 73, 101.

• Advantages

– Easy to compute.
– Achieves good cache performance since the probing sequence is linear in memory
• Disadvantages
– A problem tends to create long sequences of occupied buckets. The existing chain will act as a "net"
and catch many of the new keys, which will be appended to the chain and exacerbate the problem.
– Key is storing far away from its home address, which is bad for performance since the average number
of buckets scanned during insert and lookup or search increases. The phenomenon is called primary
clustering or just clustering.
– Worst time to search an element in linear probing is O(N), where N is the size of the table.

16
1.3.2.2 Quadratic Probing
• In Quadratic Probing, the problem of primary clustering is solved by solved by storing the colliding keys
away from the initial collision point.
• Quadratic probing is an open-addressing scheme where we look for i2 -th slot in it h iteration if the given hash
value x collides in the hash table.

• The probing sequence for quadratic probing will be:

Step 1 - Calculate the hash:


index = key%hashTableSize
Step 2 - If hashTable[index] is empty, store the value directly hashTable[index] = key.
Step 3 - If the hash index already has some value, check for next index.
index = (index + i2 )%hashTableSize, where i = 0, 1, ..., (hashTableSize − 1)
i.e.
index = (index + 12 )%hashTableSize = (index + 1)%hashTableSize
index = (index + 22 )%hashTableSize = (index + 4)%hashTableSize
index = (index + 32 )%hashTableSize = (index + 9)%hashTableSize
and so on until an empty bucket is found.

• The formula for quadratic probing can be written as:


H(key, i) = (hash(key) + i2 ) mod hashTableSize
where i = 0, 1, ....., (hashTableSize − 1)
• The search for empty location will be in sequence:
H(key, i), H(key, i) + 1, H(key, i) + 4, H(key, i) + 9.......
• Advantages - This creates larger and larger gaps in the search sequence and avoids primary clustering.
• Disadvantage - If one key hashes to the same bucket as another key, the search sequence for the second key
will go in the footsteps of the first one. If this happens repeatedly (for example due to a poorly implemented
hash function) long chains will still form, and cause performance to degrade. The phenomenon is called
secondary clustering.

17
• Example

18
1.3.2.3 Double Hashing
• Double hashing uses the idea of applying a second hash function to key when a collision occurs.
• First hash function is
hash(key) or index = key%hashTableSize

• A popular second hash or secondary hash function is:


hash0 (key) = PRIME − (key%PRIME),
where PRIME is a prime smaller than the hashTableSize.
• The secondary hash function should have several characteristics:

– It should never yield an index of zero


– It should cycle through the whole table
– It should be very fast to compute

The formula for double hashing can be written as:


H(key, i) = (hash(key) + ihash0 (key))
where i = 0, 1, ....., (hashTableSize − 1)

• The search for empty location will be in sequence:


H(key, i), H(key, i) + hash0 (key), H(key, i) + 2hash0 (key), H(key, i) + 3hash0 (key), .......

19
• Advantage
– It does not suffer from any clustering.
– Although the computational cost may be high, double hashing can find the next free slot faster than the
linear probing approach.
• Disadvantages
– It requires more computation time as two hash functions need to be computed.

1.3.3 Comparison of Probing Techniques


• Linear Probing has the best cache performance but suffers from primary clustering. Linear probing is also
easy to compute.
• Quadratic probing lies between the two in terms of cache performance and secondary clustering.

• Double caching has poor cache performance but no clustering. Double hashing requires more computation
time as two hash functions need to be computed.

20
1.4 Separate Chaining vs Open Addressing

21

You might also like