0% found this document useful (0 votes)
100 views

Hashing and Indexing

Hashing is an indexing technique that maps keys to values in a hash table using a hash function. This allows for fast lookup of values in constant time. However, collisions may occur when different keys map to the same location. Open hashing resolves collisions using separate chaining by storing values in linked lists, while closed hashing uses techniques like linear probing to find alternate locations in the table. A good hash function provides a uniform distribution of keys and is easy to compute. Hashing is commonly used to index genomes, proteins, and count character frequencies in strings.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views

Hashing and Indexing

Hashing is an indexing technique that maps keys to values in a hash table using a hash function. This allows for fast lookup of values in constant time. However, collisions may occur when different keys map to the same location. Open hashing resolves collisions using separate chaining by storing values in linked lists, while closed hashing uses techniques like linear probing to find alternate locations in the table. A good hash function provides a uniform distribution of keys and is easy to compute. Hashing is commonly used to index genomes, proteins, and count character frequencies in strings.

Uploaded by

Ayesha Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Hashing and Indexing

Zoya Khalid
[email protected]
Indexing a text (a genome, etc)

• Example 1: we want to index a genome such that we can look up


any k-mer along the genome in O(1) time (without scanning the
whole genome).

• Example 2: we want to index a protein database such that we can


look up all the proteins containing a word (k-mer) in constant
time.
Hashing

• Hashing is an indexing technique that enables fast search by


computing index directly based on the key
Terminologies

• The process of finding a record using some computation to map


its key value to a position in the array is called hashing.
• The function that maps key values to positions is called a hash
function (h).
• The array that holds the hash table is called the hash table (HT).
• A position in the hash table is also known as a slot.
Hashing

• Hashing is a technique that is used to uniquely identify a specific


object from a group of similar objects. Some examples of how
hashing is used in our lives include:
1. In universities, each student is assigned a unique roll
number that can be used to retrieve information about them.

2. In libraries, each book is assigned a unique number that can


be used to determine information about the book, such as its
exact position in the library or the users it has been issued to
etc.
Hashing
• Assume that you have an object and you want to assign a key to it to make
searching easy. To store the key/value pair, you can use a simple array like a
data structure where keys (integers) can be used directly as an index to store
values. However, in cases where the keys are large and cannot be used
directly as an index, you should use hashing.

• In hashing, large keys are converted into small keys by using hash functions.
The values are then stored in a data structure called hash table.

• The idea of hashing is to distribute entries (key/value pairs) uniformly across


an array. Each element is assigned a key (converted key). By using that key
you can access the element in O(1) time. Using the key, the algorithm (hash
function) computes an index that suggests where an entry can be found or
inserted.
Hashing
• Hashing is implemented in two steps:
1. An element is converted into an integer by using a hash function.
This element can be used as an index to store the original element,
which falls into the hash table.
2. The element is stored in the hash table where it can be quickly
retrieved using hashed key.
hash = hashfunc(key)
index = hash % array_size
• In this method, the hash is independent of the array size and it is
then reduced to an index (a number between 0 and array_size − 1)
by using the modulo operator (%).
Hash Function

• A hash function is any function that can be used to map a data set of
an arbitrary size to a data set of a fixed size, which falls into the hash
table. The values returned by a hash function are called hash values,
hash codes, hash sums, or simply hashes.
• To achieve a good hashing mechanism, It is important to have a good
hash function with the following basic requirements:
1.Easy to compute: It should be easy to compute and must not
become an algorithm in itself.
2. Uniform distribution: It should provide a uniform distribution
across the hash table and should not result in clustering.
Need for a good hash function
• Assume that you have to store strings in the hash table by using
the hashing technique {“abcdef”, “bcdefa”, “cdefab” , “defabc” }
• To compute the index for storing the strings, use a hash function
that states the following:
– The index for a specific string will be equal to the sum of the ASCII values
of the characters modulo 599.
– As 599 is a prime number, it will reduce the possibility of indexing
different strings (collisions). It is recommended that you use prime
numbers in case of modulo.
– The ASCII values of a, b, c, d, e, and f are 97, 98, 99, 100, 101, and 102
respectively. Since all the strings contain the same characters with
different permutations, the sum will 599.
Hash Table
Hash Function

• Let’s try a different hash function. The index for a specific string
will be equal to sum of ASCII values of characters multiplied by
their respective order in the string after which it is modulo with
2069 (prime number).
• String Hash function Index
abcdef (971 + 982 + 993 + 1004 + 1015 + 1026)%2069 38
bcdefa (981 + 992 + 1003 + 1014 + 1025 + 976)%2069 23
cdefab (991 + 1002 + 1013 + 1024 + 975 + 986)%2069 14
defabc (1001 + 1012 + 1023 + 974 + 985 + 996)%2069 11

Hash Table
Hash Function Frequency Count

• Let us consider string S. You are required to count the frequency


of all the characters in this string.
• string S = “ababcd”

• The simplest way to do this is to iterate over all the possible


characters and count their frequency one by one. The time
complexity of this approach is O(26*N) where N is the size of the
string and there are 26 possible characters.
Hash Function Frequency Count

• Let us apply hashing to this problem.

• Take an array frequency of size 26 and hash the 26 characters


with indices of the array by using the hash function.

• Then, iterate over the string and increase the value in the
frequency at the corresponding index for each character.

• The complexity of this approach is O(N) where N is the size of


the string.
Hash Table
Hash functions and collisions
• Typically there are many more values in the key range than there are
slots in the hash table.

• Given a hash function h and two keys k1 and k2, if h(k1)=h(k2)=β, we


say that k1 and k2 have a collision at slot β under hash function h.

• Perfect hashing is a system in which records are hashed such that


there are no collisions (e.g., indexing k-mers when k is small).

• An ideal hash function stores the actual records in the collection such
that each slot in the hash table has equal probability of being filled; but
clustering of records happens (many records hash to only a few of the
slots)
Collision resolution

• While the goal of a hash function is to minimize collisions, some


collisions are unavoidable in practice.
• Hashing implementations must include some form of collision
resolution policy.
• Two class of collision resolution techniques:
– Open hashing (separate chaining)—collisions are stored outside the table
– Closed hashing—collisions result in storing one of the records at another
slot in the table
Open hashing

• The simplest form of open hashing defines each slot in the hash
table to be the head of a linked list. All records that hash to a
particular slot are placed on that slot’s linked list
• Records within a slot’s list can be ordered in several ways: by
insertion order, by key value order, or by frequency-of-access
order
• The average cost for hashing should be Θ(1); however, if clustering
of records exists, then the cost to access a record can be much
higher because many elements on the linked list must be searched
Example
Closed hashing

• Closed hashing stores all records directly in the hash table.


• A collision resolution policy must be built to determine which slot
to use when collision is detected.
• The same policy must be followed during search as during
insertion.
• Some common closed hashing
– Bucket hashing --- overflow goes to an overflow bucket
Closed Hashing

• Like separate chaining, open addressing is a method for handling


collisions.

• In Open Addressing, all elements are stored in the hash table


itself. So at any point, size of the table must be greater than or
equal to the total number of keys (Note that we can increase table
size by copying old data if needed).
Example
Problem
• Clustering: The main problem with linear probing is clustering,
many consecutive elements form groups and it starts taking time
to find a free slot or to search an element.

• It occurs after a hash collision causes two of the records in the


hash table to hash to the same position, and causes one of the
records to be moved to the next location in its probe sequence.

• Once this happens, the cluster formed by this pair of records is


more likely to grow by the addition of even more colliding
records, regardless of whether the new records hash to the same
location as the first two. This phenomenon causes searches for
keys within the cluster to be longer
Comparison
BLAT Tool
References

• Lecture notes of Colin Dewey @ University of Wisconsin-Madison


• Lecture notes of Arne Elofsson @ Stockholm University
• Lecture notes of Yuzhen Ye @ Indiana University

You might also like