Hashing and Indexing
Hashing and Indexing
Zoya Khalid
[email protected]
Indexing a text (a genome, etc)
• In hashing, large keys are converted into small keys by using hash functions.
The values are then stored in a data structure called hash table.
• A hash function is any function that can be used to map a data set of
an arbitrary size to a data set of a fixed size, which falls into the hash
table. The values returned by a hash function are called hash values,
hash codes, hash sums, or simply hashes.
• To achieve a good hashing mechanism, It is important to have a good
hash function with the following basic requirements:
1.Easy to compute: It should be easy to compute and must not
become an algorithm in itself.
2. Uniform distribution: It should provide a uniform distribution
across the hash table and should not result in clustering.
Need for a good hash function
• Assume that you have to store strings in the hash table by using
the hashing technique {“abcdef”, “bcdefa”, “cdefab” , “defabc” }
• To compute the index for storing the strings, use a hash function
that states the following:
– The index for a specific string will be equal to the sum of the ASCII values
of the characters modulo 599.
– As 599 is a prime number, it will reduce the possibility of indexing
different strings (collisions). It is recommended that you use prime
numbers in case of modulo.
– The ASCII values of a, b, c, d, e, and f are 97, 98, 99, 100, 101, and 102
respectively. Since all the strings contain the same characters with
different permutations, the sum will 599.
Hash Table
Hash Function
• Let’s try a different hash function. The index for a specific string
will be equal to sum of ASCII values of characters multiplied by
their respective order in the string after which it is modulo with
2069 (prime number).
• String Hash function Index
abcdef (971 + 982 + 993 + 1004 + 1015 + 1026)%2069 38
bcdefa (981 + 992 + 1003 + 1014 + 1025 + 976)%2069 23
cdefab (991 + 1002 + 1013 + 1024 + 975 + 986)%2069 14
defabc (1001 + 1012 + 1023 + 974 + 985 + 996)%2069 11
•
Hash Table
Hash Function Frequency Count
• Then, iterate over the string and increase the value in the
frequency at the corresponding index for each character.
• An ideal hash function stores the actual records in the collection such
that each slot in the hash table has equal probability of being filled; but
clustering of records happens (many records hash to only a few of the
slots)
Collision resolution
• The simplest form of open hashing defines each slot in the hash
table to be the head of a linked list. All records that hash to a
particular slot are placed on that slot’s linked list
• Records within a slot’s list can be ordered in several ways: by
insertion order, by key value order, or by frequency-of-access
order
• The average cost for hashing should be Θ(1); however, if clustering
of records exists, then the cost to access a record can be much
higher because many elements on the linked list must be searched
Example
Closed hashing