0% found this document useful (0 votes)

100 views

Hashing and Indexing

Hashing is an indexing technique that maps keys to values in a hash table using a hash function. This allows for fast lookup of values in constant time. However, collisions may occur when different keys map to the same location. Open hashing resolves collisions using separate chaining by storing values in linked lists, while closed hashing uses techniques like linear probing to find alternate locations in the table. A good hash function provides a uniform distribution of keys and is easy to compute. Hashing is commonly used to index genomes, proteins, and count character frequencies in strings.

Uploaded by

Ayesha Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views

Hashing and Indexing

Uploaded by

Ayesha Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Hashing and Indexing

Zoya Khalid
[email protected]
Indexing a text (a genome, etc)

• Example 1: we want to index a genome such that we can look up

any k-mer along the genome in O(1) time (without scanning the
whole genome).

• Example 2: we want to index a protein database such that we can

look up all the proteins containing a word (k-mer) in constant
time.
Hashing

• Hashing is an indexing technique that enables fast search by

computing index directly based on the key
Terminologies

• The process of finding a record using some computation to map

its key value to a position in the array is called hashing.
• The function that maps key values to positions is called a hash
function (h).
• The array that holds the hash table is called the hash table (HT).
• A position in the hash table is also known as a slot.
Hashing

• Hashing is a technique that is used to uniquely identify a specific

object from a group of similar objects. Some examples of how
hashing is used in our lives include:
1. In universities, each student is assigned a unique roll
number that can be used to retrieve information about them.

2. In libraries, each book is assigned a unique number that can

be used to determine information about the book, such as its
exact position in the library or the users it has been issued to
etc.
Hashing
• Assume that you have an object and you want to assign a key to it to make
searching easy. To store the key/value pair, you can use a simple array like a
data structure where keys (integers) can be used directly as an index to store
values. However, in cases where the keys are large and cannot be used
directly as an index, you should use hashing.

• In hashing, large keys are converted into small keys by using hash functions.
The values are then stored in a data structure called hash table.

• The idea of hashing is to distribute entries (key/value pairs) uniformly across

an array. Each element is assigned a key (converted key). By using that key
you can access the element in O(1) time. Using the key, the algorithm (hash
function) computes an index that suggests where an entry can be found or
inserted.
Hashing
• Hashing is implemented in two steps:
1. An element is converted into an integer by using a hash function.
This element can be used as an index to store the original element,
which falls into the hash table.
2. The element is stored in the hash table where it can be quickly
retrieved using hashed key.
hash = hashfunc(key)
index = hash % array_size
• In this method, the hash is independent of the array size and it is
then reduced to an index (a number between 0 and array_size − 1)
by using the modulo operator (%).
Hash Function

• A hash function is any function that can be used to map a data set of
an arbitrary size to a data set of a fixed size, which falls into the hash
table. The values returned by a hash function are called hash values,
hash codes, hash sums, or simply hashes.
• To achieve a good hashing mechanism, It is important to have a good
hash function with the following basic requirements:
1.Easy to compute: It should be easy to compute and must not
become an algorithm in itself.
2. Uniform distribution: It should provide a uniform distribution
across the hash table and should not result in clustering.
Need for a good hash function
• Assume that you have to store strings in the hash table by using
the hashing technique {“abcdef”, “bcdefa”, “cdefab” , “defabc” }
• To compute the index for storing the strings, use a hash function
that states the following:
– The index for a specific string will be equal to the sum of the ASCII values
of the characters modulo 599.
– As 599 is a prime number, it will reduce the possibility of indexing
different strings (collisions). It is recommended that you use prime
numbers in case of modulo.
– The ASCII values of a, b, c, d, e, and f are 97, 98, 99, 100, 101, and 102
respectively. Since all the strings contain the same characters with
different permutations, the sum will 599.
Hash Table
Hash Function

• Let’s try a different hash function. The index for a specific string
will be equal to sum of ASCII values of characters multiplied by
their respective order in the string after which it is modulo with
2069 (prime number).
• String Hash function Index
abcdef (971 + 982 + 993 + 1004 + 1015 + 1026)%2069 38
bcdefa (981 + 992 + 1003 + 1014 + 1025 + 976)%2069 23
cdefab (991 + 1002 + 1013 + 1024 + 975 + 986)%2069 14
defabc (1001 + 1012 + 1023 + 974 + 985 + 996)%2069 11
•
Hash Table
Hash Function Frequency Count

• Let us consider string S. You are required to count the frequency

of all the characters in this string.
• string S = “ababcd”

• The simplest way to do this is to iterate over all the possible

characters and count their frequency one by one. The time
complexity of this approach is O(26*N) where N is the size of the
string and there are 26 possible characters.
Hash Function Frequency Count

• Let us apply hashing to this problem.

• Take an array frequency of size 26 and hash the 26 characters

with indices of the array by using the hash function.

• Then, iterate over the string and increase the value in the
frequency at the corresponding index for each character.

• The complexity of this approach is O(N) where N is the size of

the string.
Hash Table
Hash functions and collisions
• Typically there are many more values in the key range than there are
slots in the hash table.

• Given a hash function h and two keys k1 and k2, if h(k1)=h(k2)=β, we

say that k1 and k2 have a collision at slot β under hash function h.

• Perfect hashing is a system in which records are hashed such that

there are no collisions (e.g., indexing k-mers when k is small).

• An ideal hash function stores the actual records in the collection such
that each slot in the hash table has equal probability of being filled; but
clustering of records happens (many records hash to only a few of the
slots)
Collision resolution

• While the goal of a hash function is to minimize collisions, some

collisions are unavoidable in practice.
• Hashing implementations must include some form of collision
resolution policy.
• Two class of collision resolution techniques:
– Open hashing (separate chaining)—collisions are stored outside the table
– Closed hashing—collisions result in storing one of the records at another
slot in the table
Open hashing

• The simplest form of open hashing defines each slot in the hash
table to be the head of a linked list. All records that hash to a
particular slot are placed on that slot’s linked list
• Records within a slot’s list can be ordered in several ways: by
insertion order, by key value order, or by frequency-of-access
order
• The average cost for hashing should be Θ(1); however, if clustering
of records exists, then the cost to access a record can be much
higher because many elements on the linked list must be searched
Example
Closed hashing

• Closed hashing stores all records directly in the hash table.

• A collision resolution policy must be built to determine which slot
to use when collision is detected.
• The same policy must be followed during search as during
insertion.
• Some common closed hashing
– Bucket hashing --- overflow goes to an overflow bucket
Closed Hashing

• Like separate chaining, open addressing is a method for handling

collisions.

• In Open Addressing, all elements are stored in the hash table

itself. So at any point, size of the table must be greater than or
equal to the total number of keys (Note that we can increase table
size by copying old data if needed).
Example
Problem
• Clustering: The main problem with linear probing is clustering,
many consecutive elements form groups and it starts taking time
to find a free slot or to search an element.

• It occurs after a hash collision causes two of the records in the

hash table to hash to the same position, and causes one of the
records to be moved to the next location in its probe sequence.

• Once this happens, the cluster formed by this pair of records is

more likely to grow by the addition of even more colliding
records, regardless of whether the new records hash to the same
location as the first two. This phenomenon causes searches for
keys within the cluster to be longer
Comparison
BLAT Tool
References

• Lecture notes of Colin Dewey @ University of Wisconsin-Madison

• Lecture notes of Arne Elofsson @ Stockholm University
• Lecture notes of Yuzhen Ye @ Indiana University

315 11 - Digital Computer Organization
No ratings yet
315 11 - Digital Computer Organization
286 pages
Calculation Radial Forced Slot Wedge-Paper - 40
No ratings yet
Calculation Radial Forced Slot Wedge-Paper - 40
6 pages
Terex 780
No ratings yet
Terex 780
20 pages
05 Hashing
No ratings yet
05 Hashing
47 pages
Hashing
No ratings yet
Hashing
37 pages
Structure: A Structure Is A Group of Data Items of Different Data Types Held Together in A Single Unit
No ratings yet
Structure: A Structure Is A Group of Data Items of Different Data Types Held Together in A Single Unit
31 pages
05 DSA PPT Algorithmic Anaysis-II
No ratings yet
05 DSA PPT Algorithmic Anaysis-II
19 pages
Chap 4 PDF
No ratings yet
Chap 4 PDF
33 pages
Languages Strings
No ratings yet
Languages Strings
53 pages
Digital Logic Basics
No ratings yet
Digital Logic Basics
50 pages
TURING MACHINE
No ratings yet
TURING MACHINE
84 pages
RG, RE-RG, FA-RG, RG-FA, RLG-LLG, LLG-RLG
No ratings yet
RG, RE-RG, FA-RG, RG-FA, RLG-LLG, LLG-RLG
19 pages
FAFL-Final-Lecture 2.2
No ratings yet
FAFL-Final-Lecture 2.2
29 pages
Closure Properties
No ratings yet
Closure Properties
41 pages
TOC in 8 Hours
No ratings yet
TOC in 8 Hours
312 pages
Regular Grammar
No ratings yet
Regular Grammar
56 pages
Regular expression
No ratings yet
Regular expression
89 pages
Context Free Language
No ratings yet
Context Free Language
39 pages
Lecture 10
No ratings yet
Lecture 10
39 pages
Unit 5 Toc
No ratings yet
Unit 5 Toc
159 pages
Structure: A Structure Is A Group of Data Items of Different Data Types Held Together in A Single Unit
No ratings yet
Structure: A Structure Is A Group of Data Items of Different Data Types Held Together in A Single Unit
32 pages
WK 1 Probability Distributions
No ratings yet
WK 1 Probability Distributions
43 pages
PDA
100% (1)
PDA
76 pages
Finite Automata
No ratings yet
Finite Automata
144 pages
Discrete-Mathematics - UNIT-5
No ratings yet
Discrete-Mathematics - UNIT-5
26 pages
Module 5
No ratings yet
Module 5
85 pages
Abstract Algebra 1
No ratings yet
Abstract Algebra 1
42 pages
Theory of Automata
No ratings yet
Theory of Automata
22 pages
Regular Language
No ratings yet
Regular Language
43 pages
Unit 5 - PPT[1]
No ratings yet
Unit 5 - PPT[1]
95 pages
20ma402 Ps Unit II DCM
No ratings yet
20ma402 Ps Unit II DCM
89 pages
Intro To NP Completeness Modified
No ratings yet
Intro To NP Completeness Modified
72 pages
FAFL Final Lecture 1.10 CMH
No ratings yet
FAFL Final Lecture 1.10 CMH
22 pages
Finite Automata
No ratings yet
Finite Automata
144 pages
Lect 14-16
No ratings yet
Lect 14-16
36 pages
Theory of Computation
No ratings yet
Theory of Computation
4 pages
Assigment: Hungarian Method
No ratings yet
Assigment: Hungarian Method
22 pages
Assignment No.2 - Design and Analysis of Algorithms
No ratings yet
Assignment No.2 - Design and Analysis of Algorithms
37 pages
Atc Notes
No ratings yet
Atc Notes
30 pages
Formal Languages & Finite Theory of Automata: BS Course
No ratings yet
Formal Languages & Finite Theory of Automata: BS Course
39 pages
PropLogic PDF
No ratings yet
PropLogic PDF
130 pages
Automata Stud
No ratings yet
Automata Stud
240 pages
Formal Languages and Automata Theory: II B.Tech - II Sem (R19)
No ratings yet
Formal Languages and Automata Theory: II B.Tech - II Sem (R19)
26 pages
Hamilton Cycle
No ratings yet
Hamilton Cycle
7 pages
TC Notes
No ratings yet
TC Notes
108 pages
Flat CH 2
No ratings yet
Flat CH 2
86 pages
Groups & Symmetries Notes
No ratings yet
Groups & Symmetries Notes
49 pages
Gupta N Gupta
No ratings yet
Gupta N Gupta
25 pages
BBM401 Automata Theory and Formal Languages 1
No ratings yet
BBM401 Automata Theory and Formal Languages 1
34 pages
Complexity Classes
No ratings yet
Complexity Classes
18 pages
Kuk B.tech Cse Automata Theory
No ratings yet
Kuk B.tech Cse Automata Theory
237 pages
Encoders and Multiplexer Circuits: by Dr. Nermeen Talaat
No ratings yet
Encoders and Multiplexer Circuits: by Dr. Nermeen Talaat
22 pages
Hashing
No ratings yet
Hashing
24 pages
Partial Ordering
No ratings yet
Partial Ordering
31 pages
CPE121 - Chapter01 - Introduction To Data Structures and Algorithm
No ratings yet
CPE121 - Chapter01 - Introduction To Data Structures and Algorithm
24 pages
NFA to DFA _FST
No ratings yet
NFA to DFA _FST
95 pages
CD Lab Manual
100% (1)
CD Lab Manual
55 pages
Sparse Matrices
No ratings yet
Sparse Matrices
28 pages
CMP215 Data Structures Through C++
No ratings yet
CMP215 Data Structures Through C++
184 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
Lect Hashing
No ratings yet
Lect Hashing
36 pages
Idst 2016 SA 05 Hashing
No ratings yet
Idst 2016 SA 05 Hashing
68 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Quiz2 - Solution
No ratings yet
Quiz2 - Solution
2 pages
Quiz 1 - Solution
No ratings yet
Quiz 1 - Solution
2 pages
Lecture 5 Fragment Assembly
No ratings yet
Lecture 5 Fragment Assembly
40 pages
Lecture2 - Background
No ratings yet
Lecture2 - Background
43 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
Lec1-Introduction To Bioinformatics
No ratings yet
Lec1-Introduction To Bioinformatics
27 pages
Lec4 Databases
No ratings yet
Lec4 Databases
29 pages
Why Bioinformatics?: Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Why Bioinformatics?: Zoya Khalid Zoya - Khalid@nu - Edu.pk
22 pages
A Biology Primer For Computer Scientists: Franco P. Preparata
No ratings yet
A Biology Primer For Computer Scientists: Franco P. Preparata
18 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
38 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
61 pages
Assignment 3 CS-460
No ratings yet
Assignment 3 CS-460
2 pages
3D Structure Prediction
No ratings yet
3D Structure Prediction
33 pages
Bio Assignment 3
No ratings yet
Bio Assignment 3
3 pages
Computational Physics 5 TH Sem
No ratings yet
Computational Physics 5 TH Sem
133 pages
Strain Sensitivity in Fiber Optic Sensors: Phaneendra Medida
No ratings yet
Strain Sensitivity in Fiber Optic Sensors: Phaneendra Medida
28 pages
C7.5 Lecture 18: The Schwarzschild Solution 5: Black Holes, White Holes, Wormholes
No ratings yet
C7.5 Lecture 18: The Schwarzschild Solution 5: Black Holes, White Holes, Wormholes
13 pages
Error Codes
No ratings yet
Error Codes
4 pages
Embedded ARM Starter Kit
100% (2)
Embedded ARM Starter Kit
3 pages
Initial Activity On Parts of Speech - Task 1
No ratings yet
Initial Activity On Parts of Speech - Task 1
7 pages
XCMG ZL50GV Manual
No ratings yet
XCMG ZL50GV Manual
2 pages
Introduction To Valve Trays
No ratings yet
Introduction To Valve Trays
7 pages
Breathing For Singing-The Anatomy of Respiration
100% (3)
Breathing For Singing-The Anatomy of Respiration
11 pages
Mastering Ruby - A Beginners Guide
No ratings yet
Mastering Ruby - A Beginners Guide
335 pages
AP Mock Exam-2023
No ratings yet
AP Mock Exam-2023
40 pages
Biology Poster
No ratings yet
Biology Poster
1 page
Working Principle of Stroboscope
50% (4)
Working Principle of Stroboscope
5 pages
2 Laws of Thermodynamics
100% (1)
2 Laws of Thermodynamics
86 pages
Dowland Ballad Tunes and Simple Pieces in Tablature For The Guitar Complete
100% (1)
Dowland Ballad Tunes and Simple Pieces in Tablature For The Guitar Complete
15 pages
Carel Ir33 User Manual
No ratings yet
Carel Ir33 User Manual
58 pages
Field Density 9 THH Dec
No ratings yet
Field Density 9 THH Dec
4 pages
WAVES
No ratings yet
WAVES
15 pages
Separation Processes 2019-2020 (Part 1 Exercises) StudPort
No ratings yet
Separation Processes 2019-2020 (Part 1 Exercises) StudPort
5 pages
Router Commands
No ratings yet
Router Commands
3 pages
Presentasi PCM PT-YNK FIX
No ratings yet
Presentasi PCM PT-YNK FIX
25 pages
Sidd Ham
No ratings yet
Sidd Ham
4 pages
PATNERSHIP
No ratings yet
PATNERSHIP
18 pages
Maths, Science & ICT A A F S: Ssessment Ctivity Ront Heet
No ratings yet
Maths, Science & ICT A A F S: Ssessment Ctivity Ront Heet
22 pages
Apache Hive Essentials - Sample Chapter
No ratings yet
Apache Hive Essentials - Sample Chapter
13 pages
Create A Pizza Order Program
No ratings yet
Create A Pizza Order Program
6 pages
Corona: Visual
No ratings yet
Corona: Visual
6 pages
Pages From 3dsmax 2010 Animation
No ratings yet
Pages From 3dsmax 2010 Animation
42 pages