Module 5
Module 5
1. Introduction
a. What is Hashing?
A hash function is like a black box that produces
an address every time you drop in a key. More
formally, it is a function h(K) that transforms a
key K into an address. The resulting address is
used as the basis for storing and retrieving
records
The key LOWELL is transformed by the Hash
function to the address 4. That is, h(LOWELL)
= 4. Address 4 is said to be the home address of
LOWELL
Suppose you want to store 75 records in a file, where the key to each record is a person's name. Suppose also
that you set aside space for 1,000 records.
The key can be hashed by taking two numbers from the ASCII representations of the first two characters of
the name, multiplying these together, and then using the rightmost three digits of the result for the address.
Table below shows how three names would produce three addresses.
Note that even though the names are listed in alphabetical order, there is no apparent order to the addresses.
They appear to be in random order.
b. Collisions
Now suppose there is a key in the sample file with the name OLIVIER. Since the name OLIVIER starts with
the same two letters as the name LOWELL, they produce the same address (004) .
There is a collision between the record for OLIVIER and the record for LOWELL. We refer to keys that
hash to the same address as synonyms.
Collisions cause problems. We cannot put two records in the same space, so we must resolve collisions. We
do this in two ways: by choosing hashing algorithms partly on the basis of how few collisions they are likely
to produce, and by playing some tricks with the ways we store records
c. Solution to collisions
i. perfect hashing algorithm
The ideal solution to collisions is to find a transformation algorithm that avoids collisions altogether. Such
an algorithm is called a perfect hashing algorithm.
It turns out to be much more difficult to find a perfect hashing algorithm than one might expect, however.
ii. Spread out the records
Collisions occur when two or more records compete for the same address.
If we could find a hashing algorithm that distributes the records fairly randomly among the available
addresses, then we would not have large numbers of records clustering around certain addresses.
Our sample hash algorithm, which uses only two letters from the key, is not good on this account because
certain combinations of two letters are quite common in starting names, while others are uncommon (e. g.,
compare the number of names that start with ''JO" with the number that start with "XZ") .
We need to find a hashing algorithm that distributes records more randomly.
iii. Use extra memory
It is easier to find a hash algorithm that avoids collisions if we have only a few records to distribute among
many addresses than if we have about the same number of records as addresses.
Our sample hashing algorithm is very good on this account since there are 1, 000 possible addresses and only
75 addresses (corresponding to the 75 records) will be generated.
The obvious disadvantage to spreading out the records is that storage space is wasted.
These number pairs can be thought of as integer variables so we can do arithmetic on them. If we can treat
them as integer variables, then we can add them.
Before we add the numbers, we have to mention a problem caused by the fact that in most cases the sizes of
numbers we can add together are limited.
On some microcomputers, for example, integer values that exceed 32,767 (15 bits) cause overflow errors or
become negative.
For example, adding the first five of the foregoing numbers gives 7679 + 8769 + 7676 + 3232 + 3232 =
30588.
Adding in the last 3232 would, unfortunately, push the result over the maximum 32767(30588+ 3232 = 33,
820) , causing an overflow error.
Consequently, we need to make sure that each successive sum is less than 32,767. We can do this by first
identifying the largest single value we will ever add in our summation, and then making sure after each step
that our intermediate result differs from 32, 767 by that amount.
In our case, let us assume that keys consist only of blanks and uppercase alphabetic characters, so the largest
addend is 9, 090, corresponding to ZZ.
Suppose we choose 19, 937 as our largest allowable intermediate result. This differs from 32, 767 by much
more than 9, 090, so we can be confident that no new addition will cause overflow.
We can ensure in our algorithm that no intermediate sum exceeds 19, 937 by using the mod operator, which
returns the remainder when one integer is divided by another:
Suppose, for example, that we decide to use the 100 addresses 0-99 for our file. In terms of the preceding
formula,
a = 13883 mod 100 = 83
A prime number is usually used for the divisor because primes tend to distribute remainders much more
uniformly than do nonprimes.
If 101 is the size of the address space, the home address of the record in the example becomes
a = 13883 mod 101
= 46.
Hence, the record whose key is LO WELL is assigned to record number 46 in the file.
3. Hashing Functions and Record Distributions
a. Distributing Records among Addresses
There are three different distributions of records addresses.
Ideally, a hash function should distribute records in a file so there are no collisions, as illustrated by
distribution below.
Such a distribution is called uniform (best case) because the records are spread out uniformly among the
addresses.
We pointed out earlier that completely uniform distributions are so hard to find that it is generally not
considered worth trying to find them.
Worst Distribution illustrates - All records share the same home address, resulting in the maximum number
of collisions. The more a distribution looks like this one, the more collisions will be a problem.
Distribution that illustrates a distribution in which the records are somewhat spread out, but with a few
collisions is called average case. This is the most likely case if we have a function that distributes keys
randomly.
If a hash function is random, then for a given key every address has the same likelihood of being chosen as
every other address.
The fact that a certain address is chosen for one key neither diminishes nor increases the likelihood that the
same address will be chosen for another key.
For example, a set of employee identification numbers might be ordered according to when the employees
entered an organization.
This might even lead to no synonyms. If some part of a key shows a usable underlying pattern, a hash functio n
that extracts that part of the key can also be used.
Fold parts of the key.
Folding is one stage in the method discussed ear-lier. It involves extracting digits from part of a key and
adding the extracted parts together.
This method destroys the original key patterns but in some circumstances may preserve the separation
between certain subsets of keys that naturally spread themselves out.
Divide the key by a number.
Division by the address size and use of the remainder usually is involved somewhere in a hash function since
the purpose of the function is to produce an address within a certain range.
Division preserves consecutive key sequences, so you can take advantage of sequences that effectively spread
out keys.
However, if there are several consecutive key sequences, division by a number that has many small factors
can result in many collisions.
Research has shown that numbers with no divisors less than 19 generally avoid this problem.
Division by a prime is even more likely than division by a nonprime to generate different results from
different consecutive sequences.
Square the key and take the middle.
This popular method (often called the mid-square method) involves treating the key as a single large number,
squaring the number, and extracting whatever number of digits is needed from the middle of the result.
For example, suppose you want to generate addresses between 0 and 99. If the key is the number 453, its
square is 205, 209.
Extracting the middle two digits yields a number between 0 and 99, in this case 52.
Radix transformation.
This method involves converting the key to some number base other than the one you are working in, and
then taking the result modulo the maximum address as the hash address.
For example, suppose you want to generate addresses between 0 and 99. If the key is the decimal number
453, its base 11 equivalent is 382; 382 mod 99 = 85, so 85 is the hash address.
c. Predicting the Distribution of Records
Given that it is nearly impossible to achieve a uniform distribution of records among the available addresses
in a file, it is important to be able to predict how records are likely to be distributed.
The Poisson function, which we also denote by
p ( x) , is given by
where N = the number of available addresses; r = the number of records to be stored; and x = the number
of records assigned to a given address,
Then p(x) gives the probability that a given address will have had x records assigned to it after the hashing
function has been app lied to all n records.
For example, that there are 1,000 addresses (N = 1,000) and 1,000 records whose keys are to be hashed to
the addresses (r = 1,000) .
Since r/N = 1, the probability that a given
address will have no keys hashed to it ( x = 0)
becomes
The probabilities that a given address will have
exactly one, two, or three keys, respectively,
hashed to it are
Suppose there are 1,000 addresses (N = 1,000) and 1,000 records (r = 1,000). Multiplying 1,000 by the
probability that a given address will have x records assigned to it gives the expected total number of addresses
with x records assigned to them.
In general, if there are N addresses, then the expected number of addresses with x records assigned to them
is Np(x) .
4. How Much Extra Memory Should Be Used?
We have seen the importance of choosing a good hashing algorithm to reduce collisions.
A second way to decrease the number of collisions (and thereby decrease the average search length) is to use
extra memory.
The tools developed in the previous section can be used to help us determine the effect of the use of extra
memory on performance.
Packing Density
The term packing density refers to the ratio of the number of records to bestored (r) to the number of available
spaces (N)
For example, if there are 75 records (n = 75) and 100 addresses (N = 100), the packing density is
The packing density gives a measure of the amount of space in a file that is actually used, and it is the only
such value needed to assess performance in a hashing environment, assuming that the hash method used
gives a reasonably random distribution of records.
The raw size of a file and its address space do not matter; what is important is the relative sizes of the two,
which are given by the packing density.
a. Predicting Collisions for Different Packing Densities
The formula for packing density (r/N) occurs
twice in the Poisson formula
Indeed, the numbers of records (r) and addresses (N) always occur together as the ratio r/N. They never occur
independently.
Problem :
Suppose that 1,000 addresses are allocated to hold 500 records in a randomly hashed file, and that each
address can hold one record.
i. What is the packing density for the file?
ii. How many addresses should have no records assigned to them?
iii. How many addresses should have exactly one record assigned (no synonyms) ?
iv. How many addresses should have one record plus one or more synonyms?
v. Assuming that only one record can be assigned to each home address, how many overflow records
can be expected?
vi. What percentage of records should be overflow records?
Solution
i.
ii. Since p (O) gives the proportion of addresses with no records assigned, the number of such addresses
is
iii.
iv. The values of p(2) , p(3) , p(4) , and so on give the proportions of addresses with one, two, three, and
so on synonyms assigned to them. Hence the sum
gives the proportion of all addresses with at least one synonym. We need only compute the results
up to p(S) before they become insignificantly small:
The number of addresses with one or more synonyms is just the product of N and this result :
v. For each of the addresses represented by p(2) , one record can be stored at the address and one must
be an overflow record. For each address represented by p(3) , one record can be stored at the address,
two are overflow records , and so on. Hence, the expected number of overflow records is given by
vi. If there are 107 overflow records and 500 records in all, then the proportion of overflow records is
The reason to avoid overflow is, of course, that extra searches (hence, extra disk accesses) have to occur
when a record is not found in its home address.
If there are a lot of collisions, there are going to be a lot of overflow records taking up spaces where they
ought not to be.
Clusters of records can form, resulting in the placement of records a long way from home, and so many disk
accesses are required to retrieve them.
Consider the following set of keys and the
corresponding addresses produced by some
hash function.
If these records are loaded into an empty file,
and progressive overflow is used to resolve
collisions, only two of the records will be at their
home addresses.
All the others require extra accesses to retrieve.
Figure below shows where each key is stored, together with information on how many accesses are required
to retrieve it.
The term search length refers to the number of
accesses required to retrieve a record from
secondary memory.
In the context of hashing, the search length for a
record increases every time there is a collision.
If a record is a long way from its home address,
the search length may be unacceptable.
A good measure of the extent of the overflow
problem is average search length.
The average search length is just the average
number of times you can expect to have to
access the disk to retrieve a record.
A rough estimate of average search length may be computed by find ing the total search length (the sum of
the search lengths of the individual records) and dividing this by the number of records:
In the example, the average search length for the five records is
With no collisions at all, the average search length is 1, since only one access is needed to retrieve any record.
On the other hand, if a large number of the records in a file results in collisions, the average search length
becomes quite long.
It turns out that, using progressive overflow, the average search length goes up very rapidly as the packing
density increases.
6. Storing More Than One Record per Address: Buckets
The word bucket is sometimes used to describe a block of records that is retrieved in one disk access,
especially when those records are seen as sharing the same address.
On sector-addressing disks, a bucket typically consists of one or more sectors; on block-addressing disks, a
bucket might be a block.
Consider the following set of keys, which is to be loaded into a hash file.
Figure illustrates part of a file into which the records with these keys are loaded.
Each address in the file identifies a bucket capable of holding the records corresponding to three synonyms.
Only the record corresponding to Nutt cannot be accommodated in a home address.
When a record is to be stored or retrieved, its home bucket address is determined by hashing.
The entire bucket is loaded into primary memory.
When a bucket is filled, we still have to worry about the record overflow problem, but this occurs much less
often when buckets are used than when each address can hold only one record.
c. Effects of Buckets on Performance
When buckets are used, the formula used to compute packing density is changed slightly since each bucket
address can hold more than one record.
To compute how densely packed a file is, we need to consider both the number of addresses (buckets) and
the number or records we can put at each address (bucket size).
If N is the number of addresses and b is the number of records that fit in a bucket, then bN is the number of
available locations for records.
If r is still the number of records in the file, then
Suppose we have a file in which 750 records are to be stored. Consider the following two ways we might
organize the file.
We can store the 750 data records among 1,000 locations, where each location can hold one record. The
packing density in this case is = 750/1000 = 75%
We can store the 750 records among 500 locations, where each location has a bucket size of 2.
There are still 1,000 places (2 x 500) to store the
750 records , so the packing density is still
Since the packing density is not changed, we
might at first not expect the use of buckets in this
way to improve performance, but in fact it does
improve performance dramatically.
The key to the improvement is that, although
there are fewer addresses, each individ ua l
address has more room for variation in the
number of records assigned to it.
Let's calculate the difference in performance for
these two ways of storing the same number of
records in the same amount of space.
The starting point for our calculations is the
fundamental description of each file structure.
In the case of the file with bucket size one, any address that is assigned exactly one record does not have any
overflow.
Any address with more than one record does have overflow. Recall that the expected number of overflow
records is given by
First, the program uses the function hash () to produce a home address for each key.
Second, the program looks for a free space for the record by starting with the bucket stored at its home
address and then, if the home buckets is full, continuing to look at successive buckets until one is found that
is not full.
The new record is inserted in this bucket, which is rewritten to the file at the location from which it is loaded.
If, as it searches for an empty bucket, a loading program passes the maximum allowable address, it must
wrap around to the beginning address.
7. Making Deletions
When progressive overflow is used, a search for a record terminates if an open address is encountered.
Because of this, we do not want to leave open
addresses that break overflow searches
improperly.
The following example illustrates the problem.
Adams, Jones, Morris, and Smith are stored in a
hash file in which each address can hold one
record.
Adams and Smith both are hashed to address 5,
and Jones and Morris are hashed to address 6.
If they are loaded in alphabetical order using
progressive overflow for collisions, they are
stored in the locations shown in Fig.
A search for Smith starts at address 5 (Smith's
home address) , successively looks for Smith at
addresses 6, 7, and 8, then finds Smith at 8.
Now suppose Morris is deleted, leaving an
empty space, as illustrated in Fig. 10. 10.
A search for Smith again starts at address 5, and
then looks at addresses 6 and 7. Since address 7
is now empty, it is reasonable for the program to
conclude that Smith's record is not in the file.
With the introduction of the use of tombstones, the insertion of records becomes slightly more difficult than
our earlier discussions imply.
Whereas programs that perform initial loading simply search for the first occurrence of an empty record slot
(signified by the presence of the key /////), it is now permissible to insert a record where either ///// or ######
occurs as the key.
This new feature, which is desirable because it yields a shorter average search length, brings with it a certain
danger.
Consider, for example, the earlier example in which Morris is deleted, giving the file organization shown in
Figure above.
Now suppose you want a program to insert Smith into the file. If the program simply searches until it
encounters ######, it never notices that Smith is already in the file.
We almost certainly don't want to put a second Smith record into the file, since doing so means that later
searches would never find the older Smith record.
To prevent this from occurring, the program must examine the entire cluster of contiguous keys and
tombstones to ensure that no duplicate key exists and then go back and insert the record in the first available
tombstone, if there is one.
8. Other Collision Resolution Techniques
a. Double Hashing
One of the problems with progressive overflow is that if many records hash to buckets in the same vicinity,
clusters of records can form.
As the packing density approaches one, this clustering tends to lead to extremely long searches for some
records.
One method for avoiding clustering is to store overflow records a long way from their home addresses by
double hashing.
With double hashing, when a collision occurs, a second hash function is applied to the key to produce a
number c that is relatively prime to the number of addresses.
The value c is added to the home address to produce the overflow address.
If the overflow address is already occupied, c is added to it to produce another overflow address.
This procedure continues until a free overflow address is found.
Double hashing does tend to spread out the
records in a file, but it suffers from a potential
problem that is encountered in several improved
overflow methods: It violates locality by
deliberately moving overflow records some
distance from their home addresses, increasing
the likelihood that the disk will need extra time
to get to the new overflow address.
A search for Cole involves an access to Adams (a synonym) and Bates (not a synonym) . Flint, the worst
case, requires six accesses, only two of which involve synonyms.
Since Adams, Cole, and Flint are synonyms, a chaining algorithm forms a linked list connecting these three
names, with Adams at the head of the list.
Since Bates and Dean are also synonyms, they form a second list. This arrangement is illustrated in Figure.
The use of chained progressive overflow requires that we attend to some details that are not required for
simple progressive overflow.
The average search length decreases from 2.5 to
Let's look at an example. Figure (a) shows an initial address space of four, and four buckets descending from
the four addresses in the directory.
In Figure (b) we have split the bucket at address 4. We address the two buckets resulting from the split as 40
and 41.
We change the shape of the directory node at address 4 from a square to a circle because it has changed from
an external node, referencing a bucket, to an internal node those points to two child nodes.
In Figure (c) we split the bucket addressed by node 2, creating the new external nodes 20 and 21.
We also split the bucket addressed by 41, extending the trie downward to include 410 and 411.
Because the directory node 41 is now an internal node rather than an external one, it changes from a square
to a circle.
As we continue to add keys and split buckets, these directory tries continue to grow.
Finding a key in a dynamic hashing scheme can involve the use of two hash functions, rather than just one.
First, there is the hash function that covers the original address space.
If you find that the directory node is an external node, and therefore points to a bucket, the search is complete.
However, if the directory node is an internal node, then you need additional address information to guide
you through the ones and zeroes that form the trie.
b. Linear Hashing
The key feature of both extendible hashing and dynamic hashing is that they use a directory to direct access
to the actual buckets containing the key records.
This directory makes it possible to expand and modify the hashed address space without expanding the
number of buckets
After expanding the directory, more than one directory node can point to the same bucket.
However, the directory adds additional layers of indirection which, if the directory must be stored on disk,
can result in an additional seek.