0% found this document useful (0 votes)
133 views

Module 5

Hashing is a technique for storing and retrieving records from a file. It works by transforming a key into an address using a hash function. Collisions occur when different keys hash to the same address. To resolve collisions, algorithms aim to spread out records randomly among addresses or use extra memory like buckets that can store multiple records at an address. A simple hashing algorithm represents the key numerically, folds and adds parts of the number, and divides the result by the address space size to get the address. The goal is to distribute records uniformly among addresses to minimize collisions, though a completely uniform distribution is difficult. Examining keys for patterns can also help spread records out more effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views

Module 5

Hashing is a technique for storing and retrieving records from a file. It works by transforming a key into an address using a hash function. Collisions occur when different keys hash to the same address. To resolve collisions, algorithms aim to spread out records randomly among addresses or use extra memory like buckets that can store multiple records at an address. A simple hashing algorithm represents the key numerically, folds and adds parts of the number, and divides the result by the address space size to get the address. The goal is to distribute records uniformly among addresses to minimize collisions, though a completely uniform distribution is difficult. Examining keys for patterns can also help spread records out more effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Module – 5 Hashing and Extendible Hashing

1. Introduction
a. What is Hashing?
 A hash function is like a black box that produces
an address every time you drop in a key. More
formally, it is a function h(K) that transforms a
key K into an address. The resulting address is
used as the basis for storing and retrieving
records
 The key LOWELL is transformed by the Hash
function to the address 4. That is, h(LOWELL)
= 4. Address 4 is said to be the home address of
LOWELL
 Suppose you want to store 75 records in a file, where the key to each record is a person's name. Suppose also
that you set aside space for 1,000 records.
 The key can be hashed by taking two numbers from the ASCII representations of the first two characters of
the name, multiplying these together, and then using the rightmost three digits of the result for the address.
 Table below shows how three names would produce three addresses.

 Note that even though the names are listed in alphabetical order, there is no apparent order to the addresses.
They appear to be in random order.
b. Collisions
 Now suppose there is a key in the sample file with the name OLIVIER. Since the name OLIVIER starts with
the same two letters as the name LOWELL, they produce the same address (004) .
 There is a collision between the record for OLIVIER and the record for LOWELL. We refer to keys that
hash to the same address as synonyms.
 Collisions cause problems. We cannot put two records in the same space, so we must resolve collisions. We
do this in two ways: by choosing hashing algorithms partly on the basis of how few collisions they are likely
to produce, and by playing some tricks with the ways we store records
c. Solution to collisions
i. perfect hashing algorithm
 The ideal solution to collisions is to find a transformation algorithm that avoids collisions altogether. Such
an algorithm is called a perfect hashing algorithm.
 It turns out to be much more difficult to find a perfect hashing algorithm than one might expect, however.
ii. Spread out the records
 Collisions occur when two or more records compete for the same address.
 If we could find a hashing algorithm that distributes the records fairly randomly among the available
addresses, then we would not have large numbers of records clustering around certain addresses.
 Our sample hash algorithm, which uses only two letters from the key, is not good on this account because
certain combinations of two letters are quite common in starting names, while others are uncommon (e. g.,
compare the number of names that start with ''JO" with the number that start with "XZ") .
 We need to find a hashing algorithm that distributes records more randomly.
iii. Use extra memory
 It is easier to find a hash algorithm that avoids collisions if we have only a few records to distribute among
many addresses than if we have about the same number of records as addresses.
 Our sample hashing algorithm is very good on this account since there are 1, 000 possible addresses and only
75 addresses (corresponding to the 75 records) will be generated.
 The obvious disadvantage to spreading out the records is that storage space is wasted.

Prepared by Shilpa B, Dept. of ISE, CEC Page 1


Module – 5 Hashing and Extendible Hashing

iv. Put more than one record at a single address


 Up to now we have assumed tacitly that each physical record location in a file could hold exactly one record,
but there is usually no reason why we cannot create our file in such a way that every file address is big enough
to hold several record.
 If, for example, each record is 80 bytes long, and we create a file with 512-byte physical records, we can
store up to six records at each file address.
 Each address is able to tolerate five synonyms. Addresses that can hold several records in this way are
sometimes called buckets.
2. A Simple Hashing Algorithm
 This algorithm has three steps:
1. Represent the key in numerical form.
2. Fold and add.
3. Divide by a prime number and use the remainder as the address.
 Step 1. Represent the Key in Numerical Form:
 If the key is already a number, then this step is already accomplished.
 If it is a string of characters, we take the ASCII code of each character and use it to form a number.
 In this algorithm we use the entire key, rather than just the first two letters.
 By using more parts of a key, we increase the likelihood that differences among the keys cause differe nces
in addresses produced.
 The extra processing time required to do this is usually insignificant when compared to the potential
improvement in performance

 Step 2. Fold and Add


 Folding and adding means chopping off pieces of the number and adding them together.
 In our algorithm we chop off pieces with two ASCII numbers each:

 These number pairs can be thought of as integer variables so we can do arithmetic on them. If we can treat
them as integer variables, then we can add them.
 Before we add the numbers, we have to mention a problem caused by the fact that in most cases the sizes of
numbers we can add together are limited.
 On some microcomputers, for example, integer values that exceed 32,767 (15 bits) cause overflow errors or
become negative.
 For example, adding the first five of the foregoing numbers gives 7679 + 8769 + 7676 + 3232 + 3232 =
30588.
 Adding in the last 3232 would, unfortunately, push the result over the maximum 32767(30588+ 3232 = 33,
820) , causing an overflow error.
 Consequently, we need to make sure that each successive sum is less than 32,767. We can do this by first
identifying the largest single value we will ever add in our summation, and then making sure after each step
that our intermediate result differs from 32, 767 by that amount.
 In our case, let us assume that keys consist only of blanks and uppercase alphabetic characters, so the largest
addend is 9, 090, corresponding to ZZ.
 Suppose we choose 19, 937 as our largest allowable intermediate result. This differs from 32, 767 by much
more than 9, 090, so we can be confident that no new addition will cause overflow.
 We can ensure in our algorithm that no intermediate sum exceeds 19, 937 by using the mod operator, which
returns the remainder when one integer is divided by another:

Prepared by Shilpa B, Dept. of ISE, CEC Page 2


Module – 5 Hashing and Extendible Hashing

 The number 13, 883 is the result of the fold-and-add operation.


 Step 3. Divide by the Size of the Address Space
 The purpose of this step is to cut down to size the number produced in step 2 so it falls within the range of
addresses of records in the file.
 This can be done by dividing that number by a number that is the address size of the file, and then taking the
remainder.
 The remainder will be the home address of the record.
 We can represent this operation symbolically as follows: If s represents the sum produced in step 2 (13, 883
in the example) , n represents the divisor (the number of addresses in the file) , and a represents the address
we are trying to produce, we apply the formula
a= s mod n.
-

 Suppose, for example, that we decide to use the 100 addresses 0-99 for our file. In terms of the preceding
formula,
a = 13883 mod 100 = 83
 A prime number is usually used for the divisor because primes tend to distribute remainders much more
uniformly than do nonprimes.
 If 101 is the size of the address space, the home address of the record in the example becomes
a = 13883 mod 101
= 46.
 Hence, the record whose key is LO WELL is assigned to record number 46 in the file.
3. Hashing Functions and Record Distributions
a. Distributing Records among Addresses
 There are three different distributions of records addresses.
 Ideally, a hash function should distribute records in a file so there are no collisions, as illustrated by
distribution below.
 Such a distribution is called uniform (best case) because the records are spread out uniformly among the
addresses.
 We pointed out earlier that completely uniform distributions are so hard to find that it is generally not
considered worth trying to find them.
 Worst Distribution illustrates - All records share the same home address, resulting in the maximum number
of collisions. The more a distribution looks like this one, the more collisions will be a problem.
 Distribution that illustrates a distribution in which the records are somewhat spread out, but with a few
collisions is called average case. This is the most likely case if we have a function that distributes keys
randomly.
 If a hash function is random, then for a given key every address has the same likelihood of being chosen as
every other address.
 The fact that a certain address is chosen for one key neither diminishes nor increases the likelihood that the
same address will be chosen for another key.

b. Some Other Hashing Methods


 Examine keys for a pattern.
 Sometimes keys fall in patterns that naturally spread themselves out.
 This is more likely to be true of numeric keys than of alphabetic keys.

Prepared by Shilpa B, Dept. of ISE, CEC Page 3


Module – 5 Hashing and Extendible Hashing

 For example, a set of employee identification numbers might be ordered according to when the employees
entered an organization.
 This might even lead to no synonyms. If some part of a key shows a usable underlying pattern, a hash functio n
that extracts that part of the key can also be used.
 Fold parts of the key.
 Folding is one stage in the method discussed ear-lier. It involves extracting digits from part of a key and
adding the extracted parts together.
 This method destroys the original key patterns but in some circumstances may preserve the separation
between certain subsets of keys that naturally spread themselves out.
 Divide the key by a number.
 Division by the address size and use of the remainder usually is involved somewhere in a hash function since
the purpose of the function is to produce an address within a certain range.
 Division preserves consecutive key sequences, so you can take advantage of sequences that effectively spread
out keys.
 However, if there are several consecutive key sequences, division by a number that has many small factors
can result in many collisions.
 Research has shown that numbers with no divisors less than 19 generally avoid this problem.
 Division by a prime is even more likely than division by a nonprime to generate different results from
different consecutive sequences.
 Square the key and take the middle.
 This popular method (often called the mid-square method) involves treating the key as a single large number,
squaring the number, and extracting whatever number of digits is needed from the middle of the result.
 For example, suppose you want to generate addresses between 0 and 99. If the key is the number 453, its
square is 205, 209.
 Extracting the middle two digits yields a number between 0 and 99, in this case 52.
 Radix transformation.
 This method involves converting the key to some number base other than the one you are working in, and
then taking the result modulo the maximum address as the hash address.
 For example, suppose you want to generate addresses between 0 and 99. If the key is the decimal number
453, its base 11 equivalent is 382; 382 mod 99 = 85, so 85 is the hash address.
c. Predicting the Distribution of Records
 Given that it is nearly impossible to achieve a uniform distribution of records among the available addresses
in a file, it is important to be able to predict how records are likely to be distributed.
 The Poisson function, which we also denote by
p ( x) , is given by
 where N = the number of available addresses; r = the number of records to be stored; and x = the number
of records assigned to a given address,
 Then p(x) gives the probability that a given address will have had x records assigned to it after the hashing
function has been app lied to all n records.
 For example, that there are 1,000 addresses (N = 1,000) and 1,000 records whose keys are to be hashed to
the addresses (r = 1,000) .
 Since r/N = 1, the probability that a given
address will have no keys hashed to it ( x = 0)
becomes
 The probabilities that a given address will have
exactly one, two, or three keys, respectively,
hashed to it are

Prepared by Shilpa B, Dept. of ISE, CEC Page 4


Module – 5 Hashing and Extendible Hashing

 Suppose there are 1,000 addresses (N = 1,000) and 1,000 records (r = 1,000). Multiplying 1,000 by the
probability that a given address will have x records assigned to it gives the expected total number of addresses
with x records assigned to them.
 In general, if there are N addresses, then the expected number of addresses with x records assigned to them
is Np(x) .
4. How Much Extra Memory Should Be Used?
 We have seen the importance of choosing a good hashing algorithm to reduce collisions.
 A second way to decrease the number of collisions (and thereby decrease the average search length) is to use
extra memory.
 The tools developed in the previous section can be used to help us determine the effect of the use of extra
memory on performance.
 Packing Density

 The term packing density refers to the ratio of the number of records to bestored (r) to the number of available
spaces (N)
 For example, if there are 75 records (n = 75) and 100 addresses (N = 100), the packing density is

 The packing density gives a measure of the amount of space in a file that is actually used, and it is the only
such value needed to assess performance in a hashing environment, assuming that the hash method used
gives a reasonably random distribution of records.
 The raw size of a file and its address space do not matter; what is important is the relative sizes of the two,
which are given by the packing density.
a. Predicting Collisions for Different Packing Densities
 The formula for packing density (r/N) occurs
twice in the Poisson formula
 Indeed, the numbers of records (r) and addresses (N) always occur together as the ratio r/N. They never occur
independently.
 Problem :
Suppose that 1,000 addresses are allocated to hold 500 records in a randomly hashed file, and that each
address can hold one record.
i. What is the packing density for the file?
ii. How many addresses should have no records assigned to them?
iii. How many addresses should have exactly one record assigned (no synonyms) ?
iv. How many addresses should have one record plus one or more synonyms?
v. Assuming that only one record can be assigned to each home address, how many overflow records
can be expected?
vi. What percentage of records should be overflow records?
 Solution
i.
ii. Since p (O) gives the proportion of addresses with no records assigned, the number of such addresses
is

iii.

Prepared by Shilpa B, Dept. of ISE, CEC Page 5


Module – 5 Hashing and Extendible Hashing

iv. The values of p(2) , p(3) , p(4) , and so on give the proportions of addresses with one, two, three, and
so on synonyms assigned to them. Hence the sum

gives the proportion of all addresses with at least one synonym. We need only compute the results
up to p(S) before they become insignificantly small:

The number of addresses with one or more synonyms is just the product of N and this result :

v. For each of the addresses represented by p(2) , one record can be stored at the address and one must
be an overflow record. For each address represented by p(3) , one record can be stored at the address,
two are overflow records , and so on. Hence, the expected number of overflow records is given by

vi. If there are 107 overflow records and 500 records in all, then the proportion of overflow records is

5. Collision Resolution by Progressive Overflow


a. Progressive Overflow
 Unfortunately, the name York hashes to the
same address as the name Rosen, whose record
is already stored there. Since York cannot fit in
its home address, it is an overflow record.
 If progressive overflow is used, the next several
addresses are searched in sequence until an
empty one is found. The first free address
becomes the address of the record.
 In the example, address 9 is the first record
found empty, so the record pertaining to York is
 In the example, we want to store the record stored in address 9.
whose key is York in the file.
 Eventually we need to find York's record in the file. Since York still hashes to 6, the search for the record
begins at address 6.
 It does not find York's record there, so it
proceeds to look at successive records until it
gets to address 9, where it finds York.
 It is assumed that the file can hold 100 records
in addresses 0-99. Blue is hashed to record
number 99, which is already occupied by Jello.
 Since the file holds only 100 records, it is not
possible to use 100 as the next address.
 The way this is handled in progressive overflow
is to wrap around the address space of the file by
choosing address 0 as the next address. Since, in
this case, address 0 is not occupied, Blue gets
stored in address 0.
b. Search Length
Prepared by Shilpa B, Dept. of ISE, CEC Page 6
Module – 5 Hashing and Extendible Hashing

 The reason to avoid overflow is, of course, that extra searches (hence, extra disk accesses) have to occur
when a record is not found in its home address.
 If there are a lot of collisions, there are going to be a lot of overflow records taking up spaces where they
ought not to be.
 Clusters of records can form, resulting in the placement of records a long way from home, and so many disk
accesses are required to retrieve them.
 Consider the following set of keys and the
corresponding addresses produced by some
hash function.
 If these records are loaded into an empty file,
and progressive overflow is used to resolve
collisions, only two of the records will be at their
home addresses.
 All the others require extra accesses to retrieve.
 Figure below shows where each key is stored, together with information on how many accesses are required
to retrieve it.
 The term search length refers to the number of
accesses required to retrieve a record from
secondary memory.
 In the context of hashing, the search length for a
record increases every time there is a collision.
 If a record is a long way from its home address,
the search length may be unacceptable.
 A good measure of the extent of the overflow
problem is average search length.
 The average search length is just the average
number of times you can expect to have to
access the disk to retrieve a record.

 A rough estimate of average search length may be computed by find ing the total search length (the sum of
the search lengths of the individual records) and dividing this by the number of records:

 In the example, the average search length for the five records is

 With no collisions at all, the average search length is 1, since only one access is needed to retrieve any record.
 On the other hand, if a large number of the records in a file results in collisions, the average search length
becomes quite long.
 It turns out that, using progressive overflow, the average search length goes up very rapidly as the packing
density increases.
6. Storing More Than One Record per Address: Buckets
 The word bucket is sometimes used to describe a block of records that is retrieved in one disk access,
especially when those records are seen as sharing the same address.
 On sector-addressing disks, a bucket typically consists of one or more sectors; on block-addressing disks, a
bucket might be a block.
 Consider the following set of keys, which is to be loaded into a hash file.

Prepared by Shilpa B, Dept. of ISE, CEC Page 7


Module – 5 Hashing and Extendible Hashing

 Figure illustrates part of a file into which the records with these keys are loaded.
 Each address in the file identifies a bucket capable of holding the records corresponding to three synonyms.
 Only the record corresponding to Nutt cannot be accommodated in a home address.
 When a record is to be stored or retrieved, its home bucket address is determined by hashing.
 The entire bucket is loaded into primary memory.
 When a bucket is filled, we still have to worry about the record overflow problem, but this occurs much less
often when buckets are used than when each address can hold only one record.
c. Effects of Buckets on Performance
 When buckets are used, the formula used to compute packing density is changed slightly since each bucket
address can hold more than one record.
 To compute how densely packed a file is, we need to consider both the number of addresses (buckets) and
the number or records we can put at each address (bucket size).
 If N is the number of addresses and b is the number of records that fit in a bucket, then bN is the number of
available locations for records.
 If r is still the number of records in the file, then

 Suppose we have a file in which 750 records are to be stored. Consider the following two ways we might
organize the file.
 We can store the 750 data records among 1,000 locations, where each location can hold one record. The
packing density in this case is = 750/1000 = 75%
 We can store the 750 records among 500 locations, where each location has a bucket size of 2.
 There are still 1,000 places (2 x 500) to store the
750 records , so the packing density is still
 Since the packing density is not changed, we
might at first not expect the use of buckets in this
way to improve performance, but in fact it does
improve performance dramatically.
 The key to the improvement is that, although
there are fewer addresses, each individ ua l
address has more room for variation in the
number of records assigned to it.
 Let's calculate the difference in performance for
these two ways of storing the same number of
records in the same amount of space.
 The starting point for our calculations is the
fundamental description of each file structure.
 In the case of the file with bucket size one, any address that is assigned exactly one record does not have any
overflow.
 Any address with more than one record does have overflow. Recall that the expected number of overflow
records is given by

Prepared by Shilpa B, Dept. of ISE, CEC Page 8


Module – 5 Hashing and Extendible Hashing

 The 222 overflow records represent 29. 6% overflow.


 In the case of the bucket file, any address that is assigned either one or two records does not have overflow.
 The value of p(1) (with r/N = 1.5) gives the proportion of addresses assigned exactly one record, and p(2)
(with r/N = 1.5) gives the proportion of addresses assigned exactly two records.
 It is not until we get to p(3) that we encounter addresses for which there are overflow records.
 For each address represented by p(3) , two records can be stored at the address , and one must be an overflow
record.
 Similarly, for each address represented by p(4) , there are two overflow records, and so forth. Hence, the
expected number of overflow records in the bucket file is
N x [1 x p (3) + 2 x p (4) + 3 x p (5) + 4 x p (6) + ...]
Which for r/N = 1.5 and N = 500 is approximately
500 x [1 x 0. 1255 + 2 x 0. 047 1 + 3 x 0.0141 + 4 x 0. 0035 + 5 x 0. 0008] = 140.
 The 140 overflow records represent 18.7% overflow.
 We have shown that with one record per address and a packing density of 75%, the expected number of
overflow records is 29. 6%.
 When 500 buckets are used, each capable of holding two records , the packing density remains 75%, but the
expected number of overflow records drops to 18. 7%.
 That is about a 37% decrease in the number of times the program is going to have· to look elsewhere for a
record.
 As the bucket size gets larger, performance continues to improve.
d. Bucket Structure
 The only difference between a file with buckets and one in which each address can hold only one key is that
with a bucket file each address has enough space to hold more than one logical record.
 All records that are housed in the same bucket share the same address.
 Suppose, for example, that we want to store as many as five names in one bucket.
 Here are three such buckets with different numbers of records .
 Each bucket contains a counter that keeps track of how many records it has stored in it.
 Collisions can occur only when the addition of a new record causes the counter to exceed the number of
records a bucket can hold.
 The counter tells us how many data records are
stored in a bucket, but it does not tell us which
slots are used and which are not.
 We need a way to tell whether or not a record
slot is empty.
 One simple way to do this is to use a special
marker to indicate an empty record, just as we
did with deleted records earlier.
 We use the key value ///// to mark empty records in the preceding illustration.
e. Initializing a File for Hashing
 Since the logical size of a hashed file must remain fixed, it makes sense in most cases to allocate physical
space for the file before we begin storing data records in it.
 This is generally done by creating a file of empty spaces for all records , and then filling the slots as they are
needed with the data records .
f. Loading a Hash File
 A program that loads a hash file is similar in many ways to earlier programs we use for populating fixed -
length record files, with two differences.

Prepared by Shilpa B, Dept. of ISE, CEC Page 9


Module – 5 Hashing and Extendible Hashing

 First, the program uses the function hash () to produce a home address for each key.
 Second, the program looks for a free space for the record by starting with the bucket stored at its home
address and then, if the home buckets is full, continuing to look at successive buckets until one is found that
is not full.
 The new record is inserted in this bucket, which is rewritten to the file at the location from which it is loaded.
 If, as it searches for an empty bucket, a loading program passes the maximum allowable address, it must
wrap around to the beginning address.
7. Making Deletions
 When progressive overflow is used, a search for a record terminates if an open address is encountered.
 Because of this, we do not want to leave open
addresses that break overflow searches
improperly.
 The following example illustrates the problem.
 Adams, Jones, Morris, and Smith are stored in a
hash file in which each address can hold one
record.
 Adams and Smith both are hashed to address 5,
and Jones and Morris are hashed to address 6.
 If they are loaded in alphabetical order using
progressive overflow for collisions, they are
stored in the locations shown in Fig.
 A search for Smith starts at address 5 (Smith's
home address) , successively looks for Smith at
addresses 6, 7, and 8, then finds Smith at 8.
 Now suppose Morris is deleted, leaving an
empty space, as illustrated in Fig. 10. 10.
 A search for Smith again starts at address 5, and
then looks at addresses 6 and 7. Since address 7
is now empty, it is reasonable for the program to
conclude that Smith's record is not in the file.

a. Tombstones for Handling Deletions


 One simple technique we use for identifying deleted records involves replacing the deleted record (or just its
key) with a marker indicating that a record once lived there but no longer does.
 Such a marker is sometimes referred to as a tombstone.
 The nice thing about the use of tombstones is that it solves both of the problems described previously :
 The freed space does not break a sequence of searches for a record;
 The freed space is obviously available and may be reclaimed for later additions.
 Figure illustrates how the sample file might look
after the tombstone # # # # # # is inserted for the
deleted record.
 Now a search for Smith does not halt at the
empty record number 7. Instead, it uses the # #
# # # # as an indication that it should continue
the search.
 It is not necessary to insert tombstones every time a deletion occurs. For example, suppose in the preceding
example that the record for Smith is to be deleted.
 Since the slot following the Smith record is empty, nothing is lost by marking Smith's slot as empty rather
than inserting a tombstone.
b. Implications of Tombstones for Insertions
Prepared by Shilpa B, Dept. of ISE, CEC Page 10
Module – 5 Hashing and Extendible Hashing

 With the introduction of the use of tombstones, the insertion of records becomes slightly more difficult than
our earlier discussions imply.
 Whereas programs that perform initial loading simply search for the first occurrence of an empty record slot
(signified by the presence of the key /////), it is now permissible to insert a record where either ///// or ######
occurs as the key.
 This new feature, which is desirable because it yields a shorter average search length, brings with it a certain
danger.
 Consider, for example, the earlier example in which Morris is deleted, giving the file organization shown in
Figure above.
 Now suppose you want a program to insert Smith into the file. If the program simply searches until it
encounters ######, it never notices that Smith is already in the file.
 We almost certainly don't want to put a second Smith record into the file, since doing so means that later
searches would never find the older Smith record.
 To prevent this from occurring, the program must examine the entire cluster of contiguous keys and
tombstones to ensure that no duplicate key exists and then go back and insert the record in the first available
tombstone, if there is one.
8. Other Collision Resolution Techniques
a. Double Hashing
 One of the problems with progressive overflow is that if many records hash to buckets in the same vicinity,
clusters of records can form.
 As the packing density approaches one, this clustering tends to lead to extremely long searches for some
records.
 One method for avoiding clustering is to store overflow records a long way from their home addresses by
double hashing.
 With double hashing, when a collision occurs, a second hash function is applied to the key to produce a
number c that is relatively prime to the number of addresses.
 The value c is added to the home address to produce the overflow address.
 If the overflow address is already occupied, c is added to it to produce another overflow address.
 This procedure continues until a free overflow address is found.
 Double hashing does tend to spread out the
records in a file, but it suffers from a potential
problem that is encountered in several improved
overflow methods: It violates locality by
deliberately moving overflow records some
distance from their home addresses, increasing
the likelihood that the disk will need extra time
to get to the new overflow address.

b. Chained Progressive Overflow


 Chained progressive overflow is another technique designed to avoid the problems caused by clustering.
 It works in the same manner as progressive overflow, except that synonyms are linked together with pointers.
 That is, each home address contains a number indicating the location of the next record with the same home
address.
 The next record in turn contains a pointer to the following record with the same home address, and so forth.
 The net effect of this is that for each set of synonyms there is a linked list connecting their records , and it is
this list that is searched when a record is sought.
 The advantage of chained progressive overflow over simple progressive overflow is that only records with
keys that are synonyms need to be accessed in any given search.
 Suppose, for example, that the set of keys shown in Fig. below is to be loaded in the order shown into a hash
file with bucket size one, and progressive overflow is used.

Prepared by Shilpa B, Dept. of ISE, CEC Page 11


Module – 5 Hashing and Extendible Hashing

 A search for Cole involves an access to Adams (a synonym) and Bates (not a synonym) . Flint, the worst
case, requires six accesses, only two of which involve synonyms.
 Since Adams, Cole, and Flint are synonyms, a chaining algorithm forms a linked list connecting these three
names, with Adams at the head of the list.
 Since Bates and Dean are also synonyms, they form a second list. This arrangement is illustrated in Figure.
 The use of chained progressive overflow requires that we attend to some details that are not required for
simple progressive overflow.
The average search length decreases from 2.5 to

 Suppose that in the example Dean's home address is 22 instead of 21.


 Since, by the time Dean is loaded, address 22 is already occupied by Cole, Dean still ends up at address 23.
 If the pointer is 25, the linked list joining Adams, Cole, and Flint is kept intact, but Dean is lost.
 If the pointer is 23, Flint is lost.
 The problem here is that a certain address (22) that should be occupied by a home record (Dean) is occupied
by a different record.
 One solution to the problem is to require that every address that qualifies as a home address for some record
in the file actually hold a home record.
 The problem can be handled easily when a file is first loaded by using a technique called two-pass loading.
 On the first pass, only home records are loaded. All records that are not home records are kept in a separate
file. This guarantees that no potential home addresses are occupied by overflow records.
 On the second pass, each overflow record is loaded and stored in one of the free addresses according to
whatever collision resolution technique is being used.
c. Chaining with a Separate Overflow Area
 One way to keep overflow records from occupying home addresses where they should not be is to move
them all to a separate overflow area.
 Many hashing schemes are variations of this basic approach. The set of home addresses is called the prime
data area, and the set of overflow addresses is called the overflow area.
 The advantage of this approach is that it keeps all unused but potential home addresses free for later additions.
 In terms of the file we examined in the preceding
section, the records for Cole, Dean, and Flint
could have been stored in a separate overflow
area rather than potential home addresses for
later-arriving records
 Now no problem occurs when a new record is
added. If its home address has room, it is stored
there.
 If not, it is moved to the overflow file, where it is added to the linked list that starts at the home address
 If the bucket size for the primary file is large enough to prevent excessive numbers of overflow records, the
overflow file can be a simple entry-sequenced file with a bucket size of one.
 Space can be allocated for overflow records only when it is needed.

Prepared by Shilpa B, Dept. of ISE, CEC Page 12


Module – 5 Hashing and Extendible Hashing

d. Scatter Tables : Indexing Revisited


 Suppose you have a hash file that contains no
records, only pointers to records.
 The file is obviously just an index that is
searched by hashing rather than by some other
method.
 The term scatter table is often applied to this
approach to file organization.
 Figure illustrates the organization of a file using a scatter table The scatter table organization provides many
of the same advantages simple indexing generally provides, with the additional advantage that the search of
the index itself requires only one access.
 The data file can be implemented in many different ways.
 For example, it can be a set of linked lists of synonyms a sorted file, or an entry-sequenced file.
 Also, scatter table organizations conveniently support the use of variable- length records.
9. Extendible Hashing
a. Tries
 The key idea behind extendible hashing is to combine conventional hashing with another retrieval approach
called the trie.
 Tries are also sometimes referred to as radix
searching because the branching factor of the
search tree is equal to the number of alternative
symbols (the radix of the alphabet) that can
occur in each position of the key.
 Suppose we want to build a trie that stores the
keys able, abrahms, adams, anderson, andrews,
and baird.
 A schematic form of the trie is shown in Figure.
 Since there are 26 symbols in the alphabet, the potential branching factor at every node of the search is 26.
 If we used the digits 0- 9 as our search alphabet,
rather than the letters a-z, the radix of the search
would be reduced to 10. A search tree using
digits might look like the one shown in Figure.
 In searching a trie we sometimes use only a
portion of the key.
 We use more of the key as we need more
information to complete the search. This use-
more-as-we-need-more capability is
fundamental to the structure of extendib le
hashing.
b. Turning the Trie into a Directory
 We use tries with a radix of two in our approach to extendible hashing: Search decisions are made on a bit-
by-bit basis.
 Suppose we have bucket A containing keys that, when hashed, have hash addresses that begin with the bits
01.
 Bucket B contains keys with hash addresses
beginning with 10, and bucket C contains keys
with addresses that start with 11.
 Figure shows a trie that allows us to retrieve
these buckets.

Prepared by Shilpa B, Dept. of ISE, CEC Page 13


Module – 5 Hashing and Extendible Hashing

 Rather than representing the trie as a tree, we


flatten it into an array of contiguous records,
forming a directory of hash addresses and
pointers to the corresponding bucket.
 The first step in turning a tree into an array
involves extending it so it is a complete binary
tree with all of its leaves at the same level as
shown in Figure.
 Even though the initial 0 is enough to select
bucket A, the new form of the tree also uses the
second address bit so both alternatives lead to
the same bucket.
 Once we have extended the tree this way, we can
collapse it into the directory structure shown in
Figure.
c. Splitting to Handle Overflow
 The goal in an extendible hashing system is to find a way to increase the address space in response to
overflow, rather than responding by creating long sequences of overflow records and buckets that have to be
searched linearly.
 Suppose we insert records that cause bucket A in Figure above to overflow.
 In this case the solution is simple: Since addresses beginning with 00 and 01 are mixed together in bucket A,
we can split bucket A by putting all the 01 addresses in a new bucket D, while keeping only the 00 addresses
in A.
 Put another way, we already have two bits of address information, but are throwing one away as we access
bucket A.
 So, now that bucket A is overflowing, we must
use the full two bits to divide the addresses
between two buckets.
 We do not need to extend the address space; we
simply make full use of the address informa tio n
that we already have.
 Figure below shows the directory and buckets
after the split.
 Starting once again with the directory and buckets in same Figure, suppose that bucket B overflows.
 Unlike our previous example, we do not have
additional, unused bits of address space that we
can press into duty as we split the bucket.
 We now need to use three bits of the hash
address in order to divide up the records that
hash to bucket B.
 The trie illustrated in Figure makes the
distinctions required to complete the split.
 Figure below shows what this trie looks like once it is extended into a completely full binary tree with all
leaves at the same level and shows the collapsed, directory form of the trie.

Prepared by Shilpa B, Dept. of ISE, CEC Page 14


Module – 5 Hashing and Extendible Hashing

10. Alternative Approaches


a. Dynamic Hashing
 Functionally, dynamic hashing and extendible hashing are very similar.
 Both use a directory to track the addresses of the buckets, and both extend the directory through the use of
tries.
 The key difference between the approaches is that dynamic hashing, like conventional, static hashing, starts
with a hash function that covers an address space of a fixed size.
 As buckets within that fixed address space overflow, they split, forming the leaves of a trie that grows down
from the original address node.
 Eventually, after enough additions and splitting, the buckets are addressed through a forest of tries that have
been seeded out of the original static address space.

 Let's look at an example. Figure (a) shows an initial address space of four, and four buckets descending from
the four addresses in the directory.
 In Figure (b) we have split the bucket at address 4. We address the two buckets resulting from the split as 40
and 41.
 We change the shape of the directory node at address 4 from a square to a circle because it has changed from
an external node, referencing a bucket, to an internal node those points to two child nodes.
 In Figure (c) we split the bucket addressed by node 2, creating the new external nodes 20 and 21.
 We also split the bucket addressed by 41, extending the trie downward to include 410 and 411.
 Because the directory node 41 is now an internal node rather than an external one, it changes from a square
to a circle.
 As we continue to add keys and split buckets, these directory tries continue to grow.
 Finding a key in a dynamic hashing scheme can involve the use of two hash functions, rather than just one.
 First, there is the hash function that covers the original address space.
 If you find that the directory node is an external node, and therefore points to a bucket, the search is complete.
 However, if the directory node is an internal node, then you need additional address information to guide
you through the ones and zeroes that form the trie.
b. Linear Hashing
 The key feature of both extendible hashing and dynamic hashing is that they use a directory to direct access
to the actual buckets containing the key records.
 This directory makes it possible to expand and modify the hashed address space without expanding the
number of buckets
 After expanding the directory, more than one directory node can point to the same bucket.

Prepared by Shilpa B, Dept. of ISE, CEC Page 15


Module – 5 Hashing and Extendible Hashing

 However, the directory adds additional layers of indirection which, if the directory must be stored on disk,
can result in an additional seek.

 An example, developed in Figure, shows how linear hashing works.


 Linear hashing, like extendible hashing, uses more bits of hashed value as the address space grows.
 The example begins (Fig. a) with an address space of four, which means that we are using an address functio n
that produces addresses with two bits of depth.
 As we add records, bucket b overflows. The overflow forces a split.
 However, as Fig. (b) shows, it is not bucket b that splits , but bucket a.
 The reason for this is that we are extending the address space linearly, and bucket a is the next bucket that
must split to create the next linear extension, which we call bucket A.
 Since bucket b was not bucket that we split, the overflowing record is placed into an overflow bucket w.
 We add more records , and bucket d overflows . Bucket b is the next one to split and extend the address
space, divide the records from bucket b and its overflow bucket w between b and the new bucket B.
 The record overflowing bucket d is placed in an overflow bucket x. The resulting arrangement is illustrated
in Fig. 11(c).
 Figure (d) shows what happens when, as we add more records, bucket d overflows beyond the capacity of
the overflow bucket w.
 Finally, assume that bucket B overflows. The overflow record is placed in the overflow bucket z.
 The overflow also triggers the extension to bucket D, dividing the contents of d, x, and y between buckets d
and D.
Question bank
1. What is hashing? Explain simple hashing algorithm with example
2. Explain various collision resolution technique with example to avoid collision
3. Suppose that 10000 addresses are allocated to hold 8000 record in a randomly hashed file and that each
address can hold one record. Compute the following values.
a. The packing density for the file
b. The expected no. of addresses with no records assigned to them by the hash function
c. The expected no. of addresses with one record assigned
d. The expected no. of overflow records
4. Briefly explain the different collision resolution techniques by progressive overflow.
5. Explain double hashing technique
6. Discuss the issues that are involved in implementing the hashed file
7. How is tombstone used for handling deletion of records from a hashed file?
8. Briefly discuss the working of extendible hashing
9. Write a note on dynamic hashing
10. Write a note on linear hashing
11. Write a note on extendible hashing
12. Explain how does extendible hashing works?
13. Analyze the performance of extendible hashing
**************************************************************************

Prepared by Shilpa B, Dept. of ISE, CEC Page 16

You might also like