Indexing & Hashing
Indexing & Hashing
● Basic Concepts
● Ordered Indices
● B+-Tree Index Files
● B-Tree Index Files
● Static Hashing
● Dynamic Hashing
● Comparison of Ordered Indexing and Hashing
● Index Definition in SQL
● Multiple-Key Access
Registers
L1 Cache
40 ns
L2 Cache
90 ns
Main Memory
14 ms = 14000000 ns
● Index files are typically much smaller than the original file
● Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order
Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”.
The reason this design is a good trade-off is that the dominant cost in
processing a database request is the time that it takes to bring a block
from disk into main memory. Once we have brought in the block, the
time to scan the entire block is negligible.
● Typical node
● Find the leaf node in which the search-key value would appear
● If the search-key value is already there in the leaf node, record is
added to file and if necessary a pointer is inserted into the
bucket.
● If the search-key value is not there, then add the record to the
main file and create a bucket if necessary. Then:
If there is room in the leaf node, insert (key-value, pointer) pair in the
leaf node
Otherwise, split the node (along with the new (key-value, pointer)
entry) as discussed in the next slide.
● Splitting a node:
take the n(search-key value, pointer) pairs (including the one being
inserted) in sorted order. Place the first ⎡ n/2 ⎤ in the original node,
and the rest in a new node.
let the new node be p, and let k be the least key value in p. Insert
(k,p) in the parent of the node being split. If the parent is full, split it
and propagate the split further up.
● The splitting of nodes proceeds upwards till a node that is not
full is found. In the worst case the root node may be split
increasing the height of the tree by 1.
● Otherwise, if the node has too few entries due to the removal,
and the entries in the node and a sibling fit into a single node,
then
Redistribute the pointers between the node and a sibling such that
both have more than the minimum number of entries.
Update the corresponding search-key value in the parent of the
node.
● The node deletions may cascade upwards till a node which has
⎡n/2 ⎤ or more pointers is found. If the root node has only one
pointer after deletion, it is deleted and the sole child becomes
the root.
● Good space utilization important since records use more space than
pointers.
● To improve space utilization, involve more sibling nodes in redistribution
during splits and merges
Involving 2 siblings in redistribution (to avoid split / merge where possible)
results in each node having at least entries
● Hashing can be used not only for file organization, but also for
index-structure creation.
● A hash index organizes the search keys, with their associated
record pointers, into a hash file structure.
● Strictly speaking, hash indices are always secondary hash
indices
if the file itself is organized using hashing, a separate hash index on
it using the same search-key is unnecessary.
We use the term hash index to refer to both secondary hash indices
and hash organized file .
● Create an index
create index <index-name> on <relation-name>
(<attribute-list>)
E.g.: create index b-index on branch(branch-name)
● Use create unique index to indirectly specify and enforce the
condition that the search key is a candidate key is a candidate
key.
Not really required if SQL unique integrity constraint is supported
● To drop an index
drop index <index-name>
● Bitmaps are packed into words; a single word and (a basic CPU
instruction) computes and of 32 or 64 bits at once
E.g. 1-million-bit maps can be anded with just 31,250 instruction
● Counting number of 1s can be done fast by a trick:
Use each byte to index into a precomputed array of 256 elements
each storing the count of 1s in the binary representation
★ Can use pairs of bytes to speed up further at a higher memory
cost
Add up the retrieved counts
● Bitmaps can be used instead of Tuple-ID lists at leaf levels of
B+-trees, for values that have a large number of matching
records
Worthwhile if > 1/64 of the records have that value, assuming a
tuple-id is 64 bits
Above technique merges benefits of bitmap and B+-tree indices