0% found this document useful (0 votes)
38 views44 pages

IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views44 pages

IN3020/4020 - Database Systems Spring 2020, Week 3.1 Indexing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

IN3020/4020 – Database Systems

Spring 2020, Week 3.1

INDEXING

Dr. M. Naci Akkøk, Chief Architect, Oracle Nordics


Based upon slides by E. Thorstensen from Spring 2019
We´ll be looking at indexes & indexing
o ConvenHonal indexes
o B-rees and hashing
o MulHdimensional indexes
o Tree structures
o Hash-like structures
o Bitmap indexes
Index

o An index on an attribute A is a data structure that facilitates


finding the elements with a certain value in A ( the search
key)
o The simplest type is probably the hash-index
o There are other (more advanced) indexing techniques based
upon binary search
Search keys are created to search for a single entry or a set of

Index
entries in an index. Search keys may only be constructed for
the key columns in the index, and may contain one or more
column values.

o An index on an attribute A is a data structure that facilitates finding the


elements with a certain values in A ( the search key)
o The index is sorted on the search key
o For each value of the search key, the index has a list of pointers to the
corresponding records
o More than one index in the same file means
o Faster search
o More complexity (changes will lead to updated indexes as well)
o Increased storage requirement, larger files
Different types of indexes
o Dense vs. sparse indexes
o Primary Index
o The data file is sorted (physically) on the search key
o Maximum one data entry for each search-key value
o Cluster index
o The data file is (sHll) sorted (physically) on the search key
o Allows more than one data entries with same search-key value
o Secondary index
The data file is NOT sorted (physically) on the related search key
A candidate key is a combination of attributes

Overview of indexes that uniquely identify a database record without


referring to any other data. Each table may have
one or more candidate. One of these candidate
keys is selected as the table primary key.

Search key is a candidate Search key is not a


Data file
key candidate key
Primary index dense or
Sorted on search key clustering dense or sparse
sparse
Not sorted on search key Secondary index dense Secondary index dense

Quick overview:
hVps://en.wikipedia.org/wiki/Database_index
Less quick but more detailed overview (for geeks):
hVps://www.geeksforgeeks.org/indexing-in-databases-set-1/
Primary index:
Dense vs. Sparse Indexes
A dense index has one
lookup for each value of the
search key

A sparse index has one


lookup for each data block
• Assume that we have

A simple comparison
• 1.000.000 entries of 300B,4B search key, 4B pointers
• 4K block size, average 5,6ms for fetching a block
• 13,6 records per block, i.e., 76924. blocks with data
• 512 indexes per block, i.e., 1954 blocks for a dense
o Without index: index and 151 blocks for a sparse index
o 76924/2 = 38462 average block access.
o Takes 38462 * 5,6 ms = 215,4 s
o With dense index and binary search:
o [log2(1954)] + 1 = 11 + 1 = 12 block accesses (max.),
o Takes 12 * 5,6 ms = 67,2 ms
o 3205 Hmes faster than the one without indexes!
o With sparse index and binary search:
o [log2(151)] + 1 = 8 + 1 = 9 block accesses (max.),
Takes 9 * 5,6 ms = 50,4 ms
o 4272 Hmes faster as compared to no index, and 1,33 Hmes faster than dense index
A simple comparison
o An index can occupy several blocks
o A mulH-level index (i.e., an index on an
index) can improve performance
o ConHnuing the example:
o Needs only 1954 / 512 = 4 blocks for 2. level h
o [log2(4)] + 1 + 1 = 2 + 1 + 1 = 4 block access,
takes 4 * 5,6 ms = 22,4 ms
o 2,25 faster than a simple sparse index,
3 Hmes faster than a dense index
o One can in principle have any number of
indices
Deletion in the case of
sparse index
o Delete entry where a = 60
o No change necessary for the
index

o Delete entry with a = 40


o The first entry in the block
is updated, which means
that the index needs to be
updated
Deletion in the case of
dense index
o Delete entry where a = 60
o Delete entry where a = 40

o One ususally prefers to


compress the data in the
blocks
o One can also compress the
whole data set, but we usually
like to keep free space for
future inserts
Insertion in the case of
sparse index
o Insert entry with a = 60
o Lucky! Place available exactly
where we need it!
o Insert entry with a = 25
o Must move entry with a = 30 to
next block to make space for 25
o The first entry on block two is
changed, and the index must be
updated
o NOTE: We could have inserted a
new block or overflow block
InserQon in the case of sparse
(and dense) index
o Insert entry with a = 95
o No space. Insert a new/overflow
block
o Overflow block: No need to do
much with the indexes. Need only
a pointer to the main blocks
o Overflow block: The index needs
to be updated too
o Insertion in the case of dense
indexes is the same, except that the
index must be updated every time
Cluster Index - Duplicate (or Multiple)
Search Keys
o If the file is sorted, a cluster
index can be used even if the
search key is not uniqe
o Example 1 – dense index:
o One index field per entry
o Makes it easy to find entries
and how there are of each
o Too many fields? More than
necessary? 🧐
Cluster Index – Dense Index
o Only one index field per
uniqe search key
o Less index – faster search

o Finding subsequent
(intermediary) records is
more complex 🙃
Cluster Index – Sparse Index
o Index fields point to the
first entry in each block

o Even fewer indexes – faster


search ☺
o Finding records is more
complex 🙃
Unsorted Files and Secondary Indices
o One can use a secondary
index if the files is
unsorted (or sorted on
another attribute)
o Sorted on the search
key: fast search
o First level is always
dense, higher levels are
sparse
o Duplicates are allowed
Dense vs. Sparse Indices

DENSE SPARSE
Space required One index field per entry One index field per data block
Block access “Many” “Feẘ”
Access to entry Direct access Must search in the data block
Must always access the data
Exists-queries Uses only the index
block
Usage All cases Not on unsorted elements
Always updated if entry Updated only if the first entry of
Updates/changes
sequence is changed the data block is changed
Inverted Indices

o What if we want to search for items within an attribute?


o SELECT * from R where a like ´%cat%´
o Search for documents that contain certain keywords, e.g.
search engines like Google, Altavista, Excite, Lycos,
AllTheWeb etc.
Inverted Indices
(called “inverted” because…)
o Direct Index: Entries of the form
(id1: document1), (id2: document2), ...
o Look up an ID, access the document
o Inverted index: Entries of
the form
(computer: [id1]),
(disk: [id2, id3]), ...
o Look up a keyword,
access all relevant
document IDs
Data Structures for Indices

o So, how do we represent an index?

o Obvious idea: Sorted list.


o BeVer ideas?
B+ Trees (Balanced Binary Trees)
o The nodes are blocks. All leaf nodes are at the same level. Each node has n search keys and n + 1
pointers.
o Inner node: all pointers are to sub-nodes.
o Leaf node: n data-pointers and 1 next-pointer.

o All nodes must contain a certain amount of


search keys / pointers
o Inner node: at least ⌈ (n + 1) / 2 ⌉ pointers to sub-nodes.
o Leaf node: at least ⌊ (n + 1) / 2 ⌋ data pointers.
B+ Trees: Efficiency
o B+ trees:
☹ A search must always start from the root to a leaf node, i.e., the number of block accesses is
equal to the height of the tree plus access to the records themselves in the data file.
☺ The number of levels is usually very low (typically 3)
☺ Interval search is fast
☺ For large n, it is rarely necessary to split / merge nodes
☺ Disk I / O can be reduced by keeping some of the index blocks in memory
o Example: 4B search keys, 8B pointers, 4KB blocks
o How many values can be stored in each node?
4𝑛 + 8 𝑛 + 1 ≤ 4096 → 𝑛 = 𝟑𝟒𝟎
o The nodes are on the average 75% full. How many records can a 3-level B+ tree contain?
340 × 75% 3 = 16679103,875 ≈ 16,6 million records!
Hash Tables
o Uses a hash function on the search key to an array
index with points to which bucket possibly contains
information about the current record
o Each bucket is a block (with support for overflow
blocks)
o Array size is usually a prime number
o Important for a hash-function!
o Fast
o A good distribution of the search keys to
buckets
o Example: Array size B = 5; h(key) = mod(key, B)
Hash Table: Efficiency
o Ideally, the array size is large enough so that all elements of
one hash value fit into one bucket block.

☺ Then we get significantly fewer disk operaHons than with


regular indices and B-trees
☺ Fast search for specific search key
☹ MulHple entries can lead to more blocks per bucket
☹ Poor on interval search
Dynamic Hash Tables
o Difficult to keep all items for one hash value within a bucket
block if the number of records increases while the hash
table remains staHc
o Dynamic hash tables allow the array size to vary so that it is
sufficient with one block per bucket
o Extensible hashing
o Linear hashing
Sequential vs. Hash Indices
o Sequential indexes such as B trees are good at interval
search:

select * from R where a > 5

o Hash indexes are good when searching for a specific key:

select * from R where a = 5


Indices in SQL
o Syntaks (DBMS-dependent):
o create index name on relaHonName (aVribute)
o create unique index name on relaHonName (aVribute)
o drop index name

o NOTE: Not all DBMSs allow us to specify


o type of index, e.g. B-three, hashing, etc.
o parameters such as load factor, hash size etc.
Indices in PSQL
Default index in Postgres is B-tree.
“B-trees can handle equality and range queries on data that
can be sorted into some ordering”
https://www.postgresql.org/docs/9.2/static/indexes-
types.html

Works for =, <, <=, BETWEEN and IN.


Also for LIKE, but limited.
Indices in PSQL

o There is also something called GiST. This is a mechanism for


implementing indexes for particular data types.

o Can be used in many cases. Some is included in postgres by


default.
Queries with several condiQons
select ... from R where a = 30 and b < 5

oStrategy 1:
o Use an index, f.ex. on a
o Find and fetch all records with a = 30.
o Search through these records to find records with b < 5.
☺ Simple strategy
☹ Risks reading many unnecessary records from disk (storage)
Several conditions: Strategy 2
select c from R where a = 30 and b = ´x´

o Use two dense indexes: One


for a, one for b
o Find all pointers to records
with a = 30.
o Find all pointers to entries with
b=‘x’.
o Compare (intersect) pointers
and fetch the relevant records
o Pick the relevant attributes
Multidimensional indices
o A multidimensional index combines multiple dimensions in
one index
o A simple tree-like approach:
Multidimensional indices: Example (with
dense indices)
select c from R where a = 30 and b = ´x´

o Search key = (30, x)


o Read the a dimension.
o Look for 30, find the
related index fo the b-
dimension
o Search for x, read relevant disk
block and fetch the record
o Pick ou the relevant aVributes
Multidimensional indices: For which queries
is this a good index to use?
😊 Find records with a = 10 and b = ´x´
😊 Find records with a = 10 and b >= ´x´
☹ Find records with a = 10
☹ Find records with b = ´x´
🤔 Find records with a >= 10 and b = ´x´
o Risks searching through many indexes in the next dimension
o Would have been betrer if the dimension order could be changed
o But there are many other alternaHves...
o Other tree-like structures,
o Hash-like structures
o Bitmap indices
Map View

o We can visulize a multidimensional


index with two dimensions as a map

o Search is then equivalent to a serach


on the map for:
o Points: a1 and b1
o Lines: a2 and < b2, b3 >
o Areas: < a3, a4 > and < b4, b5 >
Tree structures
o There are many tree-like structures that correspond to searching for map
areas:
o k-d trees (a binary search tree in which every leaf node is a k-dimensional
point. Very useful for range and nearest neighbor searches, and for
multidimensional search keys)
o Quad-trees (a tree data structure in which each internal node has exactly
four children. Often used to partition a two-dimensional space by
recursively subdividing it into four quadrants or regions. Used in image
compression, also in spatial search etc.)
o R-trees (tree data structures used for spatial access methods, i.e., for
indexing multi-dimensional information such as geographical coordinates,
rectangles or polygons. )
Tree structures - characterisQcs

o All these tree structures must saHsfy at least one of the


following characterisHcs of B trees:
o Balancing - all leaf nodes are at the same level
(a B-tree is a self-balancing tree)
o Correspondence between tree nodes and disk blocks
o Performance for update operaHons.
R-trees
o The basic idea:
Group geometric objects that are close to each other
o Nodes are blocks. All leaf nodes are at the same level
o Inner node: The smallest rectangle that covers all the
objects in the subtree
o Leaf node: An object
o All inner nodes must contain a certain number of pointers

o Search (intersecHon, contained in, nearest neighbor) is easy


o InserHon is challenging
o The tree should be completely balanced
o Rectangles should not contain too much space
o Rectangles with common parent node should not overlap
o May need to delete and re-insert objects for beVer
placement
Hash-like structures: Grid files
Grid files extend traditional hash indices to
multiple dimensions
o Hashes values for each attribute in a multi-
dimensional index
o Does not usually hash to individual values
but to regions: h(key) = <x, y>
o Grid lines partition areas into stripes
Example (2 dimensions):
Find record with (a, b) = (22, 31)
h1(22) = <ax, ay>
h2(31) = <bm, bn>
→ enter f
Grid files
o Grid files are good for finding records/entries with
o key1 = Vi and key2 =Xj
o key1 = Vi
o key2 = Vj
o key1 >= Vi and key2 < Xj
o Grid files:
☺ Are good for searches with mulHple keys
☹ Use too much space, and require some ornagizing
Bitmap indices
o StarHng point: Each recorder is assigned an unchanging,
unambiguous number
o Numbering from 1 to n
o The number can be considered a record ID and cannot be reused
even if the record is deleted
o Select the field F to be indexed
o For each value v used for F in one of the records, create a bit
vector bv of length n
o If record nr. i has F = v, let bv[i]=1
o If record nr. i has F ≠ v, let bv[i]=0
Bitmap indices: Example
Bitmap index characterisQcs
o Space requirements:
o Total number of bits is #records * #values
o In the worst case, n2 bits is needed (but then each bit vector has only one 1-bit)
o Bit vectors can be compressed; there is never more than n 1-bits total in the bit
vectors
o Effective for:
o Partial match queries (= state values for some fields, find all that have given
values)
o Calculate bit by bit and across the bitmap indexes for the relevant attributes
o Range queries (= enter intervals for some fields, find all that have values within
the ranges)
o Calculate bit by bit within the intervals

You might also like