0% found this document useful (0 votes)
50 views45 pages

Dbms Unit

Uploaded by

2020pitcse278
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views45 pages

Dbms Unit

Uploaded by

2020pitcse278
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CS8492-DATABASE MANAGEMENT SYSTEM

UNIT-IV IMPLEMENTATION TECHNIQUES


RAID – File Organization –Organization of Records in Files – Indexing and
Hashing –Ordered Indices – B+ tree Index Files – Btree Index Files – Static
Hashing – Dynamic Hashing – Query Processing Overview-Algorithms for
SELECT and JOIN operations-Query optimization using Heuristics and Cost
Estimation.

RAID

Databases are stored in file formats, which contain records. At physical level, the
actual data is stored in electromagnetic format on some device. These storage
devices can be broadly categorized into three types −

 Primary Storage − The memory storage that is directly accessible to the


CPU comes under this category. CPU's internal memory (registers), fast
memory (cache), and main memory (RAM) are directly accessible to the
CPU, as they are all placed on the motherboard or CPU chipset. This storage
is typically very small, ultra-fast, and volatile. Primary storage requires
continuous power supply in order to maintain its state. In case of a power
failure, all its data is lost.
 Secondary Storage − Secondary storage devices are used to store data for
future use or as backup. Secondary storage includes memory devices that are
not a part of the CPU chipset or motherboard, for example, magnetic disks,
optical disks (DVD, CD, etc.), hard disks, flash drives, and magnetic tapes.

II YEAR/IV SEM Page 1


CS8492-DATABASE MANAGEMENT SYSTEM

 Tertiary Storage − Tertiary storage is used to store huge volumes of data.


Since such storage devices are external to the computer system, they are the
slowest in speed. These storage devices are mostly used to take the back up
of an entire system. Optical disks and magnetic tapes are widely used as
tertiary storage.

Redundant Array of Independent Disks


RAID or Redundant Array of Independent Disks, is a technology to connect
multiple secondary storage devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected
together to achieve different goals. RAID levels define the use of disk
arrays.
 RAID 0
In this level, a striped array of disks is implemented. The data is broken
down into blocks and the blocks are distributed among disks. Each disk
receives a block of data to write/read in parallel. It enhances the speed and
performance of the storage device. There is no parity and backup in Level 0.

 RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller,
it sends a copy of data to all the disks in the array. RAID level 1 is also
called mirroring and provides 100% redundancy in case of a failure.

 RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data,
striped on different disks. Like level 0, each data bit in a word is recorded on
a separate disk and ECC codes of the data words are stored on a different set

II YEAR/IV SEM Page 2


CS8492-DATABASE MANAGEMENT SYSTEM

disks. Due to its complex structure and high cost, RAID 2 is not
commercially available.

 RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data
word is stored on a different disk. This technique makes it to overcome
single disk failures.

 RAID 4
In this level, an entire block of data is written onto data disks and then the
parity is generated and stored on a different disk. Note that level 3 uses byte-
level striping, whereas level 4 uses block-level striping. Both level 3 and
level 4 require at least three disks to implement RAID.

 RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits
generated for data block stripe are distributed among all the data disks rather
than storing them on a different dedicated disk.

II YEAR/IV SEM Page 3


CS8492-DATABASE MANAGEMENT SYSTEM

 RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are
generated and stored in distributed fashion among multiple disks. Two
parities provide additional fault tolerance. This level requires at least four
disk drives to implement RAID.

Relative data and information is stored collectively in file formats. A file is a


sequence of records stored in binary format. A disk drive is formatted into several
blocks that can store records. File records are mapped onto those disk blocks.

ChoiceofRAIDLevel

Factors
inchoosingRAIDlevel
Monetarycost

II YEAR/IV SEM Page 4


CS8492-DATABASE MANAGEMENT SYSTEM

 Performance: Number ofI/O operations per second, and bandwidth


duringnormal operation
 Performance duringfailure
 Performance duringrebuild of failed disk
 Includingtime taken to rebuild failed disk
o RAID 0 is used onlywhen data safetyis not important
E.g. data can be recovered quicklyfrom other sources
 Level 2 and 4 never used sincetheyaresubsumed by3and 5

 Level3isnotusedanymoresincebit-
stripingforcessingleblockreadstoaccessalldisks, wastingdisk arm
movement, which block striping (level 5) avoids.
oLevel 6 is rarelyusedsincelevels 1 and 5 offeradequate safetyfor almost
allapplications o So competition is between 1 and 5 only
Level 1 provides muchbetter write performance than level 5

oLevel 5 requires at least 2 block readsand 2 block writes to writea single block,
whereasLevel 1 onlyrequires 2 block writes.
oLevel 1 preferred for high update environmentssuch as logdisks
Level 1 had higher storage cost than level 5

odiskdrivecapacitiesincreasingrapidly(50%/year)whereasdiskaccesstimeshavede
creased much less (x3 in 10years)
oI/Orequirements haveincreasedgreatly,E.g. for Web servers

OWhenenoughdiskshavebeenboughttosatisfyrequiredrateofI/O,theyoftenhavespa
re storage capacity
so there is often no extra monetary cost for Level 1!

Level 5 is preferred for applications with low update rate, and large
amounts of data Level 1 is preferred for all other applications.

II YEAR/IV SEM Page 5


CS8492-DATABASE MANAGEMENT SYSTEM

File Organization

 The File is a collection of records. Using the primary key, we can access the
records. The type and frequency of access can be determined by the type of
file organization which was used for a given set of records.
 File organization is a logical relationship among various records. This
method defines how file records are mapped onto disk blocks.
 File organization is used to describe the way in which the records are stored
in terms of blocks, and the blocks are placed on the storage medium.
 The first approach to map the database to the file is to use the several files
and store only one fixed length record in any given file. An alternative
approach is to structure our files so that we can contain multiple lengths for
records.
 Files of fixed length records are easier to implement than the files of variable
length records.

Objective of file organization

 It contains an optimal selection of records, i.e., records can be selected as


fast as possible.
 To perform insert, delete or update transaction on the records should be
quick and easy.
 The duplicate records cannot be induced as a result of insert, update or
delete.
 For the minimal cost of storage, records should be stored efficiently.

Types of file organization:

File organization contains various methods. These particular methods have pros
and cons on the basis of access or selection. In the file organization, the
programmer decides the best-suited file organization method according to his
requirement.

II YEAR/IV SEM Page 6


CS8492-DATABASE MANAGEMENT SYSTEM

Types of file organization are as follows:

 Sequential file organization


 Heap file organization
 Hash file organization
 B+ file organization
 Indexed sequential access method (ISAM)
 Cluster file organization

ORGANIZATION OF RECORDS IN FILES


Records
– Data is usually stored in form of records.
– Each record consists of a collection of related data values or items.
– Records usually describe entities and their attributes.
– For example, an EMPLOYEE record represents and employee entity and each
field value in the record specifies some attribute of that employee, such as NAME,
BIRTHDATE, SALARY.
– A collection of field names and their corresponding data types constitutes a
record type or record format.
File
– A sequence of records.
– Usually all records in a file are of the same record type (Fixed-length records)
1.File Systems Organization -Sequential, Pointer, Indexed, Direct
A file is a collection or set (ordered or unordered) of data elements stored on
storage media. A system software module is responsible for managing (reading,

II YEAR/IV SEM Page 7


CS8492-DATABASE MANAGEMENT SYSTEM

processing, deleting, etc.) a file.


LogicalStructureofafile
Field: Smallest (fixed) indivisible logical unit of a file. A field holds a part of some
data value.
Record: A set of logically related fields. Cardinality of this set may be fixed or
variable, i.e., a record size may be fixed or variable. A file, therefore, is a
collection of logically related records.

Storing the files in certain order is called file organization. The main objective of
file organization is

 Optimal selection of records i.e.; records should be accessed as fast as


possible.
 Any insert, update or delete transaction on records should be easy, quick and
should not harm other records.
 No duplicate records should be induced as a result of insert, update or delete
 Records should be stored efficiently so that cost of storage is minimal.

There are various methods of file organizations. These methods may be efficient
for certain types of access/selection meanwhile it will turn inefficient for other
selections. Hence it is up to the programmer to decide the best suited file
organization method depending on his requirement.

Sequential File Organization

This method is the easiest method for file organization. In this method, files are
stored sequentially. This method can be implemented in two ways:

II YEAR/IV SEM Page 8


CS8492-DATABASE MANAGEMENT SYSTEM

1. Pile File Method:

 It is a quite simple method. In this method, we store the record in a


sequence, i.e., one after another. Here, the record will be inserted in the
order in which they are inserted into tables.
 In case of updating or deleting of any record, the record will be searched in
the memory blocks. When it is found, then it will be marked for deleting,
and the new record is inserted.

Insertion of the new record:

Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence.


Hence, records are nothing but a row in the table. Suppose we want to insert a new
record R2 in the sequence, then it will be placed at the end of the file. Here,
records are nothing but a row in any table.

II YEAR/IV SEM Page 9


CS8492-DATABASE MANAGEMENT SYSTEM

2. Sorted File Method:

 In this method, the new record is always inserted at the file's end, and then it
will sort the sequence in ascending or descending order. Sorting of records is
based on any primary key or any other key.
 In the case of modification of any record, it will update the record and then
sort the file, and lastly, the updated record is placed in the right place.

Insertion of the new record:

Suppose there is a preexisting sorted sequence of four records R1, R3 and so on


upto R6 and R7. Suppose a new record R2 has to be inserted in the sequence, then
it will be inserted at the end of the file, and then it will sort the sequence.

Pros of sequential file organization

 It contains a fast and efficient method for the huge amount of data.
 In this method, files can be easily stored in cheaper storage mechanism like
magnetic tapes.
 It is simple in design. It requires no much effort to store the data.

II YEAR/IV SEM Page 10


CS8492-DATABASE MANAGEMENT SYSTEM

 This method is used when most of the records have to be accessed like grade
calculation of a student, generating the salary slip, etc.
 This method is used for report generation or statistical calculations.

Cons of sequential file organization

 It will waste time as we cannot jump on a particular record that is required


but we have to move sequentially which takes our time.
 Sorted file method takes more time and space for sorting the records.

Heap file organization

 It is the simplest and most basic type of organization. It works with data
blocks. In heap file organization, the records are inserted at the file's end.
When the records are inserted, it doesn't require the sorting and ordering of
records.
 When the data block is full, the new record is stored in some other block.
This new data block need not to be the very next data block, but it can select
any data block in the memory to store new records. The heap file is also
known as an unordered file.
 In the file, every record has a unique id, and every page in a file is of the
same size. It is the DBMS responsibility to store and manage the new
records.

II YEAR/IV SEM Page 11


CS8492-DATABASE MANAGEMENT SYSTEM

Insertion of a new record

Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we
want to insert a new record R2 in a heap. If the data block 3 is full then it will be
inserted in any of the database selected by the DBMS, let's say data block 1.

If we want to search, update or delete the data in heap file organization, then we
need to traverse the data from staring of the file till we get the requested record.

If the database is very large then searching, updating or deleting of record will be
time-consuming because there is no sorting or ordering of records. In the heap file
organization, we need to check all the data until we get the requested record.

Pros of Heap file organization

 It is a very good method of file organization for bulk insertion. If there is a


large number of data which needs to load into the database at a time, then
this method is best suited.
 In case of a small database, fetching and retrieving of records is faster than
the sequential record.

II YEAR/IV SEM Page 12


CS8492-DATABASE MANAGEMENT SYSTEM

Cons of Heap file organization

 This method is inefficient for the large database because it takes time to
search or modify the record.
 This method is inefficient for large databases.

Hash File Organization

Hash File Organization uses the computation of hash function on some fields
of the records. The hash function's output determines the location of disk
block where the records are to be placed.

When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way,
when a new record has to be inserted, then the address is generated using the hash
key and record is directly inserted. The same process is applied in the case of
delete and update.

In this method, there is no effort for searching and sorting the entire file. In this
method, each record will be stored randomly in the memory.

II YEAR/IV SEM Page 13


CS8492-DATABASE MANAGEMENT SYSTEM

B+ File Organization

 B+ tree file organization is the advanced method of an indexed sequential


access method. It uses a tree-like structure to store records in File.
 It uses the same concept of key-index where the primary key is used to sort
the records. For each primary key, the value of the index is generated and
mapped with the record.
 The B+ tree is similar to a binary search tree (BST), but it can have more
than two children. In this method, all the records are stored only at the leaf
node. Intermediate nodes act as a pointer to the leaf nodes. They do not
contain any records

II YEAR/IV SEM Page 14


CS8492-DATABASE MANAGEMENT SYSTEM

The above B+ tree shows that:

 There is one root node of the tree, i.e., 25.


 There is an intermediary layer with nodes. They do not store the actual
record. They have only pointers to the leaf node.
 The nodes to the left of the root node contain the prior value of the root and
nodes to the right contain next value of the root, i.e., 15 and 30 respectively.
 There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27
and 29.
 Searching for any record is easier as all the leaf nodes are balanced.
 In this method, searching any record can be traversed through the single path
and accessed easily.

Pros of B+ tree file organization

 In this method, searching becomes very easy as all the records are stored
only in the leaf nodes and sorted the sequential linked list.
 Traversing through the tree structure is easier and faster.
 The size of the B+ tree has no restrictions, so the number of records can
increase or decrease and the B+ tree structure can also grow or shrink.
 It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.

Cons of B+ tree file organization

 This method is inefficient for the static method.


 Indexed sequential access method (ISAM)
 ISAM method is an advanced sequential file organization. In this method,
records are stored in the file using the primary key. An index value is
generated for each primary key

II YEAR/IV SEM Page 15


CS8492-DATABASE MANAGEMENT SYSTEM

and mapped with the record. This index contains the address of the record in
the file.

If any record has to be retrieved based on its index value, then the address of the
data block is fetched and the record is retrieved from the memory.

Pros of ISAM:

 In this method, each record has the address of its data block, searching a
record in a huge database is quick and easy.
 This method supports range retrieval and partial retrieval of records. Since
the index is based on the primary key values, we can retrieve the data for the
given range of value. In the same way, the partial value can also be easily
searched, i.e., the student name starting with 'JA' can be easily searched.

Cons of ISAM

 This method requires extra space in the disk to store the index value.
 When the new records are inserted, then these files have to be reconstructed
to maintain the sequence.
 When the record is deleted, then the space used by it needs to be released.
Otherwise, the performance of the database will slow down.

Cluster file organization

 When the two or more records are stored in the same file, it is known as
clusters. These files will have two or more tables in the same data block, and

II YEAR/IV SEM Page 16


CS8492-DATABASE MANAGEMENT SYSTEM

key attributes which are used to map these tables together are stored only
once.
 This method reduces the cost of searching for various records in different
files.
 The cluster file organization is used when there is a frequent need for joining
the tables with the same condition. These joins will give only a few records
from both tables. In the given example, we are retrieving the record for only
particular departments. This method can't be used to retrieve the record for
the entire department.

In this method, we can directly insert, update or delete any record. Data is sorted
based on the key with which searching is done. Cluster key is a type of key with
which joining of the table is performed.

Types of Cluster file organization:

Cluster file organization is of two types:

II YEAR/IV SEM Page 17


CS8492-DATABASE MANAGEMENT SYSTEM

1. Indexed Clusters:

In indexed cluster, records are grouped based on the cluster key and stored
together. The above EMPLOYEE and DEPARTMENT relationship is an example
of an indexed cluster. Here, all the records are grouped based on the cluster key-
DEP_ID and all the records are grouped.

2. Hash Clusters:

It is similar to the indexed cluster. In hash cluster, instead of storing the records
based on the cluster key, we generate the value of the hash key for the cluster key
and store the records with the same hash key value.

Pros of Cluster file organization

 The cluster file organization is used when there is a frequent request for
joining the tables with same joining condition.
 It provides the efficient result when there is a 1:M mapping between the
tables.

Cons of Cluster file organization

 This method has the low performance for the very large database.
 If there is any change in joining condition, then this method cannot use. If
we change the condition of joining then traversing the file takes a lot of time.
 This method is not suitable for a table with a 1:1 condition.

INDEXING AND HASHING

Indexing in DBMS

 Indexing is used to optimize the performance of a database by minimizing


the number of disk accesses required when a query is processed.
 The index is a type of data structure. It is used to locate and access the data
in a database table quickly.

II YEAR/IV SEM Page 18


CS8492-DATABASE MANAGEMENT SYSTEM

Index structure:

Indexes can be created using some database columns.

 The first column of the database is the search key that contains a copy of the
primary key or candidate key of the table. The values of the primary key are
stored in sorted order so that the corresponding data can be accessed easily.
 The second column of the database is the data reference. It contains a set of
pointers holding the address of the disk block where the value of the
particular key can be found.

Indexing Methods

Indexing Methods

Secondary
Ordered Clustering Index
Primary
Indices Index Index

Dense Index Sparse Index

Ordered indices

The indices are usually sorted to make searching faster. The indices which are
sorted are known as ordered indices.

II YEAR/IV SEM Page 19


CS8492-DATABASE MANAGEMENT SYSTEM

Example: Suppose we have an employee table with thousands of record and each
of which is 10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to
search student with ID-543.

 In the case of a database with no index, we have to search the disk block
from starting till it reaches 543. The DBMS will read the record after
reading 543*10=5430 bytes.
 In the case of an index, we will search using indexes and the DBMS will
read the record after reading 542*2= 1084 bytes which are very less
compared to the previous case.

Primary Index

 If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. These primary keys are unique to each record
and contain 1:1 relation between the records.
 As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
 The primary index can be classified into two types: Dense index and Sparse
index.

Dense index

 The dense index contains an index record for every search key value in the
data file. It makes searching faster.
 In this, the number of records in the index table is same as the number of
records in the main table.
 It needs more space to store index record itself. The index records have the
search key and a pointer to the actual record on the disk.

II YEAR/IV SEM Page 20


CS8492-DATABASE MANAGEMENT SYSTEM

Sparse index

 In the data file, index record appears only for a few items. Each item points
to a block.
 In this, instead of pointing to each record in the main table, the index points
to the records in the main table in a gap.

Clustering Index

 A clustered index can be defined as an ordered data file. Sometimes the


index is created on non-primary key columns which may not be unique for
each record.
 In this case, to identify the record faster, we will group two or more columns
to get the unique value and create index out of them. This method is called a
clustering index.
 The records which have similar characteristics are grouped, and indexes are
created for these group.

Example: suppose a company contains several employees in each department.


Suppose we use a clustering index, where all employees which belong to the same
Dept_ID are considered within a single cluster, and index pointers point to the
cluster as a whole. Here Dept_Id is a non-unique key.

II YEAR/IV SEM Page 21


CS8492-DATABASE MANAGEMENT SYSTEM

The previous schema is little confusing because one disk block is shared by records
which belong to the different cluster. If we use separate disk block for separate
clusters, then it is called better technique.

II YEAR/IV SEM Page 22


CS8492-DATABASE MANAGEMENT SYSTEM

Secondary Index

In the sparse indexing, as the size of the table grows, the size of mapping also
grows. These mappings are usually kept in the primary memory so that address
fetch should be faster. Then the secondary memory searches the actual data based
on the address got from mapping. If the mapping size grows then fetching the
address itself becomes slower. In this case, the sparse index will not be efficient.
To overcome this problem, secondary indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of indexing is


introduced. In this method, the huge range for the columns is selected initially so
that the mapping size of the first level becomes small. Then each range is further
divided into smaller ranges. The mapping of the first level is stored in the primary
memory, so that address fetch is faster. The mapping of the second level and actual
data are stored in the secondary memory (hard disk).

For example:

 If you want to find the record of roll 111 in the diagram, then it will search
the highest entry which is smaller than or equal to 111 in the first level
index. It will get 100 at this level.

II YEAR/IV SEM Page 23


CS8492-DATABASE MANAGEMENT SYSTEM

 Then in the second index level, again it does max (111) <= 111 and gets
110. Now using the address 110, it goes to the data block and starts
searching each record till it gets 111.
 This is how a search is performed in this method. Inserting, updating or
deleting is also done in the same manner.

B++ TREE

B+ Tree

 The B+ tree is a balanced binary search tree. It follows a multi-level index


format.
 In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all
leaf nodes remain at the same height.
 In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree
can support random access as well as sequential access.

Structure of B+ Tree

 In the B+ tree, every leaf node is at equal distance from the root node. The
B+ tree is of the order n where n is fixed for every B+ tree.
 It contains an internal node and leaf node.

Internal node

 An internal node of the B+ tree can contain at least n/2 record pointers
except the root node.
 At most, an internal node of the tree contains n pointers.

Leaf node

 The leaf node of the B+ tree can contain at least n/2 record pointers and n/2
key values.
 At most, a leaf node contains n record pointer and n key values.
 Every leaf node of the B+ tree contains one block pointer P to point to next
leaf node.

II YEAR/IV SEM Page 24


CS8492-DATABASE MANAGEMENT SYSTEM

Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we will fetch
for the intermediary node which will direct to the leaf node that can contain a
record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then
at the end, we will be redirected to the third leaf node. Here DBMS will perform a
sequential search to find 55.

B+ Tree Insertion

Suppose we want to insert a record 60 in the below structure. It will go to the 3rd
leaf node after 55. It is a balanced tree, and a leaf node of this tree is already full,
so we cannot insert 60 there.

In this case, we have to split the leaf node, so that it can be inserted into tree
without affecting the fill factor, balance and order.

II YEAR/IV SEM Page 25


CS8492-DATABASE MANAGEMENT SYSTEM

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50.
We will split the leaf node of the tree in the middle so that its balance is not
altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It
should have 60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it
is very easy to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion

Suppose we want to delete 60 from the above example. In this case, we have to
remove 60 from the intermediate node as well as from the 4th leaf node too. If we
remove it from the intermediate node, then the tree will not satisfy the rule of the
B+ tree. So we need to modify it to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show
as following B+ index files

II YEAR/IV SEM Page 26


CS8492-DATABASE MANAGEMENT SYSTEM

Hashing

In a huge database structure, it is very inefficient to search all the index values and
reach the desired data. Hashing technique is used to calculate the direct location of
a data record on the disk without using index structure.

In this technique, data is stored at the data blocks whose address is generated by
using the hashing function. The memory location where these records are stored is
known as data bucket or data blocks.

In this, a hash function can choose any of the column value to generate the address.
Most of the time, the hash function uses the primary key to generate the address of
the data block. A hash function is a simple mathematical function to any complex
mathematical function. We can even consider the primary key itself as the address
of the data block. That means each row whose address will be the same as a
primary key stored in the data block.

The above diagram shows data block addresses same as primary key value. This
hash function can also be a simple mathematical function like exponential, mod,
cos, sin, etc. Suppose we have mod (5) hash function to determine the address of
the data block. In this case, it applies mod (5) hash function on the primary keys
and generates 3, 3, 1, 4 and 2 respectively, and records are stored in those data
II YEAR/IV SEM Page 27
CS8492-DATABASE MANAGEMENT SYSTEM

block addresses.

Static Hashing

In static hashing, the resultant data bucket address will always be the same. That
means if we generate an address for EMP_ID =103 using the hash function mod
(5) then it will always result in same bucket address 3. Here, there will be no
change in the bucket address.

II YEAR/IV SEM Page 28


CS8492-DATABASE MANAGEMENT SYSTEM

Hence in this static hashing, the number of data buckets in memory remains
constant throughout. In this example, we will have five data buckets in the memory
used to store the data.

Operations of Static Hashing

 Searching a record

When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.

 Insert a Record

When a new record is inserted into the table, then we will generate an address for a
new record based on the hash key and record is stored in that location.

 Delete a Record

To delete a record, we will first fetch the record which is supposed to be deleted.
Then we will delete the records for that address in memory.
II YEAR/IV SEM Page 29
CS8492-DATABASE MANAGEMENT SYSTEM

 Update a Record

To update a record, we will first search it using a hash function, and then the data
record is updated.

If we want to insert some new record into the file but the address of a data bucket
generated by the hash function is not empty, or data already exists in that address.
This situation in the static hashing is known as bucket overflow. This is a critical
situation in this method.

To overcome this situation, there are various methods. Some commonly used
methods are as follows:

1. Open Hashing

When a hash function generates an address at which data is already stored, then the
next bucket will be allocated to it. This mechanism is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full.
So the system searches next available data bucket, 113 and assigns R3 to it.

II YEAR/IV SEM Page 30


CS8492-DATABASE MANAGEMENT SYSTEM

2. Close Hashing

When buckets are full, then a new data bucket is allocated for the same hash result
and is linked after the previous one. This mechanism is known as Overflow
chaining.

For example: Suppose R3 is a new address which needs to be inserted into the
table, the hash function generates address as 110 for it. But this bucket is full to
store the new data. In this case, a new bucket is inserted at the end of 110 buckets
and is linked to it.

Dynamic Hashing

 The dynamic hashing method is used to overcome the problems of static


hashing like bucket overflow.
 In this method, data buckets grow or shrink as the records increases or
decreases. This method is also known as Extendable hashing method.
 This method makes hashing dynamic, i.e., it allows insertion or deletion
without resulting in poor performance.

How to search a key

 First, calculate the hash address of the key.


 Check how many bits are used in the directory, and these bits are called as i.
II YEAR/IV SEM Page 31
CS8492-DATABASE MANAGEMENT SYSTEM

 Take the least significant i bits of the hash address. This gives an index of
the directory.
 Now using the index, go to the directory and find bucket address where the
record might be.

How to insert a new record

 Firstly, you have to follow the same procedure for retrieval, ending up in
some bucket.
 If there is still space in that bucket, then place the record in it.
 If the bucket is full, then we will split the bucket and redistribute the records.

For example:

Consider the following grouping of keys into buckets, depending on the prefix of
their hash address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of
5 and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it
will go into bucket B2. The last two bits of 7 are 11, so it will go into B3.

II YEAR/IV SEM Page 32


CS8492-DATABASE MANAGEMENT SYSTEM

Insert key 9 with hash address 10001 into the above structure:

 Since key 9 has hash address 10001, it must go into the first bucket. But
bucket B1 is full, so it will get split.
 The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so
it will go into bucket B1, and the last three bits of 6 are 101, so it will go into
bucket B5.
 Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100
entry because last two bits of both the entry are 00.
 Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110
entry because last two bits of both the entry are 10.
 Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry
because last two bits of both the entry are 11.

II YEAR/IV SEM Page 33


CS8492-DATABASE MANAGEMENT SYSTEM

Advantages of dynamic hashing

 In this method, the performance does not decrease as the data grows in the
system. It simply increases the size of memory to accommodate the data.
 In this method, memory is well utilized as it grows and shrinks with the data.
There will not be any unused memory lying.
 This method is good for the dynamic database where data grows and shrinks
frequently.

Disadvantages of dynamic hashing

 In this method, if the data size increases then the bucket size is also
increased. These addresses of data will be maintained in the bucket address
table. This is because the data address will keep changing as buckets grow
and shrink. If there is a huge increase in data, maintaining the bucket address
table becomes tedious.
 In this case, the bucket overflow situation will also occur. But it might take
little time to reach this situation than static hashing.

II YEAR/IV SEM Page 34


CS8492-DATABASE MANAGEMENT SYSTEM

Query Processing Overview

Introduction to Query Processing

 Query Processing is a translation of high-level queries into low-level


expression.
 It is a step wise process that can be used at the physical level of the file
system, query optimization and actual execution of the query to get the
result.
 It requires the basic concepts of relational algebra and file structure.
 It refers to the range of activities that are involved in extracting data from
the database.
 It includes translation of queries in high-level database languages into
expressions that can be implemented at the physical level of the file system.
 In query processing, we will actually understand how these queries are
processed and how they are optimized.

In the above diagram,

II YEAR/IV SEM Page 35


CS8492-DATABASE MANAGEMENT SYSTEM

 The first step is to transform the query into a standard form.


 A query is translated into SQL and into a relational algebraic expression.
During this process, Parser checks the syntax and verifies the relations and
the attributes which are used in the query.
 The second step is Query Optimizer. In this, it transforms the query into
equivalent expressions that are more efficient to execute.
 The third step is Query evaluation. It executes the above query execution
plan and returns the result.

Translating SQL Queries into Relational Algebra

 An SQL query is first translated into an equivalent extended relation algebra


expression (as a query tree) that is then optimized.
 Query block in SQL: the basic unit that can be translated into the algebraic
operators and optimized. A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and HAVING clauses.
Consider the following SQL query.
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY >( SELECT MAX (SALARY)
FROM EMPLOYEE

WHERE DNO=5);

- We can decompose it into two blocks:


Inner block: Outer block:
( SELECT MAX (SALARY) SELECT LNAME,
FNAME
FROM EMPLOYEE FROM EMPLOYEE
WHERE DNO=5) WHERE SALARY > c
-Then translate to algebra expressions:
_ Inner block: =∂MAX SALARY (_DNO=5 EMPLOY EE)

_ Outer block: ∏LNAME; FNAME(_SALARY >c EMPLOY EE)

II YEAR/IV SEM Page 36


CS8492-DATABASE MANAGEMENT SYSTEM

Example

SELECTEname FROM Employee


WHERE Salary > 5000;

Translated into Relational Algebra Expression

σ Salary > 5000 (π Ename (Employee))


OR
π Ename (σ Salary > 5000 (Employee))

 A sequence of primitive operations that can be used to evaluate a query is a


Query Execution Plan or Query Evaluation Plan.
 The above diagram indicates that the query execution engine takes a query
execution plan and returns the answers to the query.
 Query Execution Plan minimizes the cost of query evaluation.

ALGORITHMS FOR SELECT AND JOIN OPERATIONS


Implementing the SELECT Operation
_ Five operations for demonstration:
(OP1): _SSN=01234567890 (EMPLOY EE)
(OP2): _DNUMBER>5 (DEPARTMENT)
(OP3): _DNO=5 (EMPLOY EE)
(OP4): _DNO=5 and SALARY >30000 and SEX=0F0 (EMPLOY EE)

II YEAR/IV SEM Page 37


CS8492-DATABASE MANAGEMENT SYSTEM

(OP5): _ESSN=01234567890 and PNO=10 (WORKS ON)

_ Search methods for simple selection:


S1. Linear search (brute force)
S2. Binary search: If the selection condition involves an equality comparison on a
key attribute on which the _le is ordered. (ex. OP1)
S3. Using a primary index: If the selection condition involves an equality
comparison on a key attribute with a primary index. (ex. OP1)

S4. Using a primary index to retrieve multiple records: If the comparison condition
is >;_; <;_ on a key field with a primary index { for example, DNUMBER >5 in
OP2- use the index to find the record satisfying the equality condition(DNUMBER
= 5), then retrieve all subsequent (preceding) records in the (or-dered) file.

S5. Using a clustering index to retrieve multiple records: If the selection


conditioninvolves an equality comparison on a non-key attribute with a clustering
index -for example, DNO = 5 in OP3 .use the index to retrieve all the records
satisfyingthe condition.

S6. Using a secondary (B+-Tree) index on an equality comparison: Retrieve one


record if the indexing field is a key, or multiple records if the indexing field is nota
key. This method can also be used for comparison involved >;_; <;_.

_ Search methods for complex (conjunctive) selection:


S7. Conjunctive selection using an individual index: If an attribute involved in any
single simple condition in the conjunctive condition has an access path that permits
the use of one of the methods S2 to S6 to retrieve the records and then check
whether each retrieved record satisfies the remaining conditions in the conjunctive
condition.

S8. Conjunctive selection using a composite index: If two or more attributes are
involved in equality conditions and a composite index exists on the combined
fields, we can use the index directly. For example, OP5 can use this method if
there is a composite index (ESSN; PNO) for WORKS ON file.
II YEAR/IV SEM Page 38
CS8492-DATABASE MANAGEMENT SYSTEM

S9. Conjunctive selection by intersection of record pointers: If secondary indexes


are available on more than one of the fields in the conjunctive condition, and if the
indexes include record pointers (rather than block pointers). The intersection of
these sets of record pointers would satisfy the conjunctive condition, which are
then used to retrieve those records directly. If only some of the simple conditions
have secondary indexes, each retrieved record is further tested for the remaining
conditions.
Implementing the JOIN Operation
Two operations for demonstration:
(OP6): EMPLOY EE. / DNO=DNUMBER DEPARTMENT
(OP7): DEPARTMENT. / MGRSSN=SSN EMPLOY EE
 Methods for implementing JOINs:
The algorithm we consider are for join operation of the form
R ./ A=B S
o J1. Nested-loop join (brute force): For each record t in R (outer loop),
retrieve every records from S and test whether the two records satisfy
the join condition t[A] = s[B].
o J2. Single-loop join (using an access structure to retrieve the
matching records):
If an index exists for one of the two attributes { say, B of S { retrieve each recordt
in R, one at a time (single loop), and then use the access structure to
retrievedirectly all matching records s from S that satisfy s[B] = t[A].
o J3. Sort-merge join: First to sort R and S based on the corresponding
attributesA and B using external sorting. Second to merge sorted R
and S by retrievingthe matching records t and s that satisfy the join
condition t[A] = s[B].
Note that if there are secondary indexes for R and S based on attributes A
andB, then we can merge these two secondary indexes instead of sorting and
mergingthe data _les R and S.
o J4. Hash-join:
 Partitioning phase: a single pass through the _le with fewer
records (say,
R) hashes its records (using A as the hash key) to the hash file buckets

II YEAR/IV SEM Page 39


CS8492-DATABASE MANAGEMENT SYSTEM

 Probing phase: a single pass through the other _le (S) then
hashes each ofits records to probe the appropriate bucket.

QUERY OPTIMIZATION USING HEURISTICS AND COST


ESTIMATION

Query: A query is a request for information from a database.

Query Plans: A query plan (or query execution plan) is an ordered set of steps
used to access data in a SQL relational database management system.

Query Optimization is the process of selecting the most efficient Query evaluation
plan from among many.

Query Parsing and Translation

Initially, the SQL query is scanned. Then it is parsed to look for syntactical errors
and correctness of data types. If the query passes this step, the query is
decomposed into smaller query blocks. Each block is then translated to equivalent
relational algebra expression.

Steps for Query Optimization

Query optimization involves three steps, namely query tree generation, plan
generation, and query plan code generation.

Step 1 − Query Tree Generation

A query tree is a tree data structure representing a relational algebra expression.


The tables of the query are represented as leaf nodes. The relational algebra
operations are represented as the internal nodes. The root represents the query as a
whole.

During execution, an internal node is executed whenever its operand tables are
available. The node is then replaced by the result table. This process continues for
all internal nodes until the root node is executed and replaced by the result table.

II YEAR/IV SEM Page 40


CS8492-DATABASE MANAGEMENT SYSTEM

For example, let us consider the following schemas −

EMPLOYEE

EmpID EName Salary DeptNo DateOfJoining

DEPARTMENT

DNo DName Location

Example 1

Let us consider the query as the following.

πEmpID(σEName="ArunKumar"(EMPLOYEE))

The corresponding query tree will be −

Example 2

Let us consider another query involving a join.

πEName,Salary(σDName="Marketing"(DEPARTMENT))⋈DNo=DeptNo(EMPLO
YEE)

Following is the query tree for the above query.

II YEAR/IV SEM Page 41


CS8492-DATABASE MANAGEMENT SYSTEM

Step 2 − Query Plan Generation

After the query tree is generated, a query plan is made. A query plan is an extended
query tree that includes access paths for all operations in the query tree. Access
paths specify how the relational operations in the tree should be performed. For
example, a selection operation can have an access path that gives details about the
use of B+ tree index for selection.

Besides, a query plan also states how the intermediate tables should be passed from
one operator to the next, how temporary tables should be used and how operations
should be pipelined/combined.

Step 3− Code Generation

Code generation is the final step in query optimization. It is the executable form of
the query, whose form depends upon the type of the underlying operating system.
Once the query code is generated, the Execution Manager runs it and produces the
results.

Terminologies:

QUERY OPTIMIZATION
Query optimization is the process of selecting the most efficient query-evaluation
plan from among the many strategies usually possible for processing a given query.
II YEAR/IV SEM Page 42
CS8492-DATABASE MANAGEMENT SYSTEM

As there are many equivalent transformations of same high-level query, aim of


Query Optimization is to choose one that minimizes resource usage. Generally,
Query optimization aims to reduce total execution time of query. Problem
computationally intractable with large number of relations, so strategy adopted is
reduced to finding near optimum solution. There are two query optimizations are
available such as,
1. Heuristic or Rule based Optimizations
2. Systematic or Cost based Optimizations
1. Heuristic Query Optimization Oracle and IBM DB2 use Rule Based
optimization.
A query can be represented as a tree data structure. Operations are at the interior
nodes and data items (tables, columns) are at the leaves. The query is evaluated in a
depth-first pattern.
Example: SELECT PNUMBER, DNUM, LNAME FROM PROJECT,
DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER and
MGREMPID=EMPID and PLOCATION = 'Stafford';
Normal Execution Plan:
Optimized Query Execution: These transformations can be used in various
combinations to optimize queries. Some general steps follow:
[1] Using rule 1, break up conjunctive selection conditions and chain them together.
[2] Using the commutatively rules, move the selection operations as far down the
tree as possible.
[3] Using the associativity rules, rearrange the leaf nodes so that the most restrictive
selection conditions are executed first.
[4] Combine Cartesian product operations with associated selection conditions to
form a single Join operation.

II YEAR/IV SEM Page 43


CS8492-DATABASE MANAGEMENT SYSTEM

[5] Using the Commutatively of projection rules, move the projection operations
down the tree to reduce the sizes of intermediate result sets.
[6] Finally, identify sub trees that can be executed using a single efficient access
method.
2. Systematic or Cost based Optimizations Cost Estimation for RA Operations is
done and the best execution plan which incurs lowest cost is chosen.
The cost of query evaluation can be measured in terms of a number of different
resources, including disk accesses, CPU time to execute a query, and, in a
distributed or parallel database system, the cost of communication The number of
block transfers from disk and the number of disk seeks to estimate the cost of a
query-evaluation plan It uses formulae that estimate costs for a number of options,
and select one with lowest cost. It also considers only cost of disk access, which is
usually dominant cost in Query Processing.

The response time for a query-evaluation plan is very hard to estimate without
actually executing the plan, for the following reasons:
a. The response time depends on the contents of the buffer when the query begins
execution; this information is not available when the query is optimized, and is hard
to account for even if it were available.
b. In a system with multiple disks, the response time depends on how accesses are
distributed among disks, which is hard to estimate without detailed knowledge of
data layout on disk. A cost-based optimizer explores the space of all query-
evaluation plans that are equivalent to the given query, and chooses the one with the
least estimated cost.
The term “memorization” is a process of storing the optimal query evaluation plan
for a subexpression when it is optimized for the first time; subsequent requests to

II YEAR/IV SEM Page 44


CS8492-DATABASE MANAGEMENT SYSTEM

optimize the same subexpression are handled by returning the already memoized
plan. The optimizer in PostgreSQL is cost based query optimizer.
Several Cost components to consider like
1. Access cost to secondary storage (hard disk)
2. Storage Cost for intermediate result sets
3. Computation costs: CPU, memory transfers, etc.
4. Communications Costs e.g., in a distributed or client/server database.
Cost-bases Algorithm:
1. Enumerate all of the legitimate plans (call these P1…Pn) where each plan
contains a set of operations O1…Ok
2. Select a plan
3. For each operation Oi in the plan, enumerate the access routines
4. For each possible Access routine for Oi, estimate the cost
5. Select the access routine with the lowest cost
6. Repeat previous 2 steps until an efficient access routine has been selected for each
operation
7. Sum up the costs of each access routine to determine a total cost for the plan
8. Repeat steps 2 through 5 for each plan and choose the plan with the lowest total
cost

II YEAR/IV SEM Page 45

You might also like