100% found this document useful (1 vote)
141 views25 pages

r23 Dbms Unit 4- Implementation Techniques

The document discusses RAID (Redundant Array of Independent Disks) technology, which enhances performance and data redundancy through various levels, including RAID 0 to RAID 6, each with unique features and fault tolerance capabilities. It also covers file organization methods, including fixed and variable length records, and different approaches to organizing records in files such as heap, sequential, and hashing file organizations. Additionally, it explains indexing and hashing techniques that improve data retrieval efficiency, highlighting the trade-offs involved in maintaining multiple indices.

Uploaded by

ganeshbalaji765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
141 views25 pages

r23 Dbms Unit 4- Implementation Techniques

The document discusses RAID (Redundant Array of Independent Disks) technology, which enhances performance and data redundancy through various levels, including RAID 0 to RAID 6, each with unique features and fault tolerance capabilities. It also covers file organization methods, including fixed and variable length records, and different approaches to organizing records in files such as heap, sequential, and hashing file organizations. Additionally, it explains indexing and hashing techniques that improve data retrieval efficiency, highlighting the trade-offs involved in maintaining multiple indices.

Uploaded by

ganeshbalaji765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Database Management Systems 4-2 Implementation Techniques

Part I : File Organization and Indexing

4.1 RAID AU : May-07,15,16,17, Dec.-06,13,14,15,16, Marks 16

 RAID stands for Redundant Array of Independent Disks. This is a technology in


which multiple secondary disks are connected together to increase the performance,
data redundancy or both.
 For achieving the data redundancy - in case of disk failure, if the same data is also
backed up onto another disk, we can retrieve the data and go on with the operation.
 It consists of an array of disks in which multiple disks are connected to achieve
different goals.
 The main advantage of RAID, is the fact that, to the operating system the array of
disks can be presented as a single disk.
Need for RAID
o RAID is a technology that is used to increase the performance.
o It is used for increased reliability of data storage.
o An array of multiple disks accessed in parallel will give greater throughput than a
single disk.
o With multiple disks and a suitable redundancy scheme, your system can stay up
and running when a disk fails, and even while the replacement disk is being
installed and its data restored.
Features
(1) RAID is a technology that contains the set of physical disk drives.
(2) In this technology, the operating system views the separate disks as a single logical
disk.
(3) The data is distributed across the physical drives of the array.
(4) In case of disk failure, the parity information can be helped to recover the data.

4.1.1 RAID Levels


Level : RAID 0
o In this level, data is broken down into blocks and these blocks are stored across all
the disks.
o Thus striped array of disks is implemented in this level. For instance in the
following figure blocks “A B” form a stripe.
Database Management Systems 4-3 Implementation Techniques

o There is no duplication of data in this level so once a block is lost then there is no
way recover it.
o The main priority of this level is performance and not the reliability.

Level : RAID 1
o This level makes use of mirroring. That means all data in the drive is duplicated to
another drive.
o This level provides 100% redundancy in case of failure.
o Only half space of the drive is used to store the data. The other half of drive is just
a mirror to the already stored data.
o The main advantage of this level is fault tolerance. If some disk fails then the other
automatically takes care of lost data.

Level : RAID 2
o This level makes use of mirroring as well as stores Error Correcting Codes (ECC)
for its data striped on different disks.
o The data is stored in separate set of disks and ECC is stored another set of disks.
o This level has a complex structure and high cost. Hence it is not used
commercially.
Database Management Systems 4-4 Implementation Techniques

Level : RAID 3
o This level consists of byte-level stripping with dedicated parity. In this level, the
parity information is stored for each disk section and written to a dedicated parity
drive.
o We can detect single errors with a parity bit. Parity is a technique that checks
whether data has been lost or written over when it is moved from one place in
storage to another.
o In case of disk failure, the parity disk is accessed and data is reconstructed from
the remaining devices. Once the failed disk is replaced, the missing data can be
restored on the new disk.

Level : RAID 4
o RAID 4 consists of block-level stripping with a parity disk.
Database Management Systems 4-5 Implementation Techniques
o Note that level 3 uses byte-level striping, whereas level 4 uses block-level striping.
Level : RAID 5
o RAID 5 is a modification of RAID 4.
o RAID 5 writes whole data blocks onto different disks, but the parity bits generated
for data block stripe are distributed among all the data disks rather than storing
them on a different dedicated disk.

RAID : Level 6
o RAID 6 is a extension of Level 5
o RAID 6 writes whole data blocks onto different disks, but the two independent
parity bits generated for data block stripe are distributed among all the data
disks rather than storing them on a different dedicated disk.
o Two parities provide additional fault tolerance.
o This level requires at least four disks to implement RAID.

The factors to be taken into account in choosing a RAID level are :

Monetary cost of extra disk-storage requirements.


1. Performance requirements in terms of number of I/O operations.
2. Performance when a disk has failed.
3. Performance during rebuild
Database Management Systems 4-6 Implementation Techniques
University Questions
4.2 File Organization AU : Dec.-08, Marks 10

 A file organization is a method of arranging records in a file when the file is stored on
disk.
 A file is organized logically as a sequence of records.
 Record is a sequence of fields.
 There are two types of records used in file organization (1) Fixed Length Record
(2) Variable Length Record.
(1) Fixed Length Record
 A file where each records is of the same length is said to have fixed length records.
 Some fields are always the same length (e.g. PhoneNumber is always 10
characters).
 Some fields may need to be 'padded out' so they are the correct length.
 For example -
type Employee = record
EmpNo varchar(4)
Ename varchar(10)
Salary integer(5)
Phone varchar(10)
End

EmpNo EName Salary Phone


1111 AAA 1000 1111111111
2222 BBB 2000 2222222222
3333 CCC 3000 3333333333
4444 DDD 4000 4444444444
For instance the first record of example file can be stored as
1 1 1 1 A A A 1
0 0 0 1 1 1 1 1 1 1 1 1 1
Thus total 29 bytes are required to store.
Advantage :
 Access is fast because the computer knows where each record starts.
Disadvantage :
(1) Due to fixed size, some larger sized record may cross the block boundaries. That
means part of record will be stored in one block and other part of the record may be
Database Management Systems 4-7 Implementation Techniques
stored in some another block. Thus we may require two block access for each read or
write.
(2) It is difficult to delete the record from this structure. If some intermediate record is
deleted from the file then the vacant space must be occupied by next subsequent
records.
 When a record is deleted,we could move the record that came after it into the space
formerly occupied by the deleted record. But this may require moving of multiple
records to occupy the vacant space. This is an undesirable solution to fill up the
vacant space of deleted records.
 Another approach is to use a file header. At the beginning of the file, we allocate a
certain number of bytes as a file header. The header will contain information such as
address of the first record whose contents are deleted. We use this first record to store
the address of the second available record and so on. Thus the stored record addresses
are referred as pointer while the deleted records thus form a linked list which is called
as free list.
For example - Consider the employee record in a file is -

Emp No Ename Salary Phone

record 0 1111 AAA 1000 1111111111


record 1 2222 BBB 2000 2222222222
record 2 3333 CCC 3000 3333333333
record 3 4444 DDD 4000 4444444444
record 4 5555 EEE 5000 5555555555
record 5 6666 FFF 6000 6666666666
record 6 7777 GGG 7000 7777777777
The representation of records maintaining free list after deletion of record 1,3 and 5
Database Management Systems 4-8 Implementation Techniques

 On insertion of a new record, we use the record pointed to by the header. We change
the header pointer to point to the next available record. If no space is available, we
add the new record to the end of the file.
(2) Variable Length Record

 Variable-length records arise in a database in several ways:


(i) Storage of multiple record types in a file.
(ii) Record types that allow variable lengths for one or more fields.
(iii) Record types that allow repeating fields such as arrays or multisets.
 The representation of a record with variable-length attributes typically has two parts :
a) an initial part with fixed length attributes. This initial part of the record is
represented by a pair (offset, length). The offset denotes the starting address of the
record while the length represents the actual length of the length.
b) followed by data for variable length attributes.
 For example - Consider the employee records stored in a file as

record 0 1111 AAA 1000 1111111111


2222 BBB 2000 2222222222
record 1

record 2 3333 CCC 3000 3333333333


record 3 4444 DDD 4000 4444444444
record 4 5555 EEE 5000 5555555555
record 5 6666 FFF 6000 6666666666
record 6 7777 GGG 7000 7777777777

 The variable length representation of first record is,

 The figure also illustrates the use of a null bitmap, which indicates which attributes of
the record have a null value.
 The variable length record can be stored in blocks. A specialized structure called
Database Management Systems 4-9 Implementation Techniques
slotted page structure is commonly used for organizing the records within a block.
This structure is as shown by following Fig. 4.2.1.

Fig. 4.2.1

 This structure can be described as follows -


o At the beginning of each block there is a block header which contains -
 Total number of entries (i.e. records).
 Pointer to end of free list.
 Followed by an array of entries that contain size and location of each record.
o The actual records are stored in contiguous block. Similarly the free list is also
maintained as continuous block.
o When a new record is inserted then the space is allocated from the block of free
list.
o Records can be moved around within a page to keep them contiguous with no
empty space between them; entry in the header must be updated.

4.3 Organization of Records in File


There are three commonly used approaches of organizing records in file -

(1) Heap File Organization : Any record can be placed anywhere in the file where
there is a space. There is no ordering for placing the records in the file. Generally
single file is used.
(2) Sequential File Organization : Records are stored in sequential order based on
the value of search key.
(3) Hashing File Organization : A hash function is used to obtain the location of the
record in the file. Based on the value returned by hash function, the record is
stored in the file.
4.3.1 Sequential File Organization
Database Management Systems 4 - 10 Implementation Techniques
 The sequential file organization is a simple file organization method in which the
records are stored based on the search key value.
 For example – Consider following set of records stored according to the RollNo of
student. Note that we assume here that the RollNo is a search key.

Now if we want to insert following record

4 E
E
Then, we will insert it in sorted order of RollNo and
E adjust the pointers accordingly.

Fig. 4.3.1
 However, maintaining physical sequential order is very difficult as there can be
several insertions and deletions.
 We can maintain the deletion by next pointer chain.
 For insertion following rules can be applied -
o If there is free space insert record there. If no free space, insert the record in an
overflow block.
o In either case, pointer chain must be updated.
4.3.2 Multi-table Clustering File Organization
In a multitable clustering file organization, records of several different relations are
stored in the same file.
For example - Following two tables Student and Course
Database Management Systems 4 - 11 Implementation Techniques
Sname Marks
Ankita 55
Ankita 67
Ankita 86
Prajkta 91

Sname Cname City


Ankita ComputerSci Chennai
Prajkta Electronics Pune
The multitable clustering organization for above tables is,

Ankita ComputerSci Chennai

Ankita 55

Ankita 67

Ankita 86

Prajkta Electronics Pune

Prajkta 91

This type of file organization is good for join operations such as Student Course.
This file organization results in variable size records.
The pointer chain can be added to the above records to keep track of address of next
record. It can be as shown in following Fig. 4.3.2

Fig. 4.3.2
Database Management Systems 4 - 12 Implementation Techniques
4.4 Indexing and Hashing AU : Dec.-11, Marks 10

 An index is a data structure that organizes data records on the disk to make the
retrieval of data efficient.
 The search key for an index is collection of one or more fields of records using which
we can efficiently retrieve the data that satisfy the search conditions.
 The indexes are required to speed up the search operations on file of records.
 There are two types of indices -
o Ordered Indices : This type of indexing is based on sorted ordering values.
o Hash Indices : This type of indexing is based on uniform distribution of values
across range of buckets. The address of bucket is obtained using the hash function.
 There are several techniques of for using indexing and hashing. These techniques are
evaluated based on following factors -
o Access Types : It supports various types of access that are supported efficiently.
o Access Time : It denotes the time it takes to find a particular data item or set items.
o Insertion Time : It represents the time required to insert new data item.
o Deletion Time : It represents the time required to delete the desired data item.
o Space overhead : The space is required to occupy the index structure. But
allocating such extra space is worth to achieve improved performance.
Example 4.4.1 Since indices speed query processing. Why might they not be kept on several
search keys? List as many reasons as possible. AU : Dec.-11, Marks 6

Solution : Reasons for not keeping several search indices include :


a. Every index requires additional CPU time and disk I/O overhead during inserts and
deletions.
b. Indices on non-primary keys might have to be changed on updates, although an index
on the primary key might not (as updates typically do not modify the primary key
attributes).
c. Each extra index requires additional storage space.
d. For queries which involve conditions on several search keys, efficiency might not be
bad even if only some of the keys have indices on them.
Therefore database performance is improved less by adding indices when many
indices already exist.

4.5 Ordered Indices AU : Dec.-04, 08, 15, May-06, Marks 10

4.5.1 Primary and Clustered Indices


Primary Index :
 An index on a set of fields that includes the primary key is called a primary index. The
Database Management Systems 4 - 13 Implementation Techniques
primary index file should be always in sorted order.
 The primary indexing is always done when the data file is arranged in sorted order
and primary indexing contains the primary key as its search key.
 Consider following scenario in which the primary index consists of few entries as
compared to actual data file.

Fig. 4.5.1 : Example of primary index

 Once if you are able to locate the first entry of the record containing block, other
entries are stored continuously. For example if we want to search a record for RegNo
11AS32 we need not have to search for the entire data file. With the help of primary
index structure we come to know the location of the record containing the RegNo
11AS30, now when the first entry of block 30 is located, then we can easily locate the
entry for 11AS32.
 We can apply binary search technique. Suppose there are n = 300 blocks in a main data
file then the number of accesses required to search the data file will be log2 n + 1 = (log2 
300) + 1  9
 If we use primary index file which contains at the most n = 3 blocks then using binary
search technique, the number of accesses required to search using the primary index
file will be log2 n + 1 = (log2 3 ) + 1  3

 This shows that using primary index the access time can be deduced to great extent.
Clustered Index :
 In some cases, the index is created on non-primary key columns which may not be
unique for each record. In such cases, in order to identify the records faster, we will
group two or more columns together to get the unique values and create index out of
them. This method is known as clustering index.
 When a file is organized so that the ordering of data records is the same as the
ordering of data entries in some index then say that index is clustered, otherwise it is
an unclustered index.
Database Management Systems 4 - 14 Implementation Techniques
 Note that, the data file need to be in sorted order.
 Basically, records with similar characteristics are grouped together and indexes are
created for these groups.
 For example, students studying in each semester are grouped together. i.e.; 1st
semester students, 2nd semester students, 3rd semester students etc. are grouped.

Fig. 4.5.2 Clustered Index

4.5.2 Dense and Sparse Indices


There are two types of ordered indices :
1) Dense index :
 An index record appears for every search key value in file.
 This record contains search key value and a pointer to the actual record.
 For example :

2) Sparse index :

Fig. 4.5.3 : Dense index


Database Management Systems 4 - 15 Implementation Techniques

 Index records are created only for some of the records.


 To locate a record, we find the index record with the largest search key value less
than or equal to the search key value we are looking for.
 We start at that record pointed to by the index record, and proceed along the
pointers in the file (that is, sequentially) until we find the desired record.
 For example -

Fig. 4.5.4 : Sparse index

4.5.3 Single and Multilevel Indices


Single level indexing :
 A single-level index is an auxiliary file that makes it more efficient to search for a
record in the data file.
 The index is usually specified on one field of the file (although it could be specified on
several fields).

 Each index can be in the following form.


Search Key Pointer to Record

 The index file usually occupies considerably less disk blocks than the data file because
its entries are much smaller.
 A binary search on the index yields a pointer to the file record.
 The types of single level indexing can be primary indexing, clustering index or
secondary indexing.
Database Management Systems 4 - 16 Implementation Techniques
 Example : Following Fig. 4.5.5 represents the single level indexing -

Fig. 4.5.5 : Single level indexing


Multilevel indexing :
 There is an immense need to keep the index records in the main memory so as to
speed up the search operations. If single-level index is used, then a large size index
cannot be kept in memory which leads to multiple disk accesses.
 Multi-level Index helps in breaking down the index into several smaller indices in
order to make the outermost level so small that it can be saved in a single disk block,
which can easily be accommodated anywhere in the main memory.
The multilevel indexing can be represented by following Fig. 4.5.6
Database Management Systems 4 - 17 Implementation Techniques

Fig. 4.5.6 Multilevel indexing


4.5.4 Secondary Indices

 In this technique two levels of indexing are used in order to reduce the mapping size
of the first level and in general.
 Initially, for the first level, a large range of numbers is selected so that the mapping
size is small. Further, each range is divided into further sub ranges.
 It is used to optimize the query processing and access records in a database with some
information other than the usual search key.
Database Management Systems 4 - 18 Implementation Techniques

 For example -

Part II : Query Processing

4.11 Query Processing Overview AU : May-14, 16, 18, Marks 16

 Query processing is a collection of activities that are involved in extracting data from
database.
 During query processing there is translation high level database language queries into
the expressions that can be used at the physical level of filesystem.
 There are three basic steps involved in query processing and those are –
1. Parsing and Translation
o In this step the query is translated into its internal form and then into relational
algebra.
o Parser checks syntax and verifies relations.
o For instance - If we submit the query as,
SELECT RollNo, name
FROM Student
HAVING RollNo=10
Then it will issue a syntactical error message as the correct query should be
SELECT RollNo, name
FROM Student
HAVING RollNo=10
Database Management Systems 4 - 19 Implementation Techniques
Thus during this step the syntax of the query is checked so that only correct and verified
query can be submitted for further processing.

2. Optimization :
o During this process the query evaluation plan is prepared from all the relational
algebraic expressions.
o The query cost for all the evaluation plans is calculated.
o Amongst all equivalent evaluation plans the one with lowest cost is chosen.
o Cost is estimated using statistical information from the database catalog, such as
the number of tuples in each relation, size of tuples, etc.
3. Evaluation
o The query-execution engine takes a query-evaluation plan,executes that plan, and
returns the answers to the query.
The above describe steps are represented by following Fig. 4.11.1 –

Fig. 4.11.1 Query processing


For example - If the SQL query is,
SELECT balance
FROM account
WHERE balance<1000
Step 1 : This query is first verified by the parser and translator unit for correct syntax. If
so then the relational algebra expressions can be obtained. For the above given queries
there are two possible relational algebra
(1)  balance<1000( balance(account))

(2) balance ( balance<1000 (account))


Database Management Systems 4 - 20 Implementation Techniques
Step 2 :
Query Evaluation Plan : To specify fully how to evaluate a query, we need not only to
provide the relational-algebra expression, but also to annotate it with instructions
specifying how to evaluate each operation. For that purpose, using the order of evaluation
of queries, two query evaluation plans are prepared. These are as follows

Fig. 4.11.2 Query evaluation plans


Associated with each query evaluation plan there is a query cost. The query optimization
selects the query evaluation plan having minimum query cost.
Once the query plan is chosen, the query is evaluated with that plan and the result of the
query is output.

University Questions

1. Briefly explain about query processing. AU : May-14, 16, Marks 16

2. What is query optimization? Outline the steps in query optimization. AU : May-18, Marks 13

4.12 Measure of Query Cost


 There are multiple possible evaluation plans for a query, and it is important to be able
to compare the alternatives in terms of their estimated cost and choose the best plan.
 There are many factors that contribute to query cost are -
o Disk access
o CPU time to execute the query
o Cost of communication in distributed and parallel database system.
 The cost of access data from disk is an important cost. Normally disk access is
relatively slow as compared to in-memory operations. Moreover, the CPU speed is
much faster than the disk speed. Hence the time spent in disk is relatively dominant
factor in query execution.
 Computing CPU access time is comparatively harder, hence we will not consider the
CPU time to execute the query.
 Similarly the cost of communication does not matter for simple large databases
present in the centralized database system.
 Typically disk access is the predominant cost and is also relatively easy to estimate,
Database Management Systems 4 - 21 Implementation Techniques
taking into account:
o Number of seeks × average-seek-cost
o Number of blocks read × average-block-read-cost
o Number of blocks written × average-block-write-cost
 Cost to write a block is greater than cost to read a block because data is read back after
being written to ensure that the write was successful.
 We use number of block transfers from disk and number of disk seeks to estimate
the Query cost.
 Let,
o b be the number to blocks
o S be the number of Seeks
o tT is average time required to transfer a block of data, in seconds
o tS is average block access time in seconds.
Then query cost can be computed using following formula b*tT+S*tS

4.13 Algorithms for SELECT Operation


For selection operation, the file scan is an important activity. Basically file scan is a
based on searching algorithms. These searching algorithms locate and retrieve the records
that fulfills a selection condition.
Let us discuss various algorithms used for SELECT Operation based in file scan
Algorithm A1 : Linear Search

 Scan each file block and test all records to see whether they satisfy the selection
condition 
Cost = br block transfers + 1 seek
Where,
br denotes number of blocks containing records from relation r

 If selection is on a key attribute, can stop on finding record


Cost = (br/2) block transfers + 1 seek

 Advantages of Linear Search


o Linear search works even-if there is no selection condition specified.
o For linear search, there is no need to have records in the file in ordered form.
o Linear search works regardless of indices.
Algorithm A2 : Binary Search
 Applicable if selection is an equality comparison on the attribute on which file is
Database Management Systems 4 - 22 Implementation Techniques
ordered.
 Assume that the blocks of a relation are stored contiguously. 
 Cost estimate is nothing but the number of disk blocks to be scanned.
Cost of locating the first tuple by a binary search on the blocks = ⌈log2(br)⌉ × (tT + tS )
Where,
br denotes number of blocks containing records from relation r
tT is average time required to transfer a block of data, in seconds
tS is average block access time, in seconds

 If there are multiple records satisfying the selection add transfer cost of the number of
blocks containing records that satisfy selection condition.
4.14 Algorithms for JOIN Operation
 JOIN operation is the most time consuming operation to process.
 There are Several different algorithms to implement joins
1. Nested-loop join
2. Block nested-loop join
3. Indexed nested-loop join
4. Merge-join
5. Hash-join
Choice of a particular algorithm is based on cost estimate.
Algorithm For Nested Loop Join
This algorithm is for computing r s
Let, r is called the outer relation and s the inner relation of the join.
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition 
if  is satisfied, then, add (t r , ts ) to the result.
end
end
 This algorithm requires no indices and can be used with any kind of join condition.
 It is expensive since it examines every pair of tuples in the two relations.
 In the worst case, if there is enough memory only to hold one block of each relation,
the estimated cost is
 nr × bs + br block transfers, plus nr + br seeks
 If the smaller relation fits entirely in memory, use that as the inner relation
o Reduces cost to br + bs block transfers and 2 seeks
Database Management Systems 4 - 23 Implementation Techniques
 For example - Assume the query CUSTOMERS ORDERS (with join attribute only
being CName)
Number of records of customer : 10000 order : 5000
Number of blocks of customer : 400 order : 100
Formula Used :
(1) nr × bs + br block transfers,
(2) nr + br seeks
r is outer relation and s is inner relation.
With order as outer relation:
nr = 5000 bs = 400, br = 100
5000 × 400 + 100 = 2000100 block transfers and
5000 + 100 = 5100 seeks
With customer as the outer relation:
nr=10000, bs=100, br=400
10000 × 100 + 400 = 1000400 block transfers and
10000+400 =10400 seeks
If smaller relation (order ) fits entirely in memory, the cost estimate will be:
br + bs = 500 block transfers
Algorithm For Block Nested Loop Join
Variant of nested-loop join in which every block of inner relation is paired with every
block of outer relation.

This algorithm is for computing r ⋈ s


Let, r is called the outer relation and s the inner relation of the join.
for each block Br of r do
for each block Bs of s do
for each tuple t r in Br do begin
for each tuple ts in Bs do begin
test pair (tr , ts) to see if they satisfy the join condition 
if  is satisfied, then, add (t r , ts ) to the result.
end
end
end
end
Worst case estimate: br × bs + br block transfers + 2 × br seeks
Database Management Systems 4 - 24 Implementation Techniques
Each block in the inner relation s is read once for each block in the outer relation.
Best case : br + bs block transfers + 2 seeks
(3) Merge Join

 In this operation, both the relations are sorted on their join attributes. Then merged
these sorted relations to join them.
 Only equijoin and natural joins are used.
 The cost of merge join is,
br + bs block transfers + ⌈br /bb⌉ + ⌈bs/bb⌉ seeks + the cost of sorting, if relations are
unsorted.
(4) Hash Join

 In this operation, the hash function h is used to partition tuples of both the relations.
 h maps A values to {0, 1, ..., n}, where A denotes the attributes of r and s used in the
join.
 Cost of hash join is :
3(br + bs ) + 4 × n block transfers + 2(⌈br /b b⌉ + ⌈bs/bb⌉) seeks
If the entire build input can be kept in main memory no partitioning is required
Cost estimate goes down to br + bs.

4.15 Query Optimization using Heuristics and Cost Estimation


AU : Dec. -13, 16, May-15, Marks 16

4.15.1 Heuristic Estimation


 Heuristic is a rule that leads to least cost in most of cases.
 Systems may use heuristics to reduce the number of choices that must be made in a
cost-based fashion.
 Heuristic optimization transforms the query-tree by using a set of rules that typically t
improve execution performance. These rules are
1. Perform selection early (reduces the number of tuples)
2. Perform projection early (reduces the number of attributes)
3. Perform most restrictive selection and join operations before other similar
operations (such as cartesian product).
 Some systems use only heuristics, others combine heuristics with partial cost-based
optimization.
Steps in Heuristic Estimation
Step 1 : Scanner and parser generate initial query representation
Database Management Systems 4 - 25 Implementation Techniques
Step 2 : Representation is optimized according to heuristic rules
Step 3 : Query execution plan is developed

For example : Suppose there are two relational algebra -


(1) city= “Pune” (cname Branch) Account Customer)

(2) cname(city=”Pune(Branch Account Customer))


The query evaluation plan can be drawn using the query trees as follows -

Fig. 4.15.1 Query evaluation plan

Out of the above given query evaluation plans, the Fig. 4.15.1 (b) is much faster than
Fig. 4.15.1 (a) because – in Fig. 4.15.1 (a) the join operation is among Branch, Account and
Customer, whereas in Fig. 4.15.1 (b) the join of (Account and Customer) is made with the
selected tuple for City=”Pune” . Thus the output of entire table for join operation is much
more than the join for some selected tuples. Thus we get choose the optimized query.

4.15.2 Cost based Estimation


 A cost based optimizer will look at all of the possible ways or scenarios in which a
query can be executed.
 Each scenario will be assigned a ‘cost’, which indicates how efficiently that query can
be run.
 Then, the cost based optimizer will pick the scenario that has the least cost and
execute the query using that scenario, because that is the most efficient way to run the
query.
 Scope of query optimization is a query block. Global query optimization involves
multiple query blocks.
 Cost components for query execution
o Access cost to secondary storage
Database Management Systems 4 - 26 Implementation Techniques
o Disk storage cost
o Computation cost
o Memory usage cost
o Communication cost
 Following information stored in DBMS catalog and used by optimizer
o File size
o Organization
o Number of levels of each multilevel index
o Number of distinct values of an attribute
o Attribute selectivity

 RDBMS stores histograms for most important attributes.

You might also like