r23 Dbms Unit 4- Implementation Techniques
r23 Dbms Unit 4- Implementation Techniques
o There is no duplication of data in this level so once a block is lost then there is no
way recover it.
o The main priority of this level is performance and not the reliability.
Level : RAID 1
o This level makes use of mirroring. That means all data in the drive is duplicated to
another drive.
o This level provides 100% redundancy in case of failure.
o Only half space of the drive is used to store the data. The other half of drive is just
a mirror to the already stored data.
o The main advantage of this level is fault tolerance. If some disk fails then the other
automatically takes care of lost data.
Level : RAID 2
o This level makes use of mirroring as well as stores Error Correcting Codes (ECC)
for its data striped on different disks.
o The data is stored in separate set of disks and ECC is stored another set of disks.
o This level has a complex structure and high cost. Hence it is not used
commercially.
Database Management Systems 4-4 Implementation Techniques
Level : RAID 3
o This level consists of byte-level stripping with dedicated parity. In this level, the
parity information is stored for each disk section and written to a dedicated parity
drive.
o We can detect single errors with a parity bit. Parity is a technique that checks
whether data has been lost or written over when it is moved from one place in
storage to another.
o In case of disk failure, the parity disk is accessed and data is reconstructed from
the remaining devices. Once the failed disk is replaced, the missing data can be
restored on the new disk.
Level : RAID 4
o RAID 4 consists of block-level stripping with a parity disk.
Database Management Systems 4-5 Implementation Techniques
o Note that level 3 uses byte-level striping, whereas level 4 uses block-level striping.
Level : RAID 5
o RAID 5 is a modification of RAID 4.
o RAID 5 writes whole data blocks onto different disks, but the parity bits generated
for data block stripe are distributed among all the data disks rather than storing
them on a different dedicated disk.
RAID : Level 6
o RAID 6 is a extension of Level 5
o RAID 6 writes whole data blocks onto different disks, but the two independent
parity bits generated for data block stripe are distributed among all the data
disks rather than storing them on a different dedicated disk.
o Two parities provide additional fault tolerance.
o This level requires at least four disks to implement RAID.
A file organization is a method of arranging records in a file when the file is stored on
disk.
A file is organized logically as a sequence of records.
Record is a sequence of fields.
There are two types of records used in file organization (1) Fixed Length Record
(2) Variable Length Record.
(1) Fixed Length Record
A file where each records is of the same length is said to have fixed length records.
Some fields are always the same length (e.g. PhoneNumber is always 10
characters).
Some fields may need to be 'padded out' so they are the correct length.
For example -
type Employee = record
EmpNo varchar(4)
Ename varchar(10)
Salary integer(5)
Phone varchar(10)
End
On insertion of a new record, we use the record pointed to by the header. We change
the header pointer to point to the next available record. If no space is available, we
add the new record to the end of the file.
(2) Variable Length Record
The figure also illustrates the use of a null bitmap, which indicates which attributes of
the record have a null value.
The variable length record can be stored in blocks. A specialized structure called
Database Management Systems 4-9 Implementation Techniques
slotted page structure is commonly used for organizing the records within a block.
This structure is as shown by following Fig. 4.2.1.
Fig. 4.2.1
(1) Heap File Organization : Any record can be placed anywhere in the file where
there is a space. There is no ordering for placing the records in the file. Generally
single file is used.
(2) Sequential File Organization : Records are stored in sequential order based on
the value of search key.
(3) Hashing File Organization : A hash function is used to obtain the location of the
record in the file. Based on the value returned by hash function, the record is
stored in the file.
4.3.1 Sequential File Organization
Database Management Systems 4 - 10 Implementation Techniques
The sequential file organization is a simple file organization method in which the
records are stored based on the search key value.
For example – Consider following set of records stored according to the RollNo of
student. Note that we assume here that the RollNo is a search key.
4 E
E
Then, we will insert it in sorted order of RollNo and
E adjust the pointers accordingly.
Fig. 4.3.1
However, maintaining physical sequential order is very difficult as there can be
several insertions and deletions.
We can maintain the deletion by next pointer chain.
For insertion following rules can be applied -
o If there is free space insert record there. If no free space, insert the record in an
overflow block.
o In either case, pointer chain must be updated.
4.3.2 Multi-table Clustering File Organization
In a multitable clustering file organization, records of several different relations are
stored in the same file.
For example - Following two tables Student and Course
Database Management Systems 4 - 11 Implementation Techniques
Sname Marks
Ankita 55
Ankita 67
Ankita 86
Prajkta 91
Ankita 55
Ankita 67
Ankita 86
Prajkta 91
This type of file organization is good for join operations such as Student Course.
This file organization results in variable size records.
The pointer chain can be added to the above records to keep track of address of next
record. It can be as shown in following Fig. 4.3.2
Fig. 4.3.2
Database Management Systems 4 - 12 Implementation Techniques
4.4 Indexing and Hashing AU : Dec.-11, Marks 10
An index is a data structure that organizes data records on the disk to make the
retrieval of data efficient.
The search key for an index is collection of one or more fields of records using which
we can efficiently retrieve the data that satisfy the search conditions.
The indexes are required to speed up the search operations on file of records.
There are two types of indices -
o Ordered Indices : This type of indexing is based on sorted ordering values.
o Hash Indices : This type of indexing is based on uniform distribution of values
across range of buckets. The address of bucket is obtained using the hash function.
There are several techniques of for using indexing and hashing. These techniques are
evaluated based on following factors -
o Access Types : It supports various types of access that are supported efficiently.
o Access Time : It denotes the time it takes to find a particular data item or set items.
o Insertion Time : It represents the time required to insert new data item.
o Deletion Time : It represents the time required to delete the desired data item.
o Space overhead : The space is required to occupy the index structure. But
allocating such extra space is worth to achieve improved performance.
Example 4.4.1 Since indices speed query processing. Why might they not be kept on several
search keys? List as many reasons as possible. AU : Dec.-11, Marks 6
Once if you are able to locate the first entry of the record containing block, other
entries are stored continuously. For example if we want to search a record for RegNo
11AS32 we need not have to search for the entire data file. With the help of primary
index structure we come to know the location of the record containing the RegNo
11AS30, now when the first entry of block 30 is located, then we can easily locate the
entry for 11AS32.
We can apply binary search technique. Suppose there are n = 300 blocks in a main data
file then the number of accesses required to search the data file will be log2 n + 1 = (log2
300) + 1 9
If we use primary index file which contains at the most n = 3 blocks then using binary
search technique, the number of accesses required to search using the primary index
file will be log2 n + 1 = (log2 3 ) + 1 3
This shows that using primary index the access time can be deduced to great extent.
Clustered Index :
In some cases, the index is created on non-primary key columns which may not be
unique for each record. In such cases, in order to identify the records faster, we will
group two or more columns together to get the unique values and create index out of
them. This method is known as clustering index.
When a file is organized so that the ordering of data records is the same as the
ordering of data entries in some index then say that index is clustered, otherwise it is
an unclustered index.
Database Management Systems 4 - 14 Implementation Techniques
Note that, the data file need to be in sorted order.
Basically, records with similar characteristics are grouped together and indexes are
created for these groups.
For example, students studying in each semester are grouped together. i.e.; 1st
semester students, 2nd semester students, 3rd semester students etc. are grouped.
2) Sparse index :
The index file usually occupies considerably less disk blocks than the data file because
its entries are much smaller.
A binary search on the index yields a pointer to the file record.
The types of single level indexing can be primary indexing, clustering index or
secondary indexing.
Database Management Systems 4 - 16 Implementation Techniques
Example : Following Fig. 4.5.5 represents the single level indexing -
In this technique two levels of indexing are used in order to reduce the mapping size
of the first level and in general.
Initially, for the first level, a large range of numbers is selected so that the mapping
size is small. Further, each range is divided into further sub ranges.
It is used to optimize the query processing and access records in a database with some
information other than the usual search key.
Database Management Systems 4 - 18 Implementation Techniques
For example -
Query processing is a collection of activities that are involved in extracting data from
database.
During query processing there is translation high level database language queries into
the expressions that can be used at the physical level of filesystem.
There are three basic steps involved in query processing and those are –
1. Parsing and Translation
o In this step the query is translated into its internal form and then into relational
algebra.
o Parser checks syntax and verifies relations.
o For instance - If we submit the query as,
SELECT RollNo, name
FROM Student
HAVING RollNo=10
Then it will issue a syntactical error message as the correct query should be
SELECT RollNo, name
FROM Student
HAVING RollNo=10
Database Management Systems 4 - 19 Implementation Techniques
Thus during this step the syntax of the query is checked so that only correct and verified
query can be submitted for further processing.
2. Optimization :
o During this process the query evaluation plan is prepared from all the relational
algebraic expressions.
o The query cost for all the evaluation plans is calculated.
o Amongst all equivalent evaluation plans the one with lowest cost is chosen.
o Cost is estimated using statistical information from the database catalog, such as
the number of tuples in each relation, size of tuples, etc.
3. Evaluation
o The query-execution engine takes a query-evaluation plan,executes that plan, and
returns the answers to the query.
The above describe steps are represented by following Fig. 4.11.1 –
University Questions
2. What is query optimization? Outline the steps in query optimization. AU : May-18, Marks 13
Scan each file block and test all records to see whether they satisfy the selection
condition
Cost = br block transfers + 1 seek
Where,
br denotes number of blocks containing records from relation r
If there are multiple records satisfying the selection add transfer cost of the number of
blocks containing records that satisfy selection condition.
4.14 Algorithms for JOIN Operation
JOIN operation is the most time consuming operation to process.
There are Several different algorithms to implement joins
1. Nested-loop join
2. Block nested-loop join
3. Indexed nested-loop join
4. Merge-join
5. Hash-join
Choice of a particular algorithm is based on cost estimate.
Algorithm For Nested Loop Join
This algorithm is for computing r s
Let, r is called the outer relation and s the inner relation of the join.
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the join condition
if is satisfied, then, add (t r , ts ) to the result.
end
end
This algorithm requires no indices and can be used with any kind of join condition.
It is expensive since it examines every pair of tuples in the two relations.
In the worst case, if there is enough memory only to hold one block of each relation,
the estimated cost is
nr × bs + br block transfers, plus nr + br seeks
If the smaller relation fits entirely in memory, use that as the inner relation
o Reduces cost to br + bs block transfers and 2 seeks
Database Management Systems 4 - 23 Implementation Techniques
For example - Assume the query CUSTOMERS ORDERS (with join attribute only
being CName)
Number of records of customer : 10000 order : 5000
Number of blocks of customer : 400 order : 100
Formula Used :
(1) nr × bs + br block transfers,
(2) nr + br seeks
r is outer relation and s is inner relation.
With order as outer relation:
nr = 5000 bs = 400, br = 100
5000 × 400 + 100 = 2000100 block transfers and
5000 + 100 = 5100 seeks
With customer as the outer relation:
nr=10000, bs=100, br=400
10000 × 100 + 400 = 1000400 block transfers and
10000+400 =10400 seeks
If smaller relation (order ) fits entirely in memory, the cost estimate will be:
br + bs = 500 block transfers
Algorithm For Block Nested Loop Join
Variant of nested-loop join in which every block of inner relation is paired with every
block of outer relation.
In this operation, both the relations are sorted on their join attributes. Then merged
these sorted relations to join them.
Only equijoin and natural joins are used.
The cost of merge join is,
br + bs block transfers + ⌈br /bb⌉ + ⌈bs/bb⌉ seeks + the cost of sorting, if relations are
unsorted.
(4) Hash Join
In this operation, the hash function h is used to partition tuples of both the relations.
h maps A values to {0, 1, ..., n}, where A denotes the attributes of r and s used in the
join.
Cost of hash join is :
3(br + bs ) + 4 × n block transfers + 2(⌈br /b b⌉ + ⌈bs/bb⌉) seeks
If the entire build input can be kept in main memory no partitioning is required
Cost estimate goes down to br + bs.
Out of the above given query evaluation plans, the Fig. 4.15.1 (b) is much faster than
Fig. 4.15.1 (a) because – in Fig. 4.15.1 (a) the join operation is among Branch, Account and
Customer, whereas in Fig. 4.15.1 (b) the join of (Account and Customer) is made with the
selected tuple for City=”Pune” . Thus the output of entire table for join operation is much
more than the join for some selected tuples. Thus we get choose the optimized query.