File and File Structure: Overview of Storage Device
File and File Structure: Overview of Storage Device
4) Hard disk:
A hard disk is a hardware component in a computer. Data is kept permanently in this memory. Memory
from hard disk is not directly accessed by the CPU, hence it is slower. As compared with RAM, hard
disk is cheaper per bit. Typically, the entire database is stored on disk. Data must be moved from disk to
main memory in order for data to be processed.
5) Magnetic tape:
Magnetic tape memory is usually used for backing up large data. When the system needs to access a tape,
it is first mounted to access the data. When the data is accessed, it is then un-mounted. The memory
access time is slower in magnetic tape and it usually takes few minutes to access a tape.
If we assume that each character occupies one byte, an integer occupies 4 bytes, and a real 8 bytes, our
deposit record is 52 bytes long. The simplest approach is to use the first 52 bytes for the first record, the
next 52 bytes for the second, and so on.
However, there are two problems with this approach.
Account-information is an array with an arbitrary number of elements i.e. the type definition does not
limit the number of elements in the array. So there is no limit on how large a record can be. The space
available for the records must be managed carefully. To manage free space, there is a pointer that
indicates the start of the free space area.
There are two approaches for variable length record
1. Byte-String representation
2. Fixed length representation
1. Byte-String Representation
Byte string representation uses the technique of attaching a special end-of-record symbol ( ) to the end
of each record. Then we can store each record as a string of successive bytes.
b. Pointers - represent by a list of fixed-length records, chained together via pointer. It can be used
even if the maximum record length is not known. The problem is that space is wasted in
successive records in a chain as non-repeating fields are still present.
Buffer Management
We need to use disk storage for the database, and to transfer blocks of data between MM and disk. We also want
to minimize the number of such transfers, as they are time-consuming. One way is to keep as many blocks as
possible in MM. Usually, we cannot keep all blocks in MM, and so we need to manage the allocation of available
MM space. The buffer is the part of MM available for storage of copies of disk blocks. The subsystem
responsible for the allocation of buffer space is called the buffer manager.
The buffer manager handles all requests for blocks of the database. If the block is already in MM, the address in
MM is given to the requestor. If not, the buffer manager must read the block in from disk (possibly displacing
some other block if the buffer is full) and then pass the address in MM to the requestor. The buffer manager must
use some sophisticated techniques in order to provide good service:
File Organization
A file is a sequence of records stored in binary format. A disk drive is formatted into several blocks that can store
records. File records are mapped onto those disk blocks. File Organization defines how file records are mapped
onto disk blocks. We have four types of File Organization to organize file records.
1. Heap (Pile) file organization
2. Sequential file organization
3. Indexed Sequential file organization
4. Hash file organization
Record N
Disadvantage
This method is inefficient for larger databases as it takes time to search/modify the record
Proper memory management is required to boost the performance. Otherwise there would be lots
of unused memory blocks lying and memory size will simply be growing.
Advantage:
The design is very simple compared other file organization.
Searching and deletion operation is faster as compared to heap file organization.
This method is good in case of report generation or statistical calculations.
These files can be stored in magnetic tapes which are comparatively cheap.
Disadvantage
This method always involves the effort for sorting the record. Each time any insert/update/ delete
transaction is performed, file must be sorted.
3. Indexing
An index is any data structure that take as input a property of records, typically the values of one or more
fields and finds the records with that property quickly.
Primary index
We assume all files are ordered sequentially on some search key, such file are also called as index
sequential file. A primary index is associated with dense and sparse index.
Dense Index
An index record appears for every search key value in a file. This record contain search key
value and a pointer to the actual record. It require more memory space for index. Records
are accessed fast.
Secondary index
We can use extra level of indirection to implement secondary indices on each search key that are not
candidate key. The pointer in such secondary index do not point directly to the file but to a bucket that
contains pointer to the file. Secondary index must be dense with and index entry for each search key
values and the pointer to every record in the file. Secondary indices improve the performance of queries
that uses keys other than the search key of the primary index
Suppose we want to search 65 in the below B+ tree structure. First we will fetch for the intermediary
node which will direct to the leaf node that can contain record for 65. So we find branch between 50
and 75 nodes in the intermediary node. Then we will be redirected to the third leaf node at the end.
Here DBMS will perform sequential search to find 65. Suppose, instead of 65, we have to search for
60. What will happen in this case? We will not be able to find in the leaf node. No
insertions/update/delete is allowed during the search in B+ tree.
Insertion in B+ Tree
Traverse B-tree until we find leaf node suitable for insertion.
We have two cases for inserting a key.
i. Node is not full
- Insertion is done in the node in increasing order of key
ii. Node is full
a. If a node is full and is a leaf, classify the keys L (Lowest), M (Middle) and H
(Highest) and split the node as below.
L M H
b. If a node is full and is non leaf, classify the keys L (Lowest), M (Middle) and H
(Highest) and split the node as below.
L H
The median value (M) is pushed one level up in the parent node. If the parent node is not
full then accommodate the median value otherwise repeat same procedure in case II.
Example:
Construct a B+ tree for the following set of key values: (2,3,5,7,11,17,19,23,29,31)
Assume that the tree is initially empty and values are added in ascending order.
Construct a B+ tree where the pointer number is Four.
Solution:
We know each node has N pointer and has N-1 search key values. In our example pointer
number is 4, so the search key value is 4-1=3.
So each node has 4 pointers and 3 search key value. The first node is
Then we put 19, 23 and split the leaf node. The node is
Then we put 19, 23 in the leaf node and split the leaf node.
Then we put 29, 31 and split the leaf node but when we split the leaf node at a time we
need to split the non-leaf node. We split the non-leaf node at same rules.
A hash function, H (K), is a mapping function that maps all the set of search-keys K to the bucket
where keys and its associated pointers are placed.
Formally let K denote the set of all search keys value, B denote the set of all bucket address and h
denote the hash function. To insert a record with search key ki, we compute H (Ki) which gives
the address of bucket for that record. That record is then stored in that bucket
Deletion is also simple. If the search key value of the record to be deleted is Ki. We compute H
(ki) to find the corresponding bucket for that record and delete the record from the bucket.
Assignment
Recovery System
A computer system is subjected to failure due to variety of causes such as disk crash, power failure, software
crash, fire, network down etc.
Whatever the cause, information may be lost.
The database must take action in advance to ensure that atomicity and durability properties of transaction are
preserved.
Recovery System in an integral part of a database system which is responsible for the restoration of the
database to a consistent state that existed prior to the occurrence of the failure.
Failure Classification
Designing a reliable system that can recover form failure requires identifying the types of failure with which
the system has to deal.
The major types of failures are
1. Transaction failure: The transaction failure occurs when a transaction is aborted. There are two types
of error that can make a transaction to fail.
Logical errors: transaction cannot complete due to some internal error condition such as bad
input data, resource limit exceeds etc.
System errors: the database system must terminate an active transaction due to an error
condition (e.g., deadlock). The transaction however can be re-executed at later time.
2. System crash: a power failure or other hardware or software failure causes the system to crash. This
generally occurs when there is hardware malfunctioning, there is bug in the database software or OS
resulting loss of content of volatile storage.
3. Disk failure: a head crash or similar disk failure destroys all or part of disk storage. Copies of data on
another disk or ternary media such as tapes are used to recover from the failure.
4. Natural Disaster: fire, flood, earthquake etc. may cause physical loss of all information on all media.
Storage Structure
The various data items in database may be stored and access in the member of different storage media. To ensure how
to recover data from any kind of failure, we must know about types of storage media and their access methods. The
storage media can be classified as.
1. Volatile storage:
Information here does not survive system crashes
examples: main memory, cache memory
2. Nonvolatile storage:
Information normally does survives system crashes
examples: disk, tape, flash memory,
But may still fail in case of head crash.
3. Stable storage:
A mythical form of storage that survives all failures i.e. system designed not to loss data.
Approximated by maintaining multiple copies on distinct nonvolatile media
3. Checkpoints
The main problems faced in recovery procedure discussed earlier are
1. Searching the entire log to determine which transaction need to be redone and those that need to be undone is
time consuming.
2. We might unnecessarily redo transactions which have already output their updates to the database.
A scheme called check point is used to limit the volume of log information that has to be handled and processed in the
events of system failure. A check point performs periodic copy of log information onto the stable storage. A check
point requires the following sequence of actions to take place.
1. Output all the log records currently residing in main memory onto stable storage.
2. Output to the disk all modified buffer block.
3. Write a log record <check point> onto stable storage.
During recovery, we need to consider only the most recent transaction T i that started just before the check point and
the transactions that started after it. For this the following task is done.
1. Scan backwards from end of log to the most recent <check point> record.
2. Continue scanning backward till a record <Ti start> is identified.
3. So consider only those log records following <T i start>. Earlier part of log records can be ignored during
recovery.
The UNDO and REDO operation on all transaction Tj that started execution after identified Ti is as below.
If no <Tj commit> record in log and Tj has done immediate modification then execute UNDO for all Tj.
If <Tj commit> record in log then execute REDO for all Tj.
Shadow Paging
Shadow paging is an alternative to log based recovery which is useful if transaction executes serially. Here, when a
page of storage is modified by a transaction, a new page is allocated for the modified data and the old page remains as
a shadow copy of the data.
In this technique, database is partitioned into some number of fixed length block called as pages. The key idea behind
shadow paging technique is to maintain two page tables during the lifetime of a transaction. These tables are called as
current page table and shadow page table. Each entry in the page tables contain pointer to a page on a disk. When the
Disadvantage
1. Copying the entire page table is very expensive
2. Commit overhead: The commit of a single transaction requires multiple blocks to be output: the current page
table, the actual data i.e. updated pages. Log based scheme need to output only the log records.
3. Data fragmentation: It causes database pages to change location making no longer contiguous.
4. Garbage collection: Each time a transaction commits, the database pages containing the old version of data
changed by the transaction must become inaccessible. Such pages do not contain usable information hence
considered as garbage. Periodically it is necessary to find all the garbage pages and add them to the list of free
pages. The process is called garbage collection and requires additional overhead and complexity on the
system.
5. Hard to extend algorithm to allow transaction to run concurrently.
When the primary site fails, the remote backup site takes over processing.
First, however, it performs recovery, using its (perhaps outdated) copy of the data from the primary, and the
log records received from the primary.
In effect, the remote backup site is performing recovery actions that would have been performed at the
primary site when the latter recovered.
Once recovery has been performed, the remote backup site starts processing transactions.
When the backup site takes over processing it becomes the new primary and the old primary site become new
backup site.
Transaction
Collection of operations such as insertion, deletion, modifications etc. that form a single logical unit of work are called
transactions. A transaction is a unit of program execution that accesses and possibly updates various data items. A
transaction operations must be done entirely (commit) or aborted (Rollback), no intermediated state are acceptable.
During transaction execution, the database may be temporarily inconsistent however when the transaction completes
successfully i.e. commits, the database must be consistent. After a transaction commits, the changes it has made to the
database persist, even if there is system failure.
Transaction Processing Systems are the systems, which includes large size database and allows many users to use
same database concurrently. For example: Airline reservation system.
Example:
Transaction to transfer Rs 50 from account A to account B
1. Read(A)
2. A := A – 50
3. Write (A)
4. Read(B)
5. B := B + 50
6. Write(B)
State of Transaction
After the transaction starts execution, it changes state. The state of a transaction is defined as the current activity of
that transaction. Each transaction may have one of the following states explained below.
Partially Committed
Committed
Active
Failed Aborted
1. Active: It is the initial state; the transaction stays in this state while it is executing.
2. Partially Committed: After the final statement has been executed.
3. Failed: After the discovery that normal execution can no longer proceed.
4. Aborted: After the transaction has been rolled back and the database has been restored to the state
prior to the start of transaction.
5. Committed: A transaction that completes its execution successfully is said to be committed.
Schedules
When several transactions are executing concurrently then the order of execution of various instructions is known as
schedule. There are four types of schedule
1. Serial schedule
2. Non-serial schedule
3. Equivalent schedule
4. Conflict schedule
A schedule S is called serial schedule if for any two transaction Ti and Tj participating in S either all
operation of Ti precedes all operation of Tj or vice versa. Hence in serial schedule, only one transaction is
active at a time. The commit or abort of an active transaction initiates execution of next transaction. No
interleaving occurs in serial schedule
A schedule S is called non-serial schedule if for any two transaction Ti and Tj participating in S, the operation
of each transaction are executed non-consecutively with interleaved operations from the other transaction.
Two schedules A and B are said to be equivalent schedules if the execution of first schedule is identical to
execution of second schedule.
A schedule S is called conflict schedule if for any two transaction Ti and Tj there is some action ti of Ti and an
action tj of Tj accessing the same object and at least one of the actions is write.
Schedule1 Schedule2
T1 T2 T1 T2
Read (A) Read (A)
A := A – 50 Temp := A *0.1
Write (A) A := A – Temp
Read (B) Write (A)
B := B + 50 Read (B)
Write (B) B := B + Temp
Read (A) Write (B)
Temp := A *0.1 Read (A)
A := A – Temp A := A – 50
Write (A) Write (A)
Read (B) Read (B)
B := B + Temp B := B + 50
Write (B) Write (B)
T1 T2 T1 T2
Read (A) Read (A)
A := A – 50 A := A – 50
Write (A) Read (A)
Read (A) Temp := A *0.1
Temp := A *0.1 A := A – Temp
A := A – Temp Write (A)
Write (A) Read (B)
Read (B) Write (A)
B := B + 50 Read (B)
Write (B) B := B + 50
Read (B) Write (B)
B := B + Temp B := B + Temp
Write (B) Write (B)
In the above example, schedule1 and schedule2 are serial schedule because in schedule1 Transaction T1 is followed
by Transaction T2 and in schedule2 Transaction T2 is followed by Transaction T1. Schedule3 is non-serial schedule
because there is interleaving of operations between transactions T1 and T2. However the schedule3 is equivalent
schedule to schedule1 and schedule2 since the sum (A+B) is preserved. Schedule4 is conflict schedule because it
doesn’t preserve the value of (A+B)
Serializability
Serial schedule are generally considered unacceptable in practice because in serial schedule, if a transaction waits for
I/O operations to complete, it make waste of CPU time making while other transactions are ready for execution. So
interleaving i.e. non serial schedule could improve the use of CPU cycle. But some non-serial schedule produces
erroneous result while some produce correct result. So this introduces the concept of serializability.
A concurrent execution of N transaction is called serializable, if the execution is computationally equivalent to a
serial execution. When more than one transactions is being executed concurrently, we must have serializability in
order to have same effect on the database as same serial execution does.
There are two types of serializability.
1. Conflict serializability
2. View serializability
1. Conflict serializability
If two operations in a schedule satisfy all these three conditions then operations are said to be conflict.
i. They belong to different transaction.
ii. They access the same item.
iii. At least one of the operations is write operation.
Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by
both li and lj, and at least one of these instructions wrote Q.
1. li = read(Q), lj = read(Q). li and lj don’t conflict.
2. li = read(Q), lj = write(Q). They conflict.
3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-conflicting instructions, we say that S
and S´ are conflict equivalent.
So a schedule S is conflict serializability if it is conflict equivalent to a serial schedule
T1 T2
Read (A)
Write (A)
Write (A)
Schedule1 is conflict serializability since we can swap instructions (T1, T2) or (T2,T1) to get and equivalent serial
schedule but schedule2 is not as we are unable to swap the instruction in the above schedule to obtain either the serial
schedule (t1,t2) or (t2, t1).
2. View serializability
Two schedules S1 and S2 having same set of instructions are said to be view equivalent if the following conditions are
met.
1. If the transaction Ti in S1 reads an initial value for object X, so does the transaction Ti in S2.
2. If the transaction Ti in S1 reads the value written by transaction Tj in S1 for object X, so does the
transaction Ti in S2.
3. If the transaction Ti in S1 is the final transaction to write the value for an object X, so is the transaction Ti in
S2.
So a schedule is said to be view serializability if it is view equivalent to a serial schedule. Schedule3 in page no 2, is
schedule which is both view serializability and conflict serializability. Also every conflict-serializability schedule is
also a view serializability but there are some view serializability that are not conflict serializability.
Below figure shows a schedule which is view serializability but not conflict serializability.
Testing of Serializability
For testing of serializability, the simple and efficient method is to construct a directed graph, called a precedence
graph for schedule S.
G=(V,E)
The set of vertex consists of all the transactions participating in the schedule.
The set of edges consists of all edges Ti →Tj for which one of the three condition holds.
a. Ti executes Write (Q) before Tj executes Read (Q).
b. Ti executes Read (Q) before Tj executes Write (Q).
c. Ti executes Write (Q) before Tj executes Write (Q).
If the graph doesn’t contain a cycle, it is called as conflict serializable else non serializable. i.e. A schedule is conflict
serializable if and only if its precedence graph is acyclic.
Example:
X
T T
1 2
X X
X T
3
Since the graph contains a cycle, hence it is not conflict serializable.
X
T T
1 2
X X
T
3
Since the graph contains no cycle, hence it is conflict serializable.
T1 T2 T3
Read(X)
Read (Z)
Read (X)
Read (Z)
Read (Y)
Read (Y)
Write (X)
Write (Z)
Write (Y)
Write (Y)
T Z T
1 2
Y
X Y
T
3
Since the graph contains cycle, hence it is not conflict serializability.
If A and B are two arbitrary locking modes, if transaction T1 hold A lock on data item X and another transaction T2
request a B lock on the same data item. Then if the transaction T2 gets the permission to lock X in B mode then we
say that B mode is compatible to A mode. Shared mode is compatible with shared mode but not with exclusive mode.
At any time several shared lock can be held simultaneously on a particular data item. A subsequent exclusive mode
locking request has to wait until the currently held shared mode locks are released.
For example: let us consider we have two transaction as below.
T1: Read (Y) T2: Read (X)
Read (X) Read (Y)
X=X+Y X=X - Y
Compiled By: Mohan Bhandari
Write (X) Write (X)
The schedule S2 wouldn’t lead to incorrect result, but both transactions are blocked waiting for each other resulting
deadlock situations. Deadlock occurs because of circular wait conditions involving two or more transactions as above
i.e. T2 holding B and requesting A held by T1, T1 holding A and requesting B held by T1. So in 2pl, we need a
deadlock detection mechanism and a technique to resolve the deadlock once it occurs.
A
T T
1 2
B
Compiled By: Mohan Bhandari
Types
Example:
Readlock (X)
Readlock(Y)
Writelock(Z)
Read (X)
Read (Y)
Z=X+Y
Write (Z)
Unlock (X)
Unlock (Y)
Unlock (Z)
B. Dynamic 2PL:
In this scheme, a transaction locks on a data items immediately before it is accesses. All the locks are released
immediately after the last step of transaction.
Example:
Readlock (X)
Read (X)
Readlock (Y)
Read (Y)
Writelock (Z)
Z=X+Y
Write (Z)
Unlock (X)
Unlock (Y)
Unlock (Z)
C. Strict 2PL:
In this protocol, a transaction does not release any of its write lock until it commits or aborts.
D. Rigorous 2PL:
In this protocol, a transaction does not release any of write lock and read lock until it commits or aborts.
Deadlock
Deadlock is a set of blocked transactions each holding a resource and waiting to acquire a resource held by another
process in the set. The simplest example of deadlock is given below.
1. Transaction T1 request resource A and receive it.
2. Transaction T2 request resource B and receive it.
3. Transaction T1 request resource B and is queued up waiting the release of B by transaction T2.
4. Transaction T2 request resource A and is queued up waiting the release of A by transaction T1.
Condition of deadlock
We can say a deadlock is occurred if these four conditions occur simultaneously.
1. Mutual Exclusion: Only one transaction at a time can use a resource.
Deadlock Prevention
We can prevent the deadlock by
1. Preventing mutual exclusion by making the entire sharable data item sharable. This doesn’t create any
problem for read operation but may lead to serious problem for write operation.
2. No cyclic wait should occur.
3. Resources should be preempted.
4. A transaction may not request a resource if it is holding another resource.
5. Perform transaction rollback instead of waiting for a lock.
Deadlock Detection
Whenever a transaction is blocked by a lock, every time the DBMS build and analyse a waits- for-graph. If there is a
cycle in the wait-for-graph, then a deadlock is said to exist in the system. Once deadlock has been detected, it must be
resolved. The most common resolution technique is that one transaction in the wait-for-graph is selected as a victim
and is rolled back and restarted and the other transaction continues.
Deadlock Recovery
When deadlock is detected, the system must recover from the deadlock. Some of the approaches for breaking a
deadlock are listed below.
1. Simply Rollback one or more Transactions (victim) in order to break the circular wait.
In this, two method can be applied.
a) Rollback all deadlock transaction
This will break the deadlock cycle but at a loss of lots of computation, since transactions may have
computed for a long period of time and the result of the partial computation must be discarded and
recomputed later.
b) Rollback one process at a time
This will rollback one transaction at a time until a deadlock cycle is eliminated. With each rollback,
this requires invoking deadlock detection algorithm to determine whether any other transactions are
still deadlock.
2. Resource Preemption
In this, preempt some resource from transactions and allocate it to other transactions till the deadlock
cycle is broken. It involve the following issues.
a) We must determine which resource and which transaction are to be preempted. If a transaction has
completed 95% of its execution, it should not be our choice for preemption.
b) The process from which the resource is preempted should be rolled back and restarted.
c) Starvation should be avoided. Starvation happens if same transaction is chosen as victim .i.e. resource
are always preempted from same transaction as a result this transaction never completes its task.
Multiple Granularity
Granularity is the size of data item being locked. It is considered as the major factor concurrency control because it
can effect the performance of concurrency and recovery. Because, if the granularity of a data item is very large then
the overhead of the lock is very low. For example, if we select the data item as whole database, then a transaction T
while accessing even a small portion of database will lock the whole database and no concurrent access by other
transactions would be allowed.
Multiple granularity allows data item to be of various size and define a hierarchy of data granularities where the small
granularities are nested within the larger one. It can be represented graphically as a tree.
The highest level represent the entire database. Below it are node of type area. Each area in turn has node of type file
as its children. Finally each file has node of type records. In addition to shared lock and exclusive lock, there are three
more additional lock modes with it.
1. Intention-shared (IS): If a node is locked in IS mode then it indicates explicit locking at a lower level of the
tree but only with shared locks.
2. Intention-exclusive (IX): indicates explicit locking at a lower level with exclusive or shared locks.
3. Shared and intention-exclusive (SIX): the sub-tree rooted by that node is locked explicitly in shared mode and
explicit locking is being done at a lower level with exclusive-mode locks.
Intention locks allow a higher level node to be locked in S or X mode without having to check all descendent nodes.
Locks are acquired in root-to-leaf order whereas they are released in leaf-to-root order.
Assignment