Storage System Hierarchy in DBMS
Storage System Hierarchy in DBMS
The storage hierarchy typically has multiple levels, each with its specific
characteristics. Here's a typical hierarchy from fastest (and usually most expensive
per byte) to slowest (and usually least expensive per byte):
1. Registers
Located within the CPU.
Smallest and fastest type of storage.
Used to hold data currently being processed.
7. Magnetic Tapes
Sequential access storage, unlike disks which are random access.
Often used for backups and archiving due to their high capacity and low cost.
Much slower access times compared to magnetic disks.
Types of Storage:
1. Primary Storage: Includes registers, cache memory, and main memory
(RAM). It's the main memory where the operating system, application
programs, and data in current use are kept for quick access by the computer's
processor.
2. Secondary Storage: Encompasses data storage devices like HDDs, SSDs,
CDs, and USB drives. It is non-volatile and retains data even when the
computer is turned off.
3. Tertiary Storage or Off-line Storage: Often involves magnetic tape systems
or optical disk archives. This is slower than secondary storage and is used for
data archiving and backup.
4. Quaternary Storage: Refers to methods like cloud storage or other remote
storage techniques where data is stored in remote servers and is fetched over
the internet or other networks
File Organization
File organization refers to the arrangement of data on storage devices. The method
chosen can have a profound effect on the efficiency of various database operations.
Common methods of file organization include:
Ordered Records: Records in a sequential file are stored based on a key field.
Continuous Memory Allocation: The records are stored in contiguous memory
locations.
No Direct Access: To access a record, you have to traverse from the first record
until you find the desired one.
Simplicity: The design and logic behind sequential file organization are
straightforward.
Efficient for Batch Processing: Since records are stored in sequence, sequential
processing (like batch updates) can be done efficiently.
Less Overhead: There's no need for complex algorithms or mechanisms to store
records.
Inefficient for Random Access**: If you need a specific record, you may have to go
through many records before finding the desired one. This makes random access
slow.
Insertion and Deletion**: Inserting or deleting a record (other than at the end) can be
time-consuming since you may need to shift records to maintain the order.
Redundancy Issues**: There's a risk of redundancy if checks are not made before
inserting records. For example, a record with the same key might get added twice if
not checked.
Practical Application: Suppose you have a file of students ordered by their roll
number:
Roll No | Name
--------|--------
1 | Madhu
2 | Naveen
4 | Shivaji
5 | Durga
In a sequential file, if you wanted to add a student with roll number 6, you would
append them at the end. However, if you wanted to insert a student with a roll
number 3 which is between 2 and 4, you would need to shift all subsequent records
to maintain the sequence, which can be time-consuming.
Hash Function: A hash function converts a record's key value into an address.
Buckets: A bucket typically stores one or more records. A hash function might map
multiple keys to the same bucket.
No Ordering of Records: Records are not stored in any specific logical order.
Rapid Access: If the hash function is efficient and there's minimal collision, the
retrieval of a record is very quick.
Uniform Distribution: A good hash function will distribute records uniformly across
all buckets.
Efficient Search: Searching becomes efficient as only a specific bucket needs to be
searched rather than the entire file.
Collisions**: A collision occurs when two different keys hash to the same bucket.
Handling collisions can be tricky and might affect access time.
Dependency on Hash Function**: The efficiency of a hash file organization heavily
depends on the hash function used. A bad hash function can lead to clustering and
inefficient utilization of space.
Dynamic Growth and Shrinking**: If the number of records grows or shrinks
significantly, rehashing might be needed which is an expensive operation.
Practical Application: Imagine a database that holds information about books.
Each book has a unique ISBN number. A hash function takes an ISBN and returns
an address. When you want to find a particular book's details, you hash the ISBN,
which directs you to a particular bucket. If two books' ISBNs hash to the same value,
you handle that collision, maybe by placing the new record in a linked list associated
with that bucket.
Primary Data File: The actual database file where records are stored.
Index: An auxiliary file that contains key values and pointers to the corresponding
records in the data file.
Multi-level Index: Sometimes, if the index becomes large, a secondary (or even
tertiary) index can be created on the primary index to expedite searching further.
Quick Random Access: Direct access to records is possible using the index.
Flexible Searches: Since an index provides a mechanism to jump directly to
records, different types of search operations (like range queries) can be efficiently
supported.
Ordered Access: If the primary file is ordered, then indexed file organization can
support efficient sequential access too.
Indexing
Indexing involves creating an auxiliary structure (an index) to improve data retrieval
times. Just like the index in the back of a book, a database index provides pointers to
the locations of records.
Structure of Index
We can create indices using some columns of the database.
|-------------|----------------|
| Search Key | Data Reference |
|-------------|----------------|
The search key is the database’s first column, and it contains a duplicate or
copy of the table’s candidate key or primary key. The primary key values are
saved in sorted order so that the related data can be quickly accessible.
The data reference is the database’s second column. It contains a group of
pointers that point to the disk block where the value of a specific key can be
found.
Types of Indexes:
1. Single-level Index: A single index table that contains pointers to the actual
data records.
2. Multi-level Index: An index of indexes. This hierarchical approach reduces
the number of accesses (disk I/O operations) required to find an entry.
3. Dense and Sparse Indexes:
o In a dense index, there's an index entry for every search key value in
the database.
o In a sparse index, there are fewer index entries. One entry might point
to several records.
4. Primary and Secondary Indexes:
o A primary index is an ordered file whose records are of fixed length
with two fields. The first field is the same as the primary key, and the
second field is a pointer to the data block. There's a one-to-one
relationship between the number of entries in the index and the number
of records in the main file.
o A secondary index provides a secondary means of accessing data. For
each secondary key value, the index points to all the records with that
key value.
5. Clustered vs. Non-clustered Index:
o In a clustered index, the rows of data in the table are stored on disk in
the same order as the index. There can only be one clustered index
per table.
o In a non-clustered index, the order of rows does not match the index's
order. You can have multiple non-clustered indexes.
6. Bitmap Index: Used mainly for data warehousing setups, a bitmap index
uses bit arrays (bitmaps) and usually involves columns that have a limited
number of distinct values.
7. B-trees and B+ trees: Balanced tree structures that ensure logarithmic
access time. B+ trees are particularly popular in DBMS for their efficiency in
disk I/O operations.
Benefits of Indexing:
Faster search and retrieval times for database operations.
Drawbacks of Indexing:
Overhead for insert, update, and delete operations, as indexes need to be
maintained.
Additional storage requirements for the index structures.
Now, when you want to find a book based on its ID, the DBMS can quickly locate the
data because the data is stored in the order of the BookID.
1. Dense Index: In this, an index entry appears for every search key value in the
data file.
2. Sparse (or Non-Dense) Index: Here, index records are created only for some
of the search key values. A sparse index reduces the size of the index file.
Assuming each block of our storage can hold two records, our blocks will be:
Characteristics of a :
1. Provides an alternative path to access the data.
2. Can be either dense or sparse.
3. Allows for non-unique values.
4. Does not guarantee the order of records in the data file.
5. Unlike a primary (often clustered) index, a secondary index is typically a non-
clustered index. This means the physical order of rows in a table is not the
same as the index order.
Here, each age value has a direct pointer to the corresponding record.
If another student with an age of 22 is added:
Hash-Based Indexing
In hash-based indexing, a hash function is used to convert a key into a hash code.
This hash code serves as an index where the value associated with that key is
stored. The goal is to distribute the keys uniformly across an array, so that access
time is, on average, constant.
Let's break down some of these elements to further understand how hash-based
indexing works in practice:
Buckets
In hash-based indexing, the data space is divided into a fixed number of slots known
as "buckets." A bucket usually contains a single page (also known as a block), but it
may have additional pages linked in a chain if the primary page becomes full. This is
known as overflow.
Hash Function
The hash function is a mapping function that takes the search key as an input and
returns the bucket number where the record should be located. Hash functions aim
to distribute records uniformly across buckets to minimize the number of collisions
(two different keys hashing to the same bucket).
Insert Operations
When a new record is inserted into the dataset, its search key is hashed to find the
appropriate bucket. If the primary page of the bucket is full, an additional overflow
page is allocated and linked to the primary page. The new record is then stored on
this overflow page.
Search Operations
To find a record with a specific search key, the hash function is applied to the search
key to identify the bucket. All pages (primary and overflow) in that bucket are then
examined to find the desired record.
Limitations
Hash-based indexing is not suitable for range queries or when the search key is not
known. In such cases, a full scan of all pages is required, which is resource-
intensive.
Hash Function: H(x) = ASCII value of first letter of the name mod 3
Alice: 65 mod 3 = 2
Bob: 66 mod 3 = 0
Carol: 67 mod 3 = 1
Buckets:
Bucket 0: Bob
Bucket 1: Carol
Bucket 2: Alice
Tree-based Indexing
The most commonly used tree-based index structure is the B-Tree, and its variations
like B+ Trees and B* Trees. In tree-based indexing, data is organized into a tree-like
structure. Each node represents a range of key values, and leaf nodes contain the
actual data or pointers to the data.
Sorted Data: They maintain data in sorted order, making it easier to perform
range queries.
Balanced Tree: B-Trees and their variants are balanced, meaning the path
from the root node to any leaf node is of the same length. This balancing
ensures that data retrieval times are consistently fast, even as the dataset
grows.
Multi-level Index: Tree-based indexes can be multi-level, which helps to
minimize the number of disk I/Os required to find an item.
Dynamic Nature: B-Trees are dynamic, meaning they're good at inserting
and deleting records without requiring full reorganization.
Versatility: They are useful for both exact-match and range queries.
ID Name
1 Abhi
2 Bharath
3 Chinni
4 Devid
[1, 3]
/ \
[1] [3, 4]
/ \ / \
1 2 3 4
In the tree, navigating from the root to the leaf nodes will lead us to the desired data
record.
File Organizations
1. Objective: To physically store records on storage media in an organized manner.
2. Methodologies: Includes sequential, random (or direct), and hashed file
organizations, among others.
3. Implications:
Indexing
1. Objective: To create a data structure that improves the speed of data retrieval
operations.
2. Methodologies: Includes clustered, non-clustered, primary, secondary,
composite, bitmap, and hash indexes, among others.
3. Implications:
Clustered indexes are excellent for range-based queries but slow down
insert/update operations.
Non-clustered indexes improve data retrieval speed but can take up additional
storage.
Bitmap indexes are useful for low-cardinality columns.
4. Real-world Examples: Search engines, e-commerce websites, any application
that requires fast data retrieval.
Performance Tuning
1. Objective: To optimize the resources used by the database for efficient
transaction processing.
2. Methodologies: Query optimization, index tuning, denormalization, database
sharding, caching, partitioning, etc.
3. Implications:
Query optimization can dramatically reduce the resources needed for query
processing.
Proper indexing can mitigate the need for full-table scans.
Denormalization and caching can improve read operations but may
compromise data integrity or consistency.
4. Real-world Examples: Financial trading systems, real-time analytics, high-
performance computing.
1. Granularity:
File organization is about how data is stored at the file level.
Indexing is about improving data access at the table or even column level.
Performance tuning is a broad set of activities that can encompass both file
organization and indexing among many other techniques.
2. Resource Usage:
File organization techniques aim to use disk space efficiently.
Indexing aims to use both disk space and memory for fast data retrieval.
Performance tuning aims to optimize all system resources including CPU,
memory, disk, and network bandwidth.
3. Query Efficiency:
File organization generally impacts how efficiently data can be read or written
to disk.
Indexing significantly impacts how efficiently queries can retrieve data.
Performance tuning seeks to optimize both read and write operations through
a variety of methods.
4. Complexity:
File organization is relatively straightforward.
Indexing can become complex depending on the types of indexes and the
nature of the queries.
Performance tuning is usually the most complex as it involves a holistic
understanding of hardware, software, data, and queries.