0% found this document useful (0 votes)
22 views33 pages

Normalization

Unit 4 discusses relational database design, focusing on normalization, which organizes data to reduce redundancy and improve integrity. It outlines the benefits of normalization, the types of anomalies it addresses, and the various normal forms (1NF, 2NF, 3NF, BCNF, and higher forms) with examples. Additionally, it covers functional dependencies, multivalued dependencies, and file organization methods.

Uploaded by

laleshpawar2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views33 pages

Normalization

Unit 4 discusses relational database design, focusing on normalization, which organizes data to reduce redundancy and improve integrity. It outlines the benefits of normalization, the types of anomalies it addresses, and the various normal forms (1NF, 2NF, 3NF, BCNF, and higher forms) with examples. Additionally, it covers functional dependencies, multivalued dependencies, and file organization methods.

Uploaded by

laleshpawar2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT 4 NOTES

Unit 4 Relational Database Design and File Organization, Indexing & Hashing

Normalization

Normalization is the process of organizing a database to reduce redundancy and improve data
integrity.

Normalization also simplifies the database design so that it achieves the optimal structure
composed of atomic elements (i.e. elements that cannot be broken down into smaller parts).
Also referred to as database normalization or data normalization, normalization is an
important part of relational database design, as it helps with the speed, accuracy, and
efficiency of the database.

By normalizing a database, you arrange the data into tables and columns. You ensure that
each table contains only related data. If data is not directly related, you create a new table for
that data.
For example, if you have a “Customers” table, you’d normally create a separate table for the
products they can order (you could call this table “Products”). You’d create another table for
customers’ orders (perhaps called “Orders”). And if each order could contain multiple items,
you’d typically create yet another table to store each order item (perhaps called “Order
Items”). All these tables would be linked by their primary key, which allows you to find related
data across all these tables (such as all orders by a given customer).

Benefits of Normalization
There are many benefits of normalizing a database. Here are some of the key benefits:
 Minimizes data redundancy (duplicate data).

 Minimizes null values.


 Results in a more compact database (due to less data redundancy/null values).

 Minimizes/avoids data modification issues.


 Simplifies queries.
 The database structure is cleaner and easier to understand. You can learn a lot about
a relational database just by looking at its schema.
 You can extend the database without necessarily impacting the existing data.

 Searching, sorting, and creating indexes can be faster, since tables are narrower, and
more rows fit on a data page.

pg. 1
UNIT 4 NOTES

Why do we need Normalization?


The main reason for normalizing the relations is removing these anomalies. Failure to
eliminate anomalies leads to data redundancy and can cause data integrity and other
problems as the database grows. Normalization consists of a series of guidelines that helps to
guide you in creating a good database structure

Data modification anomalies can be categorized into three types:

o Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple
into a relationship due to lack of data.

o Deletion Anomaly: The delete anomaly refers to the situation where the deletion of
data results in the unintended loss of some other important data.
o Updatation Anomaly: The update anomaly is when an update of a single data value
requires multiple rows of data to be updated.

Normalization Rules:

1. First Normal Form (1NF):


- Each table cell must contain a single value.
- No repeating groups or arrays.

2. Second Normal Form (2NF):


- A relation will be in 2NF if it is in 1NF and all non-key attributes are fully functional
dependent on the primary key.
3. Third Normal Form (3NF):

- A relation will be in 3NF if it is in 2NF and no transition dependency exists.

4. Boyce-Codd Normal Form (BCNF):

- A table is in 3NF, and there are no transitive dependencies.

5. Higher Normal Forms (4NF, 5NF, 6NF):


- Deal with more complex dependencies and relationships.

Example
Suppose we have a table Orders with the following attributes:

Orders:

pg. 2
UNIT 4 NOTES

OrderID CustomerName OrderDate ProductName Quantity

1 John Smith 2022-01-01 iPhone 2

1 John Smith 2022-01-01 MacBook 1

2 Jane Doe 2022-01-15 iPad 3

Normalized Tables:

Customers:

CustomerID CustomerName

1 John Smith
2 Jane Doe

Orders:

OrderID CustomerID OrderDate

1 1 2022-01-01

2 2 2022-01-15

OrderDetails:

OrderID ProductName Quantity

1 iPhone 2

1 MacBook 1
2 iPad 3

pg. 3
UNIT 4 NOTES

First Normal Form (1NF)


Unnormalized Table:

OrderID CustomerName OrderDate Products

1 John Smith 2022-01-01 iPhone, MacBook


2 Jane Doe 2022-01-15 iPad, iPhone

Problem: Repeating group (Products)


1NF Table:

OrderID CustomerName OrderDate ProductName

1 John Smith 2022-01-01 iPhone


1 John Smith 2022-01-01 MacBook
2 Jane Doe 2022-01-15 iPad
2 Jane Doe 2022-01-15 iPhone

Second Normal Form (2NF)


Unnormalized Table (1NF):

OrderID CustomerName OrderDate ProductName


1 John Smith 2022-01-01 iPhone
1 John Smith 2022-01-01 MacBook
2 Jane Doe 2022-01-15 iPad
2 Jane Doe 2022-01-15 iPhone

Problem: Partial dependency (CustomerName depends on OrderID, not ProductName)

2NF Tables:
Orders:

OrderID CustomerName OrderDate

1 John Smith 2022-01-01


2 Jane Doe 2022-01-15

OrderDetails:

OrderID ProductName
1 iPhone
1 MacBook
2 iPad
2 iPhone

pg. 4
UNIT 4 NOTES

Third Normal Form (3NF)

Unnormalized Table (2NF):

Orders:

OrderID CustomerName OrderDate Salesperson


1 John Smith 2022-01-01 Jane Doe
2 Jane Doe 2022-01-15 John Smith

Problem: Transitive dependency (Salesperson depends on CustomerName, not OrderID)


3NF Tables:
Orders:

OrderID CustomerID OrderDate

1 1 2022-01-01

2 2 2022-01-15

Customers:

CustomerID CustomerName Salesperson

1 John Smith Jane Doe


2 Jane Doe John Smith

Boyce-Codd Normal Form (BCNF)


Unnormalized Table (3NF):

Courses:

CourseID CourseName InstructorID InstructorName


1 Math 1 John Smith
2 Science 2 Jane Doe

Problem: Non-key attribute (InstructorName) depends on non-key attribute (InstructorID)

pg. 5
UNIT 4 NOTES

BCNF Tables:

Courses:

CourseID CourseName InstructorID


1 Math 1
2 Science 2

Instructors:

InstructorID InstructorName
1 John Smith
2 Jane Doe

Higher Normal Forms (4NF, 5NF, 6NF)

These forms deal with more complex dependencies and relationships, such as:

- Multi-level dependencies

- Join dependencies
- Multi-valued dependencies

Example 2

First Normal Form (1NF)

o A relation will be 1NF if it contains an atomic value.

o It states that an attribute of a table cannot hold multiple values. It must hold only
single-valued attribute.

o First normal form disallows the multi-valued attribute, composite attribute, and their
combinations.

Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute EMP_PHONE.

pg. 6
UNIT 4 NOTES

EMPLOYEE table:

EMP_ID EMP_NAME EMP_PHONE EMP_STATE

7272826385,
14 John UP
9064738238

20 Harry 8574783832 Bihar

7390372389,
12 Sam Punjab
8589830302

The decomposition of the EMPLOYEE table into 1NF has been shown below:

EMP_ID EMP_NAME EMP_PHONE EMP_STATE

14 John 7272826385 UP

14 John 9064738238 UP

20 Harry 8574783832 Bihar

12 Sam 7390372389 Punjab

12 Sam 8589830302 Punjab

pg. 7
UNIT 4 NOTES

Second Normal Form (2NF)


o In the 2NF, relational must be in 1NF.

o In the second normal form, all non-key attributes are fully functional dependent on
the primary key

Example: Let's assume, a school can store the data of teachers and the subjects they teach. In
a school, a teacher can teach more than one subject.

TEACHER table

TEACHER_ID SUBJECT TEACHER_AGE

25 Chemistry 30

25 Biology 30

47 English 35

83 Math 38

83 Computer 38

In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID which is


a proper subset of a candidate key. That's why it violates the rule for 2NF.

To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:

TEACHER_ID TEACHER_AGE

25 30

47 35

83 38

TEACHER_SUBJECT table:

pg. 8
UNIT 4 NOTES

TEACHER_ID SUBJECT

25 Chemistry

25 Biology

47 English

83 Math

83 Computer

Third Normal Form (3NF)

o A relation will be in 3NF if it is in 2NF and not contain any transitive partial
dependency.

o 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
o If there is no transitive dependency for non-prime attributes, then the relation must
be in third normal form.
A relation is in third normal form if it holds at least one of the following conditions for every
non-trivial function dependency X → Y.

1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.

Example:

EMPLOYEE_DETAIL table:

EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY

222 Harry 201010 UP Noida

333 Stephan 02228 US Boston

444 Lan 60007 US Chicago

555 Katharine 06389 UK Norwich

pg. 9
UNIT 4 NOTES

666 John 462007 MP Bhopal

Super key in the table above:

1. {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}....so on


Candidate key: {EMP_ID}
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.

Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on EMP_ID.
The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third normal form.

That's why we need to move the EMP_CITY and EMP_STATE to the new <EMPLOYEE_ZIP>
table, with EMP_ZIP as a Primary key.

EMPLOYEE table:

EMP_ID EMP_NAME EMP_ZIP

222 Harry 201010

333 Stephan 02228

444 Lan 60007

555 Katharine 06389

666 John 462007

EMPLOYEE_ZIP table:

EMP_ZIP EMP_STATE EMP_CITY

201010 UP Noida

02228 US Boston

60007 US Chicago

06389 UK Norwich

pg. 10
UNIT 4 NOTES

462007 MP Bhopal

Boyce Codd normal form (BCNF)


o BCNF is the advance version of 3NF. It is stricter than 3NF.
o A table is in BCNF if every functional dependency X → Y, X is the super key of the table.

o For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
Example: Let's assume there is a company where employees work in more than one
department.
EMPLOYEE table:

EMP_ID EMP_COUNTRY EMP_DEPT DEPT_TYPE EMP_DEPT_NO

264 India Designing D394 283

264 India Testing D394 300

364 UK Stores D283 232

364 UK Developing D283 549

In the above table Functional dependencies are as follows:

1. EMP_ID → EMP_COUNTRY

2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}

Candidate key: {EMP-ID, EMP-DEPT}


Advertisement

The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:

pg. 11
UNIT 4 NOTES

EMP_COUNTRY table:

EMP_ID EMP_COUNTRY

264 India

264 India

EMP_DEPT table:

EMP_DEPT DEPT_TYPE EMP_DEPT_NO

Designing D394 283

Testing D394 300

Stores D283 232

Developing D283 549

EMP_DEPT_MAPPING table:

EMP_ID EMP_DEPT

D394 283

D394 300

D283 232

D283 549

Functional dependencies:

1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}

Candidate keys:

pg. 12
UNIT 4 NOTES

For the first table: EMP_ID


For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}

Now, this is in BCNF because left side part of both the functional dependencies is a key.
Functional Dependency Theory:
Functional Dependency (FD) is a fundamental concept in database design, ensuring data
consistency and reducing redundancy.

Definition:
A functional dependency exists between two attributes, X and Y, if:

X → Y (X determines Y)
Properties:

1. Reflexivity: X → X (an attribute determines itself)


2. Augmentation: If X → Y, then XZ → Y (adding attributes to X doesn't change the
dependency)

3. Transitivity: If X → Y and Y → Z, then X → Z (chaining dependencies)


Real-World Example:

Suppose we have a table Employees with attributes:


| EmployeeID | Name | Department | Manager |

Functional Dependencies:
1. EmployeeID → Name, Department (EmployeeID determines Name and Department)

2. Department → Manager (Department determines Manager)

These FDs ensure data consistency:


- Each employee has a unique name and department.

- Each department has a single manager.

Multivalued Dependencies (MVDs):

MVDs occur when a single value of one attribute corresponds to multiple values of another
attribute.

Definition:
An MVD exists between two attributes, X and Y if:

X →→ Y (X multivalued determines Y)

pg. 13
UNIT 4 NOTES

Example:
Suppose we have a table Orders with attributes:

| OrderID | CustomerID | ProductID |

MVD:

CustomerID →→ ProductID (a customer can order multiple products)

This MVD implies:


- A customer can have multiple orders.
- A product can be ordered by multiple customers.

File Organization:

File organization refers to the way data is structured and stored in a file or database.
Types of File Organization:

1. Sequential File Organization:


Records are stored in sequential order, one after the other.

This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:
1. Pile File Method:

It is a quite simple method. In this method, we store the record in a sequence, i.e., one after
another. Here, the record will be inserted in the order in which they are inserted into tables.
In case of updating or deleting of any record, the record will be searched in the memory
blocks. When it is found, then it will be marked for deleting, and the new record is inserted.

Insertion of the new record:

Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records
are nothing but a row in the table. Suppose we want to insert a new record R2 in the
sequence, then it will be placed at the end of the file. Here, records are nothing but a row in
any table.

pg. 14
UNIT 4 NOTES

2. Sorted File Method:

In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order. Sorting of records is based on any primary key or
any other key.

In the case of modification of any record, it will update the record and then sort the file, and
lastly, the updated record is placed in the right place.

Insertion of the new record:


Suppose there is a preexisting sorted sequence of four records R1, R3 and so on up to R6 and
R7. Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at
the end of the file, and then it will sort the sequence.

pg. 15
UNIT 4 NOTES

Advantages:
 Easy to implement
 Fast data retrieval for sequential access
 This method is used for report generation or statistical calculations.
Disadvantages:
 Slow data retrieval for random access
 Difficult to insert or delete records

2. Indexed File Organization:

Records are stored in random order, with an index file pointing to record locations.

ISAM method is an advanced sequential file organization. In this method, records are
stored in the file using the primary key. An index value is generated for each primary key
and mapped with the record. This index contains the address of the record in the file.

pg. 16
UNIT 4 NOTES

 If any record has to be retrieved based on its index value, then the address of the
data block is fetched and the record is retrieved from the memory

Advantages:

 Fast data retrieval for random access


 Easy to insert or delete records

Disadvantages:

 Requires additional storage for index file


 Index file must be updated frequently

3. Hashed File Organization:


 Hash File Organization uses the computation of hash function on some fields of
the records. The hash function's output determines the location of disk block
where the records are to be placed.

pg. 17
UNIT 4 NOTES

When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way,
when a new record has to be inserted, then the address is generated using the hash
key and record is directly inserted. The same process is applied in the case of delete
and update.

In this method, there is no effort for searching and sorting the entire file. In this
method, each record will be stored randomly in the memory.

4.

pg. 18
UNIT 4 NOTES

Records are stored using a hash function, allowing fast retrieval.


Advantages:

 Fast data retrieval for random access


 Efficient storage usage.
Disadvantages:

 Requires complex hash function implementation.


 Collision resolution required.

5. Linked List File Organization:

Records are stored as linked lists, with each record pointing to the next.

Advantages:

 Efficient insertion and deletion


 Flexible storage usage

Disadvantages:

 Slow data retrieval for random access


 Requires additional storage for pointers

File Organization Techniques:

1. Clustering: Grouping related records together.


2. Segmentation: Dividing files into smaller segments.

3. Blocking: Grouping records into fixed-size blocks.

4. Caching: Storing frequently accessed data in fast memory

Indexing in DBMS
Indexing is used to optimize the performance of a database by minimizing the number
of disk accesses required when a query is processed.

The index is a type of data structure. It is used to locate and access the data in a
database table quickly.

pg. 19
UNIT 4 NOTES

Index structure:

Indexes can be created using some database columns.

The first column of the database is the search key that contains a copy of the primary
key or candidate key of the table. The values of the primary key are stored in sorted
order so that the corresponding data can be accessed easily.

Second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be
found.

Indexing Methods


Ordered Indices:
Ordered indices are data structures used to speed up query performance in databases by
allowing efficient retrieval of data based on specific search criteria.
Types of Ordered Indices:
1. B-Tree Index: A self-balancing search tree that keeps data sorted and allows efficient
insertion, deletion, and search operations.
2. B+ Tree Index: A variation of B-Tree index that stores only key values in the index nodes and
data records in leaf nodes.

pg. 20
UNIT 4 NOTES

3. Hash Index: An index that uses a hash function to map key values to specific locations in
the index.
4. Clustered Index: An index that reorders the physical records of the table according to the
index keys.
5. Non-Clustered Index: An index that creates a separate data structure for the index keys,
leaving the physical records unchanged.

Advantages:

1. Faster Query Performance: Ordered indices speed up query execution by reducing the
number of records to scan.

2. Efficient Data Retrieval: Ordered indices enable efficient retrieval of data based on specific
search criteria.
3. Improved Data Integrity: Ordered indices help maintain data consistency by ensuring
unique key values.

Disadvantages:

1. Additional Storage Requirements: Ordered indices require additional storage space.


2. Insertion and Deletion Overhead: Maintaining ordered indices can slow down insertion and
deletion operations.

3. Index Maintenance: Ordered indices require periodic maintenance to ensure optimal


performance.

Real-World Applications:

1. Database Query Optimization: Ordered indices are crucial for optimizing database query
performance.

2. Data Warehousing: Ordered indices help improve query performance in data warehousing
applications.

3. Big Data Analytics: Ordered indices facilitate efficient data retrieval in big data analytics.

B-Tree Index Files:


B-tree in DBMS is an m-way tree that balances itself. Due to their balanced structure, such
trees are frequently used to manage and organize enormous databases and facilitate
searches. In a B-tree, each node can have a maximum of n child nodes. In DBMS, B-tree is an
example of multilevel indexing. Leaf nodes and internal nodes will both have record
references. B-Tree is called a balanced stored tree as all the leaf nodes are at the same levels.

Properties of B-tree
Following are some of the properties of B-tree in DBMS:

pg. 21
UNIT 4 NOTES

 A non-leaf node's number of keys is one less than the number of its children.
 The number of keys in the root ranges from one to (m-1) maximum. Therefore, the
root has a minimum of two and a maximum of m children.
 The keys range from min([m/2]-1) to max(m-1) for all nodes (non-leaf nodes) besides
the root. Thus, they can have between m and [m/2] children.
 The level of each leaf node is the same.

How Database B-Tree Indexing Works

 When B-tree is used for database indexing, it becomes a little more complex because
it has both a key and a value. The value serves as a reference to the particular data
record. A payload is the collective term for the key and value.
 For index data to a particular key and value, the database first constructs a unique
random index or a primary key for each of the supplied records. The keys and record
byte streams are then all stored on a B+ tree. The random index that is generated is
used for indexing of the data.
 So this indexing helps to decrease the searching time of data. In a B-tree, all the data
is stored on the leaf nodes, now for accessing a particular data index, the database
can make use of binary search on the leaf nodes as the data is stored in the sorted
order.
 If indexing is not used, the database reads each and every record to locate the
requested record and it increases time and cost for searching the records, so B-tree
indexing is very efficient.

Advantages:
1. Fast search, insertion, and deletion. 2. Self-balancing.
3. Efficient use of storage.

Disadvantages:
1. Complex implementation. 2. Additional storage requirements.

Real-World Applications:
1. Database query optimization. 2. File systems. 3. Compilers.

B+ Tree Index File:


The B+ tree is a balanced binary search tree. It follows a multi-level index format.
In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.

pg. 22
UNIT 4 NOTES

Structure of B+ Tree
In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.

It contains an internal node and leaf node.

Internal node

An internal node of the B+ tree can contain at least n/2 record pointers except the root node.

At most, an internal node of the tree contains n pointers.

Leaf node
The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
At most, a leaf node contains n record pointer and n key values.

Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential search
to find 55.

B+ Tree Insertion

Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.

pg. 23
UNIT 4 NOTES

In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should
have 60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy
to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify
it to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:

pg. 24
UNIT 4 NOTES

Advantages:

1. Faster search, insertion, and deletion. 2. Efficient use of storage. 3. Self-balancing.


Disadvantages:
1. Complex implementation. 2. Additional storage requirements

Real-World Applications:

1. Database query optimization. 2. File systems. 3. Compilers.

Comparison between B-Tree and B+ Tree:


Similarities:

1. Both are self-balancing search trees. 2. Both use key-value pairs.


3. Both support search, insertion, and deletion operations. 4. Both are used for indexing in
databases.

Differences: B-Tree

1. Stores data in every node.2. Nodes can have variable number of keys.

3. Search, insertion, and deletion are slower.4. More complex implementation.


B+ Tree
1. Stores data only in leaf nodes. 2. Nodes have fixed number of keys.

3. Search, insertion, and deletion are faster. 4. Less complex implementation.


Comparison Table:

Feature B-Tree B+ Tree


Data Storage Every node Only leaf nodes
Node Key Count Variable Fixed
Search Speed Slower Faster
Insertion Speed Slower Faster
Deletion Speed Slower Faster
Complexity Higher Lower
Storage Efficiency Lower Higher

pg. 25
UNIT 4 NOTES

When to use B-Tree:


1. Small databases.2. Infrequent insertions and deletions.3. Query-intensive applications.

When to use B+ Tree:

1. Large databases.2. Frequent insertions and deletions.3. High-performance applications.

Database Systems:

1. B-Tree: MySQL, PostgreSQL.


2. B+ Tree: Oracle, Microsoft SQL Server.
Conclusion:

B+ Tree is generally preferred over B-Tree due to its faster search, insertion, and deletion
speeds, as well as its simpler implementation.

However, B-Tree may be suitable for smaller databases or query-intensive applications.

Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on
the disk without using index structure.

In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data
bucket or data blocks.

In this, a hash function can choose any of the column value to generate the address. Most of
the time, the hash function uses the primary key to generate the address of the data block. A
hash function is a simple mathematical function to any complex mathematical function. We
can even consider the primary key itself as the address of the data block. That means each
row whose address will be the same as a primary key stored in the data block.

pg. 26
UNIT 4 NOTES

The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we
have mod (5) hash function to determine the address of the data block. In this case, it applies
mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and
records are stored in those data block addresses.

pg. 27
UNIT 4 NOTES

Types of Hashing:
1) Static Hashing 2 ) Dynamic Hashing

Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we
generate an address for EMP_ID =103 using the hash function mod (5) then it will always
result in same bucket address 3. Here, there will be no change in the bucket address.

Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store the
data.

Operations of Static Hashing

Searching a record

When a record needs to be searched, then the same hash function retrieves the address of
the bucket where the data is stored.

Insert a Record
When a new record is inserted into the table, then we will generate an address for a new
record based on the hash key and record is stored in that location.
Backward Skip 10sPlay Video Forward Skip 10s
Delete a Record

To delete a record, we will first fetch the record which is supposed to be deleted. Then we will
delete the records for that address in memory.

Update a Record

pg. 28
UNIT 4 NOTES

To update a record, we will first search it using a hash function, and then the data record is
updated.
If we want to insert some new record into the file but the address of a data bucket generated
by the hash function is not empty, or data already exists in that address. This situation in the
static hashing is known as bucket overflow. This is a critical situation in this method.

To overcome this situation, there are various methods. Some commonly used methods are as
follows:

1. Open Hashing

When a hash function generates an address at which data is already stored, then the next
bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system
searches next available data bucket, 113 and assigns R3 to it.

2. Close Hashing

When buckets are full, then a new data bucket is allocated for the same hash result and is
linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this
case, a new bucket is inserted at the end of 110 buckets and is linked to it.

pg. 29
UNIT 4 NOTES

Advantages of Static Hashing:

 Fast Lookup: Static hashing allows for fast lookup, insertion, and deletion operations.
 Efficient Storage: Static hashing optimizes storage usage by minimizing empty slots.
 Simple Implementation: Static hashing is relatively easy to implement.
 Predictable Performance: Static hashing provides predictable performance.
 Low Overhead: Static hashing has low overhead in terms of memory and
computation.

Disadvantages of Static Hashing


Disadvantages of using static hashing techniques in DBMS are:

 Static hashing is not a good choice for big data.


 This process takes longer than usual because the hash function has to go through all
memory locations in the DBMS system.

 Does not work with extensible databases.


 The sorting technique is not good compared to other hashing techniques.

pg. 30
UNIT 4 NOTES

Dynamic Hashing
 The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.

 In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as extendable hashing method.
 This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.

 How to search a key

 First, calculate the hash address of the key.

 Check how many bits are used in the directory, and these bits are called as i.
 Take the least significant i bits of the hash address. This gives an index of the directory.

 Now using the index, go to the directory and find bucket address where the record
might be.
 How to insert a new record

 Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
 If there is still space in that bucket, then place the record in it.

 If the bucket is full, then we will split the bucket and redistribute the records.

 For example:

 Consider the following grouping of keys into buckets, depending on the prefix of their
hash address:

 The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5
and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will
go into bucket B2. The last two bits of 7 are 11, so it will go into B3.

pg. 31
UNIT 4 NOTES

 Insert key 9 with hash address 10001 into the above structure:
 Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is
full, so it will get split.

 The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
 Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.

 Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.

 Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last
two bits of both the entry are 11.

pg. 32
UNIT 4 NOTES

Advantages of dynamic hashing


o In this method, the performance does not decrease as the data grows in the system. It
simply increases the size of memory to accommodate the data.

o In this method, memory is well utilized as it grows and shrinks with the data. There
will not be any unused memory lying.

o This method is good for the dynamic database where data grows and shrinks
frequently.

Disadvantages of dynamic hashing

o In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the
data address will keep changing as buckets grow and shrink. If there is a huge increase
in data, maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might take little time
to reach this situation than static hashing.

pg. 33

You might also like