Normalization
Normalization
Unit 4 Relational Database Design and File Organization, Indexing & Hashing
Normalization
Normalization is the process of organizing a database to reduce redundancy and improve data
integrity.
Normalization also simplifies the database design so that it achieves the optimal structure
composed of atomic elements (i.e. elements that cannot be broken down into smaller parts).
Also referred to as database normalization or data normalization, normalization is an
important part of relational database design, as it helps with the speed, accuracy, and
efficiency of the database.
By normalizing a database, you arrange the data into tables and columns. You ensure that
each table contains only related data. If data is not directly related, you create a new table for
that data.
For example, if you have a “Customers” table, you’d normally create a separate table for the
products they can order (you could call this table “Products”). You’d create another table for
customers’ orders (perhaps called “Orders”). And if each order could contain multiple items,
you’d typically create yet another table to store each order item (perhaps called “Order
Items”). All these tables would be linked by their primary key, which allows you to find related
data across all these tables (such as all orders by a given customer).
Benefits of Normalization
There are many benefits of normalizing a database. Here are some of the key benefits:
Minimizes data redundancy (duplicate data).
Searching, sorting, and creating indexes can be faster, since tables are narrower, and
more rows fit on a data page.
pg. 1
UNIT 4 NOTES
o Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple
into a relationship due to lack of data.
o Deletion Anomaly: The delete anomaly refers to the situation where the deletion of
data results in the unintended loss of some other important data.
o Updatation Anomaly: The update anomaly is when an update of a single data value
requires multiple rows of data to be updated.
Normalization Rules:
Example
Suppose we have a table Orders with the following attributes:
Orders:
pg. 2
UNIT 4 NOTES
Normalized Tables:
Customers:
CustomerID CustomerName
1 John Smith
2 Jane Doe
Orders:
1 1 2022-01-01
2 2 2022-01-15
OrderDetails:
1 iPhone 2
1 MacBook 1
2 iPad 3
pg. 3
UNIT 4 NOTES
2NF Tables:
Orders:
OrderDetails:
OrderID ProductName
1 iPhone
1 MacBook
2 iPad
2 iPhone
pg. 4
UNIT 4 NOTES
Orders:
1 1 2022-01-01
2 2 2022-01-15
Customers:
Courses:
pg. 5
UNIT 4 NOTES
BCNF Tables:
Courses:
Instructors:
InstructorID InstructorName
1 John Smith
2 Jane Doe
These forms deal with more complex dependencies and relationships, such as:
- Multi-level dependencies
- Join dependencies
- Multi-valued dependencies
Example 2
o It states that an attribute of a table cannot hold multiple values. It must hold only
single-valued attribute.
o First normal form disallows the multi-valued attribute, composite attribute, and their
combinations.
pg. 6
UNIT 4 NOTES
EMPLOYEE table:
7272826385,
14 John UP
9064738238
7390372389,
12 Sam Punjab
8589830302
The decomposition of the EMPLOYEE table into 1NF has been shown below:
14 John 7272826385 UP
14 John 9064738238 UP
pg. 7
UNIT 4 NOTES
o In the second normal form, all non-key attributes are fully functional dependent on
the primary key
Example: Let's assume, a school can store the data of teachers and the subjects they teach. In
a school, a teacher can teach more than one subject.
TEACHER table
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
pg. 8
UNIT 4 NOTES
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
o A relation will be in 3NF if it is in 2NF and not contain any transitive partial
dependency.
o 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
o If there is no transitive dependency for non-prime attributes, then the relation must
be in third normal form.
A relation is in third normal form if it holds at least one of the following conditions for every
non-trivial function dependency X → Y.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
Example:
EMPLOYEE_DETAIL table:
pg. 9
UNIT 4 NOTES
Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on EMP_ID.
The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on super
key(EMP_ID). It violates the rule of third normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new <EMPLOYEE_ZIP>
table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMPLOYEE_ZIP table:
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
pg. 10
UNIT 4 NOTES
462007 MP Bhopal
o For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
Example: Let's assume there is a company where employees work in more than one
department.
EMPLOYEE table:
1. EMP_ID → EMP_COUNTRY
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
pg. 11
UNIT 4 NOTES
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
1. EMP_ID → EMP_COUNTRY
2. EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
pg. 12
UNIT 4 NOTES
Now, this is in BCNF because left side part of both the functional dependencies is a key.
Functional Dependency Theory:
Functional Dependency (FD) is a fundamental concept in database design, ensuring data
consistency and reducing redundancy.
Definition:
A functional dependency exists between two attributes, X and Y, if:
X → Y (X determines Y)
Properties:
Functional Dependencies:
1. EmployeeID → Name, Department (EmployeeID determines Name and Department)
MVDs occur when a single value of one attribute corresponds to multiple values of another
attribute.
Definition:
An MVD exists between two attributes, X and Y if:
X →→ Y (X multivalued determines Y)
pg. 13
UNIT 4 NOTES
Example:
Suppose we have a table Orders with attributes:
MVD:
File Organization:
File organization refers to the way data is structured and stored in a file or database.
Types of File Organization:
This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:
1. Pile File Method:
It is a quite simple method. In this method, we store the record in a sequence, i.e., one after
another. Here, the record will be inserted in the order in which they are inserted into tables.
In case of updating or deleting of any record, the record will be searched in the memory
blocks. When it is found, then it will be marked for deleting, and the new record is inserted.
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records
are nothing but a row in the table. Suppose we want to insert a new record R2 in the
sequence, then it will be placed at the end of the file. Here, records are nothing but a row in
any table.
pg. 14
UNIT 4 NOTES
In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order. Sorting of records is based on any primary key or
any other key.
In the case of modification of any record, it will update the record and then sort the file, and
lastly, the updated record is placed in the right place.
pg. 15
UNIT 4 NOTES
Advantages:
Easy to implement
Fast data retrieval for sequential access
This method is used for report generation or statistical calculations.
Disadvantages:
Slow data retrieval for random access
Difficult to insert or delete records
Records are stored in random order, with an index file pointing to record locations.
ISAM method is an advanced sequential file organization. In this method, records are
stored in the file using the primary key. An index value is generated for each primary key
and mapped with the record. This index contains the address of the record in the file.
pg. 16
UNIT 4 NOTES
If any record has to be retrieved based on its index value, then the address of the
data block is fetched and the record is retrieved from the memory
Advantages:
Disadvantages:
pg. 17
UNIT 4 NOTES
When a record has to be received using the hash key columns, then the address is
generated, and the whole record is retrieved using that address. In the same way,
when a new record has to be inserted, then the address is generated using the hash
key and record is directly inserted. The same process is applied in the case of delete
and update.
In this method, there is no effort for searching and sorting the entire file. In this
method, each record will be stored randomly in the memory.
4.
pg. 18
UNIT 4 NOTES
Records are stored as linked lists, with each record pointing to the next.
Advantages:
Disadvantages:
Indexing in DBMS
Indexing is used to optimize the performance of a database by minimizing the number
of disk accesses required when a query is processed.
The index is a type of data structure. It is used to locate and access the data in a
database table quickly.
pg. 19
UNIT 4 NOTES
Index structure:
The first column of the database is the search key that contains a copy of the primary
key or candidate key of the table. The values of the primary key are stored in sorted
order so that the corresponding data can be accessed easily.
Second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be
found.
Indexing Methods
Ordered Indices:
Ordered indices are data structures used to speed up query performance in databases by
allowing efficient retrieval of data based on specific search criteria.
Types of Ordered Indices:
1. B-Tree Index: A self-balancing search tree that keeps data sorted and allows efficient
insertion, deletion, and search operations.
2. B+ Tree Index: A variation of B-Tree index that stores only key values in the index nodes and
data records in leaf nodes.
pg. 20
UNIT 4 NOTES
3. Hash Index: An index that uses a hash function to map key values to specific locations in
the index.
4. Clustered Index: An index that reorders the physical records of the table according to the
index keys.
5. Non-Clustered Index: An index that creates a separate data structure for the index keys,
leaving the physical records unchanged.
Advantages:
1. Faster Query Performance: Ordered indices speed up query execution by reducing the
number of records to scan.
2. Efficient Data Retrieval: Ordered indices enable efficient retrieval of data based on specific
search criteria.
3. Improved Data Integrity: Ordered indices help maintain data consistency by ensuring
unique key values.
Disadvantages:
Real-World Applications:
1. Database Query Optimization: Ordered indices are crucial for optimizing database query
performance.
2. Data Warehousing: Ordered indices help improve query performance in data warehousing
applications.
3. Big Data Analytics: Ordered indices facilitate efficient data retrieval in big data analytics.
Properties of B-tree
Following are some of the properties of B-tree in DBMS:
pg. 21
UNIT 4 NOTES
A non-leaf node's number of keys is one less than the number of its children.
The number of keys in the root ranges from one to (m-1) maximum. Therefore, the
root has a minimum of two and a maximum of m children.
The keys range from min([m/2]-1) to max(m-1) for all nodes (non-leaf nodes) besides
the root. Thus, they can have between m and [m/2] children.
The level of each leaf node is the same.
When B-tree is used for database indexing, it becomes a little more complex because
it has both a key and a value. The value serves as a reference to the particular data
record. A payload is the collective term for the key and value.
For index data to a particular key and value, the database first constructs a unique
random index or a primary key for each of the supplied records. The keys and record
byte streams are then all stored on a B+ tree. The random index that is generated is
used for indexing of the data.
So this indexing helps to decrease the searching time of data. In a B-tree, all the data
is stored on the leaf nodes, now for accessing a particular data index, the database
can make use of binary search on the leaf nodes as the data is stored in the sorted
order.
If indexing is not used, the database reads each and every record to locate the
requested record and it increases time and cost for searching the records, so B-tree
indexing is very efficient.
Advantages:
1. Fast search, insertion, and deletion. 2. Self-balancing.
3. Efficient use of storage.
Disadvantages:
1. Complex implementation. 2. Additional storage requirements.
Real-World Applications:
1. Database query optimization. 2. File systems. 3. Compilers.
pg. 22
UNIT 4 NOTES
Structure of B+ Tree
In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the
order n where n is fixed for every B+ tree.
Internal node
An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
Leaf node
The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
At most, a leaf node contains n record pointer and n key values.
Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Searching a record in B+ Tree
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential search
to find 55.
B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node
after 55. It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert
60 there.
pg. 23
UNIT 4 NOTES
In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will
split the leaf node of the tree in the middle so that its balance is not altered. So we can group
(50, 55) and (60, 65, 70) into 2 leaf nodes.
If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should
have 60 added to it, and then we can have pointers to a new leaf node.
This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy
to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60
from the intermediate node as well as from the 4th leaf node too. If we remove it from the
intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to modify
it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows:
pg. 24
UNIT 4 NOTES
Advantages:
Real-World Applications:
Differences: B-Tree
1. Stores data in every node.2. Nodes can have variable number of keys.
pg. 25
UNIT 4 NOTES
Database Systems:
B+ Tree is generally preferred over B-Tree due to its faster search, insertion, and deletion
speeds, as well as its simpler implementation.
Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index values and reach the
desired data. Hashing technique is used to calculate the direct location of a data record on
the disk without using index structure.
In this technique, data is stored at the data blocks whose address is generated by using the
hashing function. The memory location where these records are stored is known as data
bucket or data blocks.
In this, a hash function can choose any of the column value to generate the address. Most of
the time, the hash function uses the primary key to generate the address of the data block. A
hash function is a simple mathematical function to any complex mathematical function. We
can even consider the primary key itself as the address of the data block. That means each
row whose address will be the same as a primary key stored in the data block.
pg. 26
UNIT 4 NOTES
The above diagram shows data block addresses same as primary key value. This hash function
can also be a simple mathematical function like exponential, mod, cos, sin, etc. Suppose we
have mod (5) hash function to determine the address of the data block. In this case, it applies
mod (5) hash function on the primary keys and generates 3, 3, 1, 4 and 2 respectively, and
records are stored in those data block addresses.
pg. 27
UNIT 4 NOTES
Types of Hashing:
1) Static Hashing 2 ) Dynamic Hashing
Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we
generate an address for EMP_ID =103 using the hash function mod (5) then it will always
result in same bucket address 3. Here, there will be no change in the bucket address.
Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store the
data.
Searching a record
When a record needs to be searched, then the same hash function retrieves the address of
the bucket where the data is stored.
Insert a Record
When a new record is inserted into the table, then we will generate an address for a new
record based on the hash key and record is stored in that location.
Backward Skip 10sPlay Video Forward Skip 10s
Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we will
delete the records for that address in memory.
Update a Record
pg. 28
UNIT 4 NOTES
To update a record, we will first search it using a hash function, and then the data record is
updated.
If we want to insert some new record into the file but the address of a data bucket generated
by the hash function is not empty, or data already exists in that address. This situation in the
static hashing is known as bucket overflow. This is a critical situation in this method.
To overcome this situation, there are various methods. Some commonly used methods are as
follows:
1. Open Hashing
When a hash function generates an address at which data is already stored, then the next
bucket will be allocated to it. This mechanism is called as Linear Probing.
For example: suppose R3 is a new address which needs to be inserted, the hash function
generates address as 112 for R3. But the generated address is already full. So the system
searches next available data bucket, 113 and assigns R3 to it.
2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and is
linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table, the hash
function generates address as 110 for it. But this bucket is full to store the new data. In this
case, a new bucket is inserted at the end of 110 buckets and is linked to it.
pg. 29
UNIT 4 NOTES
Fast Lookup: Static hashing allows for fast lookup, insertion, and deletion operations.
Efficient Storage: Static hashing optimizes storage usage by minimizing empty slots.
Simple Implementation: Static hashing is relatively easy to implement.
Predictable Performance: Static hashing provides predictable performance.
Low Overhead: Static hashing has low overhead in terms of memory and
computation.
pg. 30
UNIT 4 NOTES
Dynamic Hashing
The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as extendable hashing method.
This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.
Check how many bits are used in the directory, and these bits are called as i.
Take the least significant i bits of the hash address. This gives an index of the directory.
Now using the index, go to the directory and find bucket address where the record
might be.
How to insert a new record
Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
If there is still space in that bucket, then place the record in it.
If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their
hash address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5
and 6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will
go into bucket B2. The last two bits of 7 are 11, so it will go into B3.
pg. 31
UNIT 4 NOTES
Insert key 9 with hash address 10001 into the above structure:
Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is
full, so it will get split.
The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last
two bits of both the entry are 11.
pg. 32
UNIT 4 NOTES
o In this method, memory is well utilized as it grows and shrinks with the data. There
will not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks
frequently.
o In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the
data address will keep changing as buckets grow and shrink. If there is a huge increase
in data, maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might take little time
to reach this situation than static hashing.
pg. 33