UNIT IV Dbms
UNIT IV Dbms
The Schema Refinement refers to refine the schema by using some technique. The best technique of
schema refinement is decomposition. Normalization means “split the tables into small tables which
will contain less number of attributes in such a way that table design must not contain any problem
of inserting, deleting, updating anomalies and guarantees no redundancy”.
Redundancy: refers to repetition of same data or duplicate copies of same data stored in different
locations.
Anomalies: Anomalies refers to the problems occurred after poorly planned and normalized
databases where all the data is stored in one table which is sometimes called a flat file database.
Here all the data is stored in a single table which causes redundancy of data or say anomalies as SID
and Sname are repeated once for same CID . Let us discuss anomalies one by one. Due to
redundancy of data we may get the following problems, those are
1. Insertion anomalies: It may not be possible to store some information unless some other
information is stored as well.
2. Redundant storage: some information is stored repeatedly
3. Update anomalies: If one copy of redundant data is updated, then inconsistency is created unless
all redundant copies of data are updated.
4. Deletion anomalies: It may not be possible to delete some information without losing some other
information as well.
Problem in updation / updation anomaly – If there is updation in the fee from 5000 to 7000, then
we have to update FEE column in all the rows, else data will become inconsistent.
Insertion Anomaly and Deletion Anomaly- These anomalies exist only due to redundancy, otherwise
they do not exist.
Insertion Anomalies: New course is introduced C4, But no student is there who is having C4 subject.
Because of insertion of some data, It is forced to insert some other dummy data.
Deletion Anomaly: Deletion of S3 student cause the deletion of course. Because of deletion of some
data forced to delete some other useful data.
Solutions To Anomalies : Decomposition of Tables – Schema Refinement, shown below.
Purpose of Normalization:
Advantages of Normalization:
1. The disadvantage of normalization is that it produces a lot of tables with a relatively small
number of columns. These columns then have to be joined using their primary/foreign key
relationship.
2. This has two disadvantages. Performance: all the joins required to merge data slow
processing & place additional stress on your hardware. Complex queries: developers have to
code complex queries in order to merge data from different tables.
In other words, a dependency FD: “X → Y” means that the values of Y are determined by the values
of X. Two tuples sharing the same values of X will necessarily have the same values of Y. An attribute
on left hand side is known as “Determinant”. Here X is a Determinant.
Case1: A →B
Here A1 belongs to B1 & B2. So A1 does not have unique value in B. So it is not in FD.
Case1: A →C
Note: try to find all the possibilities. i.e., A→D, B→C, B→D, and C→D
Reasoning about functional dependencies:
Armstrong Axioms (Inference Rules ) : The term Armstrong axioms refers to the sound and
complete set of inference rules or axioms, introduced by William W. Armstrong, that is used to test
logical implication of functional dependencies.
Armstrong axioms define the set of rules for reasoning about functional dependencies and also to
infer all the functional dependencies on a relational database.
Primary axioms:
Attribute closure of an attribute set can be defined as set of attributes which can be functionally
determined from it.
The set of FD’s that is logically implied by F is called the closure of F and written as F+. And it is
defined as “If F is a set FD’s on a relation R, the F+, the closure of F by using the inferences axioms
that are not contained in F+.
Example: R (A, B, C, D) and set of Functional Dependencies are A→B, B→D, C→B then what is the
Closure of A, B, C, D?
Solution: A+ is
1. Fully Functional Dependency: A functional dependency is said to be full dependency “if and only if
the determinant of the functional dependency if either candidate key or super key, and the
dependent can be either prime or non-prime attribute”.
(OR)
Let’s take the functional dependency X → Y (i.e., X determines y). Here Y is said to be fully
determinant, if it cannot determine any subset of X.
Example: Consider the following determinant ABC → D i.e., ABC determines D but D is not
determined by any subset of A/ BC/C/B/AB i.e., BC→D, C→D, A→D Functional dependencies are not
exists. So D is Fully Functional Dependent.
2. Partial Functional Dependency: If a non-prime attribute of the relation is getting derived by only a
part of the candidate key, then such dependency is known as Partial Dependency.
(OR)
In a relation having more than one key field, a subset of non key fields may depend on all key fields
but another subset or a particular non-key field may depend on only one of the key fields. Such
dependency is defined as Partial Dependency.
Example: Consider the following determinants AC→P, A→D, D→P. From these determinants P is not
fully FD on AC. Because, If we find A+ (means A’s Closure) A→D, D→P i.e., A→P. But we don’t have
any requirement of C. C attribute is removed completely. So P is Partially Dependent on AC.
(3) If all the attributes in the table are part of the primary key
(2) If all the attributes in the table are part of the primary key.
4. Trivial Functional Dependency: It is basically related to Reflexive rule. i.e., if X is a set of
attributes, and Y is subset of X then X→Y holds.
5. Multi-Valued Dependency: Consider 3 fields X, Y, and Z in a relation. If for each value of X, there is
a well-defined set of values Y and Well-defined set of values of Z and set of values of Y is
independent of the set values of Z. This dependency is Multi-valued Dependency. i.e., X →Y / Z.
Prime and non-prime attributes: Attributes which are parts of any candidate key of relation are
called as prime attribute, others are non-prime attributes.
Candidate Key: Candidate Key is minimal set of attributes of a relation which can be used to identify
a tuple uniquely.
Consider student table: student(sno, sname,sphone,age) we can take sno as candidate key.
Super Key: Super Key is set of attributes of a relation which can be used to identify a tuple uniquely.
To identify the additional F.D’s : To check any F.D’s like AB can be determined from F1 or not.
Complete A+ from F1 is A+ includes B also then; AB can be derived as a F.D in F1.
Examples:
1. In a schema with attributes A,B,C,D and E the following set of attributes are given AB, AC,
CDE, BD, EA. Find CDAC determines from the given FDs or not.
2. Check DA can be derived from the following FDs or not ABC, BCAD, DE, CFB.
Identification of key by using closure set as attributes: A key attribute: An attribute that is capable of
identifying all other attributes in a given table.
(i) Primary key: It is an unique value attribute in a table to enforce entity integrity and ti identify
rows in the table uniquely.
(ii) Composite Primary Key: Sometimes single attribute is not sufficient to identify uniquely the rows
in the table so, we combine 2 or more attributes to identify the rows uniquely.
(iii) Candidate keys: Sometimes 2 or more independent attribute or attributes can be used to
identify the rows uniquely Eg :( vech no,veng no,purchase date) Either vehicle no or vehicle engine
no can be used as a key attribute then they are called as candidate keys one of the candidate key can
be elected as primary key.
Example 1: Find candidate keys for the relation R(ABCD) having following FD’s ABCD, CA, DA.
Normal Forms in DBMS
1. First Normal Form – If a relation contain composite or multi-valued attribute, it violates first
normal form or a relation is in first normal form if it does not contain any composite or multi-valued
attribute. A relation is in first normal form if every attribute in that relation is singled valued
attribute.
Partial Dependency – If the proper subset of candidate key determines non-prime attribute, it is
called partial dependency.
{Note that, there are many courses having the same course fee. }
Hence,
COURSE_FEE would be a non-prime attribute, as it does not belong to the one only candidate key
{STUD_NO, COURSE_NO} ;
But, COURSE_NO -> COURSE_FEE , i.e., COURSE_FEE is dependent on COURSE_NO, which is a proper
subset of the candidate key. Non-prime attribute COURSE_FEE is dependent on a proper subset of
the candidate key, which is a partial dependency and so this relation is not in 2NF.
To convert the above relation to 2NF, we need to split the table into two tables such as:
STUD_NO COURSE_NO
1 C1
2 C2
1 C4
4 C3
4 C1
2 C5
COURSE_NO COURSE_FEE
C1 1000
C2 1500
C3 1000
C4 2000
C5 2000
NOTE: 2NF tries to reduce the redundant data getting stored in memory. For instance, if there are
100 students taking C1 course, we dont need to store its Fee as 1000 for all the 100 records, instead
once we can store it in the second table as the course fee for C1 is 1000.
In the above relation, AB is the only candidate key and there is no partial dependency, i.e., any
proper subset of AB doesn’t determine any non-prime attribute.
3. Third Normal Form –A relation is in third normal form, if there is no transitive dependency for
non-prime attributes as well as it is in second normal form. A relation is in 3NF if at least one of the
following condition holds in every non-trivial function dependency X –> Y
1. X is a super key.
Transitive dependency – If A->B and B->C are two FDs then A->C is called transitive dependency.
Example 1 – In relation STUDENT given in Table,
FD set: {STUD_NO -> STUD_NAME, STUD_NO -> STUD_STATE, STUD_STATE -> STUD_COUNTRY,
STUD_NO -> STUD_AGE}
For this relation in table, STUD_NO -> STUD_STATE and STUD_STATE -> STUD_COUNTRY are true. So
STUD_COUNTRY is transitively dependent on STUD_NO. It violates the third normal form. To convert
it in third normal form, we will decompose the relation STUDENT (STUD_NO, STUD_NAME,
STUD_PHONE, STUD_STATE, STUD_COUNTRY_STUD_AGE) as: STUDENT (STUD_NO, STUD_NAME,
STUD_PHONE, STUD_STATE, STUD_AGE) STATE_COUNTRY (STATE, COUNTRY)
A -> BC,
CD -> E,
B -> D,
E -> A
All possible candidate keys in above relation are {A, E, CD, BC} All attribute are on right sides of all
functional dependencies are prime.
A relation R is in BCNF if R is in Third Normal Form and for every FD, LHS is super key. A relation is in
BCNF iff in every non-trivial functional dependency X –> Y, X is a super key.
Example 1 – Find the highest normal form of a relation R(A,B,C,D,E) with FD set as {BC->D, AC->BE, B-
>E}
Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can determine all attribute of
relation, So AC will be candidate key. A or C can’t be derived from any other attribute of the relation,
so there will be only 1 candidate key {AC}.
Step 2. Prime attributes are those attribute which are part of candidate key {A, C} in this example
and others will be non-prime {B, D, E} in this example
Step 3. The relation R is in 1st normal form as a relational DBMS does not allow multi-valued or
composite attribute.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is not a proper subset of
candidate key AC) and AC->BE is in 2nd normal form (AC is candidate key) and B->E is in 2nd normal
form (B is not a proper subset of candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super key nor D is a prime
attribute) and in B->E (neither B is a super key nor E is a prime attribute) but to satisfy 3rd normal
for, either LHS of an FD should be super key or RHS should be prime attribute. So the highest normal
form of relation will be 2nd Normal form.
A -> BC,
MVD or multivalued dependency means that for a single value of attribute ‘a’ multiple values of
attribute ‘b’ exist. We write it as, a --> --> b
Suppose a person named Geeks is working on 2 projects Microsoft and Oracle and has 2 hobbies
namely Reading and Music. This can be expressed in a tabular format in the following way.
Project and Hobby are multivalued attributes as they have more than one value for a single person
i.e., Geeks
We can say that multivalued dependency exists if the following conditions are met.
Any attribute say a multiple define another attribute b; if any legal relation r(R), for all pairs of tuples
t1 and t2 in r, such that, t1[a] = t2[a]
t1 = t4; t2 = t3
To check the MVD in given table, we apply the conditions stated above and we check it with the
values in the given table.
5. Fourth Normal Form-
A relation R is in 4NF if R is in BCNF and does not contain any multi-valued dependencies (MVDs)
Consider a university where a student can enroll in multiple courses, and a student can also
participate in multiple clubs. The two independent multi-valued attributes (Courses and Clubs) lead
to multi-valued dependency.
Causes redundancy: If a student joins an extra club, all their course records get duplicated.
We decompose the table into two separate tables, one for student-course and another for
student-club.
Since each independent multi-valued attribute(Course, Club) is now in it’s own table.
Data is stored efficiently, preventing duplication.
We can say that our table or relation is in 4NF.
A table is in 5NF if it is already in 4NF and it does not have join dependency that cannot be
represented as the natural join of the smaller tables which eliminates redundancy caused by
complex multi-join relationships and prevents the loss of information when decomposing.
A join dependency means that a table can be split into two or more smaller tables and then
joined back without losing any data.
If a table can be decomposed into smaller tables and reconstructed perfectly using a
natural join, it is said to have a valid join dependency.
However, if a table cannot be represented by natural joins of smaller tables without losing
information or causing extra tuples (spurious tuples), it violates 5NF.
Example: A company assigns Projects to Vendors, and Vendors supply Parts for those projects.
The Project, Vendor, and Part are interrelated, but their relationships are independent.
A Project may require multiple Parts.
A Vendor may supply multiple Parts.
A Project may use multiple Vendors.
The problem is If a project stops using a part, but the vendor still supplies it to another project,
deleting a row may cause loss of information.
Let R is decomposed into {R1, R2,....,Rn} with projected FD set {F1,F2,......Fn}. This decomposition is
dependency preserving if F + ={F1 U F2 U.........Fn} + .
Example
Let the relation R{A,B,C,D,E} F:{AB->C, C->D, AB->D} R is decomposed to R1(A,B,C), R2(D,E). Prove
decomposition is dependency preserving.
Solution
F1={AB->C}
F2={C->D}
=> (F1 u F2) = {AB->C, C->D}
AB+ under (F1 U F2) = {A,B,C,D} => AB->D is under (F1 U F2)
F+ = (F1 U F2) +
Let the relation R{A,B,C,D,E,F,G,H,I,J} where F: {AB->C, A->DE, B->F, F->GH. D->IJ}
Solution
F1={AB->C}
F2={}
F3={B->F}
F4={F->GH}
F5={D->IJ}
It is a process in which a relation is decomposed into two or more relations. This property
guarantees that the extra or less tuple generation problem does not occur, and no information is lost
from the original relation during the decomposition. It is also known as non-additive join
decomposition.
When the sub relations combine again then the new relation must be the same as the original
relation was before decomposition.
Consider a relation R if we decomposed it into sub-parts relation R1 and relation R2. The
decomposition is lossless when it satisfies the following statement −
whenever you are decomposing a table and finding union then R1UR2=R
and whenever you perform join operation R1∩R2 /= ∅
common attribute when decomposing a table should be a candidate key of
R1 or R2 or both.
So by the end when you achieve same number of tuples as that of an original table after
decomposing them and performing join operation should result in same number of tuples.
Spurious tuples: Spurious tuples are incorrect or extra tuples that appear when joining decomposed
relations improperly. They do not exist in the original relation but are generated due to lossy
decomposition.
Follow the example I taught in class when I explained dependency preserving decomposition.