Bit3107 Database Systemsii
Bit3107 Database Systemsii
Contact hours: 42
Pre-requisites: Database Systems BIT 2102
Purpose:
Expected Learning Outcomes of the Course:
By the end of the course, students should be able to:-
i. Evaluate various file organization methods, distribution strategies, concurrency
mechanisms, query optimization techniques, and database recovery and security
procedures.
ii. Critically examine and reflect on the theoretical aspect of the process of normalization
and its practical implementation.
iii. Evaluate and use various modern database management systems tools in building a
small but realistic database system.
iv. Describe advances in trends and emerging standards in database technology;
understand the comparative features and crucial aspects of data and knowledge
management technology and the impact of the Internet on database technology
Course Content:
i
Course Assessment
Examination - 70%
Continuous Assessment Test (CATS) - 20%
Assignments - 10%
Total - 100%
ii
COURSE OUTLINE
iii
WEEK SEVEN: DATABASE SECURITY
Characteristics of database security ..............................................................................................................................................................75
Threats.........................................................................................................................................................................................................75
Causes of Failures: ........................................................................................................................................................................................77
Recovery Procedures ....................................................................................................................................................................................78
Database Recovery Features.........................................................................................................................................................................78
DBMS Recovery Facilities..............................................................................................................................................................................78
Review Questions.........................................................................................................................................................................................79
iv
Database and Database System
Week One
Database
The data in the database will be expected to be both integrated and shared particularly on
multi-user systems
Integration - The database may be thought of as a unification of several otherwise
distinct files, with any redundancy among these files eliminated
Shared - individual pieces of data in the database may be shared among several
different users
1
o
Hardware
These are secondary storage on which the database physically resides, together with the
associated I/O devices, device controllers etc.
Procedures
These are instructions and rules that govern design and use of databases e.g. starting and
stopping database, making backups, handling failures.
2
2. Database Administrator (DBA) – Is a technical person responsible for development of
the total system
3
Database Architecture
External Level – concerned with the way users perceive the database
Conceptual Level – concerned with abstract representation of the database in its entirety
Internal Level – concerned with the way data is actually stored
Data Independence
1. Logical Data Independence – users and user programs are independent of logical
structure of the database
2. Physical Data Independence – the separation of structural information about the data
from the programs that manipulate and use the data i.e. the immunity of application
programs to changes in the storage structure and access strategy
Database Design
Conceptual database design – Is the process of constructing a model of the information used in
an organization, independent of all physical considerations?
Step 1 Build local conceptual data model for each user view
– Identify entity types
– Identify relationship types
– Identify and associate attributes with entity or relationship
– Determine attributes domains
– Determine candidate and primary key attributes
– Specialize/generalize entity types (optional step)
– Draw Entity-Relationship diagram
– Review local conceptual data model with user
Logical Database Design for the Relational Model: the process of constructing a model of the
info used in an organization based on a specific data model, but independent of a particular
DBMS and other physical considerations
Step 2 Build and validate local data model for each user view
– Map local conceptual data model to local logical data model
– Derive relations from local logical data model
– Validate model using normalization
– Validate model against user transactions
– Draw Entity-Relationship diagram
– Define integrity constraints
– Review local logical data model with user
Step 3 Build and validate global logical data model
– Merge local logical data models into global model
5
– Validate global data model
– Check for future growth
– Draw final Entity-Relationship diagram
– Review global logical data model with users
Physical Database Design for Relational Databases: the process of producing a description of
the implementation of the database on secondary storage.
Step 4 Translate global data model for target DBMS
– Design base relations
– Design enterprise constraints for target DBMS
Step 5 Design physical representations
– Analyze transactions
– Choose file organizations
– Choose secondary indexes
– Consider the introduction of controlled redundancy
– Estimate disk space requirements
Step 6 Design security mechanisms
– Design user views
– Design access rules
Step 7 Monitor and tune the operational system
Data model
A data model is a collection of tools for describing
– Data
– Data relationships
– Data semantics
– Data constraints
Data models include the following
§ Relational model
§ Entity-Relationship data model (mainly for database design)
§ Object-based data models (Object-oriented and Object-relational)
§ Semi structured data model (XML)
§ Other older models:
Network model
Hierarchical model
6
A Data model consists of
Entities
– A person, place, object, event or concept in the user environment about which
the organization wishes to maintain data.
Relationships
– Association between entities
– 1:M, M:N
Relational databases
A Relational Database is: -
– A database that represents data as a collection of tables in which all data
relationships are represented by common values in related tables
– Database technology involving tables (relations) representing entities and
primary/foreign keys representing relationships
7
Traditional File Processing
Dates back to before we had databases, used to store, manipulate and retrieve large files of
data to support business operation
It is still in use today, including backup of database systems
This approach to information systems design focused on the data processing needs of individual
departments, instead of evaluating the overall information needs of the organization.
8
Disadvantages of File Processing
Drawbacks of using file systems to store data include
i. Program-Data Dependence. All programs maintain metadata for each file they use
ii. Data redundancy and inconsistency. Multiple file formats, duplication of information in
different files.
iii. Duplication of Data results in different systems/programs having separate copies of the
same data i.e multiple files
iv. Difficulty in accessing data. Need to write a new program to carry out each new task
v. Integrity problems. Integrity constraints (e.g., account balance > 0) become “buried” in
program code rather than being stated explicitly
vi. Limited Data Sharing. No centralized control of data
vii. Hard to add new constraints or change existing ones: Lengthy Development Times.
Programmers must design their own file formats
viii. Atomicity of updates. Failures may leave database in an inconsistent state with partial
updates carried out. Example: Transfer of funds from one account to another should
either complete or not happen at all
ix. Concurrent access by multiple users. Concurrent access needed for performance
Uncontrolled concurrent accesses can lead to inconsistencies –Example: Two people
reading a balance (say 100) and updating it by withdrawing money (say 50 each) at the
same time
x. Security problems. Hard to provide user access to some, but not all, data
9
xi. Excessive Program Maintenance. Can take up to 80% of information systems budget
SOLUTION:
The solution to these problems is the DATABASE Approach. The approach has the following
characteristics
i. Emphasizes the integration and sharing of data throughout the organization
ii. Central repository of shared data
iii. Data is managed by a controlling agent
iv. Stored in a standardized, convenient form
10
Online retailers: order tracking, customized recommendations
Manufacturing: production, inventory, orders, supply chain
Human resources: employee records, salaries, tax deductions
Databases can be very large. Databases touch all aspects of our lives
12
Types of databases:
Databases can be classified by:
1. Location
Centralized:
– Supports data located at a single site
Distributed:
– Supports data distributed across several sites
2. use:
Transactional (or production):
– Supports a company’s day-to-day operations
Data warehouse:
– Stores data used to generate information required to make tactical or
strategic decisions
– Often used to store historical data
– Structure is quite different
The DDL compiler generates a set of table templates stored in a data dictionary
The Data dictionary contains metadata (i.e., data about data). This consists of
o Database schema
o Integrity constraints
– Primary key (ID uniquely identifies instructors)
– Referential integrity (references constraint in SQL) e.g. dept_name value
in any instructor tuple must appear in department relation
o Authorization
1990s:
Large decision support and data-mining applications
Large multi-terabyte data warehouses
Emergence of Web commerce
Early 2000s:
XML and XQuery standards
Automated database administration
Later 2000s:
Giant data storage systems Google BigTable, Yahoo PNuts, Amazon, ..
14
Review Questions
1. Define the following terms
Data
Meta data
End user data
Database
Database management system
Database system
2. Discuss the main characteristics of database approach and how it defers from traditional
file system.
3. Discuss the components of database systems
4. Compare any three common databases used in the corporate world.
5. Discuss the capabilities provided by a typical DBMS.
6. Outline the main components of a DBMS
15
Week Two
Data Models:
A database can be modeled as: a collection of entities and relationship among entities.
§ An Entity Model (also called Entity-Relationship Model) is a representation of the data
used by a system. It shows how data is logically grouped together and the relationships
between these logical data groupings as defined by the business requirements of the
system.
§ An Entity Model comprises:
§ Entity Relationship Diagram (ERD)
§ Associated Documentation of Entities, Relationships and Attributes.
Entity
An entity is an object that exists and is distinguishable from other objects. Example: a specific
person, company, event, plant etc.
§ It is a logical grouping of data relevant to the application under investigation. An
identifiable object, concept or activity belonging to the application about which
descriptive data is stored.
§ An entity set is a set of entities of the same type that share the same properties.
Example: set of all persons, companies, trees, holidays
An entity usually has attributes (i.e., data elements) that further describe it. Each
attribute is a characteristic of the entity. An entity must possess a set of one or more
attributes that uniquely identify it (called a primary key).
Attribute
Entities have attributes Example: people have names and addresses
16
An attribute is a property of an entity that distinguishes one entity occurrence from
another
§ There are two types: Key Attribute and Non-Key Attributes
§ Attributes can belong to an Attribute Domain - set of values that may be assigned to an
attribute and they can be:
– Simple attribute or Composite attribute - multiple components
– Single-valued or Multi-valued
– Derived attribute
Relationships
§ Relationship is a meaningful association between two or more entities
§ Degree of a Relationship refers to the number participating entities in a relationship
§ One-to-one - For any A there may only be one member of B and for any B there is only
one member of A associated with it at any time.
§ One to Many - For any A there may be many members of B and for any B there is only
one member of A associated with it.
Example Customer – Order
§ Many to Many - For any A there may be many members of B and for any B there may be
many members of A associated with it. Example Customer – Product
§ One-to-One relationships are, in general, discouraged and it is normal to merge the two
entities involved into one
§ One-to-Many relationships defines the association between a master entity and a detail
entity
§ Many-to-Many relationships must be decomposed into One-to-Many relationships
Constructing the Entity Model
Step 1 - Develop the Initial Entities
§ Identify the objects about which the system currently holds data. This may be done
using interviews, notes, document reviews, observation
Step 2 - Identify Initial Relationships
§ Successful identification of relationships between entities demands constant reference
to the business requirements and environments of the system.
Step 3 Validate against Current System
§ The initial ERD is validated with users. The ERD is logical and so does not show physical
aspects of the current system, instead of "Blue Enquiry Form", "Order Form", "Employee
Master Listing“use "Enquiry", "Order" "Employee".
Step 4 - Extend EM to Cover New Requirements
17
§ The EM is extended to cover entities needed to fulfil the new requirements of the
system.
Step 5 - Rationalize the Structure
§ A review of the ERD is required to ensure that it is the minimum logical model required
to support the system requirements.
§ If there are two access paths between two entities, the access logic may be is duplicated
and redundant.
§ If link entities are found to contain similar data, they may be merged.
Step 6 - Revalidate the EM with Users
More on Relationships
§ Recursive Relationships - A recursive relationship is one in which a the same entity
participates more than once in different roles
Example:
Employee married to Employee
Employee manages Employee
§ Optional/Mandatory Relationships
In an optional relationship, an entity occurrence in one entity need not have an
occurrence in the other entity associated with it.
§ Cardinality Constraints - describes the maximum number of possible relationships for
each participating entity
• It is a function of policies established by an enterprise
• The most common is binary (1:1), one-to-many (1:M) and many-to-many (M:N)
§ Cardinality Constraints - Example
A Lecturer teaches at most three Classes per semester and each Class has at most one
Lecturer. A student is limited to six classes per semester and class size is limited to 35.
Participation Constraint (Existence – Dependent) - Determines whether the existence of
an entity depends on it being related to another entity through the relationship.
Example: Course generates Class
18
§ Participation Constraint (Existence – Dependent) - Determines whether the existence of
an entity depends on it being related to another entity through the relationship.
Example: Course generates Class
Enhanced Entity Model
§ Generalization Vs Specialization
generalization and specialization are important relationships that exist between a higher level
entity set and one or more lower level entity sets.
1. Generalization is the result of taking the union of two or lower level entity sets
to produce a higher level entity sets. Specialization is the results of taking
subsets of a higher level entity set to form a lower level entity sets.
2. In generalization, each higher level entity must also be a lower level entity.
In specialization, some higher level entities may not have lower-level entity
sets at all.
3. Specialization is a Top Down process where as Generalization is Bottom Up
process.
19
§ Remove repeating groups and propagate higher level primary keys by partitioning the
un-normalized relation
Second Normal Form
§ A relation is in SNF if and only if it is in FNF and every non-key attribute is fully
functionally dependent on the key attribute
* Remove any partial dependencies by partitioning the relation
* A relation that is in FNF and has no composite key is necessarily in SNF!
Third Normal Form
§ A relation is in TNF if and only if it is in SNF and every non-key attribute is independent
of all other non-key attributes
* Remove any non-key attributes that depend on other non-key dependencies
(transitive dependencies)
Functional Dependency
§ Given a relation R, attribute Y of R is functionally dependent on attribute X of R, if and
only if each X-value in R has associated with it precisely one Y-value in R (at any one
time)
R.X R.Y (R.X functionally determines R.Y)
Transitive Dependency
§ A B C
If B depends on A and C depends on B, then C transitively depends on A via B
Example Staff No name, salary, branchno, address
branch No address
Normalization Steps
§ Represent the un-normalized relation
List Attributes
Identify Repeating Groups
Identify Key Attributes
§ Convert to FNF
• By removing Repeating Groups
§ Understand the Dependencies
• Functional Dependency Diagrams may be used
§ Convert to SNF
• by removing partial dependencies
§ Convert to TNF
• by removing transitive dependencies
20
21
22
Fourth Normal Form
§ Fourth Normal Form is concerned with multi-valued dependencies
• A functional dependency indicates that for each value of X there is one value of Y
• A multi-valued dependency states that for each value of attribute X there is a finite
set of values of Y. Example Book/Authors
Fifth Normal Form
§ Fifth Normal Form is concerned with relations which cannot be partitioned into two
relations without losing information, but can be partitioned into three or more relations
* Has little practical importance
Review Questions
1. An engineering consultancy firm supplies temporary specialized staff to bigger companies in
the country to work on their project for certain amount of time. The table below lists the
time spent by each of the company’s employees at other companies to carry out projects.
The National Insurance Number (NIN) is unique for every member of staff.
23
674315A SC1025 48 R. Press SC115 Belfast
24
Week Three
Relational Keys
i. Candidate Key - a set of unique identifiers
ii. Primary Key - The candidate key selected to be the primary key
iii. Composite Key - a candidate key that consists of two or more attributes
iv. Foreign Key - A foreign key is an attribute(s) in one relation whose values are
required to match those of the primary key of some relation
The Data Manipulation Languages (DML) of relational databases is based on Relational Algebra.
Each operator of the relational algebra takes either one or two relations as its input and
produces a new relation as its output
26
Data Manipulation is:
A DML is a language which enables users to access and manipulate data. The goal is to provide
efficient human interaction with the system. There are two types of DML:
i. procedural: the user specifies what data is needed and how to get it
ii. nonprocedural: the user only specifies what data is needed
A query language is a portion of a DML involving information retrieval only. The terms DML and
query language are often used synonymously.
Data manipulation languages were initially only used by computer programs, but (with the
advent of SQL) have come to be used by people, as well.
Data Manipulation Languages have their functional capability organized by the initial word in a
statement, which is almost always a verb. In the case of SQL, these verbs are:
1. Select
2. Insert
3. Update
4. Delete
Relational Algebra
§ SELECT - Extracts specified tuples (rows/records) from a specified relation
§ PROJECT - Extracts specified attributes (columns/fields) from a specified relation
§ JOIN - Produces a relation from two specified relations consisting of all possible
concatenated pairs of tuples, one from each of the two specified relations, such that in
each pair the two tuples satisfy some specified condition
§ DIVIDE - Takes two relations, one binary and one unary, and builds a relation consisting
of all values of one attribute of the binary relation that match (in the other attribute) all
values in the unary relation
§ UNION - Produces a relation consisting of all tuples appearing in either or both of two
specified relations
27
§ INTERSECTION - Produces a relation consisting of all tuples appearing in both of two
specified relations
§ DIFFERENCE - Produces a relation consisting of all tuples appearing in the first and not in
the second of two specified relations
§ PRODUCT-Produces a relation from two specified relations consisting of all possible
concatenated pairs of tuples, one from each of the two specified relations
A Data Definition Language (DDL) is a computer language for defining data structures. The
term was first introduced in relation to the Codasyl database model, where the schema* of the
database was written in a Data Definition Language describing the records, fields, and "sets"
making up the user Data Model. Initially it referred to a subset of SQL, but is now used in a
generic sense to refer to any formal language for describing data or information structures, like
XML schemas.
*Schema: The schema of a database system is its structure described in a formal language
supported by the database management system (DBMS). In a relational database, the schema
defines the tables, the fields in each table, and the relationships between fields and tables.
Schemas are generally stored in a data dictionary. Although a schema is defined in text
database language, the term is often used to refer to a graphical depiction of the database
structure
Review Questions
1. Define the following terms
Data model
Referential integrity
Primary candidates
28
Candidate key
Value set (domain)
Attribute
Attribute value
2. Differentiate between the following terms: -
Entity type and entity set
Relationship type, relationship instance and relationship set.
Weak entity types and strong (regular) entity types
3. Describe briefly the following data models
Relational data models
Network data models
Hierarchical data models
4. What is a participatory role? When is it necessary to use role names in the
description of relationship type?
5. Define the following terms:-
a. Internal schema
b. Conceptual schema
c. External schema
d. Data independence
e. Data definition language
f. Query language
g. Data manipulation language
6. Differentiate between
Database state and database schema
Logical data independence and physical data independence
Procedural and non-procedural DMLs.
7. Discuss the weaknesses of a relational database model
29
Week Four
Query optimisation
Objectives
A query optimizer is essentially a program which is used for the efficient evaluation of relational
queries, making use of relevant statistic information.
Objective
The main objective of query optimization is to choose the most efficient strategy for
implementing a given relational query, and thereby improve on the system performance
30
Need for Query optimization
A database system based on either the hierarchical model or network model requires users to
navigate across the database to locate and retrieve the desired data. A relational database
system, however, allows users simply to state what data they require and leave the system to
locate that data in the database.
Although there are many different ways available to perform a single user query, this automatic
navigation can be optimized by choosing and executing the most efficient plan, selected on the
basis of system statistical information. By executing queries in the most cost-effective way, the
overall system performance can be improved /enhanced.
The speed discrepancy between CPU and I/O devices is still enormous despite rapid recent
advances in computer technology. As an I/O access is one of the most costly and frequently
performed operations in a typical database system, it is the main aim of the query optimiser to
minimize the I/O activities by choosing the cheapest query plan for a given query, usually the
one with a minimum number of tuple I/O operations.
Student, Lending and Book. The attributes highlighted are the keys for the relevant relations
shown in below(fig v4.1)
V4.1
Effects of Optimization - 1
The database contains 100 students and 10,000 lendings, of which only
50 are for book B1.
It is possible to hold up to 50 tuples in memory, without having to write
back to disk.
Query Execution Plan A - No Optimization
V4.3
Query Execution Plan A - No
Optimization
Join-Select-Project
1.Join relations Student and Lending over
Stud_No
2.Select the result of Step 1 for just the tuples for
book B1
3.Project the result of Step 2 over Stud_Name to
get result (50 max)
Total tuple I/O: 1,020,000
In this first approach, the operations required by the query are performed in the sequence:
Join-Select-Project
We calculate the number of database accesses (tuple I/O operations) occurred in each
operation.
1. Join relations Student and Lending over Stud_no.
- For each Student row, every Lending row will be retrieved
32
a n d tested (reading each of the 100 Student rows 10,000 times);
- Every Lending row will match with one Student row, so the
number of joined rows in this intermediate relation is 10,000.
These have to written back to disk (only 50 tuples can be held
in memory – see assumptions).
2. Select the result of Step 1 for just the tuples for book B1.
- This results reading the 10,000 joined tuples (obtained in step
1) Back into memory.
- Then Select produces a relation containing only 50 tuples, which
Can be kept in memory (see assumption).
10,000 + 0 = 10,000
3. Project the result of Step 2 over Stud_Name to get result (50 max).
- This results in reading a relation of 50 tuples (obtained in step
2) Which is already in memory, and producing a final relation of no
more than 50 tuples, again in memory.
0+0=0
Therefore, the total number of tuple I/Os for query plan A is:
(1,010,000 + 10,000).
33
Query Execution Plan B - With Optimization
V4.4
With Optimization
Select-Join-Project
1. Select the relation Lending to just the tuples
for Book B1
2. Join the result of Step 1 to relation Student
over Stud_No
3. Project the result of Step 2 over Stud_Name
Total tuple I/O: 10,100
In this approach, the sequence of the operations has been changed to the
following:
Select-Join-Project
1.Select the relation lending for just the tuples for Bppk B1 a relation with
50 tuples,which will be kept in
memory (see assumption).
The number of tuple I/Os: 10,000 + 0 = 10,000
2. Join the result of Step 1 with relation Student over Stud_No.
Therefore, the total number of tuple I/Os for query plan B is (10,000 + 100)
Total tuple I/O: 10,100
Comparison of A and B
Although plan A and plan B will produce the same end result for the given
query, the order in which the required relational operations are executed is
different. It is obvious that plan B requires far fewer tuple I/O accesses than
plan A, and therefore is far more efficient in terms of performance.
If, furthermore, use was made in Plan B of an index on Book-No for the Lending
relation, the number of tuples read in Step 1 would be further reduced from
10,000 to just 50. Similarly, an index on Student.Stud-No would reduce
34
retrievals in Step 2 to only 50 tuples. One or more similar examples should be
shown to the students to demonstrate how tuple I/Os should be calculated and
used as a parameter in the comparison of different query execution plans.
Four Stages of
Optimization - 1
1. Convert query into internal form
suitable for machine manipulation -
e.g. Query tree, Relational algebra
2. Further convert internal form into
equivalent and more efficient canonical
form, using well-defined transformation
rules (laws)
- Query tree;
- Relational algebra.
V4.5
Four Stages of Optimization – 2
3. Choose set of candidate low-level
procedures, using statistics about the
database
– Low-level operation
– Implementation procedure
35
– Cost formula
4. Generate query plans and choose the
best (cheapest) plan by evaluating the
cost formulae
Having completed steps 1 and 2, the query optimizer must now decide how to evaluate
the transformed query. The basic principle here is to consider the transformed query
representation as specifying a series of low-level operations, using statistical
information about the database
The main results produced in this stage are:
Transformation Rules
V4.6
Transformation Rules - 1
Rule 1
(A where Restrict-1) where Restrict-2
≡ A (where Restrict-1 AND Restrict-2)
Rule 2
A ([Project]) where Restrict
≡ (A where Restrict) [Project]
Rule 3
(A [Project-1]) [Project-2] ≡ A [Project-2]
Rule 4
(A Join B) where Restrict-on-A AND Restrict-on-B
≡ (A where Restrict-on-A) Join (B where Restrict-on-B)
• Rule 1
Transform a sequence of restrictions against a given relation into
a single restriction.
(A where Restrict-1) where Restrict-2
≡ A (where Restrict-1 AND Restrict-2)
• Rule 2
36
Transform a restriction of a projection into a projection of a
restriction.
A ([Project]) where Restrict
≡ (A where Restrict) [Project]
• Rule 3
Transform a sequence of projections against a given relation into a
single (the last) projection.
• Rule 4
Distributivity (for restrictions and projections).
V4.7
Transformation Rules - 2
Rule 5
where p OR (q AND r)
≡ where (p OR q) AND (p OR r)
Rule 6
(A Join B ) Join C ≡ A Join (B Join C)
Rule 7
Perform projections as early as possible
Rule 8
Perform restrictions as early as possible
• Rule 5
Distributivity (for logical
expressions).
where p OR (q AND r)
≡ where (p OR q) AND (p OR r)
• Rule 6
Choose an optimal ordering of
the joins to keep the
intermediate results low in size.
A Join (B Join C)
37
(A Join B ) Join C ≡
• Rule 7
Perform projections as early as possible.
• Rule 8
Perform restrictions as early as possible.
6 Database Statistics
Database Statistics
System catalogue, or data dictionary
Typical statistics maintained by it include:
l For each base table
– cardinality
– number of pages for this table
l For each column of each base table
– maximum, minimum, and average value
– actual values and their frequencies
l For each index
– whether a ‘clustering index’
– number of leaf pages
– number of levels
Various decisions which have to be made in the optimization process are based upon the
database statistics stored in the system, often in the form of a system catalogue or a data
dictionary.
Typical statistics maintained by the system include
For each base relation, for example:
- cardinality of the relation;
- number of pages for this relation.
Summary
The following key points should be emphasized for this topic:
Effects:
The dramatic effects of optimization are illustrated by the
example. Students should be able to compare different query executing
plans by calculating the number of tuple I/Os for each plan, as shown in the
example.
Processes:
The main tasks involved in each of the four stages of optimization.
Information needed:
The critical role of database statistics in the form of a system
catalogue O r data dictionary.
Consider the following 3 tables: Student, Lending & Book. The attributes
highlighted are
the keys for the relevant relations:
Retrieve the names of students who have borrowed the book B1.
The database contains 100 students and 10,000 lendings, of which only
50 are for book B1.
It is possible to hold up to 50 tuples in memory, without having to write
back to disk.
In this first approach, the operations required by the query are performed in the
sequence:
Join-Select-Project
2. Select the result of Step 1 for just the tuples for book B1.
- This results reading the 10,000 joined tuples (obtained in step
1) Back into memory.
40
- Then Select produces a relation containing only 50 tuples, which
can Be kept in memory (see assumption).
The number of tuple I/Os in this step is:
10,000 + 0 = 10,000
3. Project the result of Step 2 over Stud_Name to get result (50 max).
- This results in reading a relation of 50 tuples (obtained in step
2) which is already in memory, and producing a final relation of
no more than 50 tuples, again in memory.
The number of tuple I/O in this step is:
0+0=0
Therefore, the total number of tuple I/Os for query plan A is:
(1,010,000 + 10,000).
In this approach, the sequence of the operations has been changed to the
following:
Select-Join-Project
1. Select the relation lending for just the tuples for Book B1.
41
- Same as step 3 of Query Plan A.
Therefore, the total number of tuple I/Os for query plan B is (10,000 + 100)
Transformation Rules
• Rule 1
Transform a sequence of restrictions against a given relation into a single
restriction.
(A where Restrict-1) where Restrict-2
A (where Restrict-1 AND Restrict-2)
• Rule 2
Transform a restriction of a projection into a projection of a restriction.
• Rule 5
Distributivity (for logical expressions).
where p OR (q AND r)
where (p OR q) AND (p OR r)
• Rule 6
Choose an optimal ordering of the joins to keep the intermediate results
low in size.
42
• Rule 8
Perform restrictions as early as possible.
Review Questions
1. Discuss the reasons for converting SQL queries into relational algebra before
optimization is done.
2. Briefly discuss the process of query optimization
3. Why are queries converted into canonical form during optimization?
4. What is meant by the term heuristic optimization? Discuss the main heuristics that are
applied during query optimization
Week Five
Introduction to SQL
What is SQL?
To build a web site that shows data from a database, you will need:
43
An RDBMS database program (i.e. MS Access, SQL Server, MySQL)
To use a server-side scripting language, like PHP or ASP
To use SQL to get the data you want
To use HTML / CSS
44
Create Table Construct
An SQL relation is defined using the create table command:
Create table r (A1 D1, A2 D2, ..., An Dn,
(integrity-constraint1), ..., (integrity
constraintk))
r is the name of the relation
each Ai is an attribute name in the schema of relation r
Di is the data type of values in the domain of attribute Ai
Example:
Note that primary key declaration on an attribute automatically ensures not null
The SQL data-manipulation language (DML) provides the ability to query information, and
insert, delete and update tuples
A typical SQL query has the form:
Select A1, A2, ..., An
From r1, r2, ..., rm
Where P
– Ai represents an attribute
– Ri represents a relation
– P is a predicate.
Views
In some cases, it is not desirable for all users to see the entire logical model (that is, all the
actual relations stored in the database.)
Consider a person who needs to know an instructors name and department, but not the salary.
This person should see a relation described, in SQL, by select ID, name, dept_name from
instructor
A view provides a mechanism to hide certain data from the view of certain users.
Any relation that is not of the conceptual model but is made visible to a user as a “virtual
relation” is called a view.
View Definition
A view is defined using the create view statement which has the form
Create view v as < query expression >
Where <query expression> is any legal SQL expression.
The view name is represented by v.
Once a view is defined, the view name can be used to refer to the virtual relation that the view
generates.
46
View definition is not the same as creating a new relation by evaluating the query expression
Rather, a view definition causes the saving of an expression; the expression is substituted into
queries using the view.
Example Views
A view of instructors without their salary
Create view faculty as
Select ID, name, dept_name
From instructor
Find all instructors in the Biology department
Select name
From faculty
Where dept_name = ‘Biology’
SQL Transactions
A transaction is a unit of work. An Atomic transaction either fully executed or rolled back as if it
never occurred. Isolation from concurrent transactions
Transactions begin implicitly Ended by commit work or rollback work
But default on most databases: each SQL statement commits automatically. One can turn off
auto commit for a session (e.g. using API)
In SQL: 1999 can use: begin atomic …. end Not supported on most databases
Integrity Constraints
Integrity constraints guard against accidental damage to the database, by ensuring that
authorized changes to the database do not result in a loss of data consistency.
Integrity Constraints on a Single Relation are
not null
primary key
unique
check (P), where P is a predicate
Referential Integrity
Ensures that a value that appears in one relation for a given set of attributes also appears for a
certain set of attributes in another relation. Example: If “Biology” is a department name
appearing in one of the tuples in the instructor relation, then there exists a tuple in the
department relation for “Biology”.
47
Built-in Data Types in SQL
date: Dates, containing a (4 digit) year, month and date Example: date ‘2005-7-27’
time: Time of day, in hours, minutes and seconds. Example: time ‘09:00:30’ time ‘09:00:30.75’
timestamp: date plus time of day Example: timestamp ‘2005-7-27 09:00:30.75’
interval: period of time Example: interval ‘1’ day
– Subtracting a date/time/timestamp value from another gives an interval value
– Interval values can be added to date/time/timestamp values
Authorization
Forms of authorization on parts of the database:
Read - allows reading, but not modification of data.
Insert - allows insertion of new data, but not modification of existing data.
Update - allows modification, but not deletion of data.
Delete - allows deletion of data.
Privileges in SQL
select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the instructor relation:
grant select on instructor to U1, U2, U3
insert: the ability to insert tuples
update: the ability to update using the SQL update statement
delete: the ability to delete tuples.
all privileges: used as a short form for all the allowable privileges
Joined Relations
Join operations take two relations and return as a result another relation.
A join operation is a Cartesian product which requires that tuples in the two relations match
(under some condition). It also specifies the attributes that are present in the result of the join
The join operations are typically used as sub query expressions in the from clause
48
Join operations – Example
Relation course
Relation prerequisites
Observe that
prerequisite information is missing for CS-315 and
course information is missing for CS-437
Outer Join
§ Is an extension of the join operation that avoids loss of information
§ Computes the join and then adds tuples form one relation that does not match tuples in
the other relation to the result of the join.
§ Uses null values.
Joined Relations
Join operations take two relations and return as a result another relation.
– These additional operations are typically used as subquery expressions in the from
clause
Join condition – defines which tuples in the two relations match, and what attributes are
present in the result of the join.
49
Join type – defines how tuples in each relation that do not match any tuple in the other relation
(based on the join condition) are treated.
The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in
the right table (table2). The result is NULL in the right side when there is no match.
or:
SELECT column_name(s)
FROM table1
LEFT OUTER JOIN table2
ON table1.column_name=table2.column_name;
Demo Database
50
CustomerID CustomerName ContactName Address City PostalCode Country
10308 2 7 1996-09-18 3
10309 37 3 1996-09-19 1
10310 77 8 1996-09-20 2
The following SQL statement will return all customers, and any orders they might have:
Example
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
51
SQL RIGHT JOIN Keyword
The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows
in the left table (table1). The result is NULL in the left side when there is no match.
or:
SELECT column_name(s)
FROM table1
RIGHT OUTER JOIN table2
ON table1.column_name=table2.column_name;
Demo Database
52
OrderID CustomerID EmployeeID OrderDate ShipperID
10308 2 7 1996-09-18 3
10309 37 3 1996-09-19 1
10310 77 8 1996-09-20 2
The following SQL statement will return all orders, and any customers that might have placed
them:
Example
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
RIGHT JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right
table (table2).
The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.
Demo Database
53
CustomerID CustomerName ContactName Address City PostalCode Country
10308 2 7 1996-09-18 3
10309 37 3 1996-09-19 1
10310 77 8 1996-09-20 2
The following SQL statement selects all customers, and all orders:
CustomerName OrderID
Alfreds Futterkiste
54
Ana Trujillo Emparedados y helados 10308
10382
10351
Note: The FULL OUTER JOIN keyword returns all the rows from the left table (Customers), and
all the rows from the right table (Orders). If there are rows in "Customers" that do not have
matches in "Orders", or if there are rows in "Orders" that do not have matches in "Customers",
those rows will be listed as well.
The UNION operator is used to combine the result-set of two or more SELECT statements.
Notice that each SELECT statement within the UNION must have the same number of columns.
The columns must also have similar data types. Also, the columns in each SELECT statement
must be in the same order.
Note: The UNION operator selects only distinct values by default. To allow duplicate values, use
the ALL keyword with UNION.
Note: The column names in the result-set of a UNION are usually equal to the column names in
the first SELECT statement in the UNION.
Demo Database
Charlotte 49 Gilbert
1 Exotic Liquid Londona EC1 4SD UK
Cooper St.
The following SQL statement selects all the different cities (only distinct values) from the
"Customers" and the "Suppliers" tables:
Example
SELECT City FROM Customers
UNION
SELECT City FROM Suppliers
ORDER BY City;
56
Note: UNION cannot be used to list ALL cities from the two tables. If several customers and
suppliers share the same city, each city will only be listed once. UNION selects only distinct
values. Use UNION ALL to also select duplicate values!
The following SQL statement uses UNION ALL to select all (duplicate values also) cities from the
"Customers" and "Suppliers" tables:
Example
SELECT City FROM Customers
UNION ALL
SELECT City FROM Suppliers
ORDER BY City;
The following SQL statement uses UNION ALL to select all (duplicate values also) German cities
from the "Customers" and "Suppliers" tables:
Example
SELECT City, Country FROM Customers
WHERE Country='Germany'
UNION ALL
SELECT City, Country FROM Suppliers
WHERE Country='Germany'
ORDER BY City;
Advanced SQL
JDBC and ODBC
57
Java Database Connectivity- JDBC
§ JDBC is a Java API for communicating with database systems supporting SQL.
§ JDBC supports a variety of features for querying and updating data, and for retrieving
query results.
§ JDBC also supports metadata retrieval, such as querying about relations present in the
database and the names and types of relation attributes.
§ Model for communicating with the database:
– Open a connection
– Create a “statement” object
– Execute queries using the Statement object to send queries and fetch results
– Exception mechanism to handle errors
JDBC Code
public static void JDBCexample(String dbid, String userid, String passwd)
{
try {
Class.forName ("oracle.jdbc.driver.OracleDriver");
Connection conn = DriverManager.getConnection(
"jdbc:oracle:thin:@db.yale.edu:2000:univdb", userid, passwd);
Statement stmt = conn.createStatement();
….. Do actual work ……..
Stmt.close();
Conn.close();
}
catch
(SQLException
sqle) {
System.out.println("SQLException : " + sqle);
}
Update to database
try {
stmt.executeUpdate(
"insert into instructor values(’77987’, ’Kim’, ’Physics’, 98000)");
} catch (SQLException sqle)
{ System.out.println("Could not insert tuple. " + sqle);
}
SQL Injection
§ Suppose query is constructed using "select * from instructor where name = ’" + name +
"’"
§ Suppose the user, instead of entering a name, enters: X’ or ’Y’ = ’Y
§ then the resulting statement becomes:
– "select * from instructor where name = ’" + "X’ or ’Y’ = ’Y" + "’"
– which is: select * from instructor where name = ’X’ or ’Y’ = ’Y’
– User could have even used X’; update instructor set salary = salary + 10000; --
§ Prepared statement internally uses: "select * from instructor where name = ’X\’ or \’Y\’
= \’Y’ Always use prepared statements, with user inputs as parameters
Embedded SQL
§ The SQL standard defines embeddings of SQL in a variety of programming languages
such as C, Java, and Cobol.
§ A language to which SQL queries are embedded is referred to as a host language, and
the SQL structures permitted in the host language comprise embedded SQL.
§ The basic form of these languages follows that of the System R embedding of SQL into
PL/I.
§ EXEC SQL statement is used to identify embedded SQL request to the preprocessor
59
Example Query
From within a host language, find the ID and name of students who have completed more than
the number of credits stored in variable credit_amount.
EXEC SQL
declare c cursor for
select ID, name
from student
where tot_cred > :credit_amount
END_EXEC
Review Questions
1. List the data types that are allowed in SQL
2. How does SQL allow implementation of the entity integrity and referential integrity
constraints?
3. What is a view in SQL and how is it defined?
4. Discuss the problems that may arise when one attempts to update a view
5. Describe the six clauses in the syntax of an SQL query and show what type of constructs
can be can be specified in each of the six clauses. Which of the six clauses are required
and which are optional?
6. Describe conceptually how an SQL query will be executed by specifying the conceptual
order of executing each of the six clauses. Create a Student table and a course table
using SQL.
7. Feed some student records and Course records using SQL.
8. Use SQL to retrieve data from both the tables to show the Student details and the
course they take.
60
Week Six
Transaction management
Objectives
Transaction Concept
Transaction State
Concurrent Executions
Recoverability
Implementation of Isolation
Transaction Definition in SQL
Transaction Concept
A transaction is a unit of program execution that accesses and possibly updates various data
items.
Two main issues to deal with:
i. Failures of various kinds, such as hardware failures and system crashes
ii. Concurrent execution of multiple transactions
Transaction Management
* What if the system fails?
* What if more than one user is concurrently updating the same data?
What is a transaction?
61
A transaction is a unit of program execution that accesses and possibly updates various data
items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or
none are.
Consistency. Execution of a transaction in isolation preserves the consistency of the database.
Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions Ti
and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started
execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures
Transactions Management deals with the problem of always keeping the database in a
consistent state even when concurrent accesses and failures occur
The database can (and usually is) temporarily inconsistent during the execution of a
transaction. The important point is that the database should be consistent when the
transaction terminates
Transaction:
• A transaction is a series of actions, carried out by a single user or application
program, which must be treated as a logical unit of work.
• A transaction is a unit of consistent and reliable computation.
• Example the transfer of funds from one account to another: either the entire
operations are carried out or none of the operation is carried out.
• Transactions transform the database from consistent state to another consistent
state.
Transaction State
Active – the initial state; the transaction stays in this state while it is executing
Partially committed – after the final statement has been executed.
Failed -- after the discovery that normal execution can no longer proceed.
Aborted – after the transaction has been rolled back and the database restored to its state
prior to the start of the transaction. Two options after it has been aborted: restart the
transaction can be done only if no internal logical error kill the transaction
Committed – after successful completion.
62
Transaction State
ACID Properties
A database transaction, by definition, must be atomic, consistent, isolated and durable.
Database practitioners often refer to these properties of database transactions using the
acronym ACID.
A transaction is a unit of program execution that accesses and possibly updates various data
items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or
none are.
Consistency. Execution of a transaction in isolation preserves the consistency of the database.
Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions Ti
and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started
execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.
63
Properties of a Transaction: ACID
§ Atomicity
– Transactions are either done or not done
– They are never left partially executed
§ Consistency
– Transactions should leave the database in a consistent state
§ Isolation
– Transactions must behave as if they were executed in isolation
§ Durability
– Effects of completed transactions are resilient against failures
Concurrent Executions
Transactions in SQL
§ Consider the following SQL statement
UPDATE Project
SET Budget=Budget*1.1
WHERE Pname=‘Engineering’
§ This can be translated into a transaction as follows:
Begin_Transaction Budget_Update
Begin
EXEC SQL UPDATE Project
64
SET Budget=Budget*1.1
WHERE Pname=‘Engineering’
End
Note that
§ (Implicit beginning of transaction)
– SELECT …;
– UPDATE …;
– ……
– ROLLBACK | COMMIT;
§ ROLLBACK (transaction abort)
– Will undo the the partial effects of the transaction
– May be initiated by the DBMS
65
Concurrency Control
DBMS Components
Transactions Manager - It is the responsibility of the Transactions Manager of a DBMS
to ensure transactions do not interfere with one another and corrupt the database.
Scheduler - Responsible for implementing a particular strategy for concurrency control.
Also called the Lock Manager (for those systems that use locking as the means of
managing concurrency)
Recovery Manager- Ensures that the database is restored to the state it was in before
the start of the transaction.
Buffer Manager - Responsible for the transfer of data between disk storage and
memory.
Conflicting Operations
§ Two operations on the same data item conflict if at least one of the operations is a write
– r(X) and w(X) conflict
– w(X) and r(X) conflict
– w(X) and w(X) conflict
– r(X) and r(X) do not
– r/w(X) and r/w(Y) do not
§ Order of conflicting operations matters
– If T1 .r(A) precedes T2 .w(A), then conceptually, T1 should precede T2
66
Interference Between Concurrent Transactions
There are many ways in which concurrently executing transactions can interfere with one
answer and so compromise the integrity and consistency of the DB. Three examples of such
interference are:
Lost Update Problem
Uncommitted Dependency Problem
Inconsistency Analysis Problem
68
The Inconsistent Analysis Problem
Locking
• If a transaction wants to read an object, it must first request a shared lock (S mode) on
that object
• If a transaction wants to modify an object, it must first request an exclusive lock (X
mode) on that object
• Allow one exclusive lock, or multiple shared locks
69
The Lock Manager may schedule the tx as follows:
Two Phase Locking
70
Strict Two phase Locking
Only release locks at commit/abort time
– A writer will block all other readers until the writer commits or aborts
Used in most commercial DBMS (except Oracle)
Deadlocks
System is deadlocked if there is a set of transactions such that every transaction in the set is
waiting for another transaction in the set.
71
Deadlock prevention protocols ensure that the system will never enter into a deadlock state.
Some prevention strategies : Require that each transaction locks all its data items before it
begins execution (predeclaration).
Impose partial ordering of all data items and require that a transaction can lock data items only
in the order specified by the partial order (graph-based protocol).
72
Starvation happens if same transaction is always chosen as victim. Include the number of
rollbacks in the cost factor to avoid starvation
73
Review Questions
1. Discuss the meaning of concurrent execution of database transaction in a multi-user
system.
2. Discuss why concurrency control is necessary in transaction management.
3. State and explain using suitable diagram the states that a transaction goes through
during execution.
4. Discuss the ACID properties of a transaction.
5. Define the following terms as used in transcations
Schedule
Serial schedule
Serializable schedule
74
Week Seven
Database Security
Database security can be defined as the protection of the database against
• unauthorized access to or modification of the database,
• denial of service to authorized users and
• provision of service to unauthorized users
It also includes the measures necessary to detect, document, and counter threats
Threats
Browsing - accessing info
Misuse – malice, errors of omission etc.
Penetration – Unauthorized access
Systems Flaws – h/w and s/w errors
Component Failure – malfunctioning of h/w, s/w or media
Tampering – attacks to physical and logical components
Eavesdropping – passive surveillance of telecomm channel e.g. tapping, sniffing
Denial of Service – preventing or delaying performance e.g. jamming, traffic flooding
Browsing - accessing info
Misuse – malice, errors of omission etc.
Penetration – Unauthorized access
Systems Flaws – h/w and s/w errors
Component Failure – malfunctioning of h/w, s/w or media
Tampering – attacks to physical and logical components
Eavesdropping – passive surveillance of telecomm channel e.g. tapping, sniffing
Denial of Service – preventing or delaying performance e.g. jamming, traffic flooding
75
Countermeasures – computer-based controls
authorization
views
backup and recovery
integrity
encryption
Redundant array of independent disks (RAID)
Authorization
The granting of a right or privilege that enables a subject to have legitimate
access to a database system or a database system’s object.
Authentication
A mechanism that determines whether a user is, who he or she claims to be.
View
A view is a virtual table that does not necessarily exist in the database but can be
produced upon request by a particular user, at the time of request.
Backup
Process of periodically taking a copy of the database and log file (and possibly programs)
onto offline storage media.
76
Journaling
Process of keeping and maintaining a log file (or journal) of all changes made to
database to enable recovery to be undertaken effectively in the event of failure.
Integrity
Prevents data from becoming invalid, and hence giving misleading or incorrect
results.
Encryption
Encoding the data by a special algorithm that renders the data unreadable by any
program without the decryption key.
Redundant array of independent disks (RAID)
Hardware that the DBMS runs on must be fault-tolerant, meaning that the DBMS
should continue to operate even if one of the hardware components fails.
Suggests having redundant components that can be seamlessly integrated into the
working system whenever there are failures.
The main hardware components that should be fault-tolerant include disk drives,
disk controllers, CPU, power supplies, and cooling fans.
Disk drives are the most vulnerable components with the shortest times between
failure of any of the hardware components.
One solution is to provide a large disk array comprising an arrangement of several
independent disks that are organized to improve reliability and at the same time
increase performance.
Database recovery
Database Recovery is the process of restoring the database to a correct state in the event of a
failure
Causes of Failures:
Transaction-Local - The failure of an individual transaction can be caused in several ways:
* Transaction – Induced Abort e.g. insufficient funds
* Unforeseen Transaction Failure, arising from bugs in the application
programs
* System-Induced Abort – e.g. when Transaction Manager explicitly aborts
a transaction because it conflicts with another transaction or to break a
deadlock.
§ Site Failures - Occurs due to failure of local CPU and results in a System Crashing.
77
• Total Failure – all sites in DDS are down
• Partial Failure – some sites are down
§ Media Failures e.g. head crashes
§ Network Failures - a failure may occur in the communications links.
§ Disasters e.g. fire or power failures
§ Carelessness - unintentional destruction of data by users
§ Sabotage - intentional corruption of destruction of data, h/w or s/w facilities
Recovery Procedures
Recovering from any type of system failure requires the following:
• Determining which data structures are intact and which ones need recovery.
• Following the appropriate recovery steps.
• Restarting the database so that it can resume normal operations.
• Ensuring that no work has been lost nor incorrect data entered in the database
Recovery Techniques
§ Restart Procedures – No transactions are accepted until the database has been
repaired. Includes:
• Emergency Restart - follows when a system fails without warning e.g. due to
power failure.
• Cold Restart - system restarted from archive when the log and/or restart file has
been corrupted.
• Warm Restart - follows controlled shutdown of system
§ Backups - particularly useful if there has been extensive damage to database.
§ Mirroring - Two complete copies of the DB are maintained on-line on different stable
storage devices. Used in environments with non-stop, fault-tolerant operations.
§ Undo/Redo - Undoing and Redoing a transaction after failure. The Transaction Manager
keeps an active-list, an abort-list and a commit-list; transactions that have began,
transactions aborted, and committed transactions respectively
Review Questions
80
Week Eight
Concepts
Distributed Database: A logically interrelated collection of shared data (and a description of
this data), physically distributed over a computer network.
Distributed DBMS: Software system that permits the management of the distributed database
and makes the distribution transparent to users.
§ Collection of logically-related shared data.
§ Data split into fragments.
§ Fragments may be replicated.
§ Fragments/replicas allocated to sites.
§ Sites linked by a communications network.
§ Data at each site is under control of a DBMS.
§ DBMSs handle local applications autonomously.
§ Each DBMS participates in at least one global application
81
Distributed DBMS
Distributed Processing
A centralized database that can be accessed over a computer network.
Parallel DBMS
A DBMS running across multiple processors and disks designed to execute operations in
parallel, whenever possible, to improve performance.
§ Based on premise that single processor systems can no longer meet requirements for
cost-effective scalability, reliability, and performance.
§ Parallel DBMSs link multiple, smaller machines to achieve same throughput as single,
larger machine, with greater scalability and reliability.
82
§ Main architectures for parallel DBMSs are:
Shared memory,
Shared disk,
Shared nothing.
Parallel DBMS
Advantages of DDBMSs
§ Reflects organizational structure
§ Improved share ability and local autonomy
§ Improved availability
§ Improved reliability
§ Improved performance
§ Economics
§ Modular growth
Disadvantages of DDBMSs
§ Complexity
§ Cost
§ Security
§ Integrity control more difficult
§ Lack of standards
83
§ Lack of experience
§ Database design more complex
Types of DDBMS
1) Homogeneous DDBMS
2) Heterogeneous DDBMS
Homogeneous DDBMS
§ All sites use same DBMS product.
§ Much easier to design and manage.
§ Approach provides incremental growth and allows increased performance.
Heterogeneous DDBMS
§ Sites may run different DBMS products, with possibly different underlying data models.
§ Occurs when sites have implemented their own databases and integration is considered
later.
§ Translations required to allow for:
o Different hardware.
o Different DBMS products.
o Different hardware and different DBMS products.
§ Typical solution is to use gateways.
Overview of Networking
Network - Interconnected collection of autonomous computers, capable of exchanging
information.
Local Area Network (LAN) intended for connecting computers at same site.
Wide Area Network (WAN) used when computers or LANs need to be connected over
long distances.
WAN relatively slow and less reliable than LANs. DDBMS using LAN provides much faster
response time than one using WAN.
Functions of a DDBMS
§ Expect DDBMS to have at least the functionality of a DBMS.
§ Also to have following functionality:
– Extended communication services.
– Extended Data Dictionary.
– Distributed query processing.
– Extended concurrency control.
– Extended recovery services.
85
Reference Architecture for DDBMS
§ Due to diversity, no accepted architecture equivalent to ANSI/SPARC 3-level
architecture.
§ A reference architecture consists of:
o Set of global external schemas.
o Global conceptual schema (GCS).
o Fragmentation schema and allocation schema.
o Set of schemas for each local DBMS conforming to 3-level ANSI/SPARC.
§ Some levels may be missing, depending on levels of transparency supported.
Reference Architecture for DDBMS
Fragmentation
Relation may be divided into a number of sub-relations, which are then distributed.
Definition and allocation of fragments carried out strategically to achieve:
– Locality of Reference.
87
– Improved Reliability and Availability.
– Improved Performance.
– Balanced Storage Capacities and Costs.
– Minimal Communication Costs.
Involves analyzing most important applications, based on quantitative/qualitative
information.
Quantitative information may include:
– frequency with which an application is run;
– site from which an application is run;
– performance criteria for transactions and applications.
Qualitative information may include transactions that are executed by application, type
of access (read or write), and predicates of read operations.
Data Allocation
Each fragment is stored at site with “optimal” distribution.
Four alternative strategies regarding placement of data:
– Centralized,
– Partitioned (or Fragmented),
– Complete Replication,
– Selective Replication.
• Centralized: Consists of single database and DBMS stored at one site with users
distributed across the network.
• Partitioned: Database partitioned into disjoint fragments, each fragment
assigned to one site.
• Complete Replication: Consists of maintaining complete copy of database at
each site.
• Selective Replication: Combination of partitioning, replication, and
centralization.
88
Replication
Copy of fragment may be maintained at several sites.
Comparison of Strategies for Data Distribution
Why Fragment?
§ Usage
o Applications work with views rather than entire relations.
§ Efficiency
o Data is stored close to where it is most frequently used.
o Data that is not needed by local applications is not stored.
§ Parallelism
o With fragments as unit of distribution, transaction can be divided into several
subqueries that operate on fragments.
§ Security
o Data not required by local applications is not stored and so not available to
unauthorized users.
Disadvantages
o Performance,
o Integrity.
Correctness of Fragmentation
There are three correctness rules:
– Completeness,
– Reconstruction,
– Disjointness.
89
Completeness
If relation R is decomposed into fragments R1, R2, ... Rn, each data item that can be
found in R must appear in at least one fragment.
Reconstruction
• Must be possible to define a relational operation that will reconstruct R from the
fragments.
• Reconstruction for horizontal fragmentation is Union operation and Join for vertical .
Disjointness
• If data item di appears in fragment Ri, then it should not appear in any other
fragment.
• Exception: vertical fragmentation, where primary key attributes must be repeated to
allow reconstruction.
• For horizontal fragmentation, data item is a tuple.
• For vertical fragmentation, data item is an attribute.
Types of Fragmentation
Four types of fragmentation:
– Horizontal,
– Vertical,
– Mixed,
– Derived.
– Other possibility is no fragmentation:
– If relation is small and not updated frequently, may be better not to fragment
relation.
90
Horizontal and Vertical Fragmentation
Horizontal Fragmentation
§ Consists of a subset of the tuples of a relation.
§ Defined using Selection operation of relational algebra:
p(R)
§ For example:
P1 = type=‘House’(PropertyForRent)
P2 = type=‘Flat’(PropertyForRent)
91
Mixed Fragmentation
Mixed Fragmentation
§ Consists of a horizontal fragment that is vertically fragmented, or a vertical fragment
that is horizontally fragmented.
§ Defined using Selection and Projection operations of relational algebra:
o p(a1, ... ,an(R)) or
o a1, ... ,an(σp(R))
Example - Mixed Fragmentation
S1 = staffNo, position, sex, DOB, salary(Staff)
S2 = staffNo, fName, lName, branchNo(Staff)
S21 = branchNo=‘B003’(S2)
S22 = branchNo=‘B005’(S2)
S23 = branchNo=‘B007’(S2)
Derived Horizontal Fragmentation
§ A horizontal fragment that is based on horizontal fragmentation of a parent relation.
§ Ensures that fragments that are frequently joined together are at same site.
§ Defined using Semijoin operation of relational algebra:
Ri = R F Si, 1iw
Example - Derived Horizontal Fragmentation
S3 = branchNo=‘B003’(Staff)
S4 = branchNo=‘B005’(Staff)
S5 = branchNo=‘B007’(Staff)
Could use derived fragmentation for Property:
Pi = PropertyForRent branchNo Si, 3 i 5
Derived Horizontal Fragmentation
§ If relation contains more than one foreign key, need to select one as parent.
§ Choice can be based on fragmentation used most frequently or fragmentation with
better join characteristics.
92
Distributed Database Design Methodology
1. Use normal methodology to produce a design for the global relations.
2. Examine topology of system to determine where databases will be located.
3. Analyze most important transactions and identify appropriateness of horizontal/vertical
fragmentation.
4. Decide which relations are not to be fragmented.
5. Examine relations on 1 side of relationships and determine a suitable fragmentation
schema. Relations on many side may be suitable for derived fragmentation.
Transparencies in a DDBMS
§ Distribution Transparency
– Fragmentation Transparency
– Location Transparency
– Replication Transparency
– Local Mapping Transparency
– Naming Transparency
§ Transaction Transparency
– Concurrency Transparency
– Failure Transparency
§ Performance Transparency
– DBMS Transparency
Distribution transparency allows user to perceive database as single, logical entity.
If DDBMS exhibits distribution transparency, user does not need to know:
– data is fragmented (fragmentation transparency),
– location of data items (location transparency),
– otherwise call this local mapping transparency.
With replication transparency, user is unaware of replication of fragments
Naming Transparency
§ Each item in a DDB must have a unique name.
§ DDBMS must ensure that no two sites create a database object with same name.
§ One solution is to create central name server. However, this results in:
– loss of some local autonomy;
– central site may become a bottleneck;
– low availability; if the central site fails, remaining sites cannot create any new
objects.
93
§ Alternative solution - prefix object with identifier of site that created it.
§ For example, Branch created at site S1 might be named S1.BRANCH.
§ Also need to identify each fragment and its copies.
§ Thus, copy 2 of fragment 3 of Branch created at site S 1 might be referred to as
S1.BRANCH.F3.C2.
§ However, this results in loss of distribution transparency
§ An approach that resolves these problems uses aliases for each database object.
§ Thus, S1.BRANCH.F3.C2 might be known as Local Branch by user at site S1.
§ DDBMS has task of mapping an alias to appropriate database object.
Transaction Transparency
§ Ensures that all distributed transactions maintain distributed database’s integrity and
consistency.
§ Distributed transaction accesses data stored at more than one location.
§ Each transaction is divided into a number of sub-transactions, one for each site that has
to be accessed.
§ DDBMS must ensure the indivisibility of both the global transaction and each of the sub-
transactions.
Example - Distributed Transaction
T prints out names of all staff, using schema defined above as S 1, S2, S21, S22, and S23. Define
three sub-transactions TS3, TS5, and TS7 to represent agents at sites 3, 5, and 7.
Concurrency Transparency
§ All transactions must execute independently and be logically consistent with results
obtained if transactions executed one at a time, in some arbitrary serial order.
§ Same fundamental principles as for centralized DBMS.
§ DDBMS must ensure both global and local transactions do not interfere with each other.
§ Similarly, DDBMS must ensure consistency of all subtransactions of global transaction.
94
Classification of Transactions
In IBM’s Distributed Relational Database Architecture (DRDA), four types of
transactions:
– Remote request
– Remote unit of work
– Distributed unit of work
– Distributed request.
Failure Transparency
§ DDBMS must ensure atomicity and durability of global transaction.
§ Means ensuring that subtransactions of global transaction either all commit or all abort.
§ Thus, DDBMS must synchronize global transaction to ensure that all subtransactions
have completed successfully before recording a final COMMIT for global transaction.
§ Must do this in presence of site and network failures.
95
Performance Transparency
§ DDBMS must perform as if it were a centralized DBMS.
– DDBMS should not suffer any performance degradation due to distributed
architecture.
– DDBMS should determine most cost-effective strategy to execute a request.
§ Distributed Query Processor (DQP) maps data request into ordered sequence of
operations on local databases.
§ Must consider fragmentation, replication, and allocation schemas.
§ DQP has to decide:
– which fragment to access;
– which copy of a fragment to use;
– which location to use.
§ DQP produces execution strategy optimized with respect to some cost function.
§ Typically, costs associated with a distributed request include:
– I/O cost;
– CPU cost;
– Communication cost.
Assume:
Each tuple in each relation is 100 characters long.
10 renters with maximum price greater than £200,000.
100 000 viewings for properties in Aberdeen.
Computation time negligible compared to communication time.
96
Date’s 12 Rules for a DDBMS
Fundamental Principle To the user, a distributed system should look exactly like a non-
distributed system.
1. Local Autonomy
2. No Reliance on a Central Site
3. Continuous Operation
4. Location Independence
5. Fragmentation Independence
6. Replication Independence
7. Distributed Query Processing
8. Distributed Transaction Processing
9. Hardware Independence
10. Operating System Independence
11. Network Independence
12. Database Independence
Last four rules are ideals.
97
Review Questions
1. What are the reasons for and advantages of distributed databases.
2. What is fragmentation? . What are the main types of fragments?
3. Why is the concept of fragmentation useful in distributed database design?
4. Outline the major advantages and disadvantages of:-
- Fragmentation
- Replication
5. Why is data replication useful in DDBMSs? . What typical Chapters of data are replicated?
6. How is a horizontal partitioning of a relation specified? How can a relation be put back together
from a complete horizontal partitioning?
7. How is a vertical partitioning of a relation specified? How can a relation be put back together
from a complete vertical partitioning?
8. Discuss the main problems of distributed databases.
9. List and explain any five advantages of a distributed database management system.
10. With help of illustrative diagram, main discuss three architectures of parallel DBMS
98
Week Nine
100
2. Semantic
– data models
– Aggregation
– Generalization
3. Object-Oriented Programming
– Complex objects
– Object Identity
– Classes and Methods
– Encapsulation
– Inheritance
– Extensibility
4. Object-Oriented Data Model
101
Sets are critical because they are a natural way of representing collections from the real
world
Tuples are critical because they are a natural way of representing properties of an entity
Lists or arrays are important because they capture order, which occurs in the real world
The object constructors must be orthogonal: any constructor should apply to any object
Operations on a complex object must propagate transitively to all its components
102
Support for Types or Classes (4)
There are two main categories of object-oriented systems, those supporting the notion
of class and those supporting the notion of type
A type, in an object-oriented system, summarizes the common features of a set of
objects with the same characteristics. It has two parts: the interface and the
implementation (or implementations).
– The interface consists of a list of operations together with their signatures (i.e.,
the type of the input parameters and the type of the result)
– The type implementation consists of a data part and an operation part
The notion of class is different from that of type. Its specification is the same as that of a
type, but it is more of a run-time notion. It contains two aspects: an object factory and an
object warehouse.
– The object factory can be used to create new objects, by performing the
operation new on the class, or by cloning some prototype object representative
of the class
– The object warehouse means that attached to the class is its extension, i.e., the
set of objects that are instances of the class
Class or Type Hierarchies (5)
Inheritance has two advantages:
– it is a powerful modeling tool, because it gives a concise and precise description
of the world and
– it helps in factoring out shared specifications and implementations in
applications
Example
– Employee(name, age, salary) - can die, get married, be paid
– Student(name, age, set-of-grades) - can die, get married, have GPA computed
– Person(name, age) - can die, get married
Employee(+salary) - pay
Student(+set-of-grades) - GPA computation
There are at least four types of inheritance:
– substitution inheritance
– inclusion inheritance
– constraint inheritance
– specialization inheritance
103
Overriding, Overloading and Late Binding (6)
Example The display operation
for x in X do
begin
case of type(x)
person: display(x); for x in X do display(X)
bitmap: display-bitmap(x)
graph: display-graph(x);
end
end
In an object-oriented system, we define the display operation at the object type level
(the most general type in the system). Thus, display has a single name and can be used
indifferently on graphs, persons and pictures. However, we redefine the
implementation of the operation for each of the types according to the type (this
redefinition is called overriding). This results in a single name (display) denoting three
different programs (this is called overloading). To display the set of elements, we simply
apply the display operations to each one of them, and let the system pick the
appropriate implementation at run-time.
In order to provide this new functionality, the system cannot bind operation names to
programs at compile time. Therefore, operation names must be resolved (translated
into program addresses) at run-time. This delayed translation is called late binding.
Computational Completeness (7)
From a programming language point of view the computational completeness means
that one can express any computable function, using the DML of the database system
From a database point of view this is a novelty, since SQL, for instance, is not complete.
Computational completeness can be introduced through a reasonable connection to
existing programming languages
Note that this is different from being "resource complete", i.e., being able to access all
resources of the system (e.g. screen and remote communication) from within the
language
Database query languages usually impose severe restrictions on the kind of
computations that can be performed. As a result application programs must be
implemented in general-purpose languages while access to data is realized via
declarative data sublanguages, like SQL. As a consequence data has to be passed
between these two languages. Since both languages are usually semantically as well as
structurally different such transformations may lead to a loss of information. This
problem is known as impedance mismatch.
104
Extensibility (8)
The database system comes with a set of predefined types. These types can be used at
will by programmers to write their applications. This set of types must be extensible in
the following sense:
– there is a means to define new types and
– there is no distinction in usage between system defined and user defined types
Relational database systems rely on a simple data model. It offers a fixed number of
predefined atomic data types and very few type constructors.
In order to permit the integration of arbitrary data types into a data model, it must be
possible to define new basic data types, to use previously defined data types as building
blocks for other data types, or to nest data structures to arbitrary levels
Since object-oriented database systems are meant to offer an interface which integrates
a programming language and data management facilities they must allow the
programmer to define and execute any kind of operations
Persistence (9)
Persistence is the ability of the programmer to have her/his data survive the execution
of a process, in order to eventually reuse it in another process
Persistence should be orthogonal, i.e., each object, independent of its type, is allowed
to become persistent as such
Persistence should also be implicit: the user should not have to explicitly move or copy
data to make it persistent
Persistence can be achieved in several ways:
A first solution is to define the persistence property in the methods by using operations
which make an object persistent
– Another straightforward solution is to make everything persistent
– Different approach is to let the system designer specify which classes are to be
persistent or to store only data which is explicitly identified by a unique name
Secondary Storage Management (10)
Secondary storage management is a classical feature of database management systems.
It is usually supported through a set of mechanisms. These include index management,
data clustering, data buffering, access path selection and query optimization.
None of these is visible to the user: they are simply performance features
The application programmer should not have to write code to maintain indices, to
allocate disk storage, or to move data between disk and main memory
Since objects in object-oriented database systems may be rather complex as well as
extremely large it seems to be necessary to support access strategies within objects (e.g.
105
indexes on attribute types which consist of collections of values or instances of other
types)
Support for Concurrent Users (11)
With respect to the management of multiple users concurrently interacting with the
system, the system should offer the same level of service as current database systems
provide
It should therefore insure harmonious coexistence among users working simultaneously
on the database
The system should therefore support the standard notion of atomicity of a sequence of
operations and of controlled sharing
Serializability of operations should at least be offered, although less strict alternatives
may be offered
The existence of a class hierarchy and the fact that complex objects may contain other
objects need to be considered especially by the concurrency-control mechanism for the
following reasons:
– If the scope of a query is not a single class but the complete subclass hierarchy
the set of objects of the subclass hierarchy must not be modified by a conflicting
transaction
– If a complete composite object O is accessed by a transaction it has to be
ensured that shared subobjects of O cannot be accessed in an incompatible way
by concurrent transactions
– Since a subclass S inherits structure and behavior from its superclasses it must be
forbidden that a concurrent transaction modifies the definition of any of the
superclasses of S while a transaction T accesses instances of S (otherwise
inherited attributes or methods may be deleted from S or new attributes may be
added to S while instances of S are accessed by T)
Recovery (12)
The object-oriented database system should provide the same level of service as current
database systems
Therefore, in case of hardware or software failures, the system should recover, i.e.,
bring itself back to some coherent state of the data
Hardware failures include both processor and disk failures
Support for long transactions needs a finer unit of recovery since it is unreasonable for a
user to lose all his work (of days or weeks) in case of, e.g. a system malfunction
Ad Hoc Query Facility (13)
A query language should be:
– application-independent
106
– high-level (descriptive)
– optimizable
– complete
– generic
– powerful
– adequate
– extensible
– efficient
107
persistent application classes inherit methods from an ONTOS-supplied persistent object
class.
GemStone was designed based on a derivative of Smalltalk called OPAL. However, it has
subsequently been integrated with C++ as well.
ORION was designed around extensions of Common LISP, as is the product followon,
ITASCA. LISP methods are stored in the database, and are invoked by the system.
SQL3
Most relevant to extended relational approach
ANSI and ISO draft standard
“Work on SQL3 is well underway, but the final standards is several years away”
Most important new features:
– Abstract data types w. attributes and routines
– Inheritance of ADTs and tables
– Collections, parameterized types
– Procedures, functions: computational completeness
108
Strategies for developing an oodbms
Extend an Existing OOP Language with Database Capabilities: Add traditional
database capabilities to an existing OOP language e.g. Smalltalk, C++ or Java.
Used in GemStone.
Provide Extensible OODBMS Libraries: Similar to the above. However, rather
than extending the language, class libraries are provided that support
persistence, aggregation, data types, transactions, cocurrency, etc. Used in
Ontos, Versant, ObjectStore.
Embed OO Database Language Constructs in a Conventional Host Language:
used in O2, which supports embedded extensions for C.
Extend an Existing Database Language with OO Capabilities: e.g. extending SQL
to provide OO constructs (SQL 3).
Develop a New Database Data Model/Data Language: develop an entirely new
database language and DBMS with OO capabilities.
109
a consortium of object-oriented database management system (ODBMS) vendors and
interested parties working on standards to allow portability of customer software across
ODBMS products.
Current voting members:
GemStone Systems
IBEX Computing (ITASCA)
O2 Technology
Object Design (ObjectStore)
Objectivity
POET Software
UniSQL
Versant Object Technology
110
Review Questions
111
Week Ten
Internet
Worldwide collection of interconnected networks.
§ Began in late ‘60s in ARPANET, a US DOD project, investigating how to build
networks that could withstand partial outages.
§ Starting with a few nodes, Internet estimated to have over 945 million users by end
of 2004.
§ 2 billion users projected by 2010.
§ About 3.5 billion documents on Internet (550 billion if intranets/extranets included).
Intranet and Extranet
112
§ Intranet - Web site or group of sites belonging to an organization, accessible only by
members of that organization.
§ Extranet - An intranet that is partially accessible to authorized outsiders.
Whereas intranet resides behind firewall and is accessible only to people who are
members of same organization, extranet provides various levels of accessibility to
outsiders.
eCommerce and eBusiness
§ eCommerce - Customers can place and pay for orders via the business’s Web site.
§ eBusiness - Complete integration of Internet technology into economic
infrastructure of the business.
§ Business-to-business transactions may reach $2.1 trillion in Europe and $7 trillion in
US by 2006.
§ eCommerce may account for $12.8 trillion in worldwide corporate revenue by 2006
and could represent 18% of sales in the global economy.
The Web
Hypermedia-based system that provides a simple ‘point and click’ means of browsing
information on the Internet using hyperlinks.
§ Information presented on Web pages, which can contain text, graphics, pictures,
sound, and video.
§ Can also contain hyperlinks to other Web pages, which allow users to navigate in a
non-sequential way through information.
§ Web documents written using HTML.
§ Web consists of network of computers that can act in two roles:
– as servers, providing information;
– as clients (browsers), requesting information.
§ Protocol that governs exchange of information between Web server and browser is
HTTP and locations within documents identified as a URL.
§ Much of Web’s success is due to its simplicity and platform-independence.
113
HyperText Transfer Protocol (HTTP)
Protocol used to transfer Web pages through Internet.
– Based on request-response paradigm:
Connection - Client establishes connection with Web server.
Request - Client sends request to Web server.
Response - Web server sends response (HTML document)
to client.
Close - Connection closed by Web server.
HTTP/1.0 is stateless protocol - each connection is closed once server provides response.
This makes it difficult to support concept of a session that is essential to basic DBMS
transactions.
114
Uniform Resource Locators (URLs)
String of alphanumeric characters that represents location or address of a resource on Internet
and how that resource should be accessed.
– Defines uniquely where documents (resources) can be found.
– Uniform Resource Identifiers (URIs) - generic set of all Internet resource
names/addresses.
– Uniform Resource Names (URNs) - persistent, location-independent name. Relies on
name lookup services.
– URL consists of three basic parts:
u protocol used for the connection,
u host name,
u path name on host where resource stored.
– Can optionally specify:
u port through which connection to host should be made,
u query string.
http://www.w3.org/Markup/MarkUp.html
115
o be customized by and for each user.
– Requires hypertext to be generated by servers.
– Need scripts that perform conversions from different data formats into HTML ‘on-the-
fly’.
Web Services
Collection of functions packaged as single entity and published to network for use by other
programs.
– Web services are important paradigm in building applications and business processes
for the integration of heterogeneous applications.
– Based on open standards and focus on communication and collaboration among people
and applications.
– Unlike other Web-based applications, Web services have no user interface and are not
targeted for browsers. Instead, consist of reusable software components designed to be
consumed by other applications.
Web Services – Technologies & Standards
– eXtensible Markup Language (XML).
– SOAP (Simple Object Access Protocol) protocol, based on XML, used for communication
over Internet.
– WSDL (Web Services Description Language) protocol, again based on XML, used to
describe the Web service.
– UDDI (Universal Discovery, Description and Integration) protocol used to register the
Web service for prospective users.
– Common example is stock quote facility, which receives a request for current price of a
specified stock and responds with requested price.
– Second example is Microsoft MapPoint Web service that allows high quality maps,
driving directions, and other location information to be integrated into a user
application, business process, or Web site.
116
different Web servers;
Microsoft's (Distributed) Common Object Model (DCOM/COM);
CORBA/IIOP (Internet Inter-ORB protocol);
Java/Remote Method Invocation (RMI);
XML;
Web services (SOAP, WSDL, UDDI).
– Cost-effective solution that allows for scalability, growth, and changes in strategic
directions, and helps reduce applications development costs.
– Support for transactions that span multiple HTTP requests.
– Support for session- and application-based authentication.
– Acceptable performance.
– Minimal administration overhead.
– Set of high-level productivity tools to allow applications to be developed, maintained,
and deployed with relative ease and speed.
117
Approaches to Integrating Web and DBMSs
Scripting Languages.
Common Gateway Interface (CGI).
HTTP Cookies.
Extending the Web Server.
Java, J2EE, JDBC, SQLJ, JDO, Servlets, and JSP.
Microsoft Web Solution Platform: .NET, ASP, and ADO.
Oracle Internet Platform.
Scripting Languages (JavaScript and VBScript)
– Scripting languages can be used to extend browser and Web server with database
functionality.
– As script code is embedded in HTML, it is downloaded every time page is accessed.
– Updating browser is simply a matter of changing Web document on server.
– Some popular scripting languages are: JavaScript, VBScript, Perl, and PHP.
– They are interpreted languages, not compiled, making it easy to create small
applications.
Common Gateway Interface (CGI)
Specification for transferring information between a Web server and a CGI program.
– Server only intelligent enough to send documents and to tell browser what kind of
document it is.
– But server also knows how to launch other programs.
– When server sees that URL points to a program (script), it executes script and sends
back script’s output to browser as if it were a file.
CGI – Environment
118
– A CGI script is any script designed to accept and return data that conforms to the CGI
specification.
– Before server launches script, prepares number of environment variables
representing current state of the server, who is requesting the information, and so
on.
– Script picks this up and reads STDIN.
– Then performs necessary processing and writes its output to STDOUT.
– Script responsible for sending MIME header, which allows browser to differentiate
between components.
– CGI scripts can be written in almost any language, provided it supports reading and
writing of an operating system’s environment variables.
Four primary methods for passing information from browser to a CGI script:
o Passing parameters on the command line.
o Passing environment variables to CGI programs.
o Passing data to CGI programs via standard input.
o Using extra path information.
119
CGI – Advantages
CGI is the de facto standard for interfacing Web servers with external applications.
Possibly most commonly used method for interfacing Web applications to data sources.
Advantages:
– simplicity,
– language independence,
– Web server independence,
– wide acceptance.
CGI – Disadvantages
– Communication between client and database server must always go through
Web server.
– Lack of efficiency and transaction support, and difficulty validating user input
inherited from statelessness of HTTP protocol.
– HTTP never intended for long exchanges or interactivity.
– Server has to generate a new process or thread for each CGI script.
– Security.
HTTP Cookies
Cookies can make CGI scripts more interactive.
Cookies are small text files stored on Web client.
CGI script creates cookie and has Web server send it to client’s browser to store
on hard disk.
Later, when client revisits Web site and uses a CGI script that requests this
cookie, client’s browser sends information stored in the cookie.
120
Cookies can be used to store registration information or preferences (e.g. for
virtual shopping cart).
However, not all browsers support cookies.
121
Since almost every Web browser vendor has already licensed Java and implemented
an embedded JVM, Java applications can currently be deployed on most end-user
platforms.
Before Java application can be executed, it must first be loaded into memory.
Done by Class Loader, which takes ‘.class’ file(s) containing bytecodes and transfers
it into memory.
Class file can be loaded from local hard drive or downloaded from network.
Finally, bytecodes must be verified to ensure that they are valid and do not violate
Java’s security restrictions.
Loosely speaking, Java is a ‘safe’ C++.
Safety features include strong static type checking, automatic garbage collection,
and absence of machine pointers at language level.
Safety is central design goal: ability to safely transmit Java code across Internet.
Security is also integral part of Java’s design - sandbox ensures untrusted application
cannot gain access to system resources.
Java 2 Platform
In mid-1999, Sun announced it would pursue a distinct and integrated Java enterprise
platform:
– J2ME: aimed at embedded and consumer-electronics platforms.
– J2SE: aimed at typical desktop and workstation environments. Serves as
foundation for J2EE and Web services.
– J2EE: aimed at robust, scalable, multiuser, and secure enterprise applications.
– J2EE was designed to simplify complex problems with development,
deployment, and management of multi-tier enterprise applications.
122
Cornerstone of J2EE is Enterprise JavaBeans (EJB), a standard for building server-side
components in Java.
Three types of EJB components:
– EJB Session Beans, components implementing business logic, business rules, and
workflow.
– EJB Message-Driven Beans (MDBs), which process messages sent by clients, EJBs,
or other J2EE components.
– EJB Entity Beans, components encapsulating some data contained by the
enterprise. Entity Beans are persistent.
Two types of entity beans:
– Bean-Managed Persistence (BMP), which requires developer to write code top
make bean persist using an API such as JDBC or SQLJ.
– Container-Managed Persistence (CMP), where persistence is provided
automatically by container.
Discuss 5 ways to access a database: JDBC, SQLJ, CMP, JDO, and JSP.
JDBC
Modeled after ODBC, JDBC API supports basic SQL functionality.
With JDBC, Java can be used as host language for writing database applications.
On top of JDBC, higher-level APIs can be built.
Currently, two types of higher-level APIs:
– An embedded SQL for Java (e.g. SQLJ).
– A direct mapping of relational database tables to Java classes (e.g. TopLink from
Oracle).
JDBC API consists of two main interfaces: an API for application writers, and a lower-
level driver API for driver writers.
123
Applications and applets can access databases using:
– ODBC drivers and existing database client libraries;
– JDBC API with pure Java JDBC drivers.
JDBC - Advantages/Disadvantages
Advantage of using ODBC drivers is that they are a de facto standard for PC database
access, and are available for many DBMSs, for very low price.
Disadvantages with this approach:
– Non-pure JDBC driver will not necessarily work with a Web browser.
– Currently downloaded applet can connect only to database located on host
machine.
– Deployment costs increase.
SQLJ
Another JDBC-based approach uses Java with static embedded SQL.
SQLJ comprises a set of clauses that extend Java to include SQL constructs as statements
and expressions.
SQLJ translator transforms SQLJ clauses into standard Java code that accesses database
through a CLI.
Comparison of JDBC and SQLJ
SQLJ is based on static embedded SQL while JDBC is based on dynamic SQL.
Thus, SQLJ facilitates static analysis for syntax checking, type checking, and schema
checking, which may help produce more reliable programs at loss of some functionality.
It also potentially allows DBMS to generate an execution strategy for the query, thereby
improving performance of the query.
JDBC is low-level middleware tool with features to interface Java application with
RDBMS.
Developers need to design relational schema to which they will map Java objects, and
write code to map Java objects to rows of relations.
Problems:
– need to be aware of two different paradigms (object and relational);
– need to design relational schema to map onto object design;
– need to write mapping code
EJBs
EJBs have 3 elements in common:
– an indirection mechanism;
– a bean implementation;
124
– a deployment description.
With indirection mechanism clients do not invoke EJB methods directly.
Session and entity beans provide access to their operations via interfaces.
home interface defines methods that manage lifecycle of a bean. The corresponding
server-side implementation classes are generated at deployment time.
To provide access to other operations, bean can expose a local interface (if client and
bean are colocated), a remote interface, or both.
Local interfaces expose methods to clients running in same container or JVM.
Remote interfaces make methods available to clients no matter where deployed.
When a client invokes create() method (which returns an interface) on home interface,
EJB container calls ejbCreate() to instantiate bean, at which point client can access bean
through remote or local interface returned by create().
Bean implementation is a Java class that implements business logic defined in remote
interface.
Transactional semantics are described declaratively and captured in the deployment
descriptor.
Deployment descriptor, written in XML, lists a bean’s properties and elements, which
may include:
– home interface, remote interface, local interface;
– Web service endpoint interface,
– bean implementation class,
– JNDI name for bean, transaction attributes, security attributes, and per-method
descriptors.
125
Following steps need to be followed for CMP:
– Define CMP fields in local interface.
– Define CMP fields in entity bean class implementation.
– Define CMP fields in deployment descriptor.
– Define PK field and its type in deployment descriptor.
EJB container can manage relationships between entity beans and session beans.
Relationships have a multiplicity, which can be 1:1, 1:M, or M:M, and a direction, which
can be unidirectional or bidirectional.
Local interfaces provide foundation for CMR.
With CMR, beans use local interfaces to maintain relationships with other beans.
For example, a Staff bean can use collection of PropertyForRent local interfaces to
maintain a 1:M relationship
Container can also manage referential integrity.
CMR relationships are described declaratively in deployment descriptor file outside
enterprise-beans element.
Need to specify both beans involved in relationship.
Relationship is defined in ejb-relations element, with each role defined in ejb-
relationship-role element.
When bean is deployed, the container provider’s tools parse deployment descriptor and
generate code to implement underlying classes.
126
<method-params></method-params>
</query-method>
<result-type-mapping>Local</result-type-mapping>
<ejb-ql><![CDATA[SELECT OBJECT(s) FROM Staff s]]></ejb-ql>
</query>
<query>
<query-method>
<method-name>findByStaffName</method-name>
<method-params>java.lang.String</method-params>
</query-method>
<result-type-mapping>Local</result-type-mapping>
<ejb-ql><![CDATA[SELECT OBJECT(s)
FROM Staff s WHERE s.name = ?1]]>
</ejb-ql>
</query>
127
PersistenceManager contains methods to manage the lifecycle of PersistenceCapable
instances and is also the factory for Query and Transaction instances.
A PersistenceManager instance supports one transaction at a time and uses one connection
to the underlying data source at a time.
Query allows applications to obtain persistent instances from data source. Can be many
Query instances associated with a PersistenceManager and multiple queries may be
designated for simultaneous execution.
This interface is implemented by each JDO vendor to translate expressions in JDOQL into
native query language of data store.
128
Allows developers to construct complex object graphs in memory and make them
persistent simply by creating a reference to graph from a persistent instance.
Instances have to be explicitly deleted.
Java Servlets
Servlets are programs that run on Java-enabled Web server and build Web pages,
analogous to CGI.
Have a number of advantages over CGI:
– improved performance;
– portability;
– extensibility;
– simpler session management;
– improved security and reliability.
129
Since servlet is compiled, performance is improved.
§ Java API for XML-based RPC (JAX-RPC), sends SOAP method calls to remote clients over
Internet and receives results.
§ Client written in language other than Java can access a Web service developed and
deployed on Java platform.
§ Also, client written in Java can communicate with service developed and deployed using
some other platform.
§ Java API for XML Registries (JAXR), provides standard way to access business registries
and share information.
§ JAXR gives Java developers a uniform way to use business registries based on open
standards (such as ebXML) or industry consortium-led specifications (such as UDDI).
130
Object Linking and Embedding for DataBases (OLE DB)
§ Microsoft has defined set of data objects, collectively known as OLE DB.
§ Allows OLE-oriented applications to share and manipulate sets of data as objects.
§ OLE DB is an object-oriented specification based on C++ API.
§ Components can be treated as data consumers and data providers. Consumers take
data from OLE DB interfaces and providers expose OLE DB interfaces.
OLE DB Architecture
131
ASP Architecture
132
§ Microsoft technology for client-side database manipulation across Internet.
§ Still uses ADO on server-side to execute query and return recordset to client, which can
then execute other queries on recordset.
§ RDS provides mechanism to send updated records back to server.
§ A disconnected recordset model.
Microsoft .NET
§ Number of limitations with Microsoft’s platform:
– a number of languages supported with different programming models (J2EE
composed solely of Java);
– no automatic state management;
– relatively simple user interfaces for Web compared to traditional Windows user
interfaces;
– need to abstract operating system (Windows API difficult to program).
§ Next, and current, evolution in Microsoft’s Web solution strategy was development of
.NET.
§ Various tools, services, technologies in .NET:
– Windows Server,
– BizTalk Server (to build XML-based business processes across applications and
organizations),
– Commerce Server (to build scalable e-Commerce solutions),
– Application Center (to deploy and manage scalable Web applications),
– Mobile Information Server (to support handheld devices),
– SQL Server,
– Microsoft Visual Studio .NET
– Microsoft .NET Framework (CLR + Class Library).
133
.NET Framework
134
– network communications, thread management,
– user interface design features,
– database access and manipulation.
§ 3 main components:
– Windows Forms to support user interface development.
– ASP.NET to support development of Web applications and Web services.
Reengineered version of ASP to improve performance and scalability.
– ADO.NET to help applications connect to databases.
ADO.NET
§ Designed to address 3 main weaknesses with ADO:
– providing a disconnected data access model required for Web;
– providing compatibility with .NET Framework class library;
– providing extensive support for XML.
§ Different from connected style of programming that existed in traditional 2-tier C-S
architecture, where connection was held open for duration of program’s lifetime and no
special handling of state was required.
§ Also ADO data model is primarily relational and could not easily handle XML with a data
model that is heterogeneous and hierarchical.
§ Recognizing that ADO was a mature technology and widely used, ADO has been retained
in the .NET Framework, accessible through the .NET COM interoperability services.
§ Two main layers:
– a connected layer (similar to ADO);
– a disconnected layer, the DataSet (providing a similar functionality to RDS).
ADO.NET ARCHITECTURE
136
§ In addition, .NET Framework provides set of classes that conform to all the underlying
communication standards, such as SOAP, WSDL, and XML.
§ Microsoft UDDI SDK enables developers to add UDDI functionality to development tools,
installation programs, and any other software that needs to register or locate and bind
remote Web services.
137
Oracle Internet Platform Architecture
138
– mod_fastcgi, enhanced version of CGI that runs programs in pre-spawned
process;
– mod_oradav, provides support for WebDAV;
– mod_ossl, provides standard S-HTTP;
– mod_osso, enables transparent single sign-on.
139
§ MCS automatically translates applications written in Oracle Wireless XML, XHTML
Mobile Profile, or XHTML+XForms for any device and network.
§ Also allows portal sites to be created that use Web pages, Java applications, and XML-
based applications.
Business Intelligence
Functions to track, extract, and analyze business intelligence to support strategic
decision-making:
– Oracle Reports Services enable users to run Oracle Reports over Internet.
– Oracle Discoverer allows users to produce queries, reports, and analysis of
information from databases, OLTP systems, and data warehouses using a Web
browser.
– Oracle Clickstream provides services to capture and analyze aggregate
information about Web site usage.
– Oracle Personalization enables users to track activity of specific user and
personalize information for that user.
Review Questions
1. Discuss and do a critical review of the technologies for enabling web databases
Integration
2. How are XHML, HTML, and SGML related?
3. Explain the following term s in your own words
HTML, HTTP, URL, search engine, XML,XML schema, XQuery, SGML, web crawler,
4. How might we consider storing XML data in a database? What are the advantages and
disadvantages of this approach?
5. It is sometimes said that XML data is “schemaless”. How do you go about querying data that
has no schema? How would you design a query language for such data?.Discuss.
140
Week Eleven
Three-Tier Architecture
1) Warehouse database server
– Almost always a relational DBMS; rarely flat files
2) OLAP servers
– Relational OLAP (ROLAP): extended relational DBMS that maps operations
on multidimensional data to standard relational operations.
– Multidimensional OLAP (MOLAP): special purpose server that directly
implements multidimensional data and operations.
3) Clients
– Query and reporting tools.
142
– Analysis tools
– Data mining tools (e.g., trend analysis, prediction)
Data Mart
Data Mart is a subset of data warehouse that supports the requirements of a particular
department or business function.
143
Data Warehouse Vs Data Marts
• Enterprise warehouse: collects all information about subjects (customers, products,
sales, assets, personnel) that span the entire organization.
– Requires extensive business modeling
– May take years to design and build
• Data Marts: Departmental subsets that focus on selected subjects: Marketing data
mart: customer, products, sales.
– Faster roll out, but complex integration in the long run.
• Virtual warehouse: views over operational dbs
– Materialize some summary views for efficient query processing
– Easier to build
– Requisite excess capacity on operational db servers
Objectives
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
Solution
Data warehousing and data mining
Data warehousing and on-line analytical processing
Mining interesting knowledge (rules, regularities, patterns, constraints) from
data in large databases
144
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining with a variety of applications
Web technology and global information systems
145
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
146
Applications: Health care, retail, credit card service, telecomm.
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
– Medical insurance
Professional patients, ring of doctors, and ring of references
Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
– Phone call model: destination of the call, duration, time of day or week. Analyze
patterns that deviate from an expected norm
Retail industry
– Analysts estimate that 38% of retail shrink is due to dishonest employees
Anti-terrorism
148
Data Mining: On What Kinds of Data?
1. Relational database
2. Data warehouse
3. Transactional database
4. Advanced database and information repository
– Object-relational database
– Spatial and temporal data
– Time-series data
– Stream data
– Multimedia database
– Heterogeneous and legacy database
– Text databases & WWW
Review Questions
1. Define a data warehouse. What are the major components of a data warehouse?
2. Discuss the importance of data mining in databases
3. Discuss the new trends and challenges of data warehousing technology
151
Week Twelve
Multi-Dimensional Databases
Learning Objectives
1. Multidimensional Databases
– What are they? Why are they needed?
2. Contrast MDD and Relational Databases
3. When is MDD (In) appropriate?
4. MDD Features
– Rotation, Ranging, Roll-Up and Drill-Down, Computations
5. Benefits of MDD
Organization of Concepts
Functions
Definition MDDB vs. Relational Database
Data Requirements Rotation
Ranging Association
Transaction Processing vs. Roll-Up
Data Warehousing Drill-Down Availability Bias
Data Structures Computations Sequencial Pattern
Example Benefits Classifying
Clustering
Architecture Hybrid
Tools
Emergent Applications
Data Mapping
Example Architecture of an IBM PowerPlay
Data Warehouse
152
Why Multi-Dimensional Databases?
§ No single "best" data structure for all applications within an enterprise
§ From econometric research conducted at MIT in the 1960s, the multidimensional
database has matured into the database engine of choice for data analysis applications
§ Inherent ability to integrate and analyze large volumes of enterprise data
§ Offers a good conceptual fit with the way end-users visualize business data
– Most business people already think about their businesses in multidimensional
terms
– Managers tend to ask questions about product sales in different markets over
specific time periods
What is a Multi-Dimensional Database?
A multidimensional database (MDD) is a computer software system designed to allow for the
efficient and convenient storage and retrieval of large volumes of data that is (1) intimately
related and (2) stored, viewed and analyzed from different perspectives. These perspectives
are called dimensions.
153
Contrasting Relational and Multi-Dimensional Models:
An Example
RELATIONAL STRUCTURE
154
2. Perspectives are embedded directly into the structure in the multidimensional model
All possible combinations of perspectives containing a specific attribute
(the color BLUE, for example) line up along the dimension position for
that attribute.
3. Perspectives are placed in fields in the relational model - tells us nothing about field
contents.
4. MDD makes data browsing and manipulation intuitive to the end-user
5. Any data manipulation action possible with a MDD is also possible using relational
technology
6. Substantial cognitive advantages in query formulation
7. Substantial computational performance advantages in query processing when using
MDD
155
Viewing Data - An Example
Sales Volumes
M
O
D
E
L
DEALERSHIP
COLOR
• Assume that each dimension has 10 positions, as shown in the cube above
• How many records would be there in a relational table?
• Implications for viewing data from an end-user standpoint?
156
157
158
When is MDD (In)appropriate?
• Our sales volume dataset has a great number of meaningful interrelationships
• Interrelationships more meaningful than individual data elements themselves.
• The greater the number of inherent interrelationships between the elements of a
dataset, the more likely it is that a study of those interrelationships will yield business
information of value to the company.
• Highly interrelated dataset types be placed in a multidimensional data structure for
greatest ease of access and analysis
159
160
161
162
MDD Features: Multidimensional Computations
• Well equipped to handle demanding mathematical functions.
• Can treat arrays like cells in spreadsheets. For example, in a budget analysis situation,
one can divide the ACTUAL array by the BUDGET array to compute the VARIANCE array.
• Applications based on multidimensional database technology typically have one
dimension defined as a "business measurements" dimension.
• Integrates computational tools very tightly with the database structure.
Benefits of MDD
• Cognitive Advantages for the User
• Ease of Data Presentation and Navigation
163
– Obtaining the same views in a relational world requires the end user to either
write complex SQL queries or use an SQL generator against the relational
database to convert the table outputs into a more intuitive format.
• Ease of Maintenance
– Because data is stored in the same way as it is viewed (i.e. according to its
fundamental attributes), no additional overhead is required to translate user
queries into requests for data
• Performance
– Multidimensional databases achieve performance levels that are difficult to
match in a relational environment.
Review Questions
1. Explain what are Multi dimensional database s and their usefulness
2. Differentiate between MDD and Relational Databases
3. Discuss the Features of MDD
4. What are the benefits of MDD
164
References
1. Database Systems by Connolly & Begg 5th Edition
2. Fundamentals of Database Systems by Elmasri & Navathe 5th edition
3. Date C. J. (1990) An Introduction to Database Systems. Volume 1, 5th Edition Addison
Wesley
4. Korth HF & Silberschatz A (1991)
5. Database System Concepts Second Edition McGraw-Hill
6. Object Data Management. OO & Extended Relational Database, Systems Addison-
Wesley Publishing Atkinson,M., DeWitt,D., Maier,D.,Bancilhon,F., Dittrich,K.,
Zdonik,S.(1990)
7. Object-Oriented Database Systems Morgan Kaufmann Publishers Won Kim (1991)
8. Introduction to Object-Oriented Databases, The MIT Press, Cambridge (Massachusetts)
Unland,R., Schlagetter,G.
9. Object-Oriented Database Systems: State of the Art and Research Problems. In: Expert
Database Systems, 1992, Academic Press Ltd
165