0% found this document useful (0 votes)
4 views49 pages

Ch1 Query Processing (2)

The document covers advanced topics in database management systems, focusing on query processing and optimization, including translating SQL queries into relational algebra and implementing various operations like SELECT and JOIN. It discusses algorithms for executing query operations, external sorting, and methods for optimizing query execution strategies. Additionally, it addresses database features such as security, concurrency control, and recovery techniques.

Uploaded by

hailuzemene21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views49 pages

Ch1 Query Processing (2)

The document covers advanced topics in database management systems, focusing on query processing and optimization, including translating SQL queries into relational algebra and implementing various operations like SELECT and JOIN. It discusses algorithms for executing query operations, external sorting, and methods for optimizing query execution strategies. Additionally, it addresses database features such as security, concurrency control, and recovery techniques.

Uploaded by

hailuzemene21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

Advanced Database Management System

Query processing and optimization


Database security and authorization
Transaction processing concepts
MID SEMESTER EXAMINATION
Concurrency control techniques
Database recovery techniques
Distributed database system
Query Processing and
Optimization

Database system features


Crash Recovery
Integrity Checking
Security
Concurrency Control
Query Processing and Optimization
File Organization and Optimization

2
Query Processing and Optimization

 Translating SQL Queries into Relational Algebra


 Basic Algorithms for Executing Query Operations
 Using Heuristic in Query Optimization
 Using Selectivity and Cost Estimates in Query
Optimization
 Semantic Query Optimization

•Query languages: Allow manipulation and


retrieval
of data from a database.

3
Relational algebra-operations on a relation

Operations that could be applied on relations




Selection (  ) Selects a subset of rows from a relation.

Projection (  ) Deletes unwanted columns from a relation.

Renaming: assigning intermediate relation for a single operation

Cross-Product ( x ) Allows to concatenate a tuple from one
relation with all the tuples from the other relation.
 Set-Difference ( - ) Tuples in relation R1, but not in relation
R2.
 Union ( ) Tuples in relation R , or in relation R .
1 2
 Intersection () Tuples in relation R and in relation R
1 1

Join Tuples joined from two relations based on a condition

Query processing :
The aim of query processing is to find information in one or more

databases and deliver it to the user.

4
Query Processing and Optimization
 A query expressed in a high-level query language such as
SQL must first be scanned, parsed, and validated.
 The scanner identifies the language tokens—such as
SQL keywords, attribute names, and relation names.
 The parser checks the query syntax based on the rules of
the query language.
 The query must also be validated, by checking that all
attribute and relation names are valid and semantically
meaningful names in the schema of the particular
database being queried.

5
Query Processing and Optimization
 An internal representation of the query is then
created, usually as a tree data structure called a
query tree.
 It is also possible to represent the query using a
graph data structure called a query graph.
 The DBMS must then devise an execution strategy
for retrieving the result of the query from the
database files. A query typically has many possible
execution strategies, and the process of choosing a
suitable one for processing a query is known as
query optimization.
6
Query Processing and Optimization
 The query optimizer module has the task of
producing an execution plan.
 The code generator generates the code to
execute that plan.
 The runtime database processor has the task
of running the query code.

7
Query Processing (cont…)

8
Query Processing (cont…)
 Translating SQL queries into relational
algebra
 Query block:

The basic unit that can be translated into the
algebraic operators
 A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
 Nested queries within a query are identified as
separate query blocks.

9
Query Processing (cont…)
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);

SELECT LNAME, FNAME SELECT MAX (SALARY)


FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))

10
Basic algorithms for executing query
operations
 External Sorting
 Implementing the SELECT Operation
 Implementing the JOIN Operation
 Implementing PROJECT and Set Operations

11
Basic algorithms for executing query
operations
 External sorting:
 Refers to sorting algorithms that are suitable for large

files of records stored on disk that do not fit entirely in


main memory, such as most database files.
 Sorting is one of the primary algorithms used in query
processing. For example, whenever an SQL query
specifies an ORDER BY-clause, the query result must be
sorted. Sorting is also a key component in sort-merge
algorithms used for JOIN and other operations (such as
UNION and INTERSECTION), and in duplicate
elimination algorithms for the PROJECT operation (when
an SQL query specifies the DISTINCT option in the
SELECT clause).

12
External sorting:

13
External sorting:
 External sorting uses Sort-Merge strategy:
 Starts by sorting small subfiles (runs) of the main file

and then merges the sorted runs, creating larger sorted


subfiles that are merged in turn. 2 phases:
 Sorting phase:


Number of subfiles (runs) nR = (b/nB)
 Merging phase:

Degree of merging(dM) = Min (nB-1, nR); nP = (logdM(nR))

14
Sort-Merge strategy (cont…)
 nR: number of initial runs;
 b: number of file blocks;

 nB: available buffer space;

 dM: degree of merging;

 P: number of passes

 The size of a run and number of initial run depends on the


number of file blocks (b) and available buffer space (nB)
 Example: if nB=5 blocks and size of the file=1024 blocks,
 nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks

except the last run which will have 4 blocks.


 Hence, after the sort phase, 205 sorted runs are stored

as temporary subfiles on disk

15
Sort-Merge strategy (cont…)
 In the merging phase, the sorted runs are merged during
one or more passes.
 The degree of merging (dM) is the number of runs that
can be merged in each pass.
 dM=min((nB-1) and nR))
 The number of passes=[(logdM (nR))]
 In each pass, one buffer block is needed to hold one
block from each of the runs being merged and one block
is needed for containing one block of the merge result

16
Sort-Merge strategy (cont…)
 In the above example, dM=4(four way merging)
 Hence, the 205 initial sorted runs would be merged into:
 52 at the end of the first pass
 13 at the end of the second pass
 4 at the end of the third pass
 1 at the end of the fourth pass
 The minimum dM of 2 gives the worst case performance
of the algorithm, which is

2*b +2(*(b*(log2 (nR) )
 The first term represents the number of block access for
the sort phase and the second term represents the
number of block access for the merge phase

17
Exercise using the relation given below

 Assume available buffer


space is 3 blocks.
 Find DM?
 Find P?
 Find the number of
merges in each pass?

18
Merging phase
 Step 1 (only for run 1 and run 2)
 Read a (from first run) and b(from second run) and load to buffer

 a is written into an output buffer

 Read d(from first run) and load to buffer

 a is written to disk, and b is written into an output buffer

 Read c (from second run) and load to buffer

 b is written to disk, and c is written into an output buffer

 Read e(from second run) and load to buffer

 c is written to disk, d is written to an output buffer

 Read g (from first run) and load to buffer

 d is written to disk, e is written to an output buffer


 And write g to disk

Do for Step 1 (for run 3 and run 4) and the whole for step 2

19
Implementing the SELECT Operation
 There are many options for executing a SELECT operation
 Some options depend on the file having specific access

paths and may apply only to certain types of selection


conditions
Examples:
 (OP1):  SSN='123456789' (EMPLOYEE)
 (OP2):  DNUMBER>5(DEPARTMENT)
 (OP3):  DNO=5(EMPLOYEE)
 (OP4):  DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)

20
Implementing the SELECT Operation (cont…)
 Search Methods for implementing Simple Selection:

S1 Linear search (brute force):

Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.

S2 Binary search:

If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used. An example
is OP1 if SSN is the ordering attribute for EMPLOYEE file

S3 Using a primary index or hash key to retrieve a
single record:

If the selection condition involves an equality comparison on a
key attribute with a primary index (or a hash key), use the
primary index (or the hash key) to retrieve the record.

For Example, OP1 use primary index to retrieve the record

21
Implementing the SELECT Operation (cont…)
 Search Methods for implementing Simple Selection

S4 Using a primary index to retrieve multiple
records:

If the comparison condition is >, ≥, <, or ≤ on a key
field with a primary index, use the index to find the
record satisfying the corresponding equality
condition, then retrieve all subsequent records in
the (ordered) file. (see OP2)

S5 Using a clustering index to retrieve multiple
records:

If the selection condition involves an equality
comparison on a non-key attribute with a clustering
index, use the clustering index to retrieve all the
records satisfying the selection condition. (See OP3)

22
Implementing the SELECT Operation (cont…)
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single record if
the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is not a key
• Search Methods for implementing complex Selection like OP4
• S7: Conjunctive selection using an individual index :

If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether
each retrieved record satisfies the remaining simple
conditions in the conjunctive condition

23
Implementing the SELECT Operation
(summarized)
 Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.

If an access path exists, the method corresponding
to that access path is used;

Otherwise, the “brute force” linear search approach
of method S1 is used
 For conjunctive selection conditions, whenever

more than one of the attributes involved in the


conditions have an access path, query optimization
should be done to choose the access path that
retrieves the fewest records in the most efficient way
24
Implementing the SELECT Operation
(summarized)
 Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical
connective rather than AND.
 Compared to conjunctive selection, It is much harder to

process and optimize.


 Example: DNO=5 OR SALARY>3000 OR SEX=‘F’(EMPLOYEE)
 Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions.
 If any of the individual conditions does not have an
access path, we are compelled to use the brute force
approach.

25
Implementing the JOIN Operation:
 The join operation is one of the most time
consuming operation in query processing
 Join

two–way join: a join on two files

e.g. R A=B S

multi-way joins: joins involving more than two files

e.g. R A=B S C=D T
 Examples

(OP5): EMPLOYEE DNO=DNUMBER DEPARTMENT

(OP6): DEPARTMENT MGRSSN=SSN EMPLOYEE

26
Implementing the JOIN Operation(cont…)
 Methods for implementing joins:
 J1 Nested-loop join (brute force):

For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
 J2 Single-loop join (Using an access structure to
retrieve the matching records):

If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].

27
Implementing the JOIN Operation (cont…)
 Methods for implementing joins:
 J3 Sort-merge join:

If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.

Both files are scanned in order of the join attributes,
matching the records that have the same values for
A and B.

In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes

28
Implementing the JOIN Operation (cont’d)
 Other type of JOIN algorithms
 Partition hash join

Partitioning phase:

Each file (R and S) is first partitioned into M partitions using a
partitioning hash function on the join attributes:

R1 , R2 , R3 , ...... Rm and S1 , S2 , S3 , ...... Sm

Minimum number of in-memory buffers needed for the
partitioning phase: M.

A disk sub-file is created per partition to store the tuples for
that partition.

Joining or probing phase:

Involves M iterations, one per partitioned file.

Iteration i involves joining partitions Ri and Si.

29
Algorithms for PROJECT and SET Operations
 Algorithm for PROJECT operation
 <attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples from
R with only the values for the attributes in <attribute list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.

 Methods to remove duplicate tuples


1. Sorting
2. Hashing
 Note: in SQL query, the default is not to remove duplicates.
To remove duplicates we have to use the key word
‘DISTINCT’.

30
Set operations
 UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN
PRODUCT
 CARTESIAN PRODUCT: too expensive. If R has n records and j
attributes and S has m records and k attributes, the result
relation will have n * m records and j + k attributes. Hence, it is
important to avoid the CARTESIAN PRODUCT operation and
substitute other equivalent operations during query optimization.
 UNION, INTERSECTION, and SET DIFFERENCE: apply with
the same number of attributes and the same attribute domains.
The customary way to implement these operations is to use:
 sort-merge technique: the two relations are sorted on the

same attributes, and, after sorting, a single scan through each


relation is sufficient to produce the result.

31
Query Optimization
 There are two main techniques for implementing query
optimization.
 Heuristic rules

A heuristic is a rule that works well in most cases but is
not guaranteed to work well in every possible case. The
rules typically reorder the operations in a query tree.
 Cost estimation

Determining the cost of different execution strategies and
choosing the execution plan with the lowest cost estimate.
 The two techniques are usually combined in a query
optimizer

32
General Transformation rules for
relational algebra

1. Cascade of : A conjunctive selection condition can be


broken up into a cascade (sequence) of individual 
operations:

 c1 AND c2 AND ... AND cn(R) = c1 (c2 (...(cn(R))...) )
2. Commutativity of :

The  operation is commutative:

c1 (c2(R)) = c2 (c1(R))
3. Cascade of : In a cascade (sequence) of  operations, all
but the last one can be ignored:

List1 (List2 (...(Listn(R))...) ) = List1(R)

33
Transformation Rules (cont…)

4. Commuting  with :
 If the selection condition c involves only the
attributes A1, ..., An in the projection list, the two
operations can be commuted:
 A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))

5. Commuting of and X: both are commutative


R cS = S c R, RXS=SXR

34
Transformation Rules (cont…)
6. Commuting  with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and
B1, ..., Bm are attributes of S and the join condition c
involves only attributes in L, the two operations can be
commuted as follows:

L ( R C S ) = (A1, ..., An (R)) C ( B1, ..., Bm (S))

7. Converting a (, x) sequence into : If the condition c of a


 that follows a x Corresponds to a join condition, convert
the (, x) sequence into a as follows:

(C (R x S)) = (R C S)

35
Transformation Rules (cont…)
8. Commutativity of the Set Operations: UNION and
INTERSECTION but not SET DIFFERENCE
RS=SR and RS=SR
9. Associativity of the THETA JOIN, CARTESIAN
PRODUCT, UNION and INTERSECTION.
(R S) T=R (S T) where is one of the
operations
10. Commuting SELECTION with SET OPERATIONS
c (R S)= (c(R)  c(S)) where is one of the operations
11. Commuting PROJECTION with UNION
L1 (SR )= L1 (S )  L1 (R )
36
Using Heuristics in Query Optimization
 The heuristic approach of optimization will make use of:
 Properties of individual operators

 Association between operators

 Query Tree: a graphical representation of the operators, relations,

attributes and predicates and processing sequence during query


processing.
 Query tree is composed of three main parts:

The Leafs: the base relations used for processing the query/
extracting the required information

The Root: the final result/relation as an output based on the
operation on the relations used for query processing

Nodes: intermediate results or relations before reaching the final
result.
 Sequence of execution of operation in a query tree will start from the leaves
and continues to the intermediate nodes and ends at the root

37
Using Heuristics in Query Optimization (cont…)
 Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query
 The main heuristic is to apply first the operations that
reduce the size of intermediate results
 E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations

38
Using Heuristics in Query Optimization (cont…)
 Heuristic Optimization of Query Trees:
 The same query could correspond to many different relational algebra

expressions — and hence many different query trees.


 The task of heuristic optimization of query trees is to find a final query

tree that is efficient to execute.


 Example:
 Project (Pname, Pnumber), Employee (Bdate, Ssn, Fname, Lname)

and Works_on (Pno, Essn). Project and Employee are related by


Works_on table.
 List Last Name of employees who are working in AQUARIUS project

and born after 1957-12-31.


Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;

39
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q

40
Steps in converting a query tree during heuristic optimization:

Moving the select operation down the tree

41
Steps in converting a query tree during heuristic optimization:
Applying the more restrictive select operation first

42
Steps in converting a query tree during heuristic optimization:
Replacing Cartesian product and select with join

43
Steps in converting a query tree during heuristic optimization:
Moving project operations down the query tree

44
Using Heuristics in Query Optimization
 Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and join operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)

45
Exercise
 Consider the following schemas and the query, where the
EMPLOYEE and the PROJECT relations are related by the
WORKS_ON relation.
 EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)

 PROJECT (PProjID, PName, PLocation, PFund, PManagerID)

 WORKS_ON (WEmpID, WProjID

 Query: The manager of the company working on road construction


would like to view employees name born before January 1, 1965 who
are working on the project named Ring Road.
 Formulate SQL query, relational algebra and finally construct the
optimum query tree (show the steps you passed to reach the final
tree). You must justify to simplify your tree.

46
Using Selectivity and Cost Estimates in Query
Optimization
 Cost-based query optimization:
 Estimate and compare the costs of executing a query

using different execution strategies and choose the


strategy with the lowest cost estimate
 Cost Components for Query Execution
1. Access cost to secondary storage

2. Computation cost

3. Memory usage cost

4. Communication cost
 Note: Different database systems may focus on different
cost components.

47
Using Selectivity and Cost Estimates in
Query Optimization
 Catalog Information Used in Cost Functions
 Information about the size of a file

number of records (tuples) (r),

record size (R),

number of blocks (b)

blocking factor (bfr) –a factor of block size and record size
 Information about indexes and indexing attributes of a file

Number of levels (x) of each multilevel index

Number of first-level index blocks (bI1)

Number of distinct values (d) of an attribute

Selectivity (sl) of an attribute-r/d

48
Semantic Query Optimization
 Semantic Query Optimization:
 Uses constraints specified on the database schema in order to

modify one query into another query that is more efficient to


execute
 Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
 Explanation:
 Suppose that we had a constraint on the database schema that

stated that no employee can earn more than his or her direct
supervisor. If the semantic query optimizer checks for the
existence of this constraint, it need not execute the query at all
because it knows that the result of the query will be empty.

49

You might also like