Ch1 Query Processing (2)
Ch1 Query Processing (2)
2
Query Processing and Optimization
3
Relational algebra-operations on a relation
Query processing :
The aim of query processing is to find information in one or more
4
Query Processing and Optimization
A query expressed in a high-level query language such as
SQL must first be scanned, parsed, and validated.
The scanner identifies the language tokens—such as
SQL keywords, attribute names, and relation names.
The parser checks the query syntax based on the rules of
the query language.
The query must also be validated, by checking that all
attribute and relation names are valid and semantically
meaningful names in the schema of the particular
database being queried.
5
Query Processing and Optimization
An internal representation of the query is then
created, usually as a tree data structure called a
query tree.
It is also possible to represent the query using a
graph data structure called a query graph.
The DBMS must then devise an execution strategy
for retrieving the result of the query from the
database files. A query typically has many possible
execution strategies, and the process of choosing a
suitable one for processing a query is known as
query optimization.
6
Query Processing and Optimization
The query optimizer module has the task of
producing an execution plan.
The code generator generates the code to
execute that plan.
The runtime database processor has the task
of running the query code.
7
Query Processing (cont…)
8
Query Processing (cont…)
Translating SQL queries into relational
algebra
Query block:
The basic unit that can be translated into the
algebraic operators
A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
Nested queries within a query are identified as
separate query blocks.
9
Query Processing (cont…)
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
10
Basic algorithms for executing query
operations
External Sorting
Implementing the SELECT Operation
Implementing the JOIN Operation
Implementing PROJECT and Set Operations
11
Basic algorithms for executing query
operations
External sorting:
Refers to sorting algorithms that are suitable for large
12
External sorting:
13
External sorting:
External sorting uses Sort-Merge strategy:
Starts by sorting small subfiles (runs) of the main file
Number of subfiles (runs) nR = (b/nB)
Merging phase:
Degree of merging(dM) = Min (nB-1, nR); nP = (logdM(nR))
14
Sort-Merge strategy (cont…)
nR: number of initial runs;
b: number of file blocks;
P: number of passes
15
Sort-Merge strategy (cont…)
In the merging phase, the sorted runs are merged during
one or more passes.
The degree of merging (dM) is the number of runs that
can be merged in each pass.
dM=min((nB-1) and nR))
The number of passes=[(logdM (nR))]
In each pass, one buffer block is needed to hold one
block from each of the runs being merged and one block
is needed for containing one block of the merge result
16
Sort-Merge strategy (cont…)
In the above example, dM=4(four way merging)
Hence, the 205 initial sorted runs would be merged into:
52 at the end of the first pass
13 at the end of the second pass
4 at the end of the third pass
1 at the end of the fourth pass
The minimum dM of 2 gives the worst case performance
of the algorithm, which is
2*b +2(*(b*(log2 (nR) )
The first term represents the number of block access for
the sort phase and the second term represents the
number of block access for the merge phase
17
Exercise using the relation given below
18
Merging phase
Step 1 (only for run 1 and run 2)
Read a (from first run) and b(from second run) and load to buffer
Do for Step 1 (for run 3 and run 4) and the whole for step 2
19
Implementing the SELECT Operation
There are many options for executing a SELECT operation
Some options depend on the file having specific access
20
Implementing the SELECT Operation (cont…)
Search Methods for implementing Simple Selection:
S1 Linear search (brute force):
Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
S2 Binary search:
If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used. An example
is OP1 if SSN is the ordering attribute for EMPLOYEE file
S3 Using a primary index or hash key to retrieve a
single record:
If the selection condition involves an equality comparison on a
key attribute with a primary index (or a hash key), use the
primary index (or the hash key) to retrieve the record.
For Example, OP1 use primary index to retrieve the record
21
Implementing the SELECT Operation (cont…)
Search Methods for implementing Simple Selection
S4 Using a primary index to retrieve multiple
records:
If the comparison condition is >, ≥, <, or ≤ on a key
field with a primary index, use the index to find the
record satisfying the corresponding equality
condition, then retrieve all subsequent records in
the (ordered) file. (see OP2)
S5 Using a clustering index to retrieve multiple
records:
If the selection condition involves an equality
comparison on a non-key attribute with a clustering
index, use the clustering index to retrieve all the
records satisfying the selection condition. (See OP3)
22
Implementing the SELECT Operation (cont…)
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single record if
the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is not a key
• Search Methods for implementing complex Selection like OP4
• S7: Conjunctive selection using an individual index :
If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether
each retrieved record satisfies the remaining simple
conditions in the conjunctive condition
23
Implementing the SELECT Operation
(summarized)
Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
If an access path exists, the method corresponding
to that access path is used;
Otherwise, the “brute force” linear search approach
of method S1 is used
For conjunctive selection conditions, whenever
25
Implementing the JOIN Operation:
The join operation is one of the most time
consuming operation in query processing
Join
two–way join: a join on two files
e.g. R A=B S
multi-way joins: joins involving more than two files
e.g. R A=B S C=D T
Examples
(OP5): EMPLOYEE DNO=DNUMBER DEPARTMENT
(OP6): DEPARTMENT MGRSSN=SSN EMPLOYEE
26
Implementing the JOIN Operation(cont…)
Methods for implementing joins:
J1 Nested-loop join (brute force):
For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
J2 Single-loop join (Using an access structure to
retrieve the matching records):
If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].
27
Implementing the JOIN Operation (cont…)
Methods for implementing joins:
J3 Sort-merge join:
If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
Both files are scanned in order of the join attributes,
matching the records that have the same values for
A and B.
In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes
28
Implementing the JOIN Operation (cont’d)
Other type of JOIN algorithms
Partition hash join
Partitioning phase:
Each file (R and S) is first partitioned into M partitions using a
partitioning hash function on the join attributes:
R1 , R2 , R3 , ...... Rm and S1 , S2 , S3 , ...... Sm
Minimum number of in-memory buffers needed for the
partitioning phase: M.
A disk sub-file is created per partition to store the tuples for
that partition.
Joining or probing phase:
Involves M iterations, one per partitioned file.
Iteration i involves joining partitions Ri and Si.
29
Algorithms for PROJECT and SET Operations
Algorithm for PROJECT operation
<attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples from
R with only the values for the attributes in <attribute list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.
30
Set operations
UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN
PRODUCT
CARTESIAN PRODUCT: too expensive. If R has n records and j
attributes and S has m records and k attributes, the result
relation will have n * m records and j + k attributes. Hence, it is
important to avoid the CARTESIAN PRODUCT operation and
substitute other equivalent operations during query optimization.
UNION, INTERSECTION, and SET DIFFERENCE: apply with
the same number of attributes and the same attribute domains.
The customary way to implement these operations is to use:
sort-merge technique: the two relations are sorted on the
31
Query Optimization
There are two main techniques for implementing query
optimization.
Heuristic rules
A heuristic is a rule that works well in most cases but is
not guaranteed to work well in every possible case. The
rules typically reorder the operations in a query tree.
Cost estimation
Determining the cost of different execution strategies and
choosing the execution plan with the lowest cost estimate.
The two techniques are usually combined in a query
optimizer
32
General Transformation rules for
relational algebra
33
Transformation Rules (cont…)
4. Commuting with :
If the selection condition c involves only the
attributes A1, ..., An in the projection list, the two
operations can be commuted:
A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))
34
Transformation Rules (cont…)
6. Commuting with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and
B1, ..., Bm are attributes of S and the join condition c
involves only attributes in L, the two operations can be
commuted as follows:
L ( R C S ) = (A1, ..., An (R)) C ( B1, ..., Bm (S))
(C (R x S)) = (R C S)
35
Transformation Rules (cont…)
8. Commutativity of the Set Operations: UNION and
INTERSECTION but not SET DIFFERENCE
RS=SR and RS=SR
9. Associativity of the THETA JOIN, CARTESIAN
PRODUCT, UNION and INTERSECTION.
(R S) T=R (S T) where is one of the
operations
10. Commuting SELECTION with SET OPERATIONS
c (R S)= (c(R) c(S)) where is one of the operations
11. Commuting PROJECTION with UNION
L1 (SR )= L1 (S ) L1 (R )
36
Using Heuristics in Query Optimization
The heuristic approach of optimization will make use of:
Properties of individual operators
37
Using Heuristics in Query Optimization (cont…)
Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query
The main heuristic is to apply first the operations that
reduce the size of intermediate results
E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
38
Using Heuristics in Query Optimization (cont…)
Heuristic Optimization of Query Trees:
The same query could correspond to many different relational algebra
39
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q
40
Steps in converting a query tree during heuristic optimization:
41
Steps in converting a query tree during heuristic optimization:
Applying the more restrictive select operation first
42
Steps in converting a query tree during heuristic optimization:
Replacing Cartesian product and select with join
43
Steps in converting a query tree during heuristic optimization:
Moving project operations down the query tree
44
Using Heuristics in Query Optimization
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and join operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)
45
Exercise
Consider the following schemas and the query, where the
EMPLOYEE and the PROJECT relations are related by the
WORKS_ON relation.
EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)
46
Using Selectivity and Cost Estimates in Query
Optimization
Cost-based query optimization:
Estimate and compare the costs of executing a query
2. Computation cost
4. Communication cost
Note: Different database systems may focus on different
cost components.
47
Using Selectivity and Cost Estimates in
Query Optimization
Catalog Information Used in Cost Functions
Information about the size of a file
number of records (tuples) (r),
record size (R),
number of blocks (b)
blocking factor (bfr) –a factor of block size and record size
Information about indexes and indexing attributes of a file
Number of levels (x) of each multilevel index
Number of first-level index blocks (bI1)
Number of distinct values (d) of an attribute
Selectivity (sl) of an attribute-r/d
48
Semantic Query Optimization
Semantic Query Optimization:
Uses constraints specified on the database schema in order to
stated that no employee can earn more than his or her direct
supervisor. If the semantic query optimizer checks for the
existence of this constraint, it need not execute the query at all
because it knows that the result of the query will be empty.
49