dbi3
dbi3
Query Optimization
Lecture 3
1
Chapter 14: Query Optimization
Introduction
Transformation of Relational Expressions
Estimating Statistics of Expression Results
Choice of Evaluation Plans
2
Introduction
Alternative ways of evaluating a given query
Equivalent expressions
Different algorithms for each operation
3
Introduction (Cont.)
An evaluation plan defines exactly what algorithm is used for each
operation, and how the execution of the operations is coordinated.
4
Introduction (Cont.)
Cost difference between evaluation plans for a query can be enormous
E.g. seconds vs. days in some cases
Steps in cost-based query optimization
1. Generate logically equivalent expressions using equivalence rules
2. Annotate resultant expressions to get alternative query plans
3. Choose the cheapest plan based on estimated cost
Estimate of plan cost based on:
Statistical information about relations. e.g. -
number of tuples, number of distinct values for an attribute
Statistics estimation for intermediate results
to compute cost of complex expressions
Cost formulae for algorithms, computed using statistics
5
Transformation of Relational Expressions
Two relational algebra expressions are said to be equivalent if the
two expressions generate the same set of tuples on every legal
database instance
Note: order of tuples is irrelevant
In SQL, inputs and outputs are multisets of tuples
Two expressions in the multiset version of the relational algebra
are said to be equivalent if the two expressions generate the same
multiset of tuples on every legal database instance.
An equivalence rule says that expressions of two forms are
equivalent
Can replace expression of first form by second, or vice versa
6
Equivalence Rules
1. Conjunctive selection operations can be deconstructed into a
sequence of individual selections.
( E ) ( ( E ))
1 2 1 2
2. Selection operations are commutative.
( ( E )) ( ( E ))
1 2 2 1
b. 1(E1 2 E2 ) = E 1 1 2 E2
7
Equivalence Rules (Cont.)
5. Theta-join operations (and natural joins) are commutative.
E 1 E 2 = E 2 E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)
8
Equivalence Rules (Cont.)
7. The selection operation distributes over the theta join operation under
the following two conditions:
(a) When all the attributes in 0 involve only the attributes of one
of the expressions (E1) being joined.
9
Pictorial Depiction of Equivalence Rules
10
Equivalence Rules (Cont.)
8. The projection operation distributes over the theta join operation as
follows:
Consider a join E1 E2.
Let L1 and L2 be sets of attributes from E1 and E2, respectively.
Let L3 be attributes of E1 that are involved in join condition , but are
not in L1 L2, and
let L4 be attributes of E2 that are involved in join condition , but are
not in L1 L2.
11
Equivalence Rules (Cont.)
9. The set operations union and intersection are commutative
E1 E2 = E 2 E 1
E1 E2 = E 2 E 1
(set difference is not commutative).
10. Set union and intersection are associative.
(E1 E2) E3 = E1 (E2 E3)
(E1 E2) E3 = E1 (E2 E3)
11. The selection operation distributes over , and –.
(E1 – E2) = (E1) – (E2)
and similarly for and in place of –
Also: (E1 – E2) = (E1) – E2
and similarly for in place of –, but not for
12. The projection operation distributes over union
L(E1 E2) = (L(E1)) (L(E2))
12
Transformation Example: Pushing Selections
13
Example with Multiple Transformations
Query: Find the names of all customers with an account at a
Brooklyn branch whose account balance is over $1000.
customer_name((branch_city = “Brooklyn” balance > 1000
(branch (account depositor)))
Transformation using join associatively (Rules 6a, 7a):
customer_name((branch_city = “Brooklyn” balance > 1000
(branch account)) depositor)
Second form provides an opportunity to apply the “perform
selections early” rule, resulting in the subexpression (Rule 7b)
branch_city = “Brooklyn” (branch) balance > 1000 (account)
Thus a sequence of transformations can be useful
Branch_schema=(branch_name,branch_city,assets)
Account_schema=(account_number,branch_name,balance)
Depositor_schema=(customer_name,account_number)
14
Multiple Transformations (Cont.)
15
Transformation Example: Pushing Projections
When we compute
(branch_city = “Brooklyn” (branch) account )
16
Join Ordering Example
For all relations r1, r2, and r3,
(r1 r2) r3 = r1 (r2 r3 )
(Join Associativity)
If r2 r3 is quite large and r1 r2 is small, we choose
(r1 r2) r3
so that we compute and store a smaller temporary relation.
17
Join Ordering Example (Cont.)
Consider the expression
customer_name ((branch_city = “Brooklyn” (branch))
(account depositor))
Could compute account depositor first, and join result with
branch_city = “Brooklyn” (branch)
but account depositor is likely to be a large relation.
Only a small fraction of the bank’s customers are likely to have
accounts in branches located in Brooklyn
it is better to compute
branch_city = “Brooklyn” (branch) account
first.
18
Exercise :
Draw the expression trees and select the best options
Assignment(emp_no,proj_no,hours,……….)
19
Cost Estimation
Cost of each operator computed as described in Chapter 13
Need statistics of input relations
E.g. number of tuples, sizes of tuples
Inputs can be results of sub-expressions
Need to estimate statistics of expression results
To do so, we require additional statistics
E.g. number of distinct values for an attribute
20
Statistical Information for Cost Estimation
21
Histograms
Histogram on attribute age of relation person
Equi-width histograms
Equi-depth histograms
If no histogram is available, the optimizer assumes that the distribution
is uniform
22
Selection Size Estimation
A=v(r)
nr / V(A,r) : number of records that will satisfy the selection
Equality condition on a key attribute: size estimate = 1
AV(r) (case of A V(r) is symmetric)
Let c denote the estimated number of tuples satisfying the condition.
If min(A,r) and max(A,r) are available in catalog
c = 0 if v < min(A,r)
c = nr if v > max(A,r)
v min( A, r )
nr .
c= max( A, r ) min( A, r )
23
Size Estimation of Complex Selections
24
Size Estimation for Other Operations
Projection: estimated size of A(r) = V(A,r) /*elim. Duplicates*/
Aggregation : estimated size of AgF(r) = V(A,r) /* group on A */
Set operations
For unions/intersections of selections on the same relation:
rewrite and use size estimate for selections
E.g. 1 (r) 2 (r) can be rewritten as 1 2 (r)
For operations on different relations:
estimated size of r s = size of r + size of s.
estimated size of r s = minimum size of r and size of s.
estimated size of r – s = size of r
All the three estimates may be quite inaccurate, but provide
upper bounds on the sizes.
It is also possible to estimate the size of joins
25
Choice of Evaluation Plans
Must consider the interaction of evaluation techniques when choosing
evaluation plans
choosing the cheapest algorithm for each operation independently
may not yield best overall algorithm. E.g.
merge-join may be costlier than hash-join, but may provide a
sorted output which reduces the cost for an outer level
aggregation.
nested-loop join may provide opportunity for pipelining
Practical query optimizers incorporate elements of the following two
broad approaches:
1. Search all the plans and choose the best plan in a
cost-based fashion.
2. Uses heuristics to choose a plan.
26
Cost-Based Optimization
Consider finding the best join-order for r1 r2 . . . rn.
There are (2(n – 1))!/(n – 1)! different join orders for above expression.
With n = 7, the number is 665280, with n = 10, the number is greater
than 176 billion!
No need to generate all the join orders. Using dynamic programming,
the least-cost join order for any subset of
{r1, r2, . . . rn} is computed only once and stored for future use.
27
Heuristic Optimization
Cost-based optimization is expensive, even with dynamic programming.
Systems may use heuristics to reduce the number of choices that must
be made in a cost-based fashion.
Heuristic optimization transforms the query-tree by using a set of rules
that typically (but not in all cases) improve execution performance:
Perform selection early (reduces the number of tuples)
Perform projection early (reduces the number of attributes)
Perform most restrictive selection and join operations (i.e. with
smallest result size) before other similar operations.
Some systems use only heuristics, others combine heuristics with
partial cost-based optimization.
28