0% found this document useful (0 votes)
6 views28 pages

dbi3

The document discusses query optimization in databases, focusing on the transformation of relational expressions, cost estimation, and the selection of evaluation plans. It outlines the steps involved in cost-based query optimization, including generating equivalent expressions and estimating the cost of different evaluation plans based on statistical information. Additionally, it presents various equivalence rules for relational algebra operations that can be used to optimize query execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views28 pages

dbi3

The document discusses query optimization in databases, focusing on the transformation of relational expressions, cost estimation, and the selection of evaluation plans. It outlines the steps involved in cost-based query optimization, including generating equivalent expressions and estimating the cost of different evaluation plans based on statistical information. Additionally, it presents various equivalence rules for relational algebra operations that can be used to optimize query execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Database Internals

Query Optimization
Lecture 3

1
Chapter 14: Query Optimization

 Introduction
 Transformation of Relational Expressions
 Estimating Statistics of Expression Results
 Choice of Evaluation Plans

2
Introduction
 Alternative ways of evaluating a given query
 Equivalent expressions
 Different algorithms for each operation

3
Introduction (Cont.)
 An evaluation plan defines exactly what algorithm is used for each
operation, and how the execution of the operations is coordinated.

4
Introduction (Cont.)
 Cost difference between evaluation plans for a query can be enormous
 E.g. seconds vs. days in some cases
 Steps in cost-based query optimization
1. Generate logically equivalent expressions using equivalence rules
2. Annotate resultant expressions to get alternative query plans
3. Choose the cheapest plan based on estimated cost
 Estimate of plan cost based on:
 Statistical information about relations. e.g. -
 number of tuples, number of distinct values for an attribute
 Statistics estimation for intermediate results
 to compute cost of complex expressions
 Cost formulae for algorithms, computed using statistics

5
Transformation of Relational Expressions
 Two relational algebra expressions are said to be equivalent if the
two expressions generate the same set of tuples on every legal
database instance
 Note: order of tuples is irrelevant
 In SQL, inputs and outputs are multisets of tuples
 Two expressions in the multiset version of the relational algebra
are said to be equivalent if the two expressions generate the same
multiset of tuples on every legal database instance.
 An equivalence rule says that expressions of two forms are
equivalent
 Can replace expression of first form by second, or vice versa

6
Equivalence Rules
1. Conjunctive selection operations can be deconstructed into a
sequence of individual selections.
    ( E )   (  ( E ))
1 2 1 2
2. Selection operations are commutative.
  (  ( E ))   (  ( E ))
1 2 2 1

3. Only the last in a sequence of projection operations is needed, the


others can be omitted.
 L1 ( L2 ( ( Ln ( E )) ))  L1 ( E )
4. Selections can be combined with Cartesian products and theta joins.
a. (E1 X E2) = E1  E2

b. 1(E1 2 E2 ) = E 1 1 2 E2

7
Equivalence Rules (Cont.)
5. Theta-join operations (and natural joins) are commutative.
E 1  E 2 = E 2  E1
6. (a) Natural join operations are associative:
(E1 E2) E3 = E1 (E2 E3)

(b) Theta joins are associative in the following manner:

(E1 1 E2) 2 3 E3 = E 1 1 3 (E2 2 E3)

where 2 involves attributes from only E2 and E3.

8
Equivalence Rules (Cont.)
7. The selection operation distributes over the theta join operation under
the following two conditions:
(a) When all the attributes in 0 involve only the attributes of one
of the expressions (E1) being joined.

0E1  E2) = (0(E1))  E2

(b) When  1 involves only the attributes of E1 and 2 involves


only the attributes of E2.
1 E1  E2) = (1(E1))  ( (E2))

9
Pictorial Depiction of Equivalence Rules

10
Equivalence Rules (Cont.)
8. The projection operation distributes over the theta join operation as
follows:
Consider a join E1  E2.
 Let L1 and L2 be sets of attributes from E1 and E2, respectively.
 Let L3 be attributes of E1 that are involved in join condition , but are
not in L1  L2, and
 let L4 be attributes of E2 that are involved in join condition , but are
not in L1  L2.

(a) if  involves only attributes from L1  L2:


L1 L2 ( E1  E2 )  ( L1 ( E1 ))  ( L2 ( E2 ))

(b) General rule:


 L L ( E1
1 2  E2 )   L L (( L L ( E1 ))
1 2 1 3  ( L L ( E2 )))
2 4

11
Equivalence Rules (Cont.)
9. The set operations union and intersection are commutative
E1  E2 = E 2  E 1
E1  E2 = E 2  E 1
 (set difference is not commutative).
10. Set union and intersection are associative.
(E1  E2)  E3 = E1  (E2  E3)
(E1  E2)  E3 = E1  (E2  E3)
11. The selection operation distributes over ,  and –.
 (E1 – E2) =  (E1) – (E2)
and similarly for  and  in place of –
Also:  (E1 – E2) = (E1) – E2
and similarly for  in place of –, but not for 
12. The projection operation distributes over union
L(E1  E2) = (L(E1))  (L(E2))
12
Transformation Example: Pushing Selections

 Query: Find the names of all customers who have an account at


some branch located in Brooklyn.
customer_name(branch_city = “Brooklyn”
(branch (account depositor)))
 Transformation using rule 7a.
customer_name
((branch_city =“Brooklyn” (branch))
(account depositor))
 Performing the selection as early as possible reduces the size of the
relations to be joined.

13
Example with Multiple Transformations
 Query: Find the names of all customers with an account at a
Brooklyn branch whose account balance is over $1000.
customer_name((branch_city = “Brooklyn”  balance > 1000
(branch (account depositor)))
 Transformation using join associatively (Rules 6a, 7a):
customer_name((branch_city = “Brooklyn”  balance > 1000
(branch account)) depositor)
 Second form provides an opportunity to apply the “perform
selections early” rule, resulting in the subexpression (Rule 7b)
branch_city = “Brooklyn” (branch)  balance > 1000 (account)
 Thus a sequence of transformations can be useful

Branch_schema=(branch_name,branch_city,assets)
Account_schema=(account_number,branch_name,balance)
Depositor_schema=(customer_name,account_number)
14
Multiple Transformations (Cont.)

15
Transformation Example: Pushing Projections

customer_name((branch_city = “Brooklyn” (branch) account) depositor)

 When we compute
(branch_city = “Brooklyn” (branch) account )

we obtain a relation whose schema is:


(branch_name, branch_city, assets, account_number, balance)
 Push projections using equivalence rules 8a and 8b; eliminate unneeded
attributes from intermediate results to get:
customer_name ((
account_number ( (branch_city = “Brooklyn” (branch) account ))
depositor )
 Performing the projection as early as possible reduces the size of the
relation to be joined.

16
Join Ordering Example
 For all relations r1, r2, and r3,
(r1 r2) r3 = r1 (r2 r3 )
(Join Associativity)
 If r2 r3 is quite large and r1 r2 is small, we choose

(r1 r2) r3
so that we compute and store a smaller temporary relation.

17
Join Ordering Example (Cont.)
 Consider the expression
customer_name ((branch_city = “Brooklyn” (branch))
(account depositor))
 Could compute account depositor first, and join result with
branch_city = “Brooklyn” (branch)
but account depositor is likely to be a large relation.
 Only a small fraction of the bank’s customers are likely to have
accounts in branches located in Brooklyn
 it is better to compute
branch_city = “Brooklyn” (branch) account
first.

18
Exercise :
Draw the expression trees and select the best options

(A) proj_name,budget((emp_city = “Moratuwa” (Employee)) (Assignment Project) )

(B) proj_name,budget(emp_city = “Moratuwa” ((Employee Assignment) Project))

Write a better evaluation expression using the following information ?

Employee(emp_no, emp_name, emp_city, …….)

Assignment(emp_no,proj_no,hours,……….)

Project(proj_name, budget, proj_no,……..)

19
Cost Estimation
 Cost of each operator computed as described in Chapter 13
 Need statistics of input relations
 E.g. number of tuples, sizes of tuples
 Inputs can be results of sub-expressions
 Need to estimate statistics of expression results
 To do so, we require additional statistics
 E.g. number of distinct values for an attribute

20
Statistical Information for Cost Estimation

 nr: number of tuples in a relation r.


 br: number of blocks containing tuples of r.
 lr: size of a tuple of r (in bytes).
 fr: blocking factor of r — i.e. the number of tuples of r that fit into one block.
 V(A, r): number of distinct values that appear in r for attribute A; same as
the size of A(r).
 If tuples of r are stored together physically in a file, then:
nr 

br  

fr 

21
Histograms
 Histogram on attribute age of relation person

 Equi-width histograms
 Equi-depth histograms
 If no histogram is available, the optimizer assumes that the distribution
is uniform
22
Selection Size Estimation
 A=v(r)
 nr / V(A,r) : number of records that will satisfy the selection
 Equality condition on a key attribute: size estimate = 1
 AV(r) (case of A  V(r) is symmetric)
 Let c denote the estimated number of tuples satisfying the condition.
 If min(A,r) and max(A,r) are available in catalog
 c = 0 if v < min(A,r)
 c = nr if v > max(A,r)

v  min( A, r )
nr .
 c= max( A, r )  min( A, r )

 If histograms available, can refine above estimate


 In absence of statistical information c is assumed to be nr / 2.

23
Size Estimation of Complex Selections

 The selectivity of a condition i is the probability that a tuple in the relation r


satisfies i .
 If si is the number of satisfying tuples in r, the selectivity of i is given
by si /nr.

 Conjunction: 1 2. . .  n (r). Assuming independence, estimate


s1  s2  . . .  sn
of tuples in the result is: nr 
nrn

 Disjunction:1 2 . . .  n (r). Estimated number of tuples:


 s s s 
nr   1  (1  1 )  (1  2 )  ...  (1  n ) 
 nr nr nr 
 Negation: (r). Estimated number of tuples:
nr – size((r))

24
Size Estimation for Other Operations
 Projection: estimated size of A(r) = V(A,r) /*elim. Duplicates*/
 Aggregation : estimated size of AgF(r) = V(A,r) /* group on A */
 Set operations
 For unions/intersections of selections on the same relation:
rewrite and use size estimate for selections
 E.g. 1 (r)  2 (r) can be rewritten as 1  2 (r)
 For operations on different relations:
 estimated size of r  s = size of r + size of s.
 estimated size of r  s = minimum size of r and size of s.
 estimated size of r – s = size of r
 All the three estimates may be quite inaccurate, but provide
upper bounds on the sizes.
 It is also possible to estimate the size of joins

25
Choice of Evaluation Plans
 Must consider the interaction of evaluation techniques when choosing
evaluation plans
 choosing the cheapest algorithm for each operation independently
may not yield best overall algorithm. E.g.
 merge-join may be costlier than hash-join, but may provide a
sorted output which reduces the cost for an outer level
aggregation.
 nested-loop join may provide opportunity for pipelining
 Practical query optimizers incorporate elements of the following two
broad approaches:
1. Search all the plans and choose the best plan in a
cost-based fashion.
2. Uses heuristics to choose a plan.

26
Cost-Based Optimization
 Consider finding the best join-order for r1 r2 . . . rn.
 There are (2(n – 1))!/(n – 1)! different join orders for above expression.
With n = 7, the number is 665280, with n = 10, the number is greater
than 176 billion!
 No need to generate all the join orders. Using dynamic programming,
the least-cost join order for any subset of
{r1, r2, . . . rn} is computed only once and stored for future use.

27
Heuristic Optimization
 Cost-based optimization is expensive, even with dynamic programming.
 Systems may use heuristics to reduce the number of choices that must
be made in a cost-based fashion.
 Heuristic optimization transforms the query-tree by using a set of rules
that typically (but not in all cases) improve execution performance:
 Perform selection early (reduces the number of tuples)
 Perform projection early (reduces the number of attributes)
 Perform most restrictive selection and join operations (i.e. with
smallest result size) before other similar operations.
 Some systems use only heuristics, others combine heuristics with
partial cost-based optimization.

28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy