Data Warehousing: Need For Speed: Join Techniques
Data Warehousing: Need For Speed: Join Techniques
1
Need for Speed: Join Techniques
2
About Nested-Loop Join
3
Nested-Loop Join: Code
FOR i = 1 to N DO BEGIN /* N rows in T1 */
IF ith row of T1 qualifies THEN BEGIN
For j = 1 to M DO BEGIN /* M rows in T2 */
IF the ith row of T1 matches to jth row of T2 on join key THEN BEGIN
IF the jth row of T2 qualifies THEN BEGIN
produce output row
END
END
END
END
END
4
Nested-Loop Join: Working Example
5
Nested-Loop Join: Cost Formula
Join cost = Cost of accessing Table_A +
# of qualifying rows in Table_A Blocks of Table_B to be
scanned for each qualifying row
OR
6
Nested-Loop Join: Cost of reorder
Table_A = 500 blocks and
Table_B = 700 blocks.
7
Sort-Merge Join
• Joined tables to be sorted as per WHERE clause of
the join predicate.
8
Sort-Merge Join: Process
• The Sort -Merge join requires that both tables to be
joined are sorted on those columns
9
Sort-Merge Join: Process
• The query optimizer typically scans an index on the
columns which are part of the join,
10
Sort-Merge Join: Process
• However, in rare cases, there may be multiple
equalities in the WHERE clause, in such a case, the
merge columns are taken from only some of the
given equality clauses.
11
Sort-Merge Join: Process
• For example, for equi-join operations, the rows are
returned if they match/equal on the join predicate.
12
Sort-Merge Join: Process
The Sort -Merge join process just described works as follows:
Sort Table_A and Table_B on the join column in ascending order, then scan them
to do a``merge'' (on join column), and output result tuples/rows. Proceed with
scanning of Table_A until current A_tuple ≤ current B_tuple, then
• proceed scanning of Table_B until current B_tuple ≤ current A_tuple; do
this until current A_tuple = current B_tuple.
• At this point, all A_tuples with same value in Ai (current A_group) and all
B_tuples with same value in Bj (current B_group) match; output <a, b>
for all pairs of such tuples/records.
• Update pointers, resume scanning Table_A and Table_B .
Table_A is scanned once; each B group is scanned once per matching Table_A
tuple.
(Multiple scans of a B group are likely to find needed pages in buffer.)
Cost: M log M + N log N + (M+N)
The cost of scanning is M+N, could be M*N (very unlikely!)
13
Table_A Table_B Table_A Table_B Table_A Table_B
1 1 1 1 1 1
1 3 1 3 1 3
2 3 2 3 2 3
2 4 2 4 2 4
2 4 2 4 2 4
4 4 4 4 4 4
5 5 5 5 5 5
5 5 5 5 5 5
5 6 5 6 5 6
6 6 6 6 6 6
6 6 6 6 6 6
6 6 6 6 6 6
6 7 6 7 6 7
6 7 6 7 6 7
7 7 7 7 7 7
8 7 8 7 8 7
Sort-Merge Join Example 14
Hash-Based join
15
Hash-Based Join: Working
• Hash joins are suitable for the VLDB environment
as they are useful for joining large data sets or
tables.
16
Hash-Based Join: Working
• The optimizer decides by using the smaller of the
two tables (say) Table_A or data sources to build a
hash table in the main memory on the join key
used in the WHERE clause.
17
Hash-Based Join: Working
• The optimizer uses a hash join to join two tables if
they are joined using an equi-join and if either of
the following conditions are true:
18
Hash-Based Join: Working
• This method is best used when the smaller table
fits in the available main memory. The cost is then
limited to a single read pass over the data for the
two tables.
19
Hash-Based Join: Working
• Suitable for the VLDB environment.
20
Hash-Based Join: Working
21
Hash-Based Join: Example
Original
Relation
MAIN MEMORY Join Result
1
Table_A 1
2
2
hash
... function . .
h N .
.
.
.
Table_B
M N