0% found this document useful (0 votes)
60 views22 pages

Data Warehousing: Need For Speed: Join Techniques

Nested loop join iterates through each row of the first table and compares it to each row of the second table to find matches. Sort-merge join sorts both tables on the join key and then scans and merges the sorted tables to find matches. Hash join hashes the smaller table and probes the larger hashed table to find matches, with collisions indicating joined rows. These different join techniques allow for optimized performance depending on factors like data size and available memory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views22 pages

Data Warehousing: Need For Speed: Join Techniques

Nested loop join iterates through each row of the first table and compares it to each row of the second table to find matches. Sort-merge join sorts both tables on the join key and then scans and merges the sorted tables to find matches. Hash join hashes the smaller table and probes the larger hashed table to find matches, with collisions indicating joined rows. These different join techniques allow for optimized performance depending on factors like data size and available memory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Warehousing

Need for Speed: Join Techniques

1
Need for Speed: Join Techniques

2
About Nested-Loop Join

Nested Loop Join

3
Nested-Loop Join: Code
FOR i = 1 to N DO BEGIN /* N rows in T1 */
IF ith row of T1 qualifies THEN BEGIN
For j = 1 to M DO BEGIN /* M rows in T2 */
IF the ith row of T1 matches to jth row of T2 on join key THEN BEGIN
IF the jth row of T2 qualifies THEN BEGIN
produce output row
END
END
END
END
END

4
Nested-Loop Join: Working Example

5
Nested-Loop Join: Cost Formula
Join cost = Cost of accessing Table_A +
# of qualifying rows in Table_A  Blocks of Table_B to be
scanned for each qualifying row

OR

Join cost = Blocks accessed for Table_A +


Blocks accessed for Table_A  Blocks accessed for Table_B

6
Nested-Loop Join: Cost of reorder
Table_A = 500 blocks and
Table_B = 700 blocks.

Qualifying blocks for Table_A QB(A) = 50


Qualifying blocks for Table_B QB(B) = 100

Join cost A&B = 500 + 50700 = 35,500 I/Os


Join cost B&A = 700 + 100500 = 50,700 I/Os

i.e. an increase in I/O of about 43%.

7
Sort-Merge Join
• Joined tables to be sorted as per WHERE clause of
the join predicate.

• Query optimizer scans for (cluster) index, if exists


performs join.

• In the absence of index, tables are sorted on the


columns as per WHERE clause.

• If multiple equalities in WHERE clause, some


merge columns used

8
Sort-Merge Join: Process
• The Sort -Merge join requires that both tables to be
joined are sorted on those columns

• that are identified by the equality in the WHERE


clause of the join predicate.

• Subsequently the tables are merged based on the


join columns.

9
Sort-Merge Join: Process
• The query optimizer typically scans an index on the
columns which are part of the join,

• if one exists on the proper set of columns, fine,

• else the tables are sorted on the columns to be


joined, resulting in what is called a cluster index.

10
Sort-Merge Join: Process
• However, in rare cases, there may be multiple
equalities in the WHERE clause, in such a case, the
merge columns are taken from only some of the
given equality clauses.

• Because each table is sorted, the Sort -Merge Join


operator gets a row from each table and compares
it one at a time with the rows of the other table.

11
Sort-Merge Join: Process
• For example, for equi-join operations, the rows are
returned if they match/equal on the join predicate.

• If they are not equal or don’t match, whichever


row has the lower value is discarded, and next row
is obtained from that table.

• This process is repeated until all the rows have


been exhausted

12
Sort-Merge Join: Process
The Sort -Merge join process just described works as follows:

Sort Table_A and Table_B on the join column in ascending order, then scan them
to do a``merge'' (on join column), and output result tuples/rows. Proceed with
scanning of Table_A until current A_tuple ≤ current B_tuple, then
• proceed scanning of Table_B until current B_tuple ≤ current A_tuple; do
this until current A_tuple = current B_tuple.
• At this point, all A_tuples with same value in Ai (current A_group) and all
B_tuples with same value in Bj (current B_group) match; output <a, b>
for all pairs of such tuples/records.
• Update pointers, resume scanning Table_A and Table_B .

Table_A is scanned once; each B group is scanned once per matching Table_A
tuple.
(Multiple scans of a B group are likely to find needed pages in buffer.)
Cost: M log M + N log N + (M+N)
The cost of scanning is M+N, could be M*N (very unlikely!)

13
Table_A Table_B Table_A Table_B Table_A Table_B
1 1 1 1 1 1
1 3 1 3 1 3
2 3 2 3 2 3
2 4 2 4 2 4
2 4 2 4 2 4
4 4 4 4 4 4
5 5 5 5 5 5
5 5 5 5 5 5
5 6 5 6 5 6
6 6 6 6 6 6
6 6 6 6 6 6
6 6 6 6 6 6
6 7 6 7 6 7
6 7 6 7 6 7
7 7 7 7 7 7
8 7 8 7 8 7
Sort-Merge Join Example 14
Hash-Based join

15
Hash-Based Join: Working
• Hash joins are suitable for the VLDB environment
as they are useful for joining large data sets or
tables.

• The choice about which table first gets hashed


plays a pivotal role in the overall performance of
the join operation and left to the optimizer.

16
Hash-Based Join: Working
• The optimizer decides by using the smaller of the
two tables (say) Table_A or data sources to build a
hash table in the main memory on the join key
used in the WHERE clause.

• It then scans the larger table (say) Table_B and


probes the hashed table to find the joined rows.

• The joined rows are identified by collisions i.e.


collisions are "good" in case of hash join.

17
Hash-Based Join: Working
• The optimizer uses a hash join to join two tables if
they are joined using an equi-join and if either of
the following conditions are true:

• A large amount of data needs to be joined.


• A large portion of the table needs to be joined.

18
Hash-Based Join: Working
• This method is best used when the smaller table
fits in the available main memory. The cost is then
limited to a single read pass over the data for the
two tables.

• Else the "smaller" table has to be partitioned which


results in unnecessary delays and degradation of
performance due to undesirable I/Os.

19
Hash-Based Join: Working
• Suitable for the VLDB environment.

• The choice which table first gets hashed plays a


pivotal role in the overall performance of the join
operation, this decided by the optimizer.

• joined rows are identified by collisions i.e.


collisions are "good" in case of hash join.

20
Hash-Based Join: Working

21
Hash-Based Join: Example

Original
Relation
MAIN MEMORY Join Result
1

Table_A 1
2
2
hash
... function . .
h N .
.
.
.
Table_B
M N

Disk Table_A in main memory Disk


Table_B on disk
22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy