0% found this document useful (0 votes)

50 views29 pages

An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng

Uploaded by

IRTAZA SIDDIQUI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views29 pages

An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng

Uploaded by

IRTAZA SIDDIQUI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

AN ADAPTIVE EXECUTION ENGINE FOR

APACHE SPARK SQL

Carson Wang (carson.wang@intel.com)
Yucai Yu (yucai.yu@intel.com)
Hao Cheng (hao.cheng@intel.com)
Agenda

• Challenges in Spark SQL* High Performance

• Adaptive Execution Background

• Adaptive Execution Architecture

• Benchmark Result

*Other names and brands may be claimed as the property of others.

2
Challenges in Tuning Shuffle Partition Number

• Partition Num P = spark.sql.shuffle.partition (200 by default)

• Total Core Num C = Executor Num * Executor Core Num

• Each Reduce Stage runs the tasks in (P / C) rounds

*Other names and brands may be claimed as the property of others.

3
Shuffle Partition Challenge 1
• Partition Num Too Small：Spill, OOM

• Partition Num Too Large：Scheduling overhead. More IO requests. Too many

small output files

• Tuning method: Increase partition number starting from C, 2C, … until

performance begin to drop

Impractical for each query

in production.

4
Shuffle Partition Challenge 2

• The same Shuffle Partition number doesn’t fit for all Stages

• Shuffle data size usually decreases during the execution of the SQL
query

Question: Can we set the shuffle partition number for each stage
automatically?

5
Spark SQL* Execution Plan

• The execution plan is fixed after planning phase.

*Other names and brands may be claimed as the property of others.

Image from: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html 6
Spark SQL* Join Selection

SELECT xxx

FROM A

JOIN B

ON A.Key1 = B.Key2

*Other names and brands may be claimed as the property of others.

7
Broadcast Hash Join
Table B

A1 B A2 B An B

Partition Partition
2
…… Partition
n
1

Task 1 Task 2 Task n

Executor Executor Executor

8
Shuffle Hash Join / Sort Merge Join

A0 A1 A2 B0 B1 B2

MAP

SHUFFLE

REDUCE Output
Partition 0
Output
Partition 1
Output
Partition 2
Output ……
Output

9
Spark SQL* Join Selection
• spark.sql.autoBroadcastJoinThreshold is 10 MB by default

• For complex queries, a Join may takes intermediate results as inputs.

At planning phase, Spark SQL* doesn’t know the exact size and plans it
to SortMergeJoin.

Question: Can we
optimize the execution
plan at runtime based on
the runtime statistics ?

*Other names and brands may be claimed as the property of others. 10

Data Skew in Join
• Data in some partitions are extremely larger than other partitions.
• Data skew is a common source of slowness for Shuffle Joins.

11
Ways to Handle Skewed Join nowadays

• Increase shuffle partition number

• Increase BroadcastJoin threashold to change Shuffle Join to
Broadcast Join
• Add prefix to the skewed keys
• ……

Question 3: These involve many manual efforts and are limitted. Can
we handle skewed join at runtime automatically?

12
Adaptive Execution Background

• SPARK-9850: Adaptive execution in Spark*

• SPARK-9851: Support submitting map stages individually in

DAGScheduler

• SPARK-9858: Introduce an ExchangeCoordinator to estimate the

number of post-shuffle partitions.

*Other names and brands may be claimed as the property of others.

13
A New Adaptive Execution Engine in Spark SQL*

*Other names and brands may be claimed as the property of others.

14
Adaptive Execution Architecture
DAG of RDDs

Execution Plan FileScan Shuffled

RDD Execute the Stages
RDD RowRDD
Stage RDD
SortMerge
Join FileScan Shuffled
RDD
RDD RowRDD
Stage
Stage
Sort Sort

QueryStage
QueryStage

Exchange Exchange
Divide the plan into (a) Execute ChildStages
DAG of RDDs
multiple QueryStages SortMerge (b) Optimize the plan
Broadcast
Join (c) Determine Reducer num LocalShu
Join ffledRDD
RDD
… …
Stage

Sort Sort QueryStage Broadcast

Input Exchange

Size=100GB Execute the Stage

QueryStage QueryStage
QueryStage
Input Input
Input

ChildStage ChildStage Size=5MB

15
Auto Setting the Number of Reducers
• 5 initial reducer partitions with size
[70 MB, 30 MB, 20 MB, 10 MB, 50 MB]
• Set target size per reducer = 64 MB. At runtime, we use 3 actual reducers.
• Also support setting target row count per reducer.

Map Map Reduce Reduce Reduce

Task 1 Task 2 Task 1 Task 2 Task 3
Partition 0 Partition 0 Partition 1
(30MB)
Partition 1 Partition 1
Partition 0 Partition 2 Parition 4
Partition 2 Partition 2
(70MB) (20 MB) (50 MB)
Partition 3 Partition 3
Partition 3
Partition 4 Partition 4 (10 MB)

16
Shuffle Join => Broadcast Join
Example 1
QueryStage

• T1 < broadcast threshold SortMerge

• T2 and T3 > broadcast threshold Join2

• In this case, both Join1 and Join2 QueryStage

SortMerge
are not changed to broadcast join Input
Join1
(child stage)

T3
QueryStage QueryStage
Input Input
(child stage) (child stage)

T1 T2
17
Shuffle Join => Broadcast Join
Example 2
QueryStage

• T1 and T3 < broadcast threshold SortMerge

Join2
• T2 > broadcast threshold

• In this case, both Join1 and Join2 QueryStage

SortMerge
are changed to broadcast join Join1
Input
(child stage)

T3
QueryStage QueryStage
Input Input
(child stage) (child stage)

T1 T2
18
Remote Shuffle Read => Local Shuffle Read
task1 Task2 Task3
A0 B0 Task4 Task5

Reduce tasks on Node 1 Reduce tasks on Node 2

Map output on Node 1 Remote Shuffle Read

A1 B1 task 1
task 2

Reduce tasks on Node 1 Reduce tasks on Node 2

Map output on Node 2
Local Shuffle Read

19
Skewed Partition Detection at Runtime
• After executing child stages, we calculate the data size and
row count of each partition from MapStaus.

• A partition is skewed if its data size or row count is N times

larger than the median, and also larger than a pre-defined
threshold.

20
Handling Skewed Join
Table A (Parition 0 is skewed) Table B

Map 0 Map 0

A0-0

Shuffle Read
Join B0
Map 1 Shuffle Read Map 1

A0-1 Join

Map 2 Map 2
……

Use N tasks instead of 1 task to join the data in

Partition 0. The join result =

……
……

A0-N
Union (A0-0 Join B0, A0-1 Join B0, … , A0-N Join B0)

21
Benchmark Result

22
Cluster Setup
Hardware BDW
Slave Node# 98
CPU Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (88 cores)
Memory 256 GB
Disk 7× 400 GB SSD
Network 10 Gigabit Ethernet

Master CPU Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (88 cores)

Memory 256 GB
Disk 7× 400 GB SSD
Network 10 Gigabit Ethernet
Software
OS CentOS* Linux release 6.9
Kernel 2.6.32-573.22.1.el6.x86_64
Spark* Spark* master (2.3) / Spark* master (2.3) with adaptive execution patch
Hadoop*/HDFS* hadoop-2.7.3
JDK 1.8.0_40 (Oracle* Corporation)
*Other names and brands may be claimed as the property of others.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
23
TPC-DS* 100TB Benchmark

Spark SQL v.s. Adaptive Execution

500
1.2X
400
Duration (s)

300
3.2X
200
1.6X 1.3X 1.3X
1.9X 1.8X 1.5X 1.3X
1.7X 1.3X
100 1.3X 1.2X
1.3X
1.3X
0
q8 q81 q30 q51 q61 q60 q90 q37 q82 q56 q31 q19 q41 q74 q91
Spark Sql Adaptive Execution

*Other names and brands may be claimed as the property of others.

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
24
Auto Setting the Shuffle Partition Number
• Less scheduler overhead and task startup time.
• Less disk IO requests.
• Less data are written to disk because more data are aggregatd.

Partition Number 10976 (q30)

Partition Number changed to 1084 and 1079 at runtime. (q30)

*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

25
SortMergeJoin -> BroadcastJoin at Runtime
• Eliminate the data skew and straggler in SortMergeJoin
• Remote shuffle read -> local shuffle read.
• Random IO read -> Sequence IO read
SortMergeJoin (q8):

BroadcastJoin (q8 Adaptive Execution):

*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

26
Scheduling Difference
• Spark SQL* has to wait for the completion of all broadcasts
before scheduling the stages. Adaptive Execution can start the
stages earlier as long as its dependencies are completed.
Original Spark:

50 Seconds Gap

Adaptive Execution:

*Other names and brands may be claimed as the property of others. 27

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Thank YOU
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as
well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are
available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting
www.intel.com/design/literature.htm.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced
data are accurate.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

6th Central Pay Commission Salary Calculator
100% (436)
6th Central Pay Commission Salary Calculator
15 pages
Abap Performance Tuning
No ratings yet
Abap Performance Tuning
208 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Spark QA
No ratings yet
Spark QA
34 pages
HP Troubleshooting Assessment (0001106069)
100% (3)
HP Troubleshooting Assessment (0001106069)
4 pages
CS614 Finalterm Subjective Referencefile
No ratings yet
CS614 Finalterm Subjective Referencefile
27 pages
DB2PerfTuneTroubleshoot Db2d3e1010
No ratings yet
DB2PerfTuneTroubleshoot Db2d3e1010
757 pages
Spark
No ratings yet
Spark
49 pages
Sybase SQL Troubleshooting
No ratings yet
Sybase SQL Troubleshooting
234 pages
SG 247467
No ratings yet
SG 247467
270 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
COMP90055 Thesis
No ratings yet
COMP90055 Thesis
54 pages
Spark Basic 2-1
No ratings yet
Spark Basic 2-1
25 pages
Differences and Definitions
No ratings yet
Differences and Definitions
13 pages
Zafin Learn Session - PostgreSQL Performance For Application Developers
No ratings yet
Zafin Learn Session - PostgreSQL Performance For Application Developers
58 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
Spark Class 2
No ratings yet
Spark Class 2
37 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Parallel Job Advanced Developer Guide
No ratings yet
Parallel Job Advanced Developer Guide
857 pages
New Features Guide To Sybase Ase 15
No ratings yet
New Features Guide To Sybase Ase 15
505 pages
Dimensionnement Spark - Les 5 Erreurs À Éviter
No ratings yet
Dimensionnement Spark - Les 5 Erreurs À Éviter
75 pages
Real Time Analytics With Spark and Kafka
No ratings yet
Real Time Analytics With Spark and Kafka
53 pages
Spark Otp
No ratings yet
Spark Otp
7 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
AWS Foundation DynamoDB Part 1
No ratings yet
AWS Foundation DynamoDB Part 1
16 pages
Performance Tunning
No ratings yet
Performance Tunning
459 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
No ratings yet
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
76 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
New Features in Apache Spark 3 0 1593002435
No ratings yet
New Features in Apache Spark 3 0 1593002435
8 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
FINAL EXAM - FM - 27th July 2020
No ratings yet
FINAL EXAM - FM - 27th July 2020
2 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
Spark Optimization Case Study Cleaned
No ratings yet
Spark Optimization Case Study Cleaned
7 pages
Govtech Trends 2025
No ratings yet
Govtech Trends 2025
5 pages
1714069759520
No ratings yet
1714069759520
17 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
APAC Big Data & Cloud Summit 2013: Girish Juneja
No ratings yet
APAC Big Data & Cloud Summit 2013: Girish Juneja
23 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
DE Bootcamp - Week 3 Day 2
No ratings yet
DE Bootcamp - Week 3 Day 2
4 pages
Performance Tuning Guide: DB2 UDB V7.1
No ratings yet
Performance Tuning Guide: DB2 UDB V7.1
418 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Chapter 3 Worksheet CLASS 4 CS
50% (4)
Chapter 3 Worksheet CLASS 4 CS
2 pages
Kit Instructions: Verifone UX100 and UX300 Devices
No ratings yet
Kit Instructions: Verifone UX100 and UX300 Devices
25 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
No ratings yet
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
27 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Cisco UCS C460 M4 High-Performance Rack-Mount Server: Spec Sheet
No ratings yet
Cisco UCS C460 M4 High-Performance Rack-Mount Server: Spec Sheet
80 pages
Shark: SQL and Rich Analytics at Scale
No ratings yet
Shark: SQL and Rich Analytics at Scale
35 pages
BC0036 Digital System Paper 3
No ratings yet
BC0036 Digital System Paper 3
13 pages
ACTIVITY 2 - The PC System
No ratings yet
ACTIVITY 2 - The PC System
4 pages
Enkitec RealWorldExadata
No ratings yet
Enkitec RealWorldExadata
38 pages
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
DP CardReader 15035 Drivers
No ratings yet
DP CardReader 15035 Drivers
952 pages
Best Practices PDF
No ratings yet
Best Practices PDF
47 pages
IBM DB2 11 For z/OS Buffer Pool Monitoring and Tuning: Paper
No ratings yet
IBM DB2 11 For z/OS Buffer Pool Monitoring and Tuning: Paper
64 pages
Best Practices For DB2 On Z-OS Performance
No ratings yet
Best Practices For DB2 On Z-OS Performance
106 pages
Enterprise Data Storage and Analysis On Spark
No ratings yet
Enterprise Data Storage and Analysis On Spark
34 pages
TDD: Topics in Distributed Databases: Parallel Database Management Systems
No ratings yet
TDD: Topics in Distributed Databases: Parallel Database Management Systems
38 pages
Chapter 1 - Introduction To E-Commerce PDF
No ratings yet
Chapter 1 - Introduction To E-Commerce PDF
26 pages
Chapter 1 - Introduction To E-Commerce PDF
No ratings yet
Chapter 1 - Introduction To E-Commerce PDF
26 pages
Microsoft SQL Database Analytics Paper
No ratings yet
Microsoft SQL Database Analytics Paper
18 pages
Introduction To Embedded Systems Notes 2023 May17
No ratings yet
Introduction To Embedded Systems Notes 2023 May17
39 pages
Chapter 5 - Supply Chain Management & E-Commerce
No ratings yet
Chapter 5 - Supply Chain Management & E-Commerce
24 pages
Spark 20 Tuning Guide
No ratings yet
Spark 20 Tuning Guide
21 pages
I2c Slave
No ratings yet
I2c Slave
4 pages
Intel Sandy Ntel Sandy Bridge Architecture
No ratings yet
Intel Sandy Ntel Sandy Bridge Architecture
54 pages
CS2100 Computer Organisation: MIPS Programming
No ratings yet
CS2100 Computer Organisation: MIPS Programming
175 pages
MS-17591 10 140103 PDF
No ratings yet
MS-17591 10 140103 PDF
53 pages
Instruction Cycle
No ratings yet
Instruction Cycle
19 pages
LAb 5
No ratings yet
LAb 5
18 pages
ASSIGNMENT # 02 - Financial Management - 21st June 2020
100% (1)
ASSIGNMENT # 02 - Financial Management - 21st June 2020
2 pages
An Introduction To Virtualization
No ratings yet
An Introduction To Virtualization
15 pages
Bugdom Instructions
No ratings yet
Bugdom Instructions
20 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
2018-Hv-Sinan Petrus Toma-Rac Node Eviction - Die Nadel Im Heuhaufen Finden-Praesentation
No ratings yet
2018-Hv-Sinan Petrus Toma-Rac Node Eviction - Die Nadel Im Heuhaufen Finden-Praesentation
67 pages
DS-K1200 Series: Fingerprint Card Reader
No ratings yet
DS-K1200 Series: Fingerprint Card Reader
2 pages
Final Term Paper 8TH
No ratings yet
Final Term Paper 8TH
3 pages
Optimal Strategies For Large-Scale Batch ETL Jobs: Emma Tang, Neustar
No ratings yet
Optimal Strategies For Large-Scale Batch ETL Jobs: Emma Tang, Neustar
60 pages
GX Works - Manual de Instalação
No ratings yet
GX Works - Manual de Instalação
1 page
Redhat El6.3 and Centos 6.3 Installation Configuration Guide
No ratings yet
Redhat El6.3 and Centos 6.3 Installation Configuration Guide
44 pages
TMS320VC5509A DSP Starter Kit (DSK) :: Active
No ratings yet
TMS320VC5509A DSP Starter Kit (DSK) :: Active
1 page
Premier Quatro 4 Loop Analogue Addressable Fire Alarm Panel: Description
No ratings yet
Premier Quatro 4 Loop Analogue Addressable Fire Alarm Panel: Description
2 pages
IBM HPS POWER5 Readme - Service Pack 19
No ratings yet
IBM HPS POWER5 Readme - Service Pack 19
42 pages
Chapter 1 Chapter 1 Introduction To Computers: Jim Michael Widi, S.Kom
No ratings yet
Chapter 1 Chapter 1 Introduction To Computers: Jim Michael Widi, S.Kom
49 pages
PL Dealer PDM 26 Mei 2023
No ratings yet
PL Dealer PDM 26 Mei 2023
6 pages
01 STM32 P103 PDF
No ratings yet
01 STM32 P103 PDF
14 pages
Microprocessor System Design: 80x86 Addressing Modes
No ratings yet
Microprocessor System Design: 80x86 Addressing Modes
28 pages
SESSIONAL ASSIGNMENT # 1 EXAM Employee Training and Development (MBA)
No ratings yet
SESSIONAL ASSIGNMENT # 1 EXAM Employee Training and Development (MBA)
3 pages
BaiTap Chuong123
No ratings yet
BaiTap Chuong123
19 pages
High Performance Enterprise Data Processing With Apache Spark
No ratings yet
High Performance Enterprise Data Processing With Apache Spark
10 pages
E - C B P: Ommerce Usiness LAN
No ratings yet
E - C B P: Ommerce Usiness LAN
8 pages
Assignment: Examine Case Study in Detail and Answer The Questions at The End. All Questions Need To Be Attempted
No ratings yet
Assignment: Examine Case Study in Detail and Answer The Questions at The End. All Questions Need To Be Attempted
6 pages
SESSIONAL ASSIGNMENT # 1 EXAM Salary Compensation & Benefit
No ratings yet
SESSIONAL ASSIGNMENT # 1 EXAM Salary Compensation & Benefit
4 pages
Virtual Memory Examples: Problem 1
No ratings yet
Virtual Memory Examples: Problem 1
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng

Uploaded by

An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng

Uploaded by

AN ADAPTIVE EXECUTION ENGINE FOR

APACHE SPARK SQL

• Challenges in Spark SQL* High Performance

• Adaptive Execution Background

• Adaptive Execution Architecture

*Other names and brands may be claimed as the property of others.

• Partition Num P = spark.sql.shuffle.partition (200 by default)

• Total Core Num C = Executor Num * Executor Core Num

• Each Reduce Stage runs the tasks in (P / C) rounds

*Other names and brands may be claimed as the property of others.

• Partition Num Too Large：Scheduling overhead. More IO requests. Too many

• Tuning method: Increase partition number starting from C, 2C, … until

Impractical for each query

• The execution plan is fixed after planning phase.

*Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

Task 1 Task 2 Task n

Executor Executor Executor

• For complex queries, a Join may takes intermediate results as inputs.

*Other names and brands may be claimed as the property of others. 10

• Increase shuffle partition number

• SPARK-9850: Adaptive execution in Spark*

• SPARK-9851: Support submitting map stages individually in

• SPARK-9858: Introduce an ExchangeCoordinator to estimate the

*Other names and brands may be claimed as the property of others.

*Other names and brands may be claimed as the property of others.

Execution Plan FileScan Shuffled

Sort Sort QueryStage Broadcast

Size=100GB Execute the Stage

ChildStage ChildStage Size=5MB

Map Map Reduce Reduce Reduce

• T1 < broadcast threshold SortMerge

• T2 and T3 > broadcast threshold Join2

• In this case, both Join1 and Join2 QueryStage

• T1 and T3 < broadcast threshold SortMerge

• In this case, both Join1 and Join2 QueryStage

Reduce tasks on Node 1 Reduce tasks on Node 2

Map output on Node 1 Remote Shuffle Read

Reduce tasks on Node 1 Reduce tasks on Node 2

• A partition is skewed if its data size or row count is N times

Use N tasks instead of 1 task to join the data in

Master CPU Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (88 cores)

Spark SQL v.s. Adaptive Execution

*Other names and brands may be claimed as the property of others.

Partition Number 10976 (q30)

Partition Number changed to 1084 and 1079 at runtime. (q30)

BroadcastJoin (q8 Adaptive Execution):

*Other names and brands may be claimed as the property of others. 27

*Other names and brands may be claimed as the property of others

Copyright © 2017 Intel Corporation.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.