An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng
An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng
• Benchmark Result
4
Shuffle Partition Challenge 2
• The same Shuffle Partition number doesn’t fit for all Stages
• Shuffle data size usually decreases during the execution of the SQL
query
Question: Can we set the shuffle partition number for each stage
automatically?
5
Spark SQL* Execution Plan
SELECT xxx
FROM A
JOIN B
ON A.Key1 = B.Key2
A1 B A2 B An B
Partition Partition
2
…… Partition
n
1
8
Shuffle Hash Join / Sort Merge Join
A0 A1 A2 B0 B1 B2
MAP
SHUFFLE
REDUCE Output
Partition 0
Output
Partition 1
Output
Partition 2
Output ……
Output
9
Spark SQL* Join Selection
• spark.sql.autoBroadcastJoinThreshold is 10 MB by default
Question: Can we
optimize the execution
plan at runtime based on
the runtime statistics ?
11
Ways to Handle Skewed Join nowadays
Question 3: These involve many manual efforts and are limitted. Can
we handle skewed join at runtime automatically?
12
Adaptive Execution Background
QueryStage
QueryStage
Exchange Exchange
Divide the plan into (a) Execute ChildStages
DAG of RDDs
multiple QueryStages SortMerge (b) Optimize the plan
Broadcast
Join (c) Determine Reducer num LocalShu
Join ffledRDD
RDD
… …
Stage
15
Auto Setting the Number of Reducers
• 5 initial reducer partitions with size
[70 MB, 30 MB, 20 MB, 10 MB, 50 MB]
• Set target size per reducer = 64 MB. At runtime, we use 3 actual reducers.
• Also support setting target row count per reducer.
16
Shuffle Join => Broadcast Join
Example 1
QueryStage
T3
QueryStage QueryStage
Input Input
(child stage) (child stage)
T1 T2
17
Shuffle Join => Broadcast Join
Example 2
QueryStage
T3
QueryStage QueryStage
Input Input
(child stage) (child stage)
T1 T2
18
Remote Shuffle Read => Local Shuffle Read
task1 Task2 Task3
A0 B0 Task4 Task5
A1 B1 task 1
task 2
19
Skewed Partition Detection at Runtime
• After executing child stages, we calculate the data size and
row count of each partition from MapStaus.
20
Handling Skewed Join
Table A (Parition 0 is skewed) Table B
Map 0 Map 0
A0-0
Shuffle Read
Join B0
Map 1 Shuffle Read Map 1
A0-1 Join
Map 2 Map 2
……
……
……
A0-N
Union (A0-0 Join B0, A0-1 Join B0, … , A0-N Join B0)
21
Benchmark Result
22
Cluster Setup
Hardware BDW
Slave Node# 98
CPU Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (88 cores)
Memory 256 GB
Disk 7× 400 GB SSD
Network 10 Gigabit Ethernet
300
3.2X
200
1.6X 1.3X 1.3X
1.9X 1.8X 1.5X 1.3X
1.7X 1.3X
100 1.3X 1.2X
1.3X
1.3X
0
q8 q81 q30 q51 q61 q60 q90 q37 q82 q56 q31 q19 q41 q74 q91
Spark Sql Adaptive Execution
*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
25
SortMergeJoin -> BroadcastJoin at Runtime
• Eliminate the data skew and straggler in SortMergeJoin
• Remote shuffle read -> local shuffle read.
• Random IO read -> Sequence IO read
SortMergeJoin (q8):
*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
26
Scheduling Difference
• Spark SQL* has to wait for the completion of all broadcasts
before scheduling the stages. Adaptive Execution can start the
stages earlier as long as its dependencies are completed.
Original Spark:
50 Seconds Gap
Adaptive Execution:
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as
well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are
available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting
www.intel.com/design/literature.htm.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced
data are accurate.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
29