0% found this document useful (0 votes)
50 views29 pages

An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng

Uploaded by

IRTAZA SIDDIQUI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views29 pages

An Adaptive Execution Engine For Apache Spark SQL: Carson Wang Yucai Yu Hao Cheng

Uploaded by

IRTAZA SIDDIQUI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

AN ADAPTIVE EXECUTION ENGINE FOR

APACHE SPARK SQL


Carson Wang (carson.wang@intel.com)
Yucai Yu (yucai.yu@intel.com)
Hao Cheng (hao.cheng@intel.com)
Agenda

• Challenges in Spark SQL* High Performance

• Adaptive Execution Background

• Adaptive Execution Architecture

• Benchmark Result

*Other names and brands may be claimed as the property of others.


2
Challenges in Tuning Shuffle Partition Number

• Partition Num P = spark.sql.shuffle.partition (200 by default)

• Total Core Num C = Executor Num * Executor Core Num

• Each Reduce Stage runs the tasks in (P / C) rounds

*Other names and brands may be claimed as the property of others.


3
Shuffle Partition Challenge 1
• Partition Num Too Small:Spill, OOM

• Partition Num Too Large:Scheduling overhead. More IO requests. Too many


small output files

• Tuning method: Increase partition number starting from C, 2C, … until


performance begin to drop

Impractical for each query


in production.

4
Shuffle Partition Challenge 2

• The same Shuffle Partition number doesn’t fit for all Stages

• Shuffle data size usually decreases during the execution of the SQL
query

Question: Can we set the shuffle partition number for each stage
automatically?

5
Spark SQL* Execution Plan

• The execution plan is fixed after planning phase.

*Other names and brands may be claimed as the property of others.


Image from: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html 6
Spark SQL* Join Selection

SELECT xxx

FROM A

JOIN B

ON A.Key1 = B.Key2

*Other names and brands may be claimed as the property of others.


7
Broadcast Hash Join
Table B

A1 B A2 B An B

Partition Partition
2
…… Partition
n
1

Task 1 Task 2 Task n

Executor Executor Executor

8
Shuffle Hash Join / Sort Merge Join

A0 A1 A2 B0 B1 B2

MAP

SHUFFLE

REDUCE Output
Partition 0
Output
Partition 1
Output
Partition 2
Output ……
Output

9
Spark SQL* Join Selection
• spark.sql.autoBroadcastJoinThreshold is 10 MB by default

• For complex queries, a Join may takes intermediate results as inputs.


At planning phase, Spark SQL* doesn’t know the exact size and plans it
to SortMergeJoin.

Question: Can we
optimize the execution
plan at runtime based on
the runtime statistics ?

*Other names and brands may be claimed as the property of others. 10


Data Skew in Join
• Data in some partitions are extremely larger than other partitions.
• Data skew is a common source of slowness for Shuffle Joins.

11
Ways to Handle Skewed Join nowadays

• Increase shuffle partition number


• Increase BroadcastJoin threashold to change Shuffle Join to
Broadcast Join
• Add prefix to the skewed keys
• ……

Question 3: These involve many manual efforts and are limitted. Can
we handle skewed join at runtime automatically?

12
Adaptive Execution Background

• SPARK-9850: Adaptive execution in Spark*

• SPARK-9851: Support submitting map stages individually in


DAGScheduler

• SPARK-9858: Introduce an ExchangeCoordinator to estimate the


number of post-shuffle partitions.

*Other names and brands may be claimed as the property of others.


13
A New Adaptive Execution Engine in Spark SQL*

*Other names and brands may be claimed as the property of others.


14
Adaptive Execution Architecture
DAG of RDDs

Execution Plan FileScan Shuffled


RDD Execute the Stages
RDD RowRDD
Stage RDD
SortMerge
Join FileScan Shuffled
RDD
RDD RowRDD
Stage
Stage
Sort Sort

QueryStage
QueryStage

Exchange Exchange
Divide the plan into (a) Execute ChildStages
DAG of RDDs
multiple QueryStages SortMerge (b) Optimize the plan
Broadcast
Join (c) Determine Reducer num LocalShu
Join ffledRDD
RDD
… …
Stage

Sort Sort QueryStage Broadcast


Input Exchange

Size=100GB Execute the Stage


QueryStage QueryStage
QueryStage
Input Input
Input

ChildStage ChildStage Size=5MB

15
Auto Setting the Number of Reducers
• 5 initial reducer partitions with size
[70 MB, 30 MB, 20 MB, 10 MB, 50 MB]
• Set target size per reducer = 64 MB. At runtime, we use 3 actual reducers.
• Also support setting target row count per reducer.

Map Map Reduce Reduce Reduce


Task 1 Task 2 Task 1 Task 2 Task 3
Partition 0 Partition 0 Partition 1
(30MB)
Partition 1 Partition 1
Partition 0 Partition 2 Parition 4
Partition 2 Partition 2
(70MB) (20 MB) (50 MB)
Partition 3 Partition 3
Partition 3
Partition 4 Partition 4 (10 MB)

16
Shuffle Join => Broadcast Join
Example 1
QueryStage

• T1 < broadcast threshold SortMerge

• T2 and T3 > broadcast threshold Join2

• In this case, both Join1 and Join2 QueryStage


SortMerge
are not changed to broadcast join Input
Join1
(child stage)

T3
QueryStage QueryStage
Input Input
(child stage) (child stage)

T1 T2
17
Shuffle Join => Broadcast Join
Example 2
QueryStage

• T1 and T3 < broadcast threshold SortMerge


Join2
• T2 > broadcast threshold

• In this case, both Join1 and Join2 QueryStage


SortMerge
are changed to broadcast join Join1
Input
(child stage)

T3
QueryStage QueryStage
Input Input
(child stage) (child stage)

T1 T2
18
Remote Shuffle Read => Local Shuffle Read
task1 Task2 Task3
A0 B0 Task4 Task5

Reduce tasks on Node 1 Reduce tasks on Node 2

Map output on Node 1 Remote Shuffle Read

A1 B1 task 1
task 2

Reduce tasks on Node 1 Reduce tasks on Node 2


Map output on Node 2
Local Shuffle Read

19
Skewed Partition Detection at Runtime
• After executing child stages, we calculate the data size and
row count of each partition from MapStaus.

• A partition is skewed if its data size or row count is N times


larger than the median, and also larger than a pre-defined
threshold.

20
Handling Skewed Join
Table A (Parition 0 is skewed) Table B

Map 0 Map 0

A0-0

Shuffle Read
Join B0
Map 1 Shuffle Read Map 1

A0-1 Join

Map 2 Map 2
……

Use N tasks instead of 1 task to join the data in


Partition 0. The join result =

……
……

A0-N
Union (A0-0 Join B0, A0-1 Join B0, … , A0-N Join B0)

21
Benchmark Result

22
Cluster Setup
Hardware BDW
Slave Node# 98
CPU Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (88 cores)
Memory 256 GB
Disk 7× 400 GB SSD
Network 10 Gigabit Ethernet

Master CPU Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (88 cores)


Memory 256 GB
Disk 7× 400 GB SSD
Network 10 Gigabit Ethernet
Software
OS CentOS* Linux release 6.9
Kernel 2.6.32-573.22.1.el6.x86_64
Spark* Spark* master (2.3) / Spark* master (2.3) with adaptive execution patch
Hadoop*/HDFS* hadoop-2.7.3
JDK 1.8.0_40 (Oracle* Corporation)
*Other names and brands may be claimed as the property of others.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
23
TPC-DS* 100TB Benchmark

Spark SQL v.s. Adaptive Execution


500
1.2X
400
Duration (s)

300
3.2X
200
1.6X 1.3X 1.3X
1.9X 1.8X 1.5X 1.3X
1.7X 1.3X
100 1.3X 1.2X
1.3X
1.3X
0
q8 q81 q30 q51 q61 q60 q90 q37 q82 q56 q31 q19 q41 q74 q91
Spark Sql Adaptive Execution

*Other names and brands may be claimed as the property of others.


For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
24
Auto Setting the Shuffle Partition Number
• Less scheduler overhead and task startup time.
• Less disk IO requests.
• Less data are written to disk because more data are aggregatd.

Partition Number 10976 (q30)

Partition Number changed to 1084 and 1079 at runtime. (q30)

*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

25
SortMergeJoin -> BroadcastJoin at Runtime
• Eliminate the data skew and straggler in SortMergeJoin
• Remote shuffle read -> local shuffle read.
• Random IO read -> Sequence IO read
SortMergeJoin (q8):

BroadcastJoin (q8 Adaptive Execution):

*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

26
Scheduling Difference
• Spark SQL* has to wait for the completion of all broadcasts
before scheduling the stages. Adaptive Execution can start the
stages earlier as long as its dependencies are completed.
Original Spark:

50 Seconds Gap

Adaptive Execution:

*Other names and brands may be claimed as the property of others. 27


For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Thank YOU
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as
well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are
available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting
www.intel.com/design/literature.htm.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced
data are accurate.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

Copyright © 2017 Intel Corporation.

29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy