DBR 7.x - Spark 3.x Features Migration
DBR 7.x - Spark 3.x Features Migration
x
Features &
Migration
Prashanth Babu
Senior Resident Solutions Architect
Created: 25th Nov 2020
Last updated: 20th Jan 2021
2
Why should you migrate to DBR 7?
❖ Support for DBR 6.4 (Spark 2.4.x) ends on 1st April, 2021.
➢ DBR 6 will not be available post this cutoff
3
Why should you migrate to DBR 7?
❖ Support for DBR 6.4 (Spark 2.4.x) ends on 1st April, 2021.
➢ DBR 6 will not be available post this cutoff
➢ You will not see this version in the Clusters dropdown menu or
using REST API.
■ But your existing jobs on the earlier versions will continue
to run fine.
4
Why should you migrate to DBR 7?
❖ DBR 7.3 (Spark 3.0.1) is the new LTS in 7.x release line
https://docs.microsoft.com/en-us/azure/databricks/release-notes/runtime/releases
https://docs.databricks.com/release-notes/runtime/releases.html
5
Performance SQL Compatibility
Adaptive Query Dynamic Partition 40% Compiler Cache Sync Reserved Keywords Gregorian Store Assignment Overflow
Execution Pruning Time Reduction Minimization in Parser Calendar in INSERT Checking
Feature Connector
Accelerator-aware JDK 11 Built-in Parquet/ORC: Nested New Binary New NOOP CSV Filter
Join Hints Functions
Scheduler Support Column Pruning Data Source Data Source Pushdown
Structured JDBC tab Observable Event log New Pandas UDF Pandas UDF Eager Execution Vectorization in
Streaming UI in SHS Metrics Rollover using Type Hints Enhancements in R Shell SparkR
Explain Describe Dump Test Data Source APIs Hive 3.x Metastore Hadoop 3 SQL
Formatted Query Plan Coverage + Catalog Support Hive 2.3 Execution Support Reference
6
Agenda
Performance Spark ML new features
3.0 comes with performance improvements to
New features in Spark ML
make Spark faster, cheaper, and more flexible
▪ Adaptive Query Execution Migration and compatibility
▪ Dynamic Partition Pruning Important compatibility / behavior changes for
▪ Join Optimization Hints migration
Usability Spark Ecosystem
New features make Spark even easier to use ▪ New features in Delta
▪ Spark SQL: Explain Plans ▪ Third party connectors
7
https://dbricks.co/ebLearningSpark
8
Performance
9
Adaptive Query Execution
(AQE)
10
Adaptive Query Execution (AQE)
Re-optimizes queries based on the most up-to-date runtime statistics
11
Optimization in Spark 2.x
12
Adaptive Query Execution
adaptive planning
Based on statistics of the finished plan nodes, re-optimize the execution
plan of the remaining queries
▪ Dynamically switch join strategies
▪ Dynamically coalesce shuffle partitions
▪ Dynamically optimize skew joins
13
Performance Pitfall
Using the wrong join strategy
▪ Choose Broadcast Hash Join?
▪ Increase “ ”?
▪ Use “broadcast” hint?
However
▪ Hard to tune
▪ Hard to maintain over time
▪ OOM…
14
Adaptive Query Execution
Vision: No more manual setting of broadcast hints/thresholds! Capability:
SMJ -> BHJ at runtime
Actual:
8MB
Static size:
15MB
15
Performance Pitfall
Choosing the wrong shuffle partition number
▪ Tuning
▪ Default magic number: 200 !?!
However
▪ Too small: GC pressure; disk spilling
▪ Too large: Inefficient I/O; scheduler pressure
▪ Hard to tune over the whole query plan
▪ Hard to maintain over time
16
Adaptive Query Execution
VISION: No more manual tuning of spark.shuffle.partitions! Capability: Coalesce shuffle
partitions
▪ Set the initial partition number high to accommodate the largest data
size of the entire query execution
▪ Automatically coalesce partitions if needed after each query stage
Execute Optimize
17
Performance Pitfall
Data skew
▪ Symptoms of data skew
▪ Frozen/long-running tasks
▪ Disk spilling
▪ Low resource utilization in most nodes
▪ OOM
▪ Various ways
▪ Find the skew values and rewrite the queries
▪ Adding extra skew keys…
18
Adaptive Query Execution
Data Skew
19
Adaptive Query Execution
VISION: No more manual tuning of skew hints!
20
Skew Join and AQE Demo
21
AQE Configuration Settings
AQE is enabled by default in DBR 7.3 LTS and above: Blogpost
spark.sql.adaptive.
coalescePartitions.
enabled
spark.sql.adaptive.
coalescePartitions.
minPartitionNum
spark.sql.adaptive.
coalescePartitions.
initialPartitionNum
spark.sql.adaptive.
advisoryPartitionSizeInBytes
22
Key Takeaways - AQE
▪ Adaptive Query Execution is an extensible framework
▪ Other AQE features may default to enabled, but are still gated by this master
configuration flag
23
Dynamic Partition
Pruning
24
Static Partition Pruning
Most optimizations employ simple static partition pruning
Join
26
Table Denormalization
Join Scan
Filter
Scan Scan city = 'New York'
Sales Stores
27
Dynamic Partition Pruning
Physical Plan Optimization
Broadcast Hash
Join
SCAN Fact Table
File Scan
Broadcast
Exchange
Dynamic Filter
File Scan with
DIM filter
28
Dynamic Partition Pruning
▪ Reads only the data you need
▪ Optimizes queries by pruning
partitions read from a fact table by
identifying the partitions that result
from filtering dimension tables
▪ Significant speedup; shown in many
TPC-DS queries
▪ May help you avoid ETL-ing
denormalized tables
29
3.0: SQL Engine
Adaptive Query Execution (AQE): change execution plan at runtime to
automatically set # of reducers and join algorithms
Change join
TPC-DS 1TB No-Stats With vs.
algorithm
Without Adaptive Query Execution
Duration (secs)
Accelerates TPC-DS queries
up to 8x
30
SMJ -> BHJ with AQE Demo
Without AQE
With AQE
31
32
Key Takeaways - Dynamic Partition Pruning
▪ It is enabled by default
▪ Spark will produce a query on the “small” table. The result of which is used
to produce a “dynamic filter” similar to our list of
33
Join Optimization Hints
34
Join Optimization Hints
▪ Join hints enable the user to override the optimizer to select their own
join strategies.
35
Join Strategies
Shuffle Nested
Sort-Merge Broadcast Hash Shuffle Hash
Loop
36
Join Hint Syntax
For Broadcast Joins
* Note the spaces for the hints
37
Join Hint Syntax
For Shuffle Sort Merge Joins
* Note the spaces for the hints
38
Join Hint Syntax
For Shuffle Hash Joins and Shuffle-and-Replicate Nested Loop Join
* Note the spaces for the hints
39
Join Hint Syntax
Shuffle Merge
SQL
SELECT -*+ SHUFFLE_MERGE(t1) -/ * FROM t1 INNER JOIN t2 ON
t1.key = t2.key;
Python
df1 = spark.table("t1")
df2 = spark.table("t2")
df1.hint("SHUFFLE_MERGE").join(df2, ["key"]).show()
40
Key Takeaways - Join Optimizations
▪ Join hints enable user to manually override AQE by using join
strategies
▪ More join hint strategies are now available as part of DBR 7.X
41
Usability -- Spark SQL:
Explain Plans
42
Spark: Catalyst Optimizer
43
Spark SQL: EXPLAIN Plans
44
Spark SQL: EXPLAIN Plans
Note: Old syntax works in Spark 3.x as well
45
Spark SQL: Old EXPLAIN Plan
46
Spark SQL: New EXPLAIN FORMATTED
47
Spark SQL: New EXPLAIN FORMATTED
Subqueries are listed separately
48
Key Takeaways - Explain Plans
▪ More features for ease of usability like EXPLAIN FORMATTED for
Plans.
49
Pandas UDFs
(aka Vectorized UDFs)
50
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('long', PandasUDFType.SCALAR) Spark 2.x
def pandas_plus_one(v):
# `v` is a pandas Series Scalar Pandas UDF
return v + 1 # outputs a pandas Series [Series to Series]
spark.range(10).select(pandas_plus_one("id")).show()
@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(itr):
# `iterator` is an iterator of pandas Series. Scalar Iterator Pandas UDF
# outputs an iterator of pandas Series.
return map(lambda v: v + 1, itr) [Iterator of Series to Iterator of Series]
spark.range(10).select(pandas_plus_one("id")).show()
Spark 2.3/4
Spark 3.0
Python Type
Hints
53
Pandas UDFs
Pandas Function APIs
57
Spark ML
▪ ML function parity between Scala and Python.
– PySpark
61
Migration &
Compatibility
62
Compiling your code for Spark 3.0
• Only builds with Scala 2.12
• Deprecates Python 2 (already EOL)
• Can build with various Hadoop/Hive versions
– Hadoop 2.7 + Hive 1.2
– Hadoop 2.7 + Hive 2.3 (supports Java 11) [Default]
– Hadoop 3.2 + Hive 2.3 (supports Java 11)
• Supports the following Hive metastore versions:
– "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"
• NOTE: Databricks Runtime 7.x supports only Java 8
Spark Core, Spark SQL, DataFrames, and Datasets
▪ Built with Scala 2.12 (should be backwards compatible)
DBX blogpost on Dates and Timestamps Notes on Azure Databricks Docs Portal
Spark Core, Spark SQL, DataFrames, and Datasets
▪ In Spark 2.x, CSV/JSON datasource convert a malformed string to a
row with all nulls in the mode.
– In Spark 3.0, the returned row can contain non-null fields if some of
CSV/JSON column values were parsed and converted to desired types
successfully.
▪ For parsing Week number of the year from a timestamp in Spark 3.x EXTRACT or
weekofyear method needs to be used.
– In Spark 2.4 and below this was possible by applying w format option in the
date_format method.
DBX blogpost on Dates and Timestamps Notes on Azure Databricks Docs Portal 66
Spark Core, Spark SQL, DataFrames, and Datasets
▪ For many changes, you do have the option to restore behavior before
Spark 3.0 using
http://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30 67
Spark Core, Spark SQL, DataFrames, and Datasets
▪ Type coercions are performed per ANSI SQL standards when inserting
new values into a column.
– Unreasonable conversions will fail and throw an exception
http://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30 68
Spark Core, Spark SQL, DataFrames, and Datasets
▪ is removed
▪ Event log files will be written as UTF-8 encoding, NOT the default charset
of the driver JVM process
http://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30 69
Spark Core
Deprecated methods removed/replaced
70
ANSI Compliance in Spark SQL
There are two new experimental options available for better compliance with ANSI SQL
71
32 New Built-in Functions
72
PySpark
Library updates may be required
https://docs.microsoft.com/en-us/azure/databricks/release-notes/runtime/7.3
https://docs.databricks.com/release-notes/runtime/7.3.html
73
SparkR
Deprecated methods removed/replaced
74
Key Takeaways - Migration & Compatibility
▪ Spark 3.0 is more user-friendly and standardized.
▪ A lot of new built-in functions and HOFs.
▪ You might have to re-compile your libraries for the upgraded runtime
76
Delta
• In addition to many new features of Spark 3.0, DBR 7.x also
comes with a number of optimizations and new Delta features.
– Hilbert Curve (coming soon)
– DBR 7.4 -- CONSTRAINTS, RESTORE
– DBR 7.3 -- Significant reduction in Delta metadata overhead time
– DBR 7.2 -- DEEP / SHALLOW CLONE
– DBR 7.1 -- CONVERT TO DELTA, MERGE INTO performance
improvements
– DBR 7.0 -- Auto Loader, COPY INTO, Dynamic subquery reuse
Third party connectors
78
Compatibility of connectors to external systems
Some 3rd party connectors are still not compatible with Spark 3.0
Not supported, PR is
Azure Cosmos DB Compile yourself using code from the pull request
available in the repo
Azure Event Hubs Supported Best to use latest version. 2.3.17 had problems
Supported, since version For Databricks assembly variant should be used. Some functionality
Apache Cassandra
3.0.0 may not available out of box
Not supported, no PR
Spark-Redis May work with version compiled with Scala 2.12, not tested
available
Not supported, PR
Elasticsearch Compile yourself using code from the pull request
available
Not supported no PR
Salesforce May work with version compiled with Scala 2.12, not tested
available
Not supported no PR
Apache HBase
available
Snowflake Supported
Google BigQuery ? Could be compatible in version compiled with Scala 2.12. Needs testing
Exasol ? Could be compatible in version compiled with Scala 2.12. Needs testing
83
References and further reading
84
References and further reading
● Spark 3.0 migration guide -- Apache Spark Docs
● New Pandas UDFs & Python Type hints in Spark 3.0 -- Databricks Blogpost ||
Spark+AI Summit talk
● Top tuning tips for Spark 3.0 & Delta Lake -- Databricks Tech Talk
85
Thank you
86