0% found this document useful (0 votes)

120 views

DBR 7.x - Spark 3.x Features Migration

The document discusses migrating from Databricks Runtime (DBR) 6.4 to DBR 7.x. It notes that support for DBR 6.4 will end on April 1, 2021 and it will no longer be available after that date. Migrating to DBR 7.3 is recommended as it is the new long term support release. DBR 7.x includes performance improvements like adaptive query execution and dynamic partition pruning to make Spark faster and more efficient.

Uploaded by

franc.peralta6200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views

DBR 7.x - Spark 3.x Features Migration

Uploaded by

franc.peralta6200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

DBR 7.x / Spark 3.

x
Features &
Migration
Prashanth Babu
Senior Resident Solutions Architect
Created: 25th Nov 2020
Last updated: 20th Jan 2021

Credits to respective contributors

Context

2
Why should you migrate to DBR 7?
❖ Support for DBR 6.4 (Spark 2.4.x) ends on 1st April, 2021.
➢ DBR 6 will not be available post this cutoff

3
Why should you migrate to DBR 7?
❖ Support for DBR 6.4 (Spark 2.4.x) ends on 1st April, 2021.
➢ DBR 6 will not be available post this cutoff
➢ You will not see this version in the Clusters dropdown menu or
using REST API.
■ But your existing jobs on the earlier versions will continue
to run ﬁne.

4
Why should you migrate to DBR 7?
❖ DBR 7.3 (Spark 3.0.1) is the new LTS in 7.x release line

https://docs.microsoft.com/en-us/azure/databricks/release-notes/runtime/releases
https://docs.databricks.com/release-notes/runtime/releases.html
5
Performance SQL Compatibility

Adaptive Query Dynamic Partition 40% Compiler Cache Sync Reserved Keywords Gregorian Store Assignment Overflow
Execution Pruning Time Reduction Minimization in Parser Calendar in INSERT Checking

Feature Connector

Accelerator-aware JDK 11 Built-in Parquet/ORC: Nested New Binary New NOOP CSV Filter
Join Hints Functions
Scheduler Support Column Pruning Data Source Data Source Pushdown

Monitoring PySpark and SparkR

Structured JDBC tab Observable Event log New Pandas UDF Pandas UDF Eager Execution Vectorization in
Streaming UI in SHS Metrics Rollover using Type Hints Enhancements in R Shell SparkR

Usability and Stability Extensibility and Document

Explain Describe Dump Test Data Source APIs Hive 3.x Metastore Hadoop 3 SQL
Formatted Query Plan Coverage + Catalog Support Hive 2.3 Execution Support Reference
6
Agenda
Performance Spark ML new features
3.0 comes with performance improvements to
New features in Spark ML
make Spark faster, cheaper, and more ﬂexible
▪ Adaptive Query Execution Migration and compatibility
▪ Dynamic Partition Pruning Important compatibility / behavior changes for
▪ Join Optimization Hints migration
Usability Spark Ecosystem
New features make Spark even easier to use ▪ New features in Delta
▪ Spark SQL: Explain Plans ▪ Third party connectors

Pandas enhancements References

Enables better usage of Pandas API and improves Links to Summit sessions and blogposts for further
performance reading

7
https://dbricks.co/ebLearningSpark

8
Performance

9
Adaptive Query Execution
(AQE)

10
Adaptive Query Execution (AQE)
Re-optimizes queries based on the most up-to-date runtime statistics

Spark 1.x --> Rule

Spark 2.x --> Rule + Cost

Spark 3.0 --> Rule + Cost + Runtime

11
Optimization in Spark 2.x

12
Adaptive Query Execution

adaptive planning
Based on statistics of the ﬁnished plan nodes, re-optimize the execution
plan of the remaining queries
▪ Dynamically switch join strategies
▪ Dynamically coalesce shuffle partitions
▪ Dynamically optimize skew joins
13
Performance Pitfall
Using the wrong join strategy
▪ Choose Broadcast Hash Join?
▪ Increase “ ”?
▪ Use “broadcast” hint?
However
▪ Hard to tune
▪ Hard to maintain over time
▪ OOM…
14
Adaptive Query Execution
Vision: No more manual setting of broadcast hints/thresholds! Capability:
SMJ -> BHJ at runtime

Actual:
8MB

Static size:
15MB

15
Performance Pitfall
Choosing the wrong shuffle partition number
▪ Tuning
▪ Default magic number: 200 !?!
However
▪ Too small: GC pressure; disk spilling
▪ Too large: Inefficient I/O; scheduler pressure
▪ Hard to tune over the whole query plan
▪ Hard to maintain over time
16
Adaptive Query Execution
VISION: No more manual tuning of spark.shuffle.partitions! Capability: Coalesce shuffle
partitions
▪ Set the initial partition number high to accommodate the largest data
size of the entire query execution
▪ Automatically coalesce partitions if needed after each query stage

Execute Optimize

Stage 1 Stage 1 Stage 1

17
Performance Pitfall
Data skew
▪ Symptoms of data skew
▪ Frozen/long-running tasks
▪ Disk spilling
▪ Low resource utilization in most nodes
▪ OOM
▪ Various ways
▪ Find the skew values and rewrite the queries
▪ Adding extra skew keys…
18
Adaptive Query Execution
Data Skew

19
Adaptive Query Execution
VISION: No more manual tuning of skew hints!

20
Skew Join and AQE Demo

21
AQE Conﬁguration Settings
AQE is enabled by default in DBR 7.3 LTS and above: Blogpost

spark.sql.adaptive.
coalescePartitions.
enabled

spark.sql.adaptive.
coalescePartitions.
minPartitionNum

spark.sql.adaptive.
coalescePartitions.
initialPartitionNum

spark.sql.adaptive.
advisoryPartitionSizeInBytes

22
Key Takeaways - AQE
▪ Adaptive Query Execution is an extensible framework

▪ It’s akin to writing rules for the Catalyst Optimizer

▪ It is enabled by default in DBR 7.3 LTS and above.

▪ For earlier versions, it must be enabled by setting
to

▪ Other AQE features may default to enabled, but are still gated by this master
conﬁguration ﬂag

Link to blogpost: Databricks blogpost

23
Dynamic Partition
Pruning

24
Static Partition Pruning
Most optimizations employ simple static partition pruning

SELECT * FROM Sales WHERE store_id = 5

Partitioned ﬁles with

Basic Data Flow Filter Push-down multi-columnar data
25
A Common Workload
Star Schema Queries

Join

SELECT * FROM Sales JOIN Stores

WHERE Stores.city = 'New York' Scan
Scan Stores
Sales
● Static pruning cannot be applied Filter

● Filter is only acting on the smaller

dimensional table, not the larger fact
table Larger fact table
Small dimensional table

26
Table Denormalization

Join Scan

Filter
Scan Scan city = 'New York'
Sales Stores

27
Dynamic Partition Pruning
Physical Plan Optimization
Broadcast Hash
Join
SCAN Fact Table

File Scan
Broadcast
Exchange
Dynamic Filter
File Scan with
DIM ﬁlter

Partitioned ﬁles with

multi-columnar data Non-partitioned
dataset

28
Dynamic Partition Pruning
▪ Reads only the data you need
▪ Optimizes queries by pruning
partitions read from a fact table by
identifying the partitions that result
from ﬁltering dimension tables
▪ Signiﬁcant speedup; shown in many
TPC-DS queries
▪ May help you avoid ETL-ing
denormalized tables

29
3.0: SQL Engine
Adaptive Query Execution (AQE): change execution plan at runtime to
automatically set # of reducers and join algorithms
Change join
TPC-DS 1TB No-Stats With vs.
algorithm
Without Adaptive Query Execution

Duration (secs)
Accelerates TPC-DS queries
up to 8x
30
SMJ -> BHJ with AQE Demo
Without AQE
With AQE

31
32
Key Takeaways - Dynamic Partition Pruning
▪ It is enabled by default

▪ Spark will produce a query on the “small” table. The result of which is used
to produce a “dynamic ﬁlter” similar to our list of

▪ The “dynamic ﬁlter” is then broadcast to each executor

▪ At runtime, Spark’s physical plan is adjusted so that our “large” table is

reduced with the “dynamic ﬁlter”

▪ And if possible, that ﬁlter will employ a predicate pushdown so as to avoid

an InMemoryTableScan
Link to Summit Talk: Spark+AI Summit EU - 2019

33
Join Optimization Hints

34
Join Optimization Hints
▪ Join hints enable the user to override the optimizer to select their own
join strategies.

▪ Spark 3.0 extends the existing BROADCAST join hint by implementing

other join strategy hints:
▪ Shuffle hash
▪ Sort-merge
▪ Cartesian Product

35
Join Strategies
Shuffle Nested
Sort-Merge Broadcast Hash Shuffle Hash
Loop

▪ Most robust ▪ Requires ▪ Needs to ▪ Doesn't

▪ Can handle one side shuffle, but require join
any data size to be no sort keys
▪ Needs to small ▪ Can handle
shuffle and ▪ No shuffle large tables
sort or sort ▪ Will OOM if
▪ Can be slow ▪ Very fast data is
when table skewed
size is small

36
Join Hint Syntax
For Broadcast Joins
* Note the spaces for the hints

SELECT -+ BROADCAST(t1) -/ FROM t1 INNER JOIN t2 ON t1.key =

t2.key;

SELECT -+ BROADCASTJOIN (t1) -/ FROM t1 LEFT JOIN t2 ON

t1.key = t2.key;

SELECT -+ MAPJOIN(t2) -/ FROM t1 RIGHT JOIN t2 ON t1.key =

t2.key;

37
Join Hint Syntax
For Shuffle Sort Merge Joins
* Note the spaces for the hints

SELECT -+ MERGE(t1) -/ FROM t1 INNER JOIN t2 ON t1.key =

t2.key;

SELECT -+ SHUFFLE_MERGE(t1) -/ FROM t1 INNER JOIN t2 ON

t1.key = t2.key;

SELECT -+ MERGEJOIN(t2) -/ FROM t1 INNER JOIN t2 ON t1.key =

t2.key;

38
Join Hint Syntax
For Shuffle Hash Joins and Shuffle-and-Replicate Nested Loop Join
* Note the spaces for the hints

39
Join Hint Syntax
Shuffle Merge

SQL
SELECT -*+ SHUFFLE_MERGE(t1) -/ * FROM t1 INNER JOIN t2 ON
t1.key = t2.key;

Python
df1 = spark.table("t1")
df2 = spark.table("t2")
df1.hint("SHUFFLE_MERGE").join(df2, ["key"]).show()

40
Key Takeaways - Join Optimizations
▪ Join hints enable user to manually override AQE by using join
strategies

▪ With DBR 7.X, solving for skew is easy and automatic:

▪ Enable by setting spark.sql.adaptive.skewedJoin.enabled to true

▪ More join hint strategies are now available as part of DBR 7.X

Link to Summit Talk: Spark+AI Summit - 2020

41
Usability -- Spark SQL:
Explain Plans

42
Spark: Catalyst Optimizer

43
Spark SQL: EXPLAIN Plans

44
Spark SQL: EXPLAIN Plans
Note: Old syntax works in Spark 3.x as well

Spark 2.x Spark 3.x

45
Spark SQL: Old EXPLAIN Plan

46
Spark SQL: New EXPLAIN FORMATTED

Header: Basic operating tree for

the execution plan

Footer: Each operator

with additional attributes

47
Spark SQL: New EXPLAIN FORMATTED
Subqueries are listed separately

48
Key Takeaways - Explain Plans
▪ More features for ease of usability like EXPLAIN FORMATTED for
Plans.

▪ With DBR 7.X, feature to identify costs (statistics) is available.

▪ Old API still works too.

Link to Summit Talk: Spark+AI Summit - 2020

49
Pandas UDFs
(aka Vectorized UDFs)

50
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('long', PandasUDFType.SCALAR) Spark 2.x
def pandas_plus_one(v):
# `v` is a pandas Series Scalar Pandas UDF
return v + 1 # outputs a pandas Series [Series to Series]
spark.range(10).select(pandas_plus_one("id")).show()

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('long', PandasUDFType.SCALAR_ITER)
def pandas_plus_one(itr):
# `iterator` is an iterator of pandas Series. Scalar Iterator Pandas UDF
# outputs an iterator of pandas Series.
return map(lambda v: v + 1, itr) [Iterator of Series to Iterator of Series]
spark.range(10).select(pandas_plus_one("id")).show()

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf("id long", PandasUDFType.GROUPED_MAP)

def pandas_plus_one(pdf):
# `pdf` is a pandas DataFrame Grouped Map Pandas UDF
return pdf + 1 # outputs a pandas DataFrame
[DataFrame to DataFrame]
# `pandas_plus_one` can _only_ be used with
`groupby(--.).apply(--.)`
spark.range(10).groupby('id').apply(pandas_plus_one).show()
Spark 3.0
Python Type

def pandas_plus_one(v: pd.Series) -> Scalar Pandas UDF Hints

pd.Series: [Series to Series]

return v + 1

def pandas_plus_one(iter: Iterator[pd.Series]) Scalar Iterator Pandas UDF

-> Iterator[pd.Series]: [Iterator of Series to Iterator of Series]
return map(lambda v: v + 1, iter)

def pandas_plus_one(pdf: pd.DataFrame) -> Grouped Map Pandas UDF

pd.DataFrame:
return pdf + 1 [DataFrame to DataFrame]

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x

Pandas UDFs

Spark 2.3/4

Spark 3.0
Python Type
Hints

53
Pandas UDFs
Pandas Function APIs

Supported function APIs include:

▪ Grouped Map
▪ Map
▪ Co-grouped Map

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x

54
Pandas UDFs
Pandas Function APIs

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x 55

Key Takeaways - Pandas UDFs
▪ A new interface for Pandas UDFs that leverages Python type hints to
address the proliferation of Pandas UDF types

▪ More Pythonic and self-descriptive

▪ Easier to read, learn and debug

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x

56
Spark ML new features

57
Spark ML
▪ ML function parity between Scala and Python.

▪ Many deprecated classes in

package in Spark 2.x have been removed from Spark 3.0

▪ in is deprecated and will be

removed in 3.1.0.
– should be used instead

▪ can handle all numeric types

– Earlier required input column to be Double or
Float Link to migration guide: Spark ML migration guide
Spark ML (notable) additions
▪ Multiple columns support to

– PySpark

▪ Two new evaluators:

–

▪ A new transformer: transformer

▪ Tree-Based Feature Transformation
Link to migration guide: Spark ML migration guide
Spark ML (notable) additions
▪ Classifiers:
– Gaussian Naive Bayes Classifier
– Complement Naive Bayes Classifier
▪ Sample weights support was added in

Link to migration guide: Spark ML migration guide

Key Takeaways - Spark ML
▪ Lot of new additions in Spark 3.0
▪ Many often requested features from the community like
, new evaluators and multiple columns support are
new standout features in 3.0
▪ A few deprecations and removals too

Link to migration guide: Spark ML migration guide

61
Migration &
Compatibility

62
Compiling your code for Spark 3.0
• Only builds with Scala 2.12
• Deprecates Python 2 (already EOL)
• Can build with various Hadoop/Hive versions
– Hadoop 2.7 + Hive 1.2
– Hadoop 2.7 + Hive 2.3 (supports Java 11) [Default]
– Hadoop 3.2 + Hive 2.3 (supports Java 11)
• Supports the following Hive metastore versions:
– "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"
• NOTE: Databricks Runtime 7.x supports only Java 8
Spark Core, Spark SQL, DataFrames, and Datasets
▪ Built with Scala 2.12 (should be backwards compatible)

▪ DBR performs rebasing from the Proleptic Gregorian calendar to

the hybrid calendar (Julian + Gregorian).
– Spark 3.0 uses Java 8 API ( packages) that are based on
ISO chronology.
– For parsing / formatting of / strings in JSON/CSV,
Spak uses under the hood.
Datetime Patterns for Formatting and Parsing
– In Spark <= 2.4.x, was the default.

DBX blogpost on Dates and Timestamps Notes on Azure Databricks Docs Portal
Spark Core, Spark SQL, DataFrames, and Datasets
▪ In Spark 2.x, CSV/JSON datasource convert a malformed string to a
row with all nulls in the mode.
– In Spark 3.0, the returned row can contain non-null ﬁelds if some of
CSV/JSON column values were parsed and converted to desired types
successfully.

▪ Since version 3.0.1, the type inference is disabled by default

in JSON datasource and JSON function of .
– Set the JSON option to true to enable the type
inference.
– Note: this option was turned on by default in Spark 3.0.0 - and disabled in
3.0.1 for performance reasons
http://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30
Spark Core, Spark SQL, DataFrames, and Datasets
▪ In Spark 3.0, literals are converted to strings using the SQL conﬁg
.
– In Spark 2.4 and below, the conversion uses the default time zone of the JVM.

▪ Due to a DateTimeFormatter bug in JDK 8, the pattern

parsing fails with Spark 3.x.
– Workaround in the JDK 8 bug discussion.

▪ For parsing Week number of the year from a timestamp in Spark 3.x EXTRACT or
weekofyear method needs to be used.
– In Spark 2.4 and below this was possible by applying w format option in the
date_format method.

DBX blogpost on Dates and Timestamps Notes on Azure Databricks Docs Portal 66
Spark Core, Spark SQL, DataFrames, and Datasets

▪ The and functions in Spark 3.0 accept only ,

, as the 2nd argument
– Fractional and non-literal strings are not valid anymore.

▪ Spark 3.0’s function does not adjust the resulting date to

a last day of the month if the original date is the last day of the month.

▪ For many changes, you do have the option to restore behavior before
Spark 3.0 using

http://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30 67
Spark Core, Spark SQL, DataFrames, and Datasets
▪ Type coercions are performed per ANSI SQL standards when inserting
new values into a column.
– Unreasonable conversions will fail and throw an exception

▪ Two-parameter function signatures are

deprecated.

▪ SparkConf will reject bogus entries

– Code will fail if developers set conﬁg keys to invalid values

http://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30 68
Spark Core, Spark SQL, DataFrames, and Datasets

▪ is removed

▪ Event log ﬁles will be written as UTF-8 encoding, NOT the default charset
of the driver JVM process

http://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30 69
Spark Core
Deprecated methods removed/replaced

Removed method Replaced by

TaskContext.isRunningLocally - (Removed)
shuffleBytesWritten bytesWritten
shuffleWriteTime writeTime
shuffleRecordsWritten recordsWritten
AccumulableInfo.apply - (Disallowed)
Accumulator v1 APIs Accumulator v2 APIs

70
ANSI Compliance in Spark SQL
There are two new experimental options available for better compliance with ANSI SQL

Property Name Default Meaning

When true, Spark tries to ﬁt the ANSI

spark.sql.ansi.enabled false
SQL speciﬁcation.

When inserting values into a column

spark.sql.storeAssignmentPolicy ANSI with a different data type, Spark will
perform a type coercion per ANSI SQL.

71
32 New Built-in Functions

72
PySpark
Library updates may be required

Package Required version

Pandas 1.0.1
PyArrow 1.0.1

https://docs.microsoft.com/en-us/azure/databricks/release-notes/runtime/7.3
https://docs.databricks.com/release-notes/runtime/7.3.html

73
SparkR
Deprecated methods removed/replaced

Deprecated method Replaced by

74
Key Takeaways - Migration & Compatibility
▪ Spark 3.0 is more user-friendly and standardized.
▪ A lot of new built-in functions and HOFs.

▪ Databricks Runtime 7.x supported languages: Java 8 || Scala 2.12 ||

Python 3 (Python 2 Deprecated)

▪ You might have to re-compile your libraries for the upgraded runtime

▪ Follow documentation for details on the deprecated/replaced methods

▪ NOTE: Data/Timestamp parsing changes might be key for your pipelines

Link to migration guide: Apache Spark migration guide
75
Spark Ecosystem

76
Delta
• In addition to many new features of Spark 3.0, DBR 7.x also
comes with a number of optimizations and new Delta features.
– Hilbert Curve (coming soon)
– DBR 7.4 -- CONSTRAINTS, RESTORE
– DBR 7.3 -- Signiﬁcant reduction in Delta metadata overhead time
– DBR 7.2 -- DEEP / SHALLOW CLONE
– DBR 7.1 -- CONVERT TO DELTA, MERGE INTO performance
improvements
– DBR 7.0 -- Auto Loader, COPY INTO, Dynamic subquery reuse
Third party connectors

78
Compatibility of connectors to external systems
Some 3rd party connectors are still not compatible with Spark 3.0

External system Spark 3 support Notes/Alternatives/Workarounds

Microsoft SQL Server and Not supported, PR is ● Compile yourself using code from the pull request
Azure SQL available in the repo ● Use JDBC connector (possible performance degradation)

Not supported, PR is
Azure Cosmos DB Compile yourself using code from the pull request
available in the repo

Azure Event Hubs Supported Best to use latest version. 2.3.17 had problems

Azure Data Explorer (Azure Supported, since version

Kusto) 2.3.0

Supported, since version For Databricks assembly variant should be used. Some functionality
Apache Cassandra
3.0.0 may not available out of box

Supported, since version

MongoDB
3.0.0

Contact respective organizations / authors if you need support for Spark 3 79

Compatibility of connectors to external systems
Some 3rd party connectors are still not compatible with Spark 3.0

External system Spark 3 support Notes/Alternatives/Workarounds

Not supported, no PR
Neo4j
available

Not supported, no PR
Spark-Redis May work with version compiled with Scala 2.12, not tested
available

Not supported, work in

Couchbase Manually compile available code from repository
progress

Not supported, PR
Elasticsearch Compile yourself using code from the pull request
available

Not supported no PR
Salesforce May work with version compiled with Scala 2.12, not tested
available

Not supported no PR
Apache HBase
available

Contact respective organizations / authors if you need support for Spark 3 80

Compatibility of connectors to external systems
Some 3rd party connectors are still not compatible with Spark 3.0

External system Spark 3 support Notes/Alternatives/Workarounds

Supported via JDBC
IBM DB2 See documentation
connector

SAS data ﬁles Supported

Snowﬂake Supported

Google BigQuery ? Could be compatible in version compiled with Scala 2.12. Needs testing

Apache Kafka Built-in Enhanced version in DB Runtime

AWS Kinesis Supported Enhanced version in DB Runtime

Contact respective organizations / authors if you need support for Spark 3 81

Compatibility of connectors to external systems
Some 3rd party connectors are still not compatible with Spark 3.0

External system Spark 3 support Notes/Alternatives/Workarounds

Exasol ? Could be compatible in version compiled with Scala 2.12. Needs testing

Contact respective organizations / authors if you need support for Spark 3 82

Further references

83
References and further reading

● Azure Databricks Runtime Release notes -- ADB 7.3 release notes

● Databricks Runtime Release notes -- Databricks 7.3 release notes

84
References and further reading
● Spark 3.0 migration guide -- Apache Spark Docs

● Deep dive into Spark 3.0 features -- Spark + AI Summit talk

● Faster SQL: AQE in Databricks -- Databricks Blogpost || Spark+AI Summit talk

● Comprehensive look at Dates & Timestamps in Spark 3.0 -- Databricks Blogpost

|| Spark+AI Summit talk || Notes on ADB Docs Portal

● New Pandas UDFs & Python Type hints in Spark 3.0 -- Databricks Blogpost ||
Spark+AI Summit talk

● Top tuning tips for Spark 3.0 & Delta Lake -- Databricks Tech Talk
85
Thank you

Time Series
No ratings yet
Time Series
31 pages
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
Best of Pop & Rock For Classical Guitar (Vol 6)
100% (6)
Best of Pop & Rock For Classical Guitar (Vol 6)
48 pages
MCPA Cert Review - Questions
No ratings yet
MCPA Cert Review - Questions
21 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Spark
No ratings yet
Spark
33 pages
Spark Kafkaintegration PDF
100% (1)
Spark Kafkaintegration PDF
71 pages
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
100% (1)
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
62 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library Hien Luu all chapter instant download
100% (3)
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library Hien Luu all chapter instant download
50 pages
CPAD Practicals Merged
No ratings yet
CPAD Practicals Merged
72 pages
Spark and Scala Course
No ratings yet
Spark and Scala Course
5 pages
Technical Interview Questions For Freshers - With Answers (2024)
No ratings yet
Technical Interview Questions For Freshers - With Answers (2024)
7 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
React Js
No ratings yet
React Js
82 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
1 - Introduction To React JS
No ratings yet
1 - Introduction To React JS
13 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
React Js
No ratings yet
React Js
21 pages
Industrial Training report of frontend web development using react
No ratings yet
Industrial Training report of frontend web development using react
24 pages
Python Challenge
No ratings yet
Python Challenge
10 pages
ReactJS - CredoSystems
No ratings yet
ReactJS - CredoSystems
14 pages
Pyspark
No ratings yet
Pyspark
31 pages
An Overview of Practical Time Series Forecasting Using Pytho
No ratings yet
An Overview of Practical Time Series Forecasting Using Pytho
30 pages
Developing Modern Applications With Scala
No ratings yet
Developing Modern Applications With Scala
72 pages
_ Databricks & PySpark learning day-10
No ratings yet
_ Databricks & PySpark learning day-10
4 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Lecture # 12 - Introduction To React JS
No ratings yet
Lecture # 12 - Introduction To React JS
76 pages
Databricks Certified Machine Learning Associate Exam Guide
No ratings yet
Databricks Certified Machine Learning Associate Exam Guide
9 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
React JS Developer
No ratings yet
React JS Developer
2 pages
IP - Chapter No 6-React JS-SH 2022-Prepared by Reshma Koli
No ratings yet
IP - Chapter No 6-React JS-SH 2022-Prepared by Reshma Koli
38 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
No ratings yet
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
15 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Company Result - (:companyname) : Freecodecamp
No ratings yet
Company Result - (:companyname) : Freecodecamp
10 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
How To Master Apache Spark Interview Questions
No ratings yet
How To Master Apache Spark Interview Questions
14 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
Face Recognition in The Browser With Tensorflow - Js & JavaScript
No ratings yet
Face Recognition in The Browser With Tensorflow - Js & JavaScript
13 pages
Learning React Js
No ratings yet
Learning React Js
50 pages
Lake House Data at Scale With Power Bi
No ratings yet
Lake House Data at Scale With Power Bi
38 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
ANSAR HAYAT BigData Architect
No ratings yet
ANSAR HAYAT BigData Architect
3 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
React Js
No ratings yet
React Js
26 pages
Server Side Rendering in ReactJS
No ratings yet
Server Side Rendering in ReactJS
14 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
What Is Azure Data Engineer
No ratings yet
What Is Azure Data Engineer
74 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
React - Js Duration - 12 (Half Days)
No ratings yet
React - Js Duration - 12 (Half Days)
4 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Data Engineering
No ratings yet
Data Engineering
15 pages
UNIT 6 React JS
No ratings yet
UNIT 6 React JS
17 pages
React JS and Node JS Content
No ratings yet
React JS and Node JS Content
5 pages
React - Js Cheatsheet
No ratings yet
React - Js Cheatsheet
9 pages
Neo4j-Manual-2 0 1
No ratings yet
Neo4j-Manual-2 0 1
593 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
The JavaScript Journey: From Basics to Full-Stack Mastery
From Everand
The JavaScript Journey: From Basics to Full-Stack Mastery
Priya Singh
No ratings yet
Whiter Shade of Pale Arr by Daryl Shawn 942
No ratings yet
Whiter Shade of Pale Arr by Daryl Shawn 942
6 pages
Bolero Arcas
No ratings yet
Bolero Arcas
3 pages
SBT V0-13-Reference
No ratings yet
SBT V0-13-Reference
321 pages
2017 TheNextProductionRevolution PDF
No ratings yet
2017 TheNextProductionRevolution PDF
442 pages
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
No ratings yet
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
4 pages
One (U2)
No ratings yet
One (U2)
1 page
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
No ratings yet
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
4 pages
Got My Mojo Working-Muddy Waters
100% (1)
Got My Mojo Working-Muddy Waters
1 page
Solution Manual for Starting Out with C++: Early Objects, 8/E 8th Edition : 013336092X - Download Now And Never Miss A Chapter
100% (8)
Solution Manual for Starting Out with C++: Early Objects, 8/E 8th Edition : 013336092X - Download Now And Never Miss A Chapter
49 pages
2006 Networking Concepts
No ratings yet
2006 Networking Concepts
11 pages
Convert Dynamic Disk To Basic Disk Using
No ratings yet
Convert Dynamic Disk To Basic Disk Using
9 pages
(EX) Disabling Me0 Interface May Split Virtual Chassis (VC) : Summary
No ratings yet
(EX) Disabling Me0 Interface May Split Virtual Chassis (VC) : Summary
31 pages
Golden Gate Oracle Replication
No ratings yet
Golden Gate Oracle Replication
3 pages
JSTL Tags PDF
No ratings yet
JSTL Tags PDF
7 pages
End of Unit 1 Test
No ratings yet
End of Unit 1 Test
8 pages
SMP, MPP For Olap
100% (2)
SMP, MPP For Olap
10 pages
OSI Model
No ratings yet
OSI Model
35 pages
22319-2023-SUMMER-model-answer-paper (Msbte Study Resources)
No ratings yet
22319-2023-SUMMER-model-answer-paper (Msbte Study Resources)
43 pages
Logical DB
No ratings yet
Logical DB
8 pages
Activity No 3
No ratings yet
Activity No 3
10 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Cache GCRN Ft1
No ratings yet
Cache GCRN Ft1
80 pages
Lab File Format
No ratings yet
Lab File Format
4 pages
Dynamips - Dynagen Tutorial
No ratings yet
Dynamips - Dynagen Tutorial
65 pages
Hadoop Framework
No ratings yet
Hadoop Framework
22 pages
BIND DNS Server - Webmin Documentation
No ratings yet
BIND DNS Server - Webmin Documentation
16 pages
Dell Networker Implementation-Ssp
No ratings yet
Dell Networker Implementation-Ssp
51 pages
Amoeba: Distributed Operating System
No ratings yet
Amoeba: Distributed Operating System
13 pages
Node Upgrade Guide: Isilon NL400
No ratings yet
Node Upgrade Guide: Isilon NL400
30 pages
Troy Technologies USA: Mcse Study Guide
No ratings yet
Troy Technologies USA: Mcse Study Guide
32 pages
Microprocessor
No ratings yet
Microprocessor
54 pages
HBBTV Specification 2 0 PDF
No ratings yet
HBBTV Specification 2 0 PDF
249 pages
Computer Science for Kids Learning by Doing
No ratings yet
Computer Science for Kids Learning by Doing
69 pages
NCP-MCI v6.5
No ratings yet
NCP-MCI v6.5
65 pages
Microsoft® Azure™ SQL Database Step by Step PDF
No ratings yet
Microsoft® Azure™ SQL Database Step by Step PDF
48 pages
Nptel: Implementation of Iot With Raspberry Pi: Part 2
No ratings yet
Nptel: Implementation of Iot With Raspberry Pi: Part 2
106 pages
Rhcsa Updated 2019
No ratings yet
Rhcsa Updated 2019
28 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DBR 7.x - Spark 3.x Features Migration

Uploaded by

DBR 7.x - Spark 3.x Features Migration

Uploaded by

DBR 7.x / Spark 3.

Credits to respective contributors

Monitoring PySpark and SparkR

Usability and Stability Extensibility and Document

Pandas enhancements References

Spark 1.x --> Rule

Spark 2.x --> Rule + Cost

Spark 3.0 --> Rule + Cost + Runtime

Stage 1 Stage 1 Stage 1

▪ It’s akin to writing rules for the Catalyst Optimizer

▪ It is enabled by default in DBR 7.3 LTS and above.

Link to blogpost: Databricks blogpost

SELECT * FROM Sales WHERE store_id = 5

Partitioned ﬁles with

SELECT * FROM Sales JOIN Stores

● Filter is only acting on the smaller

Partitioned ﬁles with

▪ The “dynamic ﬁlter” is then broadcast to each executor

▪ At runtime, Spark’s physical plan is adjusted so that our “large” table is

▪ And if possible, that ﬁlter will employ a predicate pushdown so as to avoid

▪ Spark 3.0 extends the existing BROADCAST join hint by implementing

▪ Most robust ▪ Requires ▪ Needs to ▪ Doesn't

SELECT -*+ BROADCAST(t1) -/ * FROM t1 INNER JOIN t2 ON t1.key =

SELECT -*+ BROADCASTJOIN (t1) -/ * FROM t1 LEFT JOIN t2 ON

SELECT -*+ MAPJOIN(t2) -/ * FROM t1 RIGHT JOIN t2 ON t1.key =

SELECT -*+ MERGE(t1) -/ * FROM t1 INNER JOIN t2 ON t1.key =

SELECT -*+ SHUFFLE_MERGE(t1) -/ * FROM t1 INNER JOIN t2 ON

SELECT -*+ MERGEJOIN(t2) -/ * FROM t1 INNER JOIN t2 ON t1.key =

▪ With DBR 7.X, solving for skew is easy and automatic:

Link to Summit Talk: Spark+AI Summit - 2020

Spark 2.x Spark 3.x

Header: Basic operating tree for

Footer: Each operator

▪ With DBR 7.X, feature to identify costs (statistics) is available.

▪ Old API still works too.

Link to Summit Talk: Spark+AI Summit - 2020

from pyspark.sql.functions import pandas_udf, PandasUDFType

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf("id long", PandasUDFType.GROUPED_MAP)

def pandas_plus_one(v: pd.Series) -> Scalar Pandas UDF Hints

pd.Series: [Series to Series]

def pandas_plus_one(iter: Iterator[pd.Series]) Scalar Iterator Pandas UDF

def pandas_plus_one(pdf: pd.DataFrame) -> Grouped Map Pandas UDF

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x

Supported function APIs include:

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x 55

▪ More Pythonic and self-descriptive

▪ Easier to read, learn and debug

Link to blogpost: Pyspark and Pandas UDFs in Spark 3.x

▪ Many deprecated classes in

▪ in is deprecated and will be

▪ can handle all numeric types

▪ Two new evaluators:

▪ A new transformer: transformer

Link to migration guide: Spark ML migration guide

Link to migration guide: Spark ML migration guide

▪ DBR performs rebasing from the Proleptic Gregorian calendar to

▪ Since version 3.0.1, the type inference is disabled by default

▪ Due to a DateTimeFormatter bug in JDK 8, the pattern

▪ The and functions in Spark 3.0 accept only ,

▪ Spark 3.0’s function does not adjust the resulting date to

▪ Two-parameter function signatures are

▪ SparkConf will reject bogus entries

Removed method Replaced by

Property Name Default Meaning

When true, Spark tries to ﬁt the ANSI

When inserting values into a column

Package Required version

Deprecated method Replaced by

▪ Databricks Runtime 7.x supported languages: Java 8 || Scala 2.12 ||

▪ Follow documentation for details on the deprecated/replaced methods

▪ NOTE: Data/Timestamp parsing changes might be key for your pipelines

External system Spark 3 support Notes/Alternatives/Workarounds

Azure Data Explorer (Azure Supported, since version

Supported, since version

SELECT -+ BROADCAST(t1) -/ FROM t1 INNER JOIN t2 ON t1.key =

SELECT -+ BROADCASTJOIN (t1) -/ FROM t1 LEFT JOIN t2 ON

SELECT -+ MAPJOIN(t2) -/ FROM t1 RIGHT JOIN t2 ON t1.key =

SELECT -+ MERGE(t1) -/ FROM t1 INNER JOIN t2 ON t1.key =

SELECT -+ SHUFFLE_MERGE(t1) -/ FROM t1 INNER JOIN t2 ON

SELECT -+ MERGEJOIN(t2) -/ FROM t1 INNER JOIN t2 ON t1.key =