Azure Databricks Course Slide Deck
Azure Databricks Course Slide Deck
Ramesh Retnasamy
Data Engineer/ Machine Learning Engineer
https://www.linkedin.com/in/ramesh-retnasamy/
About this course
PySpark
Spark SQL
Azure Databricks
Delta Lake
About this course
Azure Databricks
Azure Key Vault
PowerBI
Formula1 Cloud Data Platform
Formula1 Cloud Data Platform
ADF Pipelines
ADF Pipelines
University students
Data Architects
Who is this course not for
You are not interested in hands-on learning approach
Azure Account
Our Commitments
Orchestration
21.Azure Data Factory 22.Connecting Other Tools
Introduction to Azure Databricks
Azure Databricks
Databricks
Microsoft Azure
Apache Spark
Apache Spark is a lightning-fast unified analytics engine for big
data processing and machine learning
Spark Core
Scala Python Java R
Workspace/
MLFlow Notebook
SQL Administrati
Analytics on Controls
Optimized
Delta Lake Spark (5x
Databases/ faster)
Tables
Azure Databricks Unified Azure Portal +
Unified Billing
Data Services
MLFlow Workspace
/ Notebook Messaging Services
SQL
Azure IoT Hub
Azure Dev Ops Analytics Azure Event Hub
Administra
tion
Controls
Delta Lake
Optimized
Databases/ Spark (5x
Tables faster)
Databricks
Azure ML
Azure Databricks Unified Azure
Portal + Unified
Billing
MLFlow Workspace
/ Notebook
Azure Dev
Ops
SQL
Analytics Administra Messaging
tion Services
Controls
Delta Lake
Databricks Power BI
Azure ML
Microsoft Azure
Databricks Service Overview
Creating Azure Databricks Service
Azure Databricks Architecture
Control Plane (Databricks
Subscription)
Databricks Cluster Manager Azure Resource Manager
Azure AD
Databricks UX Data Plane (Customer Subscription)
VNet
DBFS VM VM
Databricks
Clusters Workspace
VM VM
Notebooks Data
Clusters
Jobs Models
Databricks Clusters
Cluster Types
Cluster Configuration
Creating a cluster
Cluster Pools
Databricks Cluster
VM
Driver
VM VM VM
Cluster Mode
Support for all DSL Doesn’t support Scala Support for all DSL
For production workloads For interactive analysis & Light weight workload for
& adhoc development adhoc development ML & data analysis
Cluster Configuration
Databricks Runtime
Scala, Java, Ubuntu
Spark GPU Libraries Delta Lake
Python, R Libraries
Cluster Mode Other Databricks
Services
Databricks Runtime
Databricks Runtime ML
Cluster Mode
Auto Scaling
Databricks Runtime • User specifies the min and max work nodes
• Auto scales between min and max based on the workload
Auto Termination • Not recommended for streaming workloads
Auto Scaling
Cluster Configuration
Databricks Runtime
Compute Optimized
Auto Termination
Storage Optimized
Auto Scaling
GPU Accelerated
Creating Databricks Cluster
Cluster Configuration
Pool
(idle instances 2 & max instances 5)
Cluster Mode
VM1 VM2
Databricks Runtime
Auto Termination
Auto Scaling
Cluster 1
Cluster VM Type/ Size
Cluster Pool
Cluster Configuration
Pool
(idle instances 2 & max instances 5)
Cluster Mode
Databricks Runtime
Auto Termination
Auto Scaling
Cluster 1
Cluster VM Type/ Size
VM1 VM2
Cluster Pool
Cluster Configuration
Pool
(idle instances 2 & max instances 5)
Cluster Mode
VM3 VM4
Databricks Runtime
Auto Termination
Auto Scaling
Cluster 1
Cluster VM Type/ Size
VM1 VM2
Cluster Pool
Creating Cluster Pool
Databricks Notebooks
What’s a notebook
Creating a notebook
Magic Commands
Databricks Utilities
Creating Notebooks
Magic Commands
Databricks Utilities
Widget Utilities
Secrets Utilities
Library Utilities
Databricks Utilities
Databricks Mounts
What is DBFS
DBFS
Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2
Databricks File System (DBFS)
Benefits
Access files using file semantics rather than storage URLs (e.g. /mnt/storage1)
Stores files to object storage (e.g. Azure Blob), so you get all the benefits from Azure
Databricks File System (DBFS)
DBFS Root
DBFS
Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2
Mounting Azure Storage
DBFS
mnt/storage1 mnt/storage2
Create mount using
Credentials
Provide
Required
Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2 Access
Mounting Azure Storage
DBFS
mnt/storage1 mnt/storage2
Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2
Mounting Azure Data Lake Storage Gen2
What is formula1
Project Requirements
Solution Architecture
Data Overview
Formula1
Formula1 Overview
Seasons
Race Weekend
Teams/
Qualifying Qualifying Results
Constructors
Drivers Race
Driver Standings
Laps Race Results
Constructor
Pit Stops
Standings
Formula1 Data Source
http://ergast.com/mrd/
Formula1 Data Source
Formula1 Data Files
Import Raw Data to Data Lake
Project
Requirements
Data Ingestion Requirements
Ingest All 8 files into the data lake
Driver Standings
Constructor Standings
Analysis Requirements
Dominant Drivers
Dominant Teams
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-
architecture
Databricks Architecture
https://databricks.com/solutions/data-pipelines
Spark Architecture
Spark Architecture
Driver Node
VM
Core Core
Core Core
Driver Program
Driver Program
Source: https://www.bbc.co.uk/sport/formula1/drivers-world-championship/standings
Spark DataFrame
DataFrame DataFrame
Data Sources API DataFrame API Data Sources API
- DataFrame - DataFrame
Data Source Reader Methods Writer Methods Data Sink
(CSV, JSON etc) (ORC, Parquet
etc)
Spark Documentation
Data Ingestion Overview
Data Ingestion Requirements
Ingest All 8 files into the data lake
CSV Parquet
JSON Avro
XML Read Data Transform Data Write Data Delta
JDBC JDBC
… ….
Data Types
Row
Column
Functions
Window
Grouping
Data Ingestion Overview
Data Ingestion - Circuits
Partition By
race_year
Partition By
race_id
Data Ingestion - Pitstops
Multi Line JSON Read Data Transform Data Write Data Parquet
Data Ingestion - Laptimes
Include notebook
Notebook workflow
Databricks Jobs
Include notebook (%run)
Passing Parameters (widgets)
Notebook Workflow
Databricks Jobs
Filter/ Join Transformations
Filter Transformation
Join Transformations
Simple Aggregations
Grouped Aggregations
Window Functions
Spark SQL
Databricks Default
Hive Meta Store
External Meta Store (Azure SQL,
MySQL etc.)
Databricks
Workspace
Database
Table Views
Managed External
Managed/ External Tables
ADF Pipelines
SQL Basics
Simple Functions
Aggregate Functions
Window Functions
Joins
Dominant Drivers/ Teams Analysis
Implementation
Data Load Types
Race 3 Race 3
Race 4
Full Refresh/ Load – Day 1
Day 1
Ingestion Transformation
Race 1
Data Lake Data Lake
Race 1 Race 1
Full Refresh/ Load – Day 2
Day 1
Race 1
Race 1 Race 1
Day 2
Race 1 Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Race 2 Race 2
Full Refresh/ Load – Day 3
Day 1
Race 1
Race 1 Race 1
Race 2 Race 2
Day 2
Race 1 Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Day 3 Race 2
Race 2
Race 1 Race 3 Race 3
Race 2
Race 3
Incremental Dataset
Day 1
Ingestion Transformation
Race 1
Data Lake Data Lake
Race 1 Race 1
Incremental Load – Day 2
Day 1
Race 1
Day 2
Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Race 2 Race 2
Incremental Load – Day 3
Day 1
Race 1
Day 2
Ingestion Transformation
Race 2
Data Lake Data Lake
Race 1 Race 1
Race 2 Race 2
Day 3 Race 3
Race 3
Race 3
Hybrid Scenarios
Race 2
….
Race 1047
Formula1 Scenario / Solution 1
Race 2
….
Race 1047
Race 2
….
Race 1047
Lakehouse Architecture
Operational
Data
ETL
External
Data Data Warehouse Data Consumers
/ Mart
Data Sources
Data Warehouse
Scalability
Ingest Transform
Data Sources
Data Science/ ML
workloads BI Reports
Data Lake
No support for ACID transactions
Failed jobs leave partial files
Inconsistent reads
Unable to handle corrections to data
No history or versioning
Poor performance
Poor BI support
Complex to set-up
Lambda architecture for streaming workloads
Data Lakehouse
Operational
Data
Ingest Transform
External
Data
Data Sources
Data Science/ ML
workloads BI Reports
Data Lakehouse
Handles all types of data
Cheap cloud object storage
Uses open source format
Support for all types of workloads
Ability to use BI tools directly
ACID support
History & Versioning
Better performance
Simple architecture
Delta Lake
Overview
Creating Pipelines
Creating Triggers
Azure Data Factory Overview
What is Azure Data Factory
Multi Cloud
SaaS Apps
Data Formats
On Premises
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Data Sources
The Data Problem
Transform/
Ingest Publish
Analyze
Data Consumers
Transform/
Ingest Publish
Analyze
Data Consumers
Transform/
Ingest Publish
Analyze
Data Consumers
Serverless
A fully managed, serverless data integration solution for ingesting, preparing and
transforming all of your data at scale.
What Azure Data Factory Is Not
Pipeline
Activity Activity
Dataset
Storage Compute
SQL Azure Azure
ADLS
Database Databricks HDInsight
Connecting from Power BI
Congratulations!
&
Thank you
Feedback
Ratings & Review
Thank you
&
Good Luck!