75% found this document useful (4 votes)

4K views169 pages

Azure Databricks Course Slide Deck

This document provides an overview of Azure Databricks, including: - Azure Databricks is a unified analytics platform optimized for Apache Spark. It allows integrating Spark with other Azure services and tools like Azure Data Lake Storage, Azure Data Factory, and Power BI. - The course will cover topics like Spark, Delta Lake, data ingestion, transformations, analytics using Spark SQL and PySpark, and orchestrating workflows with Azure Data Factory. - The target audience includes data engineers, developers, data architects, and students interested in hands-on learning about data engineering with Azure Databricks. Basic Python and SQL skills are required.

Uploaded by

Raghunath Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

75% found this document useful (4 votes)

4K views169 pages

Azure Databricks Course Slide Deck

Uploaded by

Raghunath Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 169

About Me

Ramesh Retnasamy
Data Engineer/ Machine Learning Engineer

https://www.linkedin.com/in/ramesh-retnasamy/
About this course

PySpark

Spark SQL

Azure Databricks
Delta Lake
About this course

Azure Data Lake Storage Gen2

Azure Data Factory

Azure Databricks
Azure Key Vault

PowerBI
Formula1 Cloud Data Platform
Formula1 Cloud Data Platform

ADF Pipelines

Ingest Transform Analyze

Ergast API ADLS ADLS ADLS

Raw Layer Ingested Presentation
Layer Layer Report
Formula1 Cloud Data Platform

ADF Pipelines

Ingest Transform Analyze

Ergast API ADLS ADLS ADLS

Raw Layer Ingested Presentation
Layer Layer Report
Formula1 Cloud Data Platform
Formula1 Cloud Data Platform
Who is this course for

University students

IT Developers from other disciplines

AWS/ GCP/ On-prem Data Engineers

Data Architects
Who is this course not for
You are not interested in hands-on learning approach

Your only focus is Azure Data Engineering Certification

You want to Learn Spark ML or Streaming

You want to Learn Scala or Java

Pre-requisites

All code and step-by-step instructions provided

Basic Python programming knowledge required

Basic SQL knowledge required

Cloud fundamentals will be beneficial, but not mandatory

Azure Account
Our Commitments

Ask Questions, I will answer J

Keeping the course up to date

Udemy life time access

Udemy 30 day money back guarantee

Course Structure
Overviews

2.Azure Portal 3.Azure Databricks 7.Project Overview 8.Spark Overview

Databricks Spark (Python) Spark (SQL) Delta Lake

9. Data Ingestion 1
15. Temp Views
4. Clusters 10. Data Ingestion 2
16. DDL
5. Notebooks 11. Data Ingestion 3
17. DML 20.Delta Lake
6. DBFS 13. Transformation
18. Analysis
12. Jobs 14. Aggregations
19.Incremental Load
19.Incremental Load

Orchestration
21.Azure Data Factory 22.Connecting Other Tools
Introduction to Azure Databricks
Azure Databricks

Databricks

Microsoft Azure
Apache Spark
Apache Spark is a lightning-fast unified analytics engine for big
data processing and machine learning

100% Open source under Apache License

Simple and easy to use APIs

In-memory processing engine

Distributed computing Platform

Unified engine which supports SQL, streaming, ML and

graph processing
Apache Spark Architecture
Spark
Spark ML Spark Graph
Spark Streaming
SQL
DataFrame / Dataset APIs

Spark SQL Engine

Catalyst Optimizer Tungsten

Spark Core
Scala Python Java R

Resilient Distributed Dataset (RDD)

Spark Standalone, YARN, Apache Mesos, Kubernetes

Databricks
Clusters

Workspace/
MLFlow Notebook

SQL Administrati
Analytics on Controls

Optimized
Delta Lake Spark (5x
Databases/ faster)
Tables
Azure Databricks Unified Azure Portal +
Unified Billing
Data Services

Azure Data Lake

Azure Blob Storage
Azure Active Directory Azure Cosmos DB
Azure SQL Database
Clusters
Azure Synapse

MLFlow Workspace
/ Notebook Messaging Services

SQL
Azure IoT Hub
Azure Dev Ops Analytics Azure Event Hub
Administra
tion
Controls
Delta Lake

Optimized
Databases/ Spark (5x
Tables faster)

Azure Data Factory Power BI

Databricks

Azure ML
Azure Databricks Unified Azure
Portal + Unified
Billing

Azure Active Data

Directory Services
Clusters

MLFlow Workspace
/ Notebook
Azure Dev
Ops
SQL
Analytics Administra Messaging
tion Services
Controls
Delta Lake

Azure Data Optimized

Factory Databases/ Spark (5x
Tables faster)

Databricks Power BI
Azure ML

Microsoft Azure
Databricks Service Overview
Creating Azure Databricks Service
Azure Databricks Architecture
Control Plane (Databricks
Subscription)
Databricks Cluster Manager Azure Resource Manager

Azure AD
Databricks UX Data Plane (Customer Subscription)
VNet

DBFS VM VM
Databricks
Clusters Workspace
VM VM

Azure Blob Storage

Databricks Workspace Components

Notebooks Data

Clusters

Jobs Models
Databricks Clusters

What is Databricks Cluster

Cluster Types

Cluster Configuration

Creating a cluster

Cluster Pools
Databricks Cluster

VM
Driver

VM VM VM

Worker Worker Worker

Cluster Types

All Purpose Job Cluster

Created manually Created by Jobs

Persistent Terminated at the end of the job

Suitable for interactive workloads Suitable for automated workloads

Shared among many users Isolated just for the job

Expensive to run Cheaper to run

Cluster Configuration

Cluster Mode

Standard High Concurrency Single Node

Single User Multiple Users Single User

No Process Isolation Provides Process Isolation No Process Isolation

No task preemption Provides task preemption No task preemption

Support for all DSL Doesn’t support Scala Support for all DSL

For production workloads For interactive analysis & Light weight workload for
& adhoc development adhoc development ML & data analysis
Cluster Configuration
Databricks Runtime
Scala, Java, Ubuntu
Spark GPU Libraries Delta Lake
Python, R Libraries
Cluster Mode Other Databricks
Services
Databricks Runtime
Databricks Runtime ML

Everything from Popular ML Libraries (PyTorch, Keras, TensorFlow, XGBoost

Databricks runtime etc)

Databricks Runtime Genomics

Everything from Popular open source genomic libraries (e.g. Glow, ADAM etc)
Databricks runtime + Popular genomic pipelines (e.g. DNASeq, RNASeq etc)

Databricks Runtime Light

Runtime option for only jobs not requiring advanced features
Cluster Configuration

Cluster Mode Auto Termination

• Terminates the cluster after X minutes of inactivity
• Default value for Single Node and Standard clusters is 120 minutes
Databricks Runtime
• High Concurrency clusters do not have a default auto termination set
• Users can specify a value between 10 and 10000 mins as the duration
Auto Termination
Cluster Configuration

Cluster Mode
Auto Scaling
Databricks Runtime • User specifies the min and max work nodes
• Auto scales between min and max based on the workload
Auto Termination • Not recommended for streaming workloads

Auto Scaling
Cluster Configuration

Cluster Mode Memory Optimized

Databricks Runtime
Compute Optimized
Auto Termination
Storage Optimized
Auto Scaling

Cluster VM Type/ Size General Purpose

GPU Accelerated
Creating Databricks Cluster
Cluster Configuration
Pool
(idle instances 2 & max instances 5)
Cluster Mode
VM1 VM2
Databricks Runtime

Auto Termination

Auto Scaling
Cluster 1
Cluster VM Type/ Size

Cluster Pool
Cluster Configuration
Pool
(idle instances 2 & max instances 5)
Cluster Mode

Databricks Runtime

Auto Termination

Auto Scaling
Cluster 1
Cluster VM Type/ Size
VM1 VM2
Cluster Pool
Cluster Configuration
Pool
(idle instances 2 & max instances 5)
Cluster Mode
VM3 VM4
Databricks Runtime

Auto Termination

Auto Scaling
Cluster 1
Cluster VM Type/ Size
VM1 VM2
Cluster Pool
Creating Cluster Pool
Databricks Notebooks

What’s a notebook

Creating a notebook

Magic Commands

Databricks Utilities
Creating Notebooks
Magic Commands
Databricks Utilities

File System Utilities

Notebook Workflow Utilities

Widget Utilities

Secrets Utilities

Library Utilities
Databricks Utilities
Databricks Mounts
What is DBFS

What are Databricks mounts

Mount ADLS container to databricks

Create Service Principal

Create Azure Data Lake Storage Gen2

Creating Azure Key Valut

Creating Databricks secret scope

Databricks File System (DBFS)

Databricks Notebooks Databricks CLI Databricks API

DBFS

Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2
Databricks File System (DBFS)
Benefits

Access data without requiring credentials

Access files using file semantics rather than storage URLs (e.g. /mnt/storage1)

Stores files to object storage (e.g. Azure Blob), so you get all the benefits from Azure
Databricks File System (DBFS)
DBFS Root

Backed up by Azure Blob Storage in Databricks created resource group

Default storage location, but not recommended for user data

Access storage via web UI (Special Folder FileStore)

Query results are stored here (e.g. display commands)

Contains data and metadata for managed (non-external) tables

DBFS Root Demo
Mounting Azure Storage

Databricks Workspace Databricks CLI Databricks API

DBFS

Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2
Mounting Azure Storage

Databricks Workspace Databricks CLI Databricks API

DBFS
mnt/storage1 mnt/storage2
Create mount using
Credentials

Service Principal/ Access Key/ SAS Token

Provide
Required
Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2 Access
Mounting Azure Storage

Databricks Workspace Databricks CLI Databricks API

DBFS
mnt/storage1 mnt/storage2

Azure Blob Storage Azure Data Lake Gen1 Azure Data Lake Gen2
Mounting Azure Data Lake Storage Gen2

Create Azure Storage Account (Data Lake Gen2)

Create Azure Service Principal

Provide required access to the service principal

Create the mount using the service principal

Creating Azure Storage Account
(Data Lake Gen2)
Creating Azure Service Principal
Mounting Azure Data Lake
Storage Gen2
Demo
Secret Scope
Secret scopes help store the credentials securely and reference them in
notebooks and jobs when required

Databricks backed Secret Scope Azure Key-vault backed Secret Scope

Secret Scope

Notebooks/ Jobs Get secrets using dbutils.secrets.get

Databricks Secret Scope Create Databricks secret scope

Azure Key-Vault Add secrets to the key vault

Demo
Project Overview

What is formula1

Formula1 data source & datasets

Prepare the data for the project

Project Requirements

Solution Architecture
Data Overview
Formula1
Formula1 Overview
Seasons

Race Weekend

Race Circuits Practice

Teams/
Qualifying Qualifying Results
Constructors

Drivers Race
Driver Standings
Laps Race Results
Constructor
Pit Stops
Standings
Formula1 Data Source
http://ergast.com/mrd/
Formula1 Data Source
Formula1 Data Files
Import Raw Data to Data Lake
Project
Requirements
Data Ingestion Requirements
Ingest All 8 files into the data lake

Ingested data must have the schema applied

Ingested data must have audit columns

Ingested data must be stored in columnar

format (i.e., Parquet)

Must be able to analyze the ingested data via

SQL

Ingestion logic must be able to handle

incremental load
Data Transformation Requirements
Join the key information required for reporting to
create a new table.

Join the key information required for Analysis to

create a new table.

Transformed tables must have audit columns

Must be able to analyze the transformed data

via SQL

Transformed data must be stored in columnar

format (i.e., Parquet)

Transformation logic must be able to handle

incremental load
Reporting Requirements

Driver Standings

Constructor Standings
Analysis Requirements

Dominant Drivers

Dominant Teams

Visualize the outputs

Create Databricks Dashboards

Scheduling Requirements

Scheduled to run every Sunday 10PM

Ability to monitor pipelines

Ability to re-run failed pipelines

Ability to set-up alerts on failures

Other Non-Functional Requirements

Ability to delete individual records

Ability to see history and time travel

Ability to roll back to a previous version

Blank
Solution Architecture Overview
Solution Architecture
ADF Pipelines

Ingest Transform Analyze

Ergast API ADLS ADLS ADLS

Raw Layer Ingested Presentation
Layer Layer Report
Azure Databricks Modern Analytics Architecture

https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-
architecture
Databricks Architecture

https://databricks.com/solutions/data-pipelines
Spark Architecture
Spark Architecture
Driver Node
VM
Core Core
Core Core

Worker Node Worker Node

VM VM
Core Core Core Core
Core Core Core Core
Spark Architecture
Driver Node

Driver Program

Worker Node Worker Node

Executor Executor
Slot Slot Slot Slot
Slot Slot Slot Slot
Spark Architecture
Stage Task
Driver Node Job
Task
Application Stage
Driver Program Task

Job Stage Task

Worker Node Worker Node

Executor Executor
Slot Slot Slot Slot
Slot Slot Slot Slot
Spark Architecture
Stage Task
Driver Node Job
Task
Application Stage
Driver Program Task

Job Stage Task

Worker Node Worker Node

Executor Executor
Task Task Task Task
Task Task Task Task
Spark Architecture – Cluster Scaling
Driver Node

Driver Program

Worker Node Worker Node Worker Node

Executor Executor Executor
Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core
Spark DataFrame
Spark DataFrame

Source: https://www.bbc.co.uk/sport/formula1/drivers-world-championship/standings
Spark DataFrame

Read Trans Write

DataFrame DataFrame
Data Sources API DataFrame API Data Sources API
- DataFrame - DataFrame
Data Source Reader Methods Writer Methods Data Sink
(CSV, JSON etc) (ORC, Parquet
etc)
Spark Documentation
Data Ingestion Overview
Data Ingestion Requirements
Ingest All 8 files into the data lake

Ingested data must have the schema applied

Ingested data must have audit columns

Ingested data must be stored in columnar

format (i.e., Parquet)

Must be able to analyze the ingested data via

SQL

Ingestion logic must be able to handle

incremental load
Data Ingestion Overview

CSV Parquet
JSON Avro
XML Read Data Transform Data Write Data Delta
JDBC JDBC
… ….

DataFrameReader API DataFrame API DataFrameWriter API

Data Types

Row

Column

Functions

Window

Grouping
Data Ingestion Overview
Data Ingestion - Circuits

CSV Read Data Transform Data Write Data Parquet

Data Ingestion – Races (Assignment)

CSV Read Data Transform Data Write Data Parquet

withColumn('race_timestamp', to_timestamp(concat(col('date'), lit(' '), col('time')), 'yyyy-MM-dd HH:mm:ss'))

Data Ingestion – Races (Partition By)

CSV Read Data Transform Data Write Data Parquet

Partition By
race_year

withColumn('race_timestamp', to_timestamp(concat(col('date'), lit(' '), col('time')), 'yyyy-MM-dd HH:mm:ss'))

Data Ingestion - Constructors

JSON Read Data Transform Data Write Data Parquet

Data Ingestion - Drivers

JSON Read Data Transform Data Write Data Parquet

Data Ingestion – Results (Assignment)
JSON Read Data Transform Data Write Data Parquet

Partition By
race_id
Data Ingestion - Pitstops

Multi Line JSON Read Data Transform Data Write Data Parquet
Data Ingestion - Laptimes

Multiple CSV Read Data Transform Data Write Data Parquet

Data Ingestion – Qualifying (Assignment)
Multiple Multi
Line JSON Read Data Transform Data Write Data Parquet
Databricks Workflows

Include notebook

Defining notebook parameters

Notebook workflow

Databricks Jobs
Include notebook (%run)
Passing Parameters (widgets)
Notebook Workflow
Databricks Jobs
Filter/ Join Transformations

Filter Transformation

Join Transformations

Apply Transformations to F1 Project

Filter Transformations
Join Transformations
Join Transformation
Race Results
Join –Race Results
Join –Race Results
Set-up Environment
Presentation Layer
Aggregations

Simple Aggregations

Grouped Aggregations

Window Functions

Apply Aggregations to F1 Project

Built-in Aggregate Functions
Group By
Databases/ Tables/ Views
Hive Meta Store

Spark SQL

Databricks Default
Hive Meta Store
External Meta Store (Azure SQL,
MySQL etc.)

Azure Data Lake

Spark Databases/ Tables/ Views

Databricks
Workspace

Database

Table Views

Managed External
Managed/ External Tables
ADF Pipelines

Ingest Transform Analyze

Ergast API ADLS ADLS ADLS

Raw Layer Ingested Presentation
Layer Layer Report
External
Tables
Managed
Tables
Spark SQL Introduction

SQL Basics

Simple Functions

Aggregate Functions

Window Functions

Joins
Dominant Drivers/ Teams Analysis

Create a table with the data required

Granularity of the data – race_year, driver, team

Rank the dominant drivers of all time/ last decade etc

Rank the dominant teams of all time/ last decade etc

Visualization of dominant drivers

Visualization of dominant teams

Incremental Load

Data Loading Patterns

F1 Project Load Pattern

Implementation
Data Load Types

Full Load Incremental Load

Full Dataset

Day 1 Day 2 Day 3 Day 4

Race 1 Race 1 Race 1 Race 1

Race 2 Race 2 Race 2

Race 3 Race 3

Race 4
Full Refresh/ Load – Day 1

Day 1
Ingestion Transformation
Race 1
Data Lake Data Lake

Race 1 Race 1
Full Refresh/ Load – Day 2
Day 1
Race 1

Race 1 Race 1
Day 2
Race 1 Ingestion Transformation
Race 2
Data Lake Data Lake

Race 1 Race 1
Race 2 Race 2
Full Refresh/ Load – Day 3
Day 1
Race 1

Race 1 Race 1
Race 2 Race 2
Day 2
Race 1 Ingestion Transformation
Race 2
Data Lake Data Lake

Race 1 Race 1
Day 3 Race 2
Race 2
Race 1 Race 3 Race 3
Race 2
Race 3
Incremental Dataset

Day 1 Day 2 Day 3 Day 4

Race 1 Race 2 Race 3 Race 4

Incremental Load – Day 1

Day 1
Ingestion Transformation
Race 1
Data Lake Data Lake

Race 1 Race 1
Incremental Load – Day 2
Day 1
Race 1

Day 2
Ingestion Transformation
Race 2
Data Lake Data Lake

Race 1 Race 1
Race 2 Race 2
Incremental Load – Day 3
Day 1
Race 1

Day 2
Ingestion Transformation
Race 2
Data Lake Data Lake

Race 1 Race 1
Race 2 Race 2
Day 3 Race 3
Race 3
Race 3
Hybrid Scenarios

Full Dataset received, but data loaded & transformed incrementally

Incremental dataset received, but data loaded & transformed in full

Data received contains both full and incremental files

Incremental data received. Ingested incrementally & transformed in full

Formula1 Scenario

Day 1 Day 2 Day 3

Race 1 Race 1052 Race 1053

Race 2

….

Race 1047
Formula1 Scenario / Solution 1

Day 1 Day 2 Day 3

Race 1 Race 1052 Race 1053

Race 2

….

Race 1047

Full Refresh Incremental Load Incremental Load

Formula1 Scenario / Solution 2

Day 1 Day 2 Day 3

Race 1 Race 1052 Race 1053

Race 2

….

Race 1047

Incremental Load Incremental Load Incremental Load

Formula1 Data Files
Current Solution

Old Data Old Data

Transformation
Ingestion
Overwrite the
Overwrite the
data
data ingested Data Lake
Data Lake Data Lake transformed Folder - presentation
Folder - raw Folder - processed
New Data New Data New Data
New Solution
Transformation
Ingestion
1) Identify the
dates for
1) Process the
which we
data from a
have
particular sub
received new
folder
data
2) Delete the
2) Delete the
races for we
correspondin
have Data Lake
Data Lake Data Lake g races from
received data Folder - presentation
Folder - raw Folder - processed the
from
presentation
Sub folders ingested
layer
race_date_1 3) Load the new
Races 1-1047 3) Process the
data received
new data
race_date_2 Race 1052
received
race_date_3 Race 1053
Delta Lake

Pitfalls of Data Lakes

Lakehouse Architecture

Delta Lake Capabilities

Convert F1 project to Delta Lake

Delta Lake
Data Warehouse

Operational
Data

ETL

External
Data Data Warehouse Data Consumers
/ Mart

Data Sources
Data Warehouse

Lack of support for unstructured data

Longer to ingest new data

Proprietary data formats

Scalability

Expensive to store data

Lack of support for ML/ AI workloads

Data Lake
Operational
Data

Ingest Transform

Data Lake Data Lake Data Warehouse /

External Mart
Data

Data Sources
Data Science/ ML
workloads BI Reports
Data Lake
No support for ACID transactions
Failed jobs leave partial files
Inconsistent reads
Unable to handle corrections to data

Unable to roll back any data.

Lack of ability remove data for GDPR etc

No history or versioning
Poor performance
Poor BI support
Complex to set-up
Lambda architecture for streaming workloads
Data Lakehouse
Operational
Data

Ingest Transform

Delta Lake Delta Lake

External
Data

Data Sources
Data Science/ ML
workloads BI Reports
Data Lakehouse
Handles all types of data
Cheap cloud object storage
Uses open source format
Support for all types of workloads
Ability to use BI tools directly
ACID support
History & Versioning
Better performance
Simple architecture
Delta Lake

BI SQL ML/ AI Streaming Simple Architecture

Spark Transformations, ML, Streaming etc.

Delta Table Data security, Governance, Integrity, Perf.

Delta Engine Spark Optimized engine for performance

Transaction Log History, Versioning, ACID, Time Travel

Parquet Files Open file format

Delta Lake Demo
Azure Data Factory

Overview

Creating Data Factory Service

Data Factory Components

Creating Pipelines

Creating Triggers
Azure Data Factory Overview
What is Azure Data Factory

A fully managed, serverless data integration solution for ingesting,

preparing and transforming all of your data at scale.
The Data Problem

Multi Cloud

SaaS Apps

Data Formats

On Premises
The Data Problem