Introduction to Databricks
Introduction to Databricks
Databricks
Lakehouse
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Analytics Practitioner
The Data Warehouse
Data Warehouse
Pros
Highly performant
Cons
Very expensive
1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html
INTRODUCTION TO DATABRICKS
The Data Lake
Data Lake
Pros
Very flexible
Cost effective
Cons
1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html
INTRODUCTION TO DATABRICKS
Birth of the Lakehouse
1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html
INTRODUCTION TO DATABRICKS
Birth of the Lakehouse
1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html
INTRODUCTION TO DATABRICKS
The Databricks Lakehouse
The Databricks Lakehouse Platform
Simplified architecture
1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html
INTRODUCTION TO DATABRICKS
Databricks Architecture Benefits
Unification Multi-Cloud
Benefits of data warehouse and data lake No lock-in to a specific cloud platform
INTRODUCTION TO DATABRICKS
Databricks Development Benefits
Collaborative Open-Source
Ability to work in same platform in real- Support for most popular languages
time (Python, R, Scala, SQL)
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Core features of the
Databricks
Lakehouse Platform
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Apache Spark
Apache Spark is an open-source data processing framework and is the engine underneath
Databricks.
DataCamp Courses
Introduction to Pyspark
INTRODUCTION TO DATABRICKS
Benefits of Spark
Key Benefits:
4. Databricks optimizations
1 https://spark.apache.org/docs/latest/cluster-overview.html
INTRODUCTION TO DATABRICKS
Cloud computing basics
INTRODUCTION TO DATABRICKS
Databricks Compute
Clusters
SQL Warehouses
SQL only
BI use cases
Photon
INTRODUCTION TO DATABRICKS
Cloud data storage
INTRODUCTION TO DATABRICKS
Delta
Delta is an open-source data storage file
format, and provides:
ACID transactions
Schema evolution
Table history
Time-travel
1 delta.io
INTRODUCTION TO DATABRICKS
Unity Catalog
Unity Catalog is an open data governance
strategy that controls access to all data
assets in the Databricks Lakehouse platform.
INTRODUCTION TO DATABRICKS
Databricks UI
Designed for easier access to capabilities
based on your data workload.
INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Administering a
Databricks
workspace
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Account Admin
Key Responsibilities:
INTRODUCTION TO DATABRICKS
Account Console
https://accounts.cloud.databricks.com/
INTRODUCTION TO DATABRICKS
Account Console - Workspaces
https://accounts.cloud.databricks.com/
INTRODUCTION TO DATABRICKS
Account Console - Data
https://accounts.cloud.databricks.com/
INTRODUCTION TO DATABRICKS
Account Console - Users & Groups
https://accounts.cloud.databricks.com/
INTRODUCTION TO DATABRICKS
Account Console - Settings
https://accounts.cloud.databricks.com/
INTRODUCTION TO DATABRICKS
Workspace Admin
Key Responsibilities:
INTRODUCTION TO DATABRICKS
Data Plane
Contains all of the customer's assets needed for computation with Databricks.
INTRODUCTION TO DATABRICKS
Control Plane
The portion of the platform that is managed and hosted by Databricks.
INTRODUCTION TO DATABRICKS
Databricks Platform Architecture
Each cloud will have the same general
options to create a workspace:
Account Console
1 https://docs.databricks.com/getting-started/overview.html
INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Setting up a
Databricks
workspace example
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Getting started with
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Compute cluster refresh
INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
INTRODUCTION TO DATABRICKS
Cluster Access
INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
Databricks Runtime
Photon Acceleration
INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!
Configuration options:
Databricks Runtime
Photon Acceleration
Auto-scaling / Auto-termination
INTRODUCTION TO DATABRICKS
Data Explorer
Get familiar with the Data Explorer! In this UI,
you can:
INTRODUCTION TO DATABRICKS
Create a notebook
Databricks notebooks:
Built-in visualizations
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Data Engineering
foundations in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Medallion architecture
INTRODUCTION TO DATABRICKS
Reading data
Spark is a highly flexible framework and can
read from various data sources/types.
Delta tables
Streaming data
Images / Videos
INTRODUCTION TO DATABRICKS
Reading data
Spark is a highly flexible framework and can #Delta table
read from various data sources/types. spark.read.table()
#CSV files
Common data sources and types:
spark.read.format('csv').load('*.csv')
Delta tables #Postgres table
spark.read.format("jdbc")
File formats (CSV, JSON, Parquet, XML)
.option("driver", driver)
Databases (MySQL, Postgres, EDW)
.option("url", url)
Streaming data .option("dbtable", table)
.option("user", user)
Images / Videos
.option("password", password)
.load()
INTRODUCTION TO DATABRICKS
Structure of a Delta table
A Delta table provides table-like qualities to an open file format.
INTRODUCTION TO DATABRICKS
Explaining the Delta Lake structure
INTRODUCTION TO DATABRICKS
DataFrames
DataFrames are two-dimensional id customerName bookTitle
representations of data. 1 John Data Guide to Spark
Look and feel similar to tables 2 Sally Bricks SQL for Data
Engineering
Similar concept for many different data
3 Adam Delta Keeping Data Clean
tools
Spark (default), pandas, dplyr, SQL df = (spark.read
queries .format("csv")
Underlying construct for most data .option("header", "true")
processes .option("inferSchema", "true")
.load("/data.csv"))
INTRODUCTION TO DATABRICKS
Writing data
Kinds of tables in Databricks df.write.saveAsTable(table_name)
1. Managed tables
CREATE TABLE table_name
Default type
USING delta
Stored with Unity Catalog AS ...
Databricks managed
Set LOCATION
CREATE TABLE table_name
Customer managed USING delta
LOCATION "<path>"
AS ...
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Data
transformations in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
SQL for data engineering
SQL -- Creating a new table in SQL
INTRODUCTION TO DATABRICKS
Other languages for data engineering
Python, R, Scala #Creating a new table in Pyspark
INTRODUCTION TO DATABRICKS
Common transformations
Schema manipulation #Pyspark
Filtering #Pyspark
INTRODUCTION TO DATABRICKS
Common transformations (continued)
Nested data df
.explode(col('arrayCol')) #wide to long
Arrays or Struct data
.flatten(col('items')) #long to wide
Expand or contract
Aggregation df
.groupBy(col('region'))
Group data based on columns
.agg(sum(col('sales')))
Calculate data summarizations
INTRODUCTION TO DATABRICKS
Auto Loader
Auto Loader processes new data files as they
land in a data lake.
Incremental processing
Efficient processing
Automatic
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(file_path)
1 https://www.databricks.com/blog/2020/02/24/introducing-databricks-ingest-easy-data-ingestion-into-delta-
lake.html
INTRODUCTION TO DATABRICKS
Structured Streaming
spark.readStream
.format("kafka")
.option("subscribe", "<topic>")
.load()
.join(table_df,
on="<id>", how="left")
.writeStream
.format("kafka")
.option("topic", "<topic>")
.start()
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Orchestration in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Analytics Practitioner
What is data orchestration?
Data orchestration is a form of automation!
INTRODUCTION TO DATABRICKS
Databricks Workflows
Databricks Workflows is a collection of built-in capabilities to orchestrate all your data
processes, at no additional cost!
1 https://docs.databricks.com/workflows
INTRODUCTION TO DATABRICKS
What can we orchestrate?
Data engineers/data scientists Data analysts
INTRODUCTION TO DATABRICKS
Databricks Jobs
Workflows UI
Users can create jobs directly from the
Databricks UI:
1 https://docs.databricks.com/workflows/jobs
INTRODUCTION TO DATABRICKS
Databricks Jobs
Programmatic {
Users can also programmatically create jobs "name": "A multitask job",
using the Jobs CLI or Jobs API with the "tags": {},
Databricks platform. "tasks": [],
"job_clusters": [],
"format": "MULTI_TASK",
}
INTRODUCTION TO DATABRICKS
Delta Live Tables
INTRODUCTION TO DATABRICKS
Delta Live Tables
INTRODUCTION TO DATABRICKS
Delta Live Tables
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
End-to-end data
pipeline example in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Overview of
Databricks SQL
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
INTRODUCTION TO DATABRICKS
Databricks for SQL Users
Databricks SQL
Data Warehousing for the Lakehouse
INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format
INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format
INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format
INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format
INTRODUCTION TO DATABRICKS
SQL in the Lakehouse Architecture
INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Getting started with
Databricks SQL
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
SQL Compute vs. General Compute
Designing compute clusters for data science is inherently different than designing compute
or data engineering workloads... for SQL workloads
INTRODUCTION TO DATABRICKS
SQL Warehouse
INTRODUCTION TO DATABRICKS
SQL Warehouse
SQL Warehouse Configuration Options
1. Cluster Name
INTRODUCTION TO DATABRICKS
SQL Warehouse
SQL Warehouse Configuration Options
1. Cluster Name
4. Cluster Type
INTRODUCTION TO DATABRICKS
SQL Warehouse Types
Different types provide different benefits Classic
In customer cloud
Pro Serverless
INTRODUCTION TO DATABRICKS
SQL Editor
INTRODUCTION TO DATABRICKS
Common SQL Commands
COPY INTO CREATE <entity> AS
Grab raw data and put into Delta Create a Table or View
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Databricks SQL
queries and
dashboards
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Databricks SQL Assets
INTRODUCTION TO DATABRICKS
Databricks SQL Assets
INTRODUCTION TO DATABRICKS
Visualizations
Lightweight, in-platform visualizations
Support for standard visual types
INTRODUCTION TO DATABRICKS
Databricks SQL Assets
INTRODUCTION TO DATABRICKS
Databricks SQL Assets
INTRODUCTION TO DATABRICKS
Dashboards
Lightweight, easily created dashboards
Ability to share and govern across your organization
INTRODUCTION TO DATABRICKS
Query Filters
Filters
INTRODUCTION TO DATABRICKS
Query Parameters
Parameters
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Creating a
Databricks SQL
Dashboard
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Overview of
Lakehouse AI
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Lakehouse AI
Why the Lakehouse for AI / ML?
1. Reliable data and files in the Delta lake
1 https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
INTRODUCTION TO DATABRICKS
MLOps Lifecycle
INTRODUCTION TO DATABRICKS
MLOps in the Lakehouse
DataOps
Integrating data across different sources
(AutoLoader)
INTRODUCTION TO DATABRICKS
MLOps in the Lakehouse
ModelOps
Develop and train different models
(Notebooks)
INTRODUCTION TO DATABRICKS
MLOps in the Lakehouse
DevOps
Govern access to different models (Unity
Catalog)
Continuous Integration and Continuous
Deployment (CI/CD) for model versions
(Model Registry)
INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Using Databricks for
machine learning
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Machine Learning Lifecycle
1 https://www.datacamp.com/blog/machine-learning-lifecycle-explained
INTRODUCTION TO DATABRICKS
Planning and preparation
INTRODUCTION TO DATABRICKS
Planning for machine learning
What do I have? What do I want?
INTRODUCTION TO DATABRICKS
ML Runtime
Extension of Databricks compute
Optimized for machine learning
applications
MLFlow
INTRODUCTION TO DATABRICKS
Exploratory Data Analysis
import pandas as pd
pd.describe(df)
# Spark DF
df.summary()
dbutils.data.summarize()
INTRODUCTION TO DATABRICKS
Feature tables and feature stores
Raw Data Feature table
count category price shelf_loc rating count category price shelf_loc rating
4 horror 12.50 end 3 4 1 12.50 1 3
6 romance 13.99 top 4.5 6 2 13.99 2 4.5
12 sci-fi 16.50 bottom 5 12 3 16.50 3 5
31 romance 9.99 bottom 3.5 31 2 9.99 3 3.5
23 fantasy 24.99 top 4 23 4 24.99 2 4
18 horror 19.99 end 2.5 18 1 19.99 1 2.5
19 cooking 17.50 end 4.5 19 5 17.50 1 4.5
7 fantasy 12.99 top 3 7 4 12.99 2 3
37 sci-fi 14.99 bottom 5 37 3 14.99 3 5
INTRODUCTION TO DATABRICKS
Databricks Feature Store
Centralized storage for featurized datasets from databricks import feature_store
Easily discover and re-use features for
machine learning models fs = feature_store.FeatureStoreClient()
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Model training with
MLFlow in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Machine Learning Lifecycle
1 https://www.datacamp.com/blog/machine-learning-lifecycle-explained
INTRODUCTION TO DATABRICKS
Model training and development
INTRODUCTION TO DATABRICKS
Single-node vs. Multi-node
Single-node machine learning Multi-node machine learning
INTRODUCTION TO DATABRICKS
AutoML
"Glass box" approach to automated
machine learning
1 https://www.databricks.com/product/automl
INTRODUCTION TO DATABRICKS
MLFlow
Open-source framework import mlflow
End-to-end machine learning lifecycle
management with mlflow.start_run() as run:
# machine learning training
Track, evaluate, manage, and deploy
Pre-installed on ML Runtime!
mlflow.autolog()
mlflow.log_metric('accuracy', acc)
mlflow.lot_param('k', kNum)
INTRODUCTION TO DATABRICKS
MLFlow Experiments
Collect information across multiple runs in a single location
Sort and compare model runs
MLFlow Experiments
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Deploying a model
in Databricks
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Machine Learning Lifecycle
1 https://www.datacamp.com/blog/machine-learning-lifecycle-explained
INTRODUCTION TO DATABRICKS
Model Deployment and Operations
INTRODUCTION TO DATABRICKS
Concerns with deploying models
Availability Evaluation
How will my end users or application use Are my users actually using my model?
the model?
Is my model still performing well?
Where do I need to put my model to access Do I need to retrain my model?
it?
Do I need a new model that is better?
Will the model be easy to understand or
use?
INTRODUCTION TO DATABRICKS
Model Deployment Process
INTRODUCTION TO DATABRICKS
Model Flavors
MLFlow Models can store a model from any
machine learning framework
pyfunc
spark
tensorflow
INTRODUCTION TO DATABRICKS
Model Registry
INTRODUCTION TO DATABRICKS
Model Registry
INTRODUCTION TO DATABRICKS
Model Registry
INTRODUCTION TO DATABRICKS
Model Registry
INTRODUCTION TO DATABRICKS
Model Serving
INTRODUCTION TO DATABRICKS
Model Serving
INTRODUCTION TO DATABRICKS
Model Serving
INTRODUCTION TO DATABRICKS
Model Serving
1 https://www.databricks.com/product/model-serving
INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Example end-to-end
machine learning
pipeline
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Wrap Up
I N T R O D U C T I O N T O D ATA B R I C K S
Kevin Barlow
Data Practitioner
Why the Lakehouse?
1 https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html
INTRODUCTION TO DATABRICKS
Databricks for data engineering
Apache Spark
Delta
Delta Live Tables
Auto Loader
Structured Streaming
Workflows
INTRODUCTION TO DATABRICKS
Databricks for data warehousing
SparkSQL Databricks for data warehousing
ANSI SQL
SQL Warehouses
Queries
Visualizations
Dashboards
INTRODUCTION TO DATABRICKS
Databricks for machine learning
INTRODUCTION TO DATABRICKS
Congratulations!
I N T R O D U C T I O N T O D ATA B R I C K S