0% found this document useful (0 votes)
538 views

Introduction to Databricks

Uploaded by

Prajwal Khairnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
538 views

Introduction to Databricks

Uploaded by

Prajwal Khairnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 149

Introduction to

Databricks
Lakehouse
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Analytics Practitioner
The Data Warehouse
Data Warehouse
Pros

Great for structured data

Highly performant

Easy to keep data clean

Cons

Very expensive

Cannot support modern applications

Not built for Machine Learning

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
The Data Lake
Data Lake
Pros

Support for all use cases

Very flexible

Cost effective

Cons

Data can become messy

Historically not very performant

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
Birth of the Lakehouse

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
Birth of the Lakehouse

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
The Databricks Lakehouse
The Databricks Lakehouse Platform

Single platform for all data workloads

Built on open source technology


Collaborative environment

Simplified architecture

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
Databricks Architecture Benefits
Unification Multi-Cloud

Every use case from AI to BI Bring powerful platform to your data

Benefits of data warehouse and data lake No lock-in to a specific cloud platform

INTRODUCTION TO DATABRICKS
Databricks Development Benefits
Collaborative Open-Source

Every data persona Underpinned by Apache Spark

Ability to work in same platform in real- Support for most popular languages
time (Python, R, Scala, SQL)

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Core features of the
Databricks
Lakehouse Platform
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Apache Spark
Apache Spark is an open-source data processing framework and is the engine underneath
Databricks.

DataCamp Courses

Introduction to Pyspark

Big Data Fundamentals with Pyspark

Cleaning Data with Pyspark

Machine Learning with Pyspark

Introduction to Spark SQL in Python

INTRODUCTION TO DATABRICKS
Benefits of Spark
Key Benefits:

1. Extensible, flexible open-source framework

2. Large developer community


3. High performing

4. Databricks optimizations

1 https://spark.apache.org/docs/latest/cluster-overview.html

INTRODUCTION TO DATABRICKS
Cloud computing basics

INTRODUCTION TO DATABRICKS
Databricks Compute
Clusters

Collection of computational resources

All workloads, any use case


All-purpose vs. Jobs

SQL Warehouses

SQL only

BI use cases

Photon

INTRODUCTION TO DATABRICKS
Cloud data storage

INTRODUCTION TO DATABRICKS
Delta
Delta is an open-source data storage file
format, and provides:

ACID transactions

Unified batch and streaming

Schema evolution

Table history

Time-travel

1 delta.io

INTRODUCTION TO DATABRICKS
Unity Catalog
Unity Catalog is an open data governance
strategy that controls access to all data
assets in the Databricks Lakehouse platform.

SQL GRANT , REVOKE statements to control


access

Simple interface for governance

INTRODUCTION TO DATABRICKS
Databricks UI
Designed for easier access to capabilities
based on your data workload.

All users have access to data and compute

SQL users get a familiar interface for


queries and reports

Data engineers leverage Delta Live Tables

Machine Learning workloads use models,


features, and more

INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Administering a
Databricks
workspace
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Account Admin
Key Responsibilities:

Creating and managing workspaces

Enabling Unity Catalog


Managing identities

Managing the account subscription

INTRODUCTION TO DATABRICKS
Account Console

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Workspaces

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Data

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Users & Groups

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Settings

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Workspace Admin
Key Responsibilities:

Managing identities in your workspace

Creating and managing compute resources


Managing workspace features and settings

INTRODUCTION TO DATABRICKS
Data Plane
Contains all of the customer's assets needed for computation with Databricks.

Data is stored in the customer's cloud environment

Clusters / SQL Warehouses run in customer's cloud tenant.

INTRODUCTION TO DATABRICKS
Control Plane
The portion of the platform that is managed and hosted by Databricks.

Orchestrates various background tasks in Databricks

Sends requests to Data Plane to create clusters, run jobs, etc.

INTRODUCTION TO DATABRICKS
Databricks Platform Architecture
Each cloud will have the same general
options to create a workspace:

Cloud Service Provider marketplace

Account Console

Using the Accounts API with Databricks

Programmatic deployment (e.g., Terraform)

1 https://docs.databricks.com/getting-started/overview.html

INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Setting up a
Databricks
workspace example
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Getting started with
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Compute cluster refresh

INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

Cluster policies and access

INTRODUCTION TO DATABRICKS
Cluster Access

INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

Cluster policies and access

Databricks Runtime

Photon Acceleration

INTRODUCTION TO DATABRICKS
Create your first cluster
The first step is to create a cluster for your
data processing!

Configuration options:

Cluster policies and access

Databricks Runtime

Photon Acceleration

Node instance types and number

Auto-scaling / Auto-termination

INTRODUCTION TO DATABRICKS
Data Explorer
Get familiar with the Data Explorer! In this UI,
you can:

1. Browse available catalogs/schemas/tables

2. Look at sample data and summary


statistics

3. View data lineage and history

You can also upload new data by clicking the


"plus" icon!

1 Photo by Jakub Zerdzicki: https://www.pexels.com/photo/magnifier-loupe-17284804/

INTRODUCTION TO DATABRICKS
Create a notebook
Databricks notebooks:

Standard interface for Databricks

Improvements on open-source Jupyter


Support for many languages
Python, R, Scala, SQL

Magic commands (%sql)

Built-in visualizations

Real-time commenting and collaboration

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Data Engineering
foundations in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Medallion architecture

INTRODUCTION TO DATABRICKS
Reading data
Spark is a highly flexible framework and can
read from various data sources/types.

Common data sources and types:

Delta tables

File formats (CSV, JSON, Parquet, XML)

Databases (MySQL, Postgres, EDW)

Streaming data

Images / Videos

INTRODUCTION TO DATABRICKS
Reading data
Spark is a highly flexible framework and can #Delta table
read from various data sources/types. spark.read.table()
#CSV files
Common data sources and types:
spark.read.format('csv').load('*.csv')
Delta tables #Postgres table
spark.read.format("jdbc")
File formats (CSV, JSON, Parquet, XML)
.option("driver", driver)
Databases (MySQL, Postgres, EDW)
.option("url", url)
Streaming data .option("dbtable", table)
.option("user", user)
Images / Videos
.option("password", password)
.load()

INTRODUCTION TO DATABRICKS
Structure of a Delta table
A Delta table provides table-like qualities to an open file format.

Feels like a table when reading

Access to underlying files (Parquet and JSON)

INTRODUCTION TO DATABRICKS
Explaining the Delta Lake structure

INTRODUCTION TO DATABRICKS
DataFrames
DataFrames are two-dimensional id customerName bookTitle
representations of data. 1 John Data Guide to Spark

Look and feel similar to tables 2 Sally Bricks SQL for Data
Engineering
Similar concept for many different data
3 Adam Delta Keeping Data Clean
tools
Spark (default), pandas, dplyr, SQL df = (spark.read
queries .format("csv")
Underlying construct for most data .option("header", "true")
processes .option("inferSchema", "true")
.load("/data.csv"))

INTRODUCTION TO DATABRICKS
Writing data
Kinds of tables in Databricks df.write.saveAsTable(table_name)

1. Managed tables
CREATE TABLE table_name
Default type
USING delta
Stored with Unity Catalog AS ...
Databricks managed

2. External tables df.write


Stored in another location .location('').saveAsTable(table_name)

Set LOCATION
CREATE TABLE table_name
Customer managed USING delta
LOCATION "<path>"
AS ...

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Data
transformations in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
SQL for data engineering
SQL -- Creating a new table in SQL

Familiar for Database Administrators


CREATE TABLE table_name
(DBAs)
USING delta
Great for standard manipulations AS (
Execute pre-defined UDFs SELECT *
FROM source_table
WHERE date >= '2023-01-01'
)

INTRODUCTION TO DATABRICKS
Other languages for data engineering
Python, R, Scala #Creating a new table in Pyspark

Familiar for software engineers


spark
Standard and complex transformations .read
Use and define custom functions .table('source_table')
.filter(col('date') >= '2023-01-01')
.write
.saveAsTable('table_name')

INTRODUCTION TO DATABRICKS
Common transformations
Schema manipulation #Pyspark

Add and remove columns df


Redefine columns .withColumn(col('newCol'), ...)
.drop(col('oldCol'))

Filtering #Pyspark

Reduce DataFrame to subset of data


df
Pass multiple criteria .filter(col('date') >= target_date)
.filter(col('id') IS NOT NULL)

INTRODUCTION TO DATABRICKS
Common transformations (continued)
Nested data df
.explode(col('arrayCol')) #wide to long
Arrays or Struct data
.flatten(col('items')) #long to wide
Expand or contract

Aggregation df
.groupBy(col('region'))
Group data based on columns
.agg(sum(col('sales')))
Calculate data summarizations

INTRODUCTION TO DATABRICKS
Auto Loader
Auto Loader processes new data files as they
land in a data lake.

Incremental processing

Efficient processing

Automatic

spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(file_path)

1 https://www.databricks.com/blog/2020/02/24/introducing-databricks-ingest-easy-data-ingestion-into-delta-
lake.html

INTRODUCTION TO DATABRICKS
Structured Streaming
spark.readStream
.format("kafka")
.option("subscribe", "<topic>")
.load()
.join(table_df,
on="<id>", how="left")
.writeStream
.format("kafka")
.option("topic", "<topic>")
.start()

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Orchestration in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Analytics Practitioner
What is data orchestration?
Data orchestration is a form of automation!

Enables data engineers to automate the end-to-end data life cycle

INTRODUCTION TO DATABRICKS
Databricks Workflows
Databricks Workflows is a collection of built-in capabilities to orchestrate all your data
processes, at no additional cost!

Example Databricks Workflow

1 https://docs.databricks.com/workflows

INTRODUCTION TO DATABRICKS
What can we orchestrate?
Data engineers/data scientists Data analysts

INTRODUCTION TO DATABRICKS
Databricks Jobs
Workflows UI
Users can create jobs directly from the
Databricks UI:

Directly from a notebook

In the Workflows section

1 https://docs.databricks.com/workflows/jobs

INTRODUCTION TO DATABRICKS
Databricks Jobs
Programmatic {
Users can also programmatically create jobs "name": "A multitask job",
using the Jobs CLI or Jobs API with the "tags": {},
Databricks platform. "tasks": [],
"job_clusters": [],
"format": "MULTI_TASK",
}

INTRODUCTION TO DATABRICKS
Delta Live Tables

INTRODUCTION TO DATABRICKS
Delta Live Tables

INTRODUCTION TO DATABRICKS
Delta Live Tables

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
End-to-end data
pipeline example in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Overview of
Databricks SQL
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
INTRODUCTION TO DATABRICKS
Databricks for SQL Users
Databricks SQL
Data Warehousing for the Lakehouse

Familiar environment for SQL users


SQL-optimized performance (Photon)

Connect to your favorite BI tools

Comes built into the platform!

INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format

INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format

Separation of compute and storage Storage often tied to compute

INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format

Separation of compute and storage Storage often tied to compute

ANSI SQL Tech-specific SQL

INTRODUCTION TO DATABRICKS
Databricks SQL vs. other databases
Databricks SQL Other Data Warehouses
Open file format (Delta) Proprietary data format

Separation of compute and storage Storage often tied to compute

ANSI SQL Tech-specific SQL

Integrated into other data workloads Usually lacking advanced analytics

INTRODUCTION TO DATABRICKS
SQL in the Lakehouse Architecture

INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Getting started with
Databricks SQL
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
SQL Compute vs. General Compute
Designing compute clusters for data science is inherently different than designing compute
or data engineering workloads... for SQL workloads

import pyspark.sql.functions as F SELECT *


FROM user_table u
spark_df = (spark LEFT JOIN product_use p
.read ON u.userId = p.userId
.table('user_table')) WHERE country = 'USA'
AND utilization >= 0.6
spark_df = (spark_df
.withColumn('score',
F.flatten(...))
)

INTRODUCTION TO DATABRICKS
SQL Warehouse

INTRODUCTION TO DATABRICKS
SQL Warehouse
SQL Warehouse Configuration Options

1. Cluster Name

2. Cluster Size (S, M, L, etc.)


3. Scaling behavior

INTRODUCTION TO DATABRICKS
SQL Warehouse
SQL Warehouse Configuration Options

1. Cluster Name

2. Cluster Size (S, M, L, etc.)


3. Scaling behavior

4. Cluster Type

INTRODUCTION TO DATABRICKS
SQL Warehouse Types
Different types provide different benefits Classic

Most basic SQL compute

In customer cloud

Pro Serverless

More advanced features than Classic Cutting edge features

In customer cloud In Databricks cloud

Most cost performant

INTRODUCTION TO DATABRICKS
SQL Editor

INTRODUCTION TO DATABRICKS
Common SQL Commands
COPY INTO CREATE <entity> AS

Grab raw data and put into Delta Create a Table or View

The Extract of ETL The Transform in ETL

COPY INTO my_table CREATE TABLE events


FROM '/path/to/files' USING DELTA
FILEFORMAT = <format> AS (
FORMAT_OPTIONS ('mergeSchema' = 'true') SELECT *
COPY_OPTIONS ('mergeSchema' = 'true'); FROM raw_events
WHERE ...
)

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Databricks SQL
queries and
dashboards
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Databricks SQL Assets

INTRODUCTION TO DATABRICKS
Databricks SQL Assets

INTRODUCTION TO DATABRICKS
Visualizations
Lightweight, in-platform visualizations
Support for standard visual types

Ability to quickly comprehend data in a graphical way

INTRODUCTION TO DATABRICKS
Databricks SQL Assets

INTRODUCTION TO DATABRICKS
Databricks SQL Assets

INTRODUCTION TO DATABRICKS
Dashboards
Lightweight, easily created dashboards
Ability to share and govern across your organization

Scalable and performant

INTRODUCTION TO DATABRICKS
Query Filters
Filters

Interactive query / dashboard components


that allow the user to reduce the size of the
result dataset SELECT *

Works on the client-side, so is very fast FROM nyctaxi.trips


WHERE pickup_zip = 10103
Supports single select, multi-select, text
AND dropoff_zip = 10023
fields, and date / time pickers

INTRODUCTION TO DATABRICKS
Query Parameters
Parameters

More flexible than filters, and supports


more kinds of selectors SELECT *
Allow the user to provide a value that is FROM nyctaxi.trips
input into the underlying SQL query text WHERE pickup_zip = 10103
AND dropoff_zip = 10023
Created in the query by using the {{ }}
AND {{ nullCheck }} IS NOT NULL
syntax

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Creating a
Databricks SQL
Dashboard
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Overview of
Lakehouse AI
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Lakehouse AI
Why the Lakehouse for AI / ML?
1. Reliable data and files in the Delta lake

2. Highly scalable compute


3. Open standards, libraries, frameworks

4. Unification with other data teams

1 https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

INTRODUCTION TO DATABRICKS
MLOps Lifecycle

INTRODUCTION TO DATABRICKS
MLOps in the Lakehouse
DataOps
Integrating data across different sources
(AutoLoader)

Transforming data into a usable, clean


format (Delta Live Tables)

Creating useful features for models


(Feature Store)

INTRODUCTION TO DATABRICKS
MLOps in the Lakehouse
ModelOps
Develop and train different models
(Notebooks)

Machine learning templates and


automation (AutoML)

Track parameters, metrics, and trials


(MLFlow)

Centralize and consume models (Model


Registry)

INTRODUCTION TO DATABRICKS
MLOps in the Lakehouse
DevOps
Govern access to different models (Unity
Catalog)
Continuous Integration and Continuous
Deployment (CI/CD) for model versions
(Model Registry)

Deploy models for consumption (Serving


Endpoints)

INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Using Databricks for
machine learning
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Machine Learning Lifecycle

1 https://www.datacamp.com/blog/machine-learning-lifecycle-explained

INTRODUCTION TO DATABRICKS
Planning and preparation

INTRODUCTION TO DATABRICKS
Planning for machine learning
What do I have? What do I want?

1. Data availability 1. Use cases

2. Business requirements 2. Legal and security compliance


3. Data scientists/data analysts 3. Business outcomes

INTRODUCTION TO DATABRICKS
ML Runtime
Extension of Databricks compute
Optimized for machine learning
applications

Contains most common libraries and


frameworks
scikit-learn , SparkML , TensorFlow

MLFlow

Works with cluster library management

INTRODUCTION TO DATABRICKS
Exploratory Data Analysis
import pandas as pd
pd.describe(df)

# Spark DF
df.summary()

dbutils.data.summarize()

import bamboolib as bam


df

INTRODUCTION TO DATABRICKS
Feature tables and feature stores
Raw Data Feature table
count category price shelf_loc rating count category price shelf_loc rating
4 horror 12.50 end 3 4 1 12.50 1 3
6 romance 13.99 top 4.5 6 2 13.99 2 4.5
12 sci-fi 16.50 bottom 5 12 3 16.50 3 5
31 romance 9.99 bottom 3.5 31 2 9.99 3 3.5
23 fantasy 24.99 top 4 23 4 24.99 2 4
18 horror 19.99 end 2.5 18 1 19.99 1 2.5
19 cooking 17.50 end 4.5 19 5 17.50 1 4.5
7 fantasy 12.99 top 3 7 4 12.99 2 3
37 sci-fi 14.99 bottom 5 37 3 14.99 3 5

INTRODUCTION TO DATABRICKS
Databricks Feature Store
Centralized storage for featurized datasets from databricks import feature_store
Easily discover and re-use features for
machine learning models fs = feature_store.FeatureStoreClient()

Upstream and downstream lineage


fs.create_table(
name=table_name,
primary_keys=["wine_id"],
df=features_df,
schema=features_df.schema,
description="wine features"
)

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Model training with
MLFlow in
Databricks
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Machine Learning Lifecycle

1 https://www.datacamp.com/blog/machine-learning-lifecycle-explained

INTRODUCTION TO DATABRICKS
Model training and development

INTRODUCTION TO DATABRICKS
Single-node vs. Multi-node
Single-node machine learning Multi-node machine learning

Great for experimenting and starting Great for production workloads

Easier initial setup Easier maintenance long-term


Hard to implement in production Highly scalable

INTRODUCTION TO DATABRICKS
AutoML
"Glass box" approach to automated
machine learning

Leverages open-source libraries

Creates models based on data and


targeted prediction

Provides notebook with generated code for


further

1 https://www.databricks.com/product/automl

INTRODUCTION TO DATABRICKS
MLFlow
Open-source framework import mlflow
End-to-end machine learning lifecycle
management with mlflow.start_run() as run:
# machine learning training
Track, evaluate, manage, and deploy

Pre-installed on ML Runtime!
mlflow.autolog()

mlflow.log_metric('accuracy', acc)

mlflow.lot_param('k', kNum)

INTRODUCTION TO DATABRICKS
MLFlow Experiments
Collect information across multiple runs in a single location
Sort and compare model runs

Find and promote the best model

MLFlow Experiments

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Deploying a model
in Databricks
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Machine Learning Lifecycle

1 https://www.datacamp.com/blog/machine-learning-lifecycle-explained

INTRODUCTION TO DATABRICKS
Model Deployment and Operations

INTRODUCTION TO DATABRICKS
Concerns with deploying models
Availability Evaluation

How will my end users or application use Are my users actually using my model?
the model?
Is my model still performing well?
Where do I need to put my model to access Do I need to retrain my model?
it?
Do I need a new model that is better?
Will the model be easy to understand or
use?

INTRODUCTION TO DATABRICKS
Model Deployment Process

INTRODUCTION TO DATABRICKS
Model Flavors
MLFlow Models can store a model from any
machine learning framework

Models are stored alongside different


configurations and artifacts

Models can be "translated" into another


kind of model based on needs. For
example:
scikit-learn

pyfunc

spark
tensorflow

INTRODUCTION TO DATABRICKS
Model Registry

INTRODUCTION TO DATABRICKS
Model Registry

INTRODUCTION TO DATABRICKS
Model Registry

INTRODUCTION TO DATABRICKS
Model Registry

INTRODUCTION TO DATABRICKS
Model Serving

INTRODUCTION TO DATABRICKS
Model Serving

INTRODUCTION TO DATABRICKS
Model Serving

INTRODUCTION TO DATABRICKS
Model Serving

1 https://www.databricks.com/product/model-serving

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Example end-to-end
machine learning
pipeline
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Wrap Up
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Why the Lakehouse?

1 https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html

INTRODUCTION TO DATABRICKS
Databricks for data engineering
Apache Spark
Delta
Delta Live Tables
Auto Loader
Structured Streaming
Workflows

INTRODUCTION TO DATABRICKS
Databricks for data warehousing
SparkSQL Databricks for data warehousing

ANSI SQL
SQL Warehouses
Queries
Visualizations
Dashboards

INTRODUCTION TO DATABRICKS
Databricks for machine learning

INTRODUCTION TO DATABRICKS
Congratulations!
I N T R O D U C T I O N T O D ATA B R I C K S

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy