ADB Course Catalog
ADB Course Catalog
Course Catalog
UPDATED: FEBRUARY 2022
Training 7
Credentials 8
Learning paths 9
Platform administration 11
Data analysis 12
Data engineering 13
Certification Prep Course for the Databricks Certified Associate Developer for Apache
Spark Exam 27
Introduction to Jobs 56
Introduction to Photon 59
Structured Streaming 68
Credential descriptions 77
Coming soon!
Coming soon!
Training
Self-paced online courses - asynchronous virtual training available to individuals
through the Databricks Academy website. This training is free for Databricks
customers. Each course is typically 1-2 hours in length.
Workshops - live 1-3 hour trainings made available to groups, typically in a virtual
format. Please reach out to a CSE / Databricks Account manager to request a
Workshop.
Current pathways are available for business leaders, data analysts, data engineers,
data scientists, and platform administrators. The credential milestones for each step
within these pathways are shown in the image below.
Please note that we are actively making updates to the data analyst, data scientist,
and platform administration pathways. These updates will result in new certification
exams, as shown in the image below:
Below, you’ll find a breakdown of the courses required for each of these steps. We
will update these regularly, as new courses are released.
Data engineering
Instructor-led course descriptions
For a full list of available instructor-led courses, along with their descriptions, please click
here.
Duration: 12 hours
Duration: 1 hour
● Prerequisites
○ Basic SQL commands
○ Experience working with SQL in a Databricks notebook
Learning objectives
● Use optional arguments in CREATE TABLE to define data format and location
in a Databricks database
● Efficiently copy, modify and create new tables from existing ones
● Use built-in functions and features of Spark SQL to Manage and manipulate
nested objects
● Use roll-up, cube, and window functions to aggregate data and pivot tables
Duration: 1 hour
Prerequisites:
Learning objectives:
● Describe use cases for Databricks in an enterprise cloud architecture.
● Configure secure connections from Databricks to data in S3.
● Configure connections between Databricks and various first-party tools in an
enterprise cloud architecture, including Redshift and Kinesis.
● Deploy an MLflow model to a Sagemaker endpoint for serving online model
predictions.
● Configure Glue as an enterprise data catalog
Course description: In this course, you will first define computation resources
(clusters, jobs, and pools) and determine which resources to use for different
workloads. Then, you will learn cluster provisioning strategies for several use cases to
maximize usability and cost-effectiveness. You will also identify best practices for
cluster governance, including cluster policies. This course also covers capacity
limits, cost management, and chargeback analysis.
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: In this course, you will learn about the Databricks File System
and Hive Metastore concepts. Then, you will apply best practices to secure access
to Amazon S3 from Databricks. Next, you will configure access control for data
objects including tables, databases, views, and functions. You will also apply column
and row-level permissions and data masking with dynamic views for multiple users
and groups. Lastly, you will identify methods for data isolation within your
organization on Databricks.
Prerequisites:
Learning objectives:
● Describe fundamental concepts about the Databricks File System and Hive
Metastore.
● Apply best practices to secure access to Amazon S3 from Databricks.
● Configure access control for data objects including tables, databases, views,
and functions.
● Apply column and row-level permissions and data masking with dynamic
views for multiple users and groups.
● Identify methods for data isolation within your organization on Databricks.
AWS Databricks Identity Access
Management
Click here for the customer enrollment link.
Course description:In this course, you will learn how to manage user accounts and
groups in the Admin Console. You will also learn how to manage token-based
authentication and settings for your workspace, such as workspace storage and
additional cluster configurations. Lastly, this course covers access control for
workspace objects, such as notebooks and folders, in addition to clusters, pools, and
jobs.
Prerequisites:
Learning objectives:
Duration: 1 hour
Prerequisites:
● Beginning-level knowledge of basic AWS cloud computing terms (ex. S3, VPC,
IAM, etc.)
● Beginning-level knowledge of basic Databricks concepts (ex. workspace,
clusters, notebooks, etc.)
Learning objectives:
Duration: 1 hour
Course description: In this course, you will learn how to set up and configure access
to the Databricks SQL Analytics user interface. The administrative tasks in this
course will be done using the Databricks Workspace and Databricks SQL Analytics
UI, and will not include instruction for API access. By the end of this course, you will
be able to set up computational resources for users, grant and revoke access to
specific data, manage users and groups, and set up alert destinations.
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: This course will walk you through setting up your Databricks
account including setting up billing, configuring your AWS account, and adding users
with appropriate permissions. At the end of this course, you'll find guidance and
resources for additional setup options and best practices.
Prerequisites:
Learning objectives:
Duration: 1 hour
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: In this course, you will first define computation resources
(clusters, jobs, and pools) and determine which resources to use for different
workloads. Then, you will learn cluster provisioning strategies for several use cases to
maximize usability and cost effectiveness. You will also identify best practices for
cluster governance, including cluster policies. This course also covers capacity
limits, cost management, and chargeback analysis.
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: In this course, you will learn about the Databricks File System
and Hive Metastore concepts. Then, you will apply best practices to secure access
to Azure data storage from Azure Databricks. Next, you will configure access control
for data objects including tables, databases, views, and functions. You will also apply
column and row-level permissions and data masking with dynamic views for
multiple users and groups. Lastly, you will identify methods for data isolation within
your organization on Azure Databricks.
Prerequisites:
Learning objectives:
Duration: 45 minutes
Course description: In this course, you will learn how to manage user accounts and
groups in the Admin Console. You will also learn how to manage token-based
authentication and settings for your workspace, such as workspace storage and
additional cluster configurations. Lastly, this course covers access control for
workspace objects, such as notebooks and folders, in addition to clusters, pools, and
jobs.
Prerequisites:
Learning objectives:
Prerequisites:
Duration: 1 hour
Course description: In this course, you will learn how to set up and configure access
to the Databricks SQL Analytics user interface. The administrative tasks in this
course will be done using the Databricks Workspace and Databricks SQL Analytics
UI, and will not include instruction for API access. By the end of this course, you will
be able to set up computational resources for users, grant and revoke access to
specific data, manage users and groups, and set up alert destinations.
Prerequisites:
Learning objectives:
Duration: 10 minutes
Course description: In this course, you will identify the prerequisites for creating an
Azure Databricks workspace, deploy an Azure Databricks workspace in the Azure
portal, launch the workspace, and access the Admin Console.
Prerequisites:
● To complete the actions outlined in this course, you must have access to an
Azure subscription.
Learning objectives:
Duration: 1 hour
Learning objectives
● Write basic SQL queries to subset gold-level tables using Databricks SQL
Queries
● Join multiple tables together to create a new table.
● Aggregate data columns using SQL functions to answer defined business
questions.
Duration: 2 hours
Course description: Prepare to take the Databricks Certified Associate Developer for
Apache Spark Exam. This course will cover the format and structure of the exam,
skills needed for the exam, tips for exam preparation, and the parts of the
DataFrame API and Spark architecture covered in the exam.
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: Databricks has extensive access control lists (ACLs) for
workspace assets to help administrators restrict and grant access to appropriate
users. This course includes a set of instructions and caveats for configuring many of
these settings, as well as a video walkthrough showing this configuration and the
resultant user experience.
Prerequisites:
Learning objectives:
Duration: 12 hours
Duration: 1 hour
Course description: This course will provide an overview of the features and
functionality within the Unified Data Analytics Platform that enable data
practitioners to follow data science and machine learning workflows. Aside from an
overview of features and functionality, the course will provide learners with
hands-on experience using the Unified Data Analytics Platform to execute basic
tasks and solve a real-world problem.
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: In this course, we’ll show you how to use scikit-learn on
Databricks, along with some core statistical and data science principles, to select a
family of machine learning models for deployment.
This course is the first in a series of three courses developed to show you how to
use Databricks to work with a single data set from experimentation to
production-scale machine learning model deployment. The other courses in this
series include:
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: In this course, you will learn how to use Databricks SQL, an
integrated SQL editing and dashboarding tool, from your Databricks workspace.
Working with Databricks SQL allows you to easily query your data lake, or other data
sources, and build dashboards that can be easily shared across your organization.
You will learn how to parameterize queries so that users can easily modify
dashboard views to target specific results. Also, we will make use of alerts for
ongoing monitoring so that you can be notified when certain events occur or when
particular attributes of a data set reach a certain threshold.
Prerequisites:
Learning objectives:
● Describe how you can use SQL from your Databricks workspace.
● Execute queries and create visualizations using Databricks SQL.
● Write parameterized queries so that users can easily customize their results
and visualizations.
● Create and share dashboards that hold a collection of visualizations.
Duration: 35 minutes
Course description: In this short course, you’ll learn how to create databases, tables,
and views on Databricks. Special attention will be given to differences in scope and
persistence for these various entities, allowing any user responsible for creating or
managing databases, tables, or views to make informed decisions for their
organization. While the syntax for creating and working with databases, tables, and
views will be familiar to most SQL users, some default behaviors may surprise users
new to Databricks.
Prerequisites:
Learning objectives:
Duration: 45 minutes
Prerequisites:
Learning objectives:
● Install and configure the Databricks CLI to securely interact with the
Databricks Workspace.
● Configure workspace secrets using the CLI for more secure sharing and use of
string-based credentials in notebooks.
● Sync notebooks and libraries between the Databricks workspace and other
environments using the CLI.
● Perform a variety of tasks including interacting with clusters, jobs, and runs
using the CLI.
Duration: 1 hour
Prerequisites:
Learning objectives:
Course description: This course dives into the platform architecture and key
security features of Databricks on Google Cloud. You will start with an overview of
Databricks on Google Cloud and how it integrates with the Google Cloud ecosystem.
Then, you will define core components of the platform architecture and deployment
model on Databricks on Google Cloud. You will also learn about key security features
to consider when provisioning and managing workspaces, as well as guidelines on
network security, identity and access management, and data protection.
● Prerequisites
○ Basic familiarity with Databricks concepts (workspace, notebooks,
clusters, DBFS, etc)
○ Basic familiarity with Google Cloud concepts (projects, IAM, GCS, VPC,
subnets, GKE, etc)
Learning objectives
● Prerequisites
○ Familiarity with the Databricks on Google Cloud workspace
○ Beginning knowledge of Spark programming (reading/writing data,
batch and streaming jobs, transformations and actions)
○ Beginning-level experience using Python or Scala to perform basic
control flow operations.
○ Familiarity with navigation and resource configuration in the Databricks
on Google Cloud Console.
Learning objectives
Course description: This course covers essential cluster configuration features and
provisioning guidelines for Databricks on Google Cloud. In this course, you will start
by defining core computation resources (clusters, jobs, and pools) and determine
which resources to use for different workloads. Then, you will learn cluster
provisioning strategies for several use cases to maximize manageability. Lastly, you
will learn how to manage cluster usage and cost for your Databricks on Google Cloud
workspaces.
● Prerequisites
○ Beginning experience using the Databricks workspace
○ Beginning experience with Databricks administration
Learning objectives
Duration: 20 minutes
Course description: This is a short course that shows new customers how to set up
a Databricks account and deploy a workspace on Google Cloud. This will cover
accessing the Account Console and adding account admins, provisioning and
accessing workspaces, and adding users and admins to a workspace.
Learning objectives
Databricks with R
Click here for the customer enrollment link.
Duration: 7 hours
Course description: In this seven-hour course, you will analyze clickstream data from
an imaginary mattress retailer called Bedbricks. In this case study, you'll explore the
fundamentals of Spark Programming with R on Databricks, including Spark
architecture, the DataFrame API, and Machine Learning.
● Prerequisites
○ Beginning experience working with R.
Learning objectives
Course description: Apache Spark™ is the dominant processing framework for big
data. Delta Lake is a robust storage solution designed specifically to work with
Apache SparkTM. It adds reliability to Spark so your analytics and machine learning
initiatives have ready access to quality, reliable data. Delta Lake makes data lakes
easier to work with and more robust. It is designed to address many of the problems
commonly found with data lakes. This course covers the basics of working with
Delta Lake, specifically with Python, on Databricks.
Prerequisites:
Learning objectives:
Duration: 2 hours
Course description: Apache Spark™ is the dominant processing framework for big
data. Delta Lake is a robust storage solution designed specifically to work with
Apache Spark™. It adds reliability to Spark so your analytics and machine learning
initiatives have ready access to quality, reliable data. Delta Lake makes data lakes
easier to work with and more robust. It is designed to address many of the problems
commonly found with data lakes. This course covers the basics of working with
Delta Lake, specifically with Spark SQL, on Databricks.
Prerequisites:
● How to upload data into a Databricks Workspace
● How to visualize data using Databricks
● Intermediate level Spark SQL usage including the CTAS pattern, use of Spark
SQL functions such as from_unixtime, lag, lead, and partitioning.
Learning objectives:
● Use Delta Lake to create a new Delta table and to convert an existing
Parquet-based data lake table
● Differentiate between a batch append and an upsert to a Delta table
● Use Delta Lake Time Travel to view different versions of a Delta tables
● Execute a MERGE command to upsert data into a Delta table
Duration: 2 hours
Course description: In this course, we’ll show you how to train and deploy a large
scale machine learning model using MLflow and Apache Spark. This course is the
third in a series of three courses developed to show you how to use Databricks to
work with a single data set from experimentation to production-scale machine
learning model deployment. We recommend taking the first two courses in this
series before continuing with this course:
Prerequisites:
Learning objectives:
● Summarize Databricks best practices for deploying machine learning projects
with MLflow.
● Explain local development strategies for writing software with Databricks.
● Use Databricks to write production-grade machine learning software.
Duration: 1 hour
Course description: Databricks Auto Loader is the preferred method for ingesting
incremental data landing in cloud object storage into your Lakehouse. This course
introduces Auto Loader and demonstrates some of the newer features added to this
product. Included are recommended patterns for data ingestion with Auto Loader.
Prerequisites:
Learning objectives:
Course description: This course teaches experienced SQL analysts and engineers
how to complete common ELT tasks using Spark SQL on Databricks. Students will
extract data from multiple data sources, load data into Delta Lake tables, and apply
data quality checks and transformations. Students will also learn how to leverage
existing tables in a Lakehouse for last-mile ETL to support dashboards and
reporting.
Prerequisites:
Learning objectives:
● Extract data from a variety of common data sources using Spark SQL in the
Databricks Data Science and Engineering workspace
● Load data into Delta Lake tables using the Databricks Data Science and
Engineering workspace
● Apply transformations to complete common cleaning tasks and data quality
checks using the Databricks Data Science and Engineering workspace
● Reshape datasets with advanced functions to derive analytical insights using
the Databricks Data Science and Engineering workspace
Duration: 7 hours
Course description: In this course you’ll learn about how business leaders, admins,
and architects use Databricks in their architecture . We’ll cover fundamental
concepts about key players: Data Engineers, Data Scientists, Platform Administrator;
raw data forms: structured and unstructured data, batch and streaming data, to help
set the stage for our discussion on how end users help businesses create data
assets like machine learning models, reports, and dashboards. Then, we’ll discuss
where components of Databricks Azure fit into an organization’s big data ecosystem.
Finally, we’ll review real-world business use cases and create enterprise level
architecture infrastructure diagrams.
Prerequisites:
● Beginning knowledge about characteristics that define big data (3 of the Vs of
big data - velocity, volume, variety)
● Beginning knowledge about how organizations process and manage big data
(Relational/SQL vs NoSQL, cloud vs. on-premise, open-source database vs.
closed-source database as a service)
● Beginning knowledge about the roles that data practitioners play on data
science teams (can distinguish between database administrators and data
scientists, data analysts and machine learning engineers, data engineers and
platform administrators)
Learning objectives:
● Prerequisites
○ Beginning-level knowledge of the Databricks Lakehouse platform
(high-level knowledge the structure and benefits of the Lakehouse
platform)
○ Intermediate-level knowledge of Python (good understanding of the
language as well as ability to read and write code)
○ Beginning-level knowledge of SQL (ability to understand and construct
basic queries)
Learning objectives
Duration: 1 hour
Course description: Databricks Machine Learning offers data scientists and other
machine learning practitioners a platform for completing and managing the
end-to-end machine learning lifecycle. This course guides practitioners through a
basic machine learning workflow using Databricks Machine Learning. Along the way,
students will learn how each of Databricks Machine Learning’s features better enable
data scientists and machine learning engineers to complete their work effectively
and efficiently.
● Prerequisites
○ Beginning-level knowledge of the Databricks Lakehouse platform
○ Intermediate-level knowledge of Python
○ Intermediate-level knowledge of machine learning workflows
Learning objectives
Duration: 1 hour
Course description: This course is an introductory course for SQL analysts that
demonstrates the entire data analysis process on Databricks SQL, from introducing
the Databricks SQL workspace (Workspace) to creating a dashboard. The course will
focus on what analysts can do, as opposed to what administrators can do, and it will
use the Workspace without administrator permissions.
● Prerequisites
○ Beginning knowledge of SQL
○ Access to Databricks SQL
○ Access to an empty database set up by an administrator
○ Access to a SQL endpoint set up by an administrator
Learning objectives
Course description: Learn the basics of Google Cloud and how to configure various
resources using the Cloud Console. This course begins with an overview of the
platform, key terminology, and core services. You will then learn essential IAM
concepts and how service accounts can be used to manage resources. You will also
learn about the function and use cases of several storage services, such as Cloud
Storage, Cloud SQL, and BigQuery. This course also covers virtual machine and
networking concepts in Compute Engine and VPC services. The course ends with an
overview of GKE clusters and Kubernetes concepts.
Prerequisites:
Learning objectives:
● Define basic concepts and core services in the Google Cloud Platform.
● Describe IAM concepts and how service accounts can be used to manage
resources.
● Identify use cases for storage services, such as Cloud Storage, Cloud SQL,
and BigQuery.
● Define virtual machine and networking concepts in Compute Engine and VPC
services.
● Describe Google Kubernetes Engine and the core components of Kubernetes
clusters.
Duration: .5 hour
Course description: Before an analyst can analyze data, that data needs to be
ingested into the Lakehouse. This course shows three different ways to ingest data: 1.
Using the Data Science & Engineering UI, 2. Using SQL, and 3. Using Partner Connect.
● Prerequisites
○ Intermediate knowledge of Databricks SQL
○ Administrator privileges
Learning objectives
Duration: 1 hour
Course description: In this course, you will explore how Apache Spark executes a
series of queries. Examples will include simple narrow transformations and more
complex wide transformations.
This course will give developers a working understanding of how to write code that
leverages the power of Apache Spark for even the simplest of queries.
Prerequisites:
● Familiarity with basic information about Apache Spark (what it is, what it is
used for)
Learning objectives:
● Explain how Apache Spark applications are divided into jobs, stages, and
tasks.
● Explain the major components of Apache Spark's distributed architecture.
Duration: 1 hour
Course description: Linear modeling is a popular starting point for machine learning
studies for a number of reasons. Generally, these models are relatively easy to
interpret and explain, and they can be applied to a broad range of problems. In this
course, you will learn how to choose, apply, and evaluate commonly used linear
modeling techniques. As you work through the course, you can put your new skills to
practice in 5 hands-on labs.
● Prerequisites
○ Intermediate experience with machine learning (experience using
machine learning and data science libraries like scikit-learn and
Pandas, knowledge of linear models).
○ Intermediate experience using the Databricks Workspace to perform
data analysis (using Spark DataFrames, Databricks notebooks, etc.).
○ Beginning experience with statistical concepts commonly used in data
science.
Learning objectives
Course description: In this course you’ll learn, both in theory and in practice, about
statistical techniques that are fundamental to many data science projects.
Throughout the course, videos will guide you through the conceptual information
you need to know about these statistical concepts, and hands-on lab activities will
give you the chance to apply the concepts you learn using the Databricks
Workspace. This course is divided into three modules: Introduction to Statistics and
Probability, Probability Distributions, and Applying Statistics to Learn from Data.
● Prerequisites
○ Beginning experience using the Databricks Data Science Workspace
(familiarity with Spark SQL, experience importing files into the
Databricks Data Science Workspace)
○ Beginning experience using Python (ability to follow guided use of the
SciPy library)
Learning objectives
Duration: 3 hours
Course description: In this course, you’ll learn how to solve complex supervised
learning problems using tree-based models. First, we’ll explain how decision trees
can be used to identify complex relationships in data. Then, we’ll show you how to
develop a random forest model to build upon decision trees and improve model
generalization. Finally, we’ll introduce you to various techniques that you can use to
account for class imbalances in a dataset. Throughout the course, you’ll have the
opportunity to practice concepts learned in hands-on labs.
Prerequisites:
Learning objectives:
● Describe how decision trees are used to solve supervised learning problems.
● Identify complex relationships in data using decision trees.
● Develop a random forest model to build upon decision trees and improve
model generalization.
● Employ common techniques to account for class imbalances in a dataset.
Duration: 3 hours
Course description: In this course, we will describe and demonstrate how to learn
from data using unsupervised learning techniques during exploratory data analysis.
The course is divided into two sections – one of which will focus on K-means
clustering and the other will describe principal components analysis, commonly
referred to as PCA. Each section includes demonstrations of important concepts, a
quiz to solidify your understanding, and a lab to practice your skills.
Prerequisites:
● Intermediate experience with machine learning (experience using machine
learning and data science libraries like scikit-learn and Pandas, knowledge of
linear models)
● Intermediate experience using the Databricks Workspace to perform data
analysis (using Spark DataFrames, Databricks notebooks, etc.)
● Beginning experience with machine learning concepts.
Learning objectives:
Duration: 30 minutes
Course description: The addition of clone to Delta Lake empowers data engineers
and administrators to easily replicate data stored in the Lakehouse. Organizations
can use deep clone to archive versions of their production tables for regulatory
compliance. Developers can easily create development datasets isolated from
production data with shallow clone. In this course, you’ll learn the basics of cloning
with Delta Lake and get hands-on experience working with the syntax.
Prerequisites:
Learning objectives:
● Describe the basic execution of deep and shallow clones with Delta Lake.
● Use deep clones to create full incremental backups of tables.
● Use shallow clones to create development datasets.
● Describe strengths, limitations, and caveats of each type of clone.
Introduction to Databricks Connect
Click here for the customer enrollment link.
Duration: 40 minutes
Prerequisites:
Learning objectives:
Duration: 30 minutes
Course description: Repos aims to make Databricks simple to use by giving data
scientists and engineers the familiar tools of git repositories and file systems. These
tools enable a more laptop-like developer experience for customers. Repos is the
new, top-level, customer-facing feature that packages these tools together in the
Databricks user interface. This course teaches how to get started with Repos.
Prerequisites:
Course description: Delta Lake is a powerful tool created by Databricks. Delta Lake
is an open, reliable, performant and secure data storage and management layer for
your data lake that enables you to create a true single source of truth. Since it is
built upon Apache Spark, you’re able to build high performance data pipelines to
clean your data from raw ingestion to business level aggregates. Finally, given the
open format - it allows you to avoid unnecessary replication and proprietary lock-in.
Ultimately - it provides the reliability, performance, and security you need to serve
your downstream data use cases.
Prerequisites:
Learning objectives:
Duration: 30 minutes
Course description: Delta Live Tables enables data teams to innovate rapidly with
simple development, using declarative tools to build and manage batch or streaming
data pipelines. Built-in quality controls and data quality monitoring ensure accurate
and useful BI, Data Science, and ML built on top of quality data. Delta Live Tables is
designed to scale with rapidly growing companies and provides clear observability
into pipeline operations and automatic error handling. This course will cover the
basics of this new product, including syntax, configuration, and deployment.
Prerequisites:
Learning objectives:
In this course, you’ll learn how to perform both of these tasks. This course is divided
into two modules - in the first, you’ll explore feature engineering. In the second, you’ll
explore feature selection. Both modules will start with an introduction to these
topics - what they are and why they’re used. Then, you’ll review techniques that help
data practitioners perform these tasks. Finally, you’ll have the chance to perform two
hands-on lab activities - one where you will engineer features and another where
you will select features for a fictional machine learning scenario.
Prerequisites:
Learning objectives:
Duration: 30 minutes
Prerequisites:
Learning objectives:
Introduction to Hyperparameter
Optimization
Click here for the customer enrollment link.
Duration: 2 hours
Course description: In this course, you’ll learn how to apply hyperparameter tuning
strategies to optimize machine learning models for unseen data. First, you’ll work
within a balanced binary classification problem setting where you’ll use random
forest to predict the correct class. You’ll learn to tune the hyperparameters of a
random forest to improve a model. Then, you’ll again work with a binary classification
problem using random forest and a technique known as cross-validation to
generalize a model.
Prerequisites:
Learning objectives:
Introduction to Jobs
Click here for the customer enrollment link.
Duration: 30 minutes
Prerequisites:
Learning objectives:
● Describe jobs and motivations for using jobs in the workflow of data
practitioners.
● Create single task jobs with a scheduled trigger.
● Orchestrate multiple notebook tasks with the Jobs UI.
● Discuss common use cases and patterns for Jobs.
Introduction to MLflow Model Registry
Click here for the customer enrollment link.
Duration: 30 minutes
Course description: This course will introduce you to MLflow Model Registry. Model
Registry is a centralized model management tool that allows you to track metrics,
parameters, and artifacts as part of experiments, package models and reproducible
ML projects, and deploy models to batch or real-time serving platforms. You will
learn how your team can use Model Registry as a central place to share ML models,
collaborate on moving them from experimentation to testing and production, and
implement approval and governance workflows.
Prerequisites:
Learning objectives:
Duration: 1 hour
● Prerequisites
○ Experience developing machine learning models in SciKit-Learn
○ Experience and comfortability using Python and Data Science related
libraries (e.g. writing functions, using attributes and methods,
instantiating classes, basic file I/O with Pandas)
○ Comfortability with building classification and regression models in
SciKit-Learn
Learning objectives
Duration: 30 minutes
Course description: After a recap of single-task jobs, as well the directed acyclic
graph (DAG) model, you will learn how to create, trigger or schedule Multi-Task jobs
in Databricks.
Prerequisites:
Learning objectives:
Duration: 4 hours
Course description: This course will introduce you to natural language processing
with Databricks. You will learn how to generate
term-frequency-inverse-document-frequency (TFIDF) vectors for your datasets
and how to perform latent semantic analysis using the Databricks Machine Learning
Runtime.
Prerequisites:
Learning objectives:
Introduction to Photon
Click here for the customer enrollment link.
Duration: 30 minutes
Course description: In this course, you’ll learn how Photon can be used to reduce
Databricks total cost of ownership (TCO) and dramatically improve query
performance. You’ll also learn best practices for when to use and not use Photon
Finally, the course will include a demonstration of a query run with and without
Photon to show improvement in query performance.
Prerequisites:
● Administrator privileges
● Introductory knowledge about the Databricks Lakehouse Platform (what the
Databricks Lakehouse Platform is, what it does, main components, etc.)
Learning objectives:
Duration: 1 hour
● Prerequisites
○ Familiarity with SQL
Learning objectives
Duration: 6 hours
NOTE: This is an e-learning version of the Just Enough Python for Apache Spark
instructor-led course. It is an on-demand recording available via the Databricks
Academy and covers the same content as the instructor-led course. For more
information about what’s in the course itself, please visit this link.
Duration: 3 hours
Prerequisites:
Learning objectives:
● Identify the core components of Delta Lake that make a Lakehouse possible.
● Define commonly used optimizations available in Delta Engine.
● Build end-to-end batch and streaming OLAP data pipeline using Delta Lake.
● Make data available for consumption by downstream stakeholders using
specified design patterns.
● Document data at the table level to promote data discovery and cross-team
communication.
● Apply Databricks’ recommended best practices in engineering a single source
of truth Delta architecture.
Duration: 30 minutes
Course description: This course will enable experienced SAS developers to quickly
learn how to translate familiar SAS statements and functions into code that can be
run on Databricks. It begins with an introduction to the Databricks environment and
the different approaches to coding in Databricks, followed by an overview of how
SAS PROC and DATA steps can be performed in Databricks. You will learn about how
you can use Spark SQL, PySpark, and other tools to read .sas7bdat files and perform
common operations. Finally, you will see code examples and gain hands-on practice
performing some of the most common SAS operations in Databricks.
Prerequisites:
Learning objectives:
● Read data stored in .sas7bdat files using Spark SQL and PySpark.
● Explain the conceptual and syntactical relationships between SAS DATA and
PROC statements and their correlaries on Databricks.
● Learn how Python can be leveraged to augment ANSI SQL to create reusable
Spark SQL code.
● Translate common PROC functions to Databricks.
● Translate common DATA steps to Databricks.
Natural Language Processing at Scale with
Databricks
Click here for the customer enrollment link.
Duration: 5 hours
Course description: This five-hour course will teach you how to do natural language
processing at scale on Databricks. You will apply libraries such as NLTK and Gensim
in a distributed setting as well as SparkML/MLlib to solve classification, sentiment
analysis, and text wrangling tasks. You will learn how to remove stop words, when to
lemmatize vs stem your tokens, and how to generate
term-frequency-inverse-document-frequency (TFIDF) vectors for your dataset. You
will also use dimensionality reduction techniques to visualize word embeddings with
Tensorboard and apply and visualize basic vector arithmetic to embeddings.
Prerequisites:
Learning objectives:
Duration: 1 hour
Course description: In this course, learners will practice using the Databricks Feature
Store. From creating and updating feature store tables to searching across the
Feature Store, functionality is accessible through Databricks notebooks and Jobs.
Feature Store enables data practitioners to share and discover features across their
organization, as well as ensure that the same feature computation code is used for
model training and inference.
● Prerequisites
○ Creating models with SciKit-Learn or ML Lib
○ Hardening for security concerns like handling data in flight, CORS or
SQL injection
○ API architecture beyond REST (e.g. SOAP or graph models will not be
discussed)
○ Optimizing clusters for serving (e.g. latency, SLAs, and throughput
concerns)
○ How the MLflow Registry works. Better if learner can log models to the
registry
○ Monitoring model drift and performance
Learning objectives
Duration: 12 hours
Duration: 1 hour
Prerequisites:
Learning objectives:
● Describe how Delta Change Data Feed emits change data records.
● Use appropriate syntax and settings to set up Change Data Feed.
● Propagate inserts, updates, and deletes with Change Data Feed.
Duration: 30 minutes
Learning objectives:
Duration: 30 minutes
Course description: Apache Spark™ is a unified analytics engine for large scale data
processing known for its speed, ease and breadth of use, ability to access diverse
data sources, and APIs built to support a wide range of use-cases. Databricks builds
on top of Spark and adds many performance and security enhancements. This
course is meant to provide an overview of Spark’s internal architecture.
Prerequisites:
Learning objectives:
Duration: 3 hours
Course description: In this course, learners integrate machine learning solutions with
scalable production pipelines backed by Apache Spark. Learners will start by
investigating common inefficiencies in machine learning. Next, students will learn to
scale the development and tuning of the machine learning workflow using tools like
Spark ML and Hyperopt. Finally, learners will finish by using Pandas UDFs and the
Pandas Function APIs to create and apply group-specific machine learning models.
By the end of this course, learners will be capable of scaling the entirety of a
machine learning pipeline.
Prerequisites:
Learning objectives:
NOTE: This is an e-learning version of the Scalable Machine Learning with Apache
Spark instructor-led course. It is an on-demand recording available via the
Databricks Academy and covers the same content as the instructor-led course. For
more information about what’s in the course itself, please visit this link.
Duration: 1 hour
Course description: Taking this course will familiarize you with the content and
format of the Associate SQL Analyst Accreditation, as well as provide you some
practical exercises that you can use to improve your skills or cement newly learned
concepts. We recommend that you complete Fundamentals of SQL on Databricks
and Applications of SQL on Databricks before using this guide.
Prerequisites:
Learning objectives:
Structured Streaming
Click here for the customer course enrollment link.
Duration: 1 hour
Learning objectives:
Duration: 2 hours
Course description: In this course, we’ll show you how to design an MLflow
experiment to identify the best machine model for deployment. This course is the
second in a series of three courses developed to show you how to use Databricks to
work with a single data set from experimentation to production-scale machine
learning model deployment. The other courses in this series include:
Prerequisites:
Learning objectives:
● Create and explore an augmented sample from user event and profile data.
● Design an MLflow experiment and write notebook-based software to run the
experiment to assess various linear models.
● Examine experimental results to decide which model to develop for
production.
Duration: 1 hour
Course description: Whether your organization is moving to the cloud for the first
time or reevaluating its current approach, making decisions about the technology
used when storing your data can have huge implications for costs and performance
in downstream analytics. As a platform focused on computation and analytics,
Databricks seeks to help our customers make choices that unlock new
opportunities, reduce redundancies, and connect data teams. In this course, you’ll
start by exploring the characteristics of data lakes, and data warehouses, two
popular data storage technologies. Then, you’ll learn about the Lakehouse, a new
data storage system invented and made popular by Databricks.
Prerequisites:
Learning objectives:
● Describe the strengths and limitations of data lakes, related to data storage.
● Describe the strengths and limitations of data warehouses, related to data
storage.
● Contrast data lake and data warehouse characteristics.
● Compare the features of a Lakehouse to the features of popular data storage
management solutions.
What is Big Data?
Click here for the customer course enrollment link.
Duration: 1 hour
Course description: This course was created for individuals who are new to the big
data landscape and want to become conversant with big data terminology. It will
cover foundational concepts related to the big data landscape including:
characteristics of big data; the relationship between big data, artificial intelligence,
and data science; how individuals on data science teams work with big data; and
how organizations can use big data to enable better business decisions.
Prerequisites:
Learning objectives:
Duration: 30 minutes
Prerequisites:
Learning objectives:
Duration: 30 minutes
Course description: Databricks Machine Learning offers data scientists and other
machine learning practitioners a platform for completing and managing the
end-to-end machine learning lifecycle. This course guides business leaders and
practitioners through a basic overview of Databricks Machine Learning, the benefits
of using Databricks Machine Learning, its fundamental components and
functionalities, and examples of successful customer use.
Prerequisites:
Learning objectives:
Duration: 30 minutes
Course description: Databricks SQL offers SQL users a platform for querying,
analyzing, and visualizing data in their organizations Lakehouse. This course explains
how Databricks SQL processes queries and guides users through how to use the
interface. Then, this course will explain how you can connect to Databricks SQL to
your favorite business intelligence tool, so that you can query your Lakehouse
without making changes to your analytical and dashboarding workflows.
Prerequisites:
● None.
Learning objectives:
Duration: 30 minutes
Course description: Delta Lake is an open format storage layer that sits on top of
your organization’s data lake. It is the foundation of a cost-effective, highly scalable
Lakehouse and is an integral part of the Databricks Lakehouse Platform.
In this course, we’ll break down the basics behind Delta Lake - what it does, how it
works, and why it is valuable from a business perspective, to any organization with
big data and AI projects.
Prerequisites:
Learning objectives:
● Describe how Delta Lake fits into the Databricks Lakehouse Platform.
● Explain the four elements encompassed by Delta Lake.
● Summarize high-level Delta Lake functionality that helps organizations solve
common challenges related to enterprise-scale data analytics.
● Articulate examples of how organizations have employed Delta Lake on
Databricks to improve business outcomes.
Duration: 1 hour
Course description: In this course you’ll learn fundamental concepts about machine
learning. First, we’ll review machine learning basics - what it is, why it’s used, and
how it relates to data science. Then, we’ll explore the two primary categories that
machine learning problems are categorized into - supervised and unsupervised
learning. Finally, we’ll review how the machine learning workflow fits into the data
science process.
● Prerequisites
○ Beginning knowledge about concepts related to the big data landscape
helpful but not required (i.e. big data types, analysis techniques,
processing techniques, etc.)
○ We recommend taking the Databricks Academy course "Introduction to
Big Data" before taking this course.
Learning objectives
● Explain how machine learning is used as an analysis tool in data science.
● Summarize the relationship between the data science process and the
machine learning workflow.
● Describe the two primary categories that machine learning problems are
categorized into.
● Describe popular machine learning techniques within the two primary
categories of machine learning.
● Determine the machine learning technique that should be used to analyze
data in a given real-world scenario.
Duration: 1 hour
● Prerequisites
○ Beginning knowledge about the Databricks Unified Data Analytics
Platform (what it is, what it is used for)
○ Beginning knowledge about concepts related to the big data landscape
(for example: structured streaming, batch processing, data pipelines)
○ Note: We recommend taking the following two Databricks Academy
courses to help you prepare for this course: Fundamentals of Big Data
and Fundamentals of Unified Data Analytics with Databricks.
Learning objectives
● Explain the benefits of Structured Streaming for working with streaming data.
● Distinguish where Structured Streaming fits into an organization’s big data
ecosystem.
● Articulate examples of real-world business use cases for Structured
Streaming.
● Describe popular machine learning techniques within the two primary
categories of machine learning.
Duration: 30 minutes
Course description: This course is designed for everyone who is brand new to the
Platform and wants to learn more about what it is, why it was developed, what it
does, and the components that make it up.
Our goal is that by the time you finish this course, you’ll have a better understanding
of the Platform in general and be able to answer questions like: What is Databricks?
Where does Databricks fit into my workflow? How have other customers been
successful with Databricks?
NOTE: This course does not contain hands-on practice with the Databricks
Lakehouse Platform.
Prerequisites:
Learning objectives:
Duration: 30 minutes
Course description: This course was created to teach Databricks users about the
major improvements to Spark in the 3.0 release. It will give an overview of new
features meant to improve performance and usability. Students will also learn about
backwards compatibility with 2.x and some of the considerations required for
updating to Spark 3.0.
Prerequisites:
Learning objectives:
Credential descriptions
Duration: 2 hours
Certification exam description: The Azure Databricks Certified Associate Platform
Administrator certification exam assesses an understanding of network
infrastructure and security with Databricks, including workspace deployment, Azure
cloud concepts, and network security. The exam also assesses the understanding of
identity and access on Azure Databricks, including identity management, workspace
access control, data access control, and fine-grained security. In addition, the exam
assesses cluster configuration and usage management. Lastly, developer tools and
automation processes are assessed.
Prerequisites:
Duration: 2 hours
Prerequisites:
It is expected that developers that have been using the Spark DataFrame API
for six months or more should be able to pass this certification exam.
While it will not be explicitly tested, the candidate must have a working
knowledge of either Python or Scala. The exam is available in both languages.
Price $150
Prerequisites:
● Understand how to use and the benefits of using the Databricks Lakehouse
Platform and its tools, including:
○ Data Lakehouse (architecture, descriptions, benefits)
○ Data Science and Engineering workspace (clusters, notebooks, data
storage)
○ Delta Lake (general concepts, table management and manipulation,
optimizations)
● Build ETL pipelines using Apache Spark SQL and Python, including:
○ Relational entities (databases, tables, views)
○ ELT (creating tables, writing data to tables, cleaning data, combining
and reshaping tables, SQL UDFs)
○ Python (facilitating Spark SQL with string manipulation and control flow,
passing data between PySpark and Spark SQL)
● Incrementally process data, including:
○ Structured Streaming (general concepts, triggers, watermarks)
○ Auto Loader (streaming reads)
○ Multi-hop Architecture (bronze-silver-gold, streaming applications)
○ Delta Live Tables (benefits and features)
● Build production pipelines for data engineering applications and Databricks
SQL queries and dashboards, including:
○ Jobs (scheduling, task orchestration, UI)
○ Dashboards (endpoints, scheduling, alerting, refreshing)
● Understand and follow best security practices, including:
○ Entity Permissions (team-based permissions, user-based permissions)
Individuals who pass this certification exam can be expected to complete data
engineering tasks using Databricks and its associated tools.
Prerequisites:
● Understand how to use and the benefits of using the Databricks platform and
its tools, including:
○ Platform (notebooks, clusters, Jobs, Databricks SQL, relational entities,
Repos)
○ Apache Spark (PySpark, DataFrame API, basic architecture)
○ Delta Lake (SQL-based Delta APIs, basic architecture, core functions)
○ Databricks CLI (deploying notebook-based workflows)
○ Databricks REST API (configure and trigger production pipelines)
● Build data processing pipelines using the Spark and Delta Lake APIs, including:
○ Building batch-processed ETL pipelines
○ Building incrementally processed ETL pipelines
○ Optimizing workloads
○ Deduplicating data
○ Using Change Data Capture (CDC) to propagate changes
● Model data management solutions, including:
○ Lakehouse (bronze/silver/gold architecture, databases, tables, views,
and the physical layout)
○ General data modeling concepts (keys, constraints, lookup tables,
slowly changing dimensions)
● Build production pipelines using best practices around security and
governance, including:
○ Managing notebook and jobs permissions with ACLs
○ Creating row- and column-oriented dynamic views to control
user/group access
○ Securely storing personally identifiable information (PII)
○ Securely delete data as requested according to GDPR & CCPA
● Configure alerting and storage to monitor and log production jobs, including:
○ Setting up notifications
○ Configuring SparkListener
○ Recording logged metrics
○ Navigating and interpreting the Spark UI
○ Debugging errors
● Follow best practices for managing, testing and deploying code, including:
○ Managing dependencies
○ Creating unit tests
○ Creating integration tests
○ Scheduling Jobs
○ Versioning code/notebooks
○ Orchestration Jobs
It is expected that testers with at least 1-2 years of experience in data engineering
with Databricks should be able to pass this exam.
Duration: 2 hours
Prerequisites:
Duration: .5 hours
This accreditation is the beginning step in most of the Databricks Academy learning
plans - SQL Analysts, Data Scientists, Data Engineers, and Platform Administrators.
Business leaders are also welcome to take this assessment.
Prerequisites:
● We recommend that you take the following courses to prepare for this
accreditation exam:
○ What is the Databricks Lakehouse Platform?
○ What are Enterprise Data Management Systems? (particularly the
section on Lakehouse architecture)
○ What is Delta Lake?
○ What is Databricks SQL?
○ What is Databricks Machine Learning?
Duration: 1 hour
Prerequisites: