O Reilly Data Lake Bootcamp Day 11694182865124
O Reilly Data Lake Bootcamp Day 11694182865124
Mohit Batra
Microsoft Certified Trainer
09/2023
Mohit Batra
linkedin.com/in/mohitbatra/
Learning Objectives
Day 1
● Azure Subscription
● Basics of Microsoft Azure
Introduction to
Data Lake
What is Data Warehouse?
8
Challenges with Data Warehouse
9
Data Lake
Relational Sources
Data Science
10
Data Lake – What is it?
Single source of truth holding vast amounts of raw data in its native format
11
Medallion Architecture
12
Design Considerations
● Centralized or decentralized
● File Format
● Organizing Data Files
● Performance, Scalability & High Availability
● Data Security
● Monitoring
● Capturing Metadata
13
Design Considerations
(1) Centralized or Decentralized?
● Easily manage security, apply governance ● Apply separate policies to different lakes
rules, manage data lifecycle ● Faster development
● Better usage & cost tracking
15
Design Considerations
(3) Organizing Data Files based on…
16
Design Considerations
(3) Organizing Data Files
17
Design Considerations
(4) Performance, Scalability & High Availability
18
Design Considerations
(5) Security
Access
Personas
● Bronze layer – Access should be limited to Data Engineers
● Silver layer – Complete access to Data Engineers, and read-access to Data Scientists/ML Engineers
● Gold layer – Complete access to Data Engineers, and read-access to others (Business Users)
19
Design Considerations
(6) Monitoring
20
Design Considerations
(7) Capturing Metadata
21
Products / Services
Reporting Tools
Data Sources Power BI, Tableau
Azure SQL, On-prem DBs,
Azure Event Hubs, Kafka
Data Science / ML
Data Lake Azure ML, Databricks
HDFS, Azure Data Lake,
AWS S3, GCS
22
Using Azure Data
Lake Gen2 to
build Data Lake
Options to build Data Lake in Azure
24
Performance Tiers
Standard Premium
25
Standard Performance Tier – Access Tiers
When to use? Read & write frequently Infrequently accessed data Archival
Minimum storage
duration N/A 30 days 180 days
Blob rehydration
Latency
milliseconds milliseconds Standard – upto 15 hours
(Time to first byte)
High – upto 1 hour (for 10GB)
Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers 26
Redundancy Options
Primary
Region Only
Regions 1 1
Datacenters used 1 3
Benefit Prevents against failure in server racks Prevents against datacenter failures
27
Redundancy Options
Multiple
Regions
Regions 2 2
Datacenters used 2 4
Benefit Prevents against regional failures Prevents against regional & datacenter failures
28
Exercise
● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days
29
Introduction to
Apache Spark and
Azure Databricks
Apache Spark
31
Spark Architecture Driver Node
Task 3 Task 4
Part 1
Part 4
Part 1
Output 1
Output 4
40
Exercise
● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days
41
Extracting &
Transforming Data
from Multiple File
Formats
DataFrames
Id Name Salary
● Higher-level Spark API to write code
5 Neha 32000
Column
43
Exercise
● Work with multiple file formats like CSV, JSON and Parquet
● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days
44
Tomorrow…