0% found this document useful (0 votes)

247 views46 pages

O Reilly Data Lake Bootcamp Day 11694182865124

Uploaded by

sanedo.owner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

247 views46 pages

O Reilly Data Lake Bootcamp Day 11694182865124

Uploaded by

sanedo.owner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Lake Bootcamp

Building Reliable Data Lakes

Mohit Batra
Microsoft Certified Trainer

09/2023
Mohit Batra

● Founder, Crystal Talks

● Author
● Microsoft Certified Trainer
● Ex-Microsoft, Saxo Bank

linkedin.com/in/mohitbatra/
Learning Objectives

By the end of this course, you will understand:

● The concept and objective of the data lake

● Ways to organize data in a data lake

● The use of Spark/Databricks to process data

● Challenges with Data Lake

● Storing data reliably in Data Lake using Delta format

2-Day Agenda

Day 1

● Concept and objective of the Data Lake

● Ways to organize data in Data Lake
● Working with Spark/Databricks to process data in Data Lake
Day 2
● Challenges with Data Lake
● How Delta Lake brings reliability to your data lake
● How data can be managed using Spark tables
● How Delta Lake can be used to build warehouse-like features on Data Lake
Today’s Agenda – Day 1

● Introduction to Data Lake

● Using Azure Data Lake Gen2 to build Data Lake
● Introduction to Apache Spark and Azure Databricks
● Extracting and transforming data from multiple file formats
Prerequisites

● Azure Subscription
● Basics of Microsoft Azure
Introduction to
Data Lake
What is Data Warehouse?

Relational Sources Staging Data Warehouse BI / Reporting

● Centralized repository to store data in pre-defined structure

● Stores highly curated data in relational format

● Optimized for data retrieval
● Typically used by business analysts / users for BI & reporting purposes

8
Challenges with Data Warehouse

Relational Sources Staging Data Warehouse BI / Reporting

IoT Devices Streaming Velocity, Variety & Volume (3Vs)

of data has significantly increased!

Files / Logs Audio / Video

9
Data Lake

Data Warehouse BI / Reporting

Relational Sources

IoT Devices Streaming

Data Lake Machine Learning

Files / Logs Audio / Video

Data Science

10
Data Lake – What is it?
Single source of truth holding vast amounts of raw data in its native format

● Store all types of data at large scale

● Supports storing streaming data at high speeds
● No need to define schema – store in native format
● Inexpensive storage compared to Data Warehouses

● Explore data first – before moving it to relational storage

● Use big data processing tools to enrich & curate data
● Supports multiple use cases

11
Medallion Architecture

Bronze Layer Silver Layer Gold Layer

Raw data Cleaned & transformed data Aggregated / domain data

• Single source of truth • Store granular data • Store business aggregates

• Use this for reprocessing • Enrich data with more info • Optimized for querying
• Combine data from different files • Useful for BI/reporting

12
Design Considerations

● Centralized or decentralized
● File Format
● Organizing Data Files
● Performance, Scalability & High Availability

● Data Security
● Monitoring
● Capturing Metadata

13
Design Considerations
(1) Centralized or Decentralized?

Centralized Region-specific Customer-specific

Data Lake Data Lakes Data Lakes

● Easily manage security, apply governance ● Apply separate policies to different lakes
rules, manage data lifecycle ● Faster development
● Better usage & cost tracking

● Impedes agile development ● Management overhead

● Doesn’t work well for global applications ● If not governed properly, can turn into
data swamps
14
Design Considerations
(2) File Format to use?

● Raw Data (Bronze layer)

○ Store in Native formats – CSV, JSON, Excel, XML etc.

● Enriched Data (Silver layer)

○ Store in Columnar formats – Parquet etc.

○ Provides file compression, stores metadata, faster retrieval

● Curated Data (Gold layer)

○ Depends on business, but many tools support formats like Parquet too

15
Design Considerations
(3) Organizing Data Files based on…

Date Time Dim model entities Consumer

Orders/Year/Month/Day/Hour Dim/Customers, Dim/Products Sales/Customer, Sales/Product

Sales/Year/Month/Day/Hour Fact/Orders, Fact/Sales Finance/Customer

Region Confidentiality Usage

Separate data lake accounts PII/Customer/ CurrentYear/Orders

Partner/Customer PreviousYears/2021/Orders

16
Design Considerations
(3) Organizing Data Files

Bronze Layer Silver Layer Gold Layer

Raw data Cleaned & transformed data Aggregated / domain data

Orders Fact Finance

2022 Orders Orders
11 Partner1 PartnerSummary
22 2022 2022-11-22
01 11 summary_by_product.csv
orders1.csv orders1.parquet summary_by_customer.csv

17
Design Considerations
(4) Performance, Scalability & High Availability

● Setting in on-premises will require planning & configuring these aspects

● Most of cloud-based storage options provides this out-of-the-box

○ Azure Data Lake Store, Amazon S3, Google Cloud Storage

18
Design Considerations
(5) Security

Access

● Create groups & roles to provide role-based access control (RBAC)

● Use the principle of least-privilege
● Build your workflow to provide access to different assets

Personas
● Bronze layer – Access should be limited to Data Engineers
● Silver layer – Complete access to Data Engineers, and read-access to Data Scientists/ML Engineers
● Gold layer – Complete access to Data Engineers, and read-access to others (Business Users)

19
Design Considerations
(6) Monitoring

● Monitor data that is being read – by whom, at what frequency, etc.

● Capture latency in serving requests

● Capture logs
○ In Log database for near real-time querying

○ In file-based storage for long-term retention

20
Design Considerations
(7) Capturing Metadata

● Building a Catalog to capture asset information is extremely important!

● Catalog stores information about assets

○ Who owns it?
○ When was this created?

○ What attributes it has?

○ Does it contain PII data?
○ Which domain / entity it belongs too?
○ …

21
Products / Services

Reporting Tools
Data Sources Power BI, Tableau
Azure SQL, On-prem DBs,
Azure Event Hubs, Kafka

Data Science / ML
Data Lake Azure ML, Databricks
HDFS, Azure Data Lake,
AWS S3, GCS

Ingestion Tools Processing Tools

Azure Data Factory, SSIS, Azure Synapse, Databricks,
Informatica Azure Stream Analytics

22
Using Azure Data
Lake Gen2 to
build Data Lake
Options to build Data Lake in Azure

Azure Data Lake Azure Data Lake

Azure Storage
Gen1 Gen2

• Performance options • Compatible with webHDFS • Performance options

• Tiered options – • Allows nested folders • Tiered options –
Hot/cool/archive (hierarchical namespace) Hot/cool/archive
• Global Redundancy options • Better performance • Global Redundancy options
• Fine-grain permissions
• Compatible with webHDFS
• Allows nested folders
(hierarchical namespace)
• Better performance
• Fine-grain permissions

24
Performance Tiers

Standard Premium

25
Standard Performance Tier – Access Tiers

Hot storage tier Cool storage tier Archive storage tier

When to use? Read & write frequently Infrequently accessed data Archival

Availability 99.9% 99% Offline

Higher storage costs Lower storage costs Lowest storage costs

Charges
Lower access costs Higher access costs Highest access costs

Minimum storage
duration N/A 30 days 180 days

Blob rehydration
Latency
milliseconds milliseconds Standard – upto 15 hours
(Time to first byte)
High – upto 1 hour (for 10GB)

Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers 26
Redundancy Options
Primary
Region Only

Locally-redundant storage (LRS) Zone-redundant storage (ZRS)

Regions 1 1

Datacenters used 1 3

Copies 3 – within same datacenter 3 – one copy in each datacenter

Benefit Prevents against failure in server racks Prevents against datacenter failures

27
Redundancy Options
Multiple
Regions

Primary Secondary Primary Secondary

Geo-redundant storage (GRS) Geo-Zone-redundant (GZRS)

Regions 2 2

Datacenters used 2 4

Copies 3 in primary & 3 in secondary 3 in primary & 3 in secondary

Benefit Prevents against regional failures Prevents against regional & datacenter failures

28
Exercise

● Create Azure Data Lake Gen2 account

● Setup permissions on Data Lake account

● Configure data lifecycle

● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days

● Data Files @ http://tinyurl.com/data-lake-bootcamp-2days

29
Introduction to
Apache Spark and
Azure Databricks
Apache Spark

● Open-source compute engine to perform distributed data processing at-scale

● Based on cluster computing – Data is processed in the memory of the cluster
● Support for multiple languages – Scala, Python, SQL, R & Java
● Supports multiple use cases – Batch processing, streaming processing, ML, Data Science

● Works very well with data in Data Lake

31
Spark Architecture Driver Node

1. Read a file from Data Lake

2. Apply transformations
3. Write processed data to Data Lake
Driver Process

• Driver & Executors = JVM processes

Executor Process Executor Process
• One Worker Node can have one or Core Core Core Core
more Executor processes

• Executor size = 2 cores + 6 GB RAM

Worker Node Worker Node

32
Spark Architecture Driver Node

Split file logically

1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Executor Process Executor Process

File1.csv
Core Core Core Core

Data Lake Worker Node Worker Node

33
Spark Architecture Driver Node

Split file logically

1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process
Task 1 Task 2

Task 3 Task 4

Part 1

Part 2 Executor Process Executor Process

Part 3 Core Core Core Core

Part 4

Data Lake Worker Node Worker Node

34
Spark Architecture Driver Node

Split file logically

1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Part 1

Part 2 Executor Process Executor Process

Part 3 Core Core Core Core
Task 1 Task 2 Task 3 Task 4
Part 4

Data Lake Worker Node Worker Node

35
Spark Architecture Driver Node

Split file logically

1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Executor Process Executor Process

Core Core Core Core
Task 1 Task 2 Task 3 Task 4
Part 1 Part 2 Part 3 Part 4

Data Lake Worker Node Worker Node

36
Spark Architecture Driver Node

Split file logically

1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Executor Process Executor Process

Core Core Core Core
Output 1 Output 2 Output 3 Output 4

Data Lake Worker Node Worker Node

37
Spark Architecture Driver Node

Split file logically

1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Output 1

Output 2 Executor Process Executor Process

Output 3 Core Core Core Core

Output 4

Data Lake Worker Node Worker Node

38
Azure Databricks
Unified analytics platform based on Apache Spark and runs in the cloud

● Managed and optimized platform for running Spark

● Setup infrastructure with few clicks

● Integrated workspace to write code and collaborate

● Deploy and schedule jobs

● Enterprise grade security

● Built-in optimizations – Up to 50x higher

performance than vanilla Spark deployments

*Image Source: Databricks 39

Azure Databricks - Components

Workspace Cluster / Pools Notebooks

Jobs Databases & Tables Models

40
Exercise

● Setup Azure Databricks workspace

● Setup Spark cluster in Databricks

● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days

● Data Files @ http://tinyurl.com/data-lake-bootcamp-2days

41
Extracting &
Transforming Data
from Multiple File
Formats
DataFrames

Id Name Salary
● Higher-level Spark API to write code

● Reads data from source as a tabular 1 Mohit 20000

structure
2 Ivan 27500 Row
● Like a table in relational database –
so you can apply typical table type 3 Sabrina 24200
operations
4 Andrew 19000

5 Neha 32000

Column

43
Exercise

● Work with multiple file formats like CSV, JSON and Parquet

● Read files from Data Lake using Spark’s DataFrame API

● Apply operations to clean and transform data

● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days

● Data Files @ http://tinyurl.com/data-lake-bootcamp-2days

44
Tomorrow…

Building Reliable Data Lake with Delta Lake

● Loading processed data to Data Lake

● Introduction to Delta Lake

● Storing Data in Delta format
● Using Delta Lake features

SAP Workflow SuccessFactors
100% (1)
SAP Workflow SuccessFactors
210 pages
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
No ratings yet
Dokumen - Pub - Understanding Etl Data Pipelines For Modern Data Architectures Early Release 9781098159252
39 pages
Data Lake Bootcamp: Building Reliable Data Lakes
No ratings yet
Data Lake Bootcamp: Building Reliable Data Lakes
29 pages
Rosarium Philosophorum Texto Original PDF
No ratings yet
Rosarium Philosophorum Texto Original PDF
194 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Making Big Data Simple With Databricks
No ratings yet
Making Big Data Simple With Databricks
25 pages
Datawarehouse To Data Lakehouse
100% (1)
Datawarehouse To Data Lakehouse
48 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Databricks For The SQL Developer: Gerhard Brueckl
No ratings yet
Databricks For The SQL Developer: Gerhard Brueckl
40 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Apache Iceberg - Additional Real World Use Cases
No ratings yet
Apache Iceberg - Additional Real World Use Cases
25 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Spark QA
No ratings yet
Spark QA
34 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
A Deep Dive Into Query Execution Engine of Spark SQL
100% (2)
A Deep Dive Into Query Execution Engine of Spark SQL
88 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Ebook Accelerating Apache Spark 3
No ratings yet
Ebook Accelerating Apache Spark 3
108 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Learn How Databricks Streamlines The Data Management Lifecycle
No ratings yet
Learn How Databricks Streamlines The Data Management Lifecycle
20 pages
Demystifying The Medallion and Lakehouse Architectures 1714820046
100% (1)
Demystifying The Medallion and Lakehouse Architectures 1714820046
19 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Data Engineers Guide Apache Spark Delta Lake v3
No ratings yet
Data Engineers Guide Apache Spark Delta Lake v3
94 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Lessons From Large-Scale Machine Learning Deployments On Spark
No ratings yet
Lessons From Large-Scale Machine Learning Deployments On Spark
105 pages
Azure-Databricks-Virtual-Workshop-21-Apr - FINAL PDF
No ratings yet
Azure-Databricks-Virtual-Workshop-21-Apr - FINAL PDF
43 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Complete Notes On Azure de 1734338895
No ratings yet
Complete Notes On Azure de 1734338895
94 pages
Cloud Training
No ratings yet
Cloud Training
14 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Week 4 - Azure-AWSStorage
No ratings yet
Week 4 - Azure-AWSStorage
97 pages
Oreillyfodooltweek 11675274112220
No ratings yet
Oreillyfodooltweek 11675274112220
45 pages
02-01-2024 NHB Theory Call - Criteria of Markets Part 3 AI Summary
No ratings yet
02-01-2024 NHB Theory Call - Criteria of Markets Part 3 AI Summary
24 pages
1398600830
100% (1)
1398600830
256 pages
Managingprojectrisk 20230207 Nonotes 1676040187186
No ratings yet
Managingprojectrisk 20230207 Nonotes 1676040187186
145 pages
Theprocessandblueprintinanutshell 1675882372879
No ratings yet
Theprocessandblueprintinanutshell 1675882372879
21 pages
Course Handouts 1676319685993
No ratings yet
Course Handouts 1676319685993
8 pages
A Study Using N-Gram Features For Text Categorization
No ratings yet
A Study Using N-Gram Features For Text Categorization
10 pages
2565-Article Text-7901-3-10-20231018
No ratings yet
2565-Article Text-7901-3-10-20231018
8 pages
Cataloguing Practice in MARC 21
No ratings yet
Cataloguing Practice in MARC 21
16 pages
Securonix SOAR DS PT 022723-1
No ratings yet
Securonix SOAR DS PT 022723-1
2 pages
Trellix Data Security - Competitive Landscape
No ratings yet
Trellix Data Security - Competitive Landscape
5 pages
Valacich Msad8e ch08
No ratings yet
Valacich Msad8e ch08
57 pages
Quation 9
No ratings yet
Quation 9
3 pages
Fyit Practical Dbms
No ratings yet
Fyit Practical Dbms
3 pages
Project Proposal Template
No ratings yet
Project Proposal Template
3 pages
Ansi Ieee c37010 1979 PDF
No ratings yet
Ansi Ieee c37010 1979 PDF
108 pages
Database Management Quiz
100% (1)
Database Management Quiz
50 pages
CS583 - Data Mining and Text Mining
No ratings yet
CS583 - Data Mining and Text Mining
22 pages
Preserving Electronic Info
No ratings yet
Preserving Electronic Info
8 pages
Veeam Definitive Guide 2023
No ratings yet
Veeam Definitive Guide 2023
36 pages
Wa0006.
No ratings yet
Wa0006.
23 pages
Startup - Prime Model
No ratings yet
Startup - Prime Model
3 pages
The Information Life Cycle
No ratings yet
The Information Life Cycle
19 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
FirstName LastName DA
No ratings yet
FirstName LastName DA
2 pages
db6 Nls Guide-Archiving PDF
No ratings yet
db6 Nls Guide-Archiving PDF
116 pages
Wk2 1
No ratings yet
Wk2 1
43 pages
DB Admin Technical Test
No ratings yet
DB Admin Technical Test
4 pages
Capstone Project R2 Niwin
No ratings yet
Capstone Project R2 Niwin
24 pages
Introduction To Integrated Library Systems
No ratings yet
Introduction To Integrated Library Systems
36 pages
Maintenance Analytics
No ratings yet
Maintenance Analytics
6 pages
Project Case Study - Healthcare Appointment Scheduling System
No ratings yet
Project Case Study - Healthcare Appointment Scheduling System
2 pages
ER
No ratings yet
ER
9 pages
Nosql Cassandra Database: What Is Apache Cassandra?
No ratings yet
Nosql Cassandra Database: What Is Apache Cassandra?
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.