0% found this document useful (0 votes)
247 views46 pages

O Reilly Data Lake Bootcamp Day 11694182865124

Uploaded by

sanedo.owner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
247 views46 pages

O Reilly Data Lake Bootcamp Day 11694182865124

Uploaded by

sanedo.owner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Data Lake Bootcamp

Building Reliable Data Lakes

Mohit Batra
Microsoft Certified Trainer

09/2023
Mohit Batra

● Founder, Crystal Talks


● Author
● Microsoft Certified Trainer
● Ex-Microsoft, Saxo Bank

linkedin.com/in/mohitbatra/
Learning Objectives

By the end of this course, you will understand:

● The concept and objective of the data lake

● Ways to organize data in a data lake

● The use of Spark/Databricks to process data

● Challenges with Data Lake

● Storing data reliably in Data Lake using Delta format


2-Day Agenda

Day 1

● Concept and objective of the Data Lake


● Ways to organize data in Data Lake
● Working with Spark/Databricks to process data in Data Lake
Day 2
● Challenges with Data Lake
● How Delta Lake brings reliability to your data lake
● How data can be managed using Spark tables
● How Delta Lake can be used to build warehouse-like features on Data Lake
Today’s Agenda – Day 1

● Introduction to Data Lake


● Using Azure Data Lake Gen2 to build Data Lake
● Introduction to Apache Spark and Azure Databricks
● Extracting and transforming data from multiple file formats
Prerequisites

● Azure Subscription
● Basics of Microsoft Azure
Introduction to
Data Lake
What is Data Warehouse?

Relational Sources Staging Data Warehouse BI / Reporting

● Centralized repository to store data in pre-defined structure

● Stores highly curated data in relational format


● Optimized for data retrieval
● Typically used by business analysts / users for BI & reporting purposes

8
Challenges with Data Warehouse

Relational Sources Staging Data Warehouse BI / Reporting

IoT Devices Streaming Velocity, Variety & Volume (3Vs)


of data has significantly increased!

Files / Logs Audio / Video

9
Data Lake

Data Warehouse BI / Reporting

Relational Sources

IoT Devices Streaming


Data Lake Machine Learning

Files / Logs Audio / Video

Data Science

10
Data Lake – What is it?
Single source of truth holding vast amounts of raw data in its native format

● Store all types of data at large scale


● Supports storing streaming data at high speeds
● No need to define schema – store in native format
● Inexpensive storage compared to Data Warehouses

● Explore data first – before moving it to relational storage


● Use big data processing tools to enrich & curate data
● Supports multiple use cases

11
Medallion Architecture

Bronze Layer Silver Layer Gold Layer

Raw data Cleaned & transformed data Aggregated / domain data

• Single source of truth • Store granular data • Store business aggregates


• Use this for reprocessing • Enrich data with more info • Optimized for querying
• Combine data from different files • Useful for BI/reporting

12
Design Considerations

● Centralized or decentralized
● File Format
● Organizing Data Files
● Performance, Scalability & High Availability

● Data Security
● Monitoring
● Capturing Metadata

13
Design Considerations
(1) Centralized or Decentralized?

Centralized Region-specific Customer-specific


Data Lake Data Lakes Data Lakes

● Easily manage security, apply governance ● Apply separate policies to different lakes
rules, manage data lifecycle ● Faster development
● Better usage & cost tracking

● Impedes agile development ● Management overhead


● Doesn’t work well for global applications ● If not governed properly, can turn into
data swamps
14
Design Considerations
(2) File Format to use?

● Raw Data (Bronze layer)

○ Store in Native formats – CSV, JSON, Excel, XML etc.

● Enriched Data (Silver layer)


○ Store in Columnar formats – Parquet etc.

○ Provides file compression, stores metadata, faster retrieval

● Curated Data (Gold layer)


○ Depends on business, but many tools support formats like Parquet too

15
Design Considerations
(3) Organizing Data Files based on…

Date Time Dim model entities Consumer

Orders/Year/Month/Day/Hour Dim/Customers, Dim/Products Sales/Customer, Sales/Product


Sales/Year/Month/Day/Hour Fact/Orders, Fact/Sales Finance/Customer

Region Confidentiality Usage

Separate data lake accounts PII/Customer/ CurrentYear/Orders


Partner/Customer PreviousYears/2021/Orders

16
Design Considerations
(3) Organizing Data Files

Bronze Layer Silver Layer Gold Layer

Raw data Cleaned & transformed data Aggregated / domain data

Orders Fact Finance


2022 Orders Orders
11 Partner1 PartnerSummary
22 2022 2022-11-22
01 11 summary_by_product.csv
orders1.csv orders1.parquet summary_by_customer.csv

17
Design Considerations
(4) Performance, Scalability & High Availability

● Setting in on-premises will require planning & configuring these aspects

● Most of cloud-based storage options provides this out-of-the-box


○ Azure Data Lake Store, Amazon S3, Google Cloud Storage

18
Design Considerations
(5) Security

Access

● Create groups & roles to provide role-based access control (RBAC)


● Use the principle of least-privilege
● Build your workflow to provide access to different assets

Personas
● Bronze layer – Access should be limited to Data Engineers
● Silver layer – Complete access to Data Engineers, and read-access to Data Scientists/ML Engineers
● Gold layer – Complete access to Data Engineers, and read-access to others (Business Users)

19
Design Considerations
(6) Monitoring

● Monitor data that is being read – by whom, at what frequency, etc.

● Capture latency in serving requests


● Capture logs
○ In Log database for near real-time querying

○ In file-based storage for long-term retention

20
Design Considerations
(7) Capturing Metadata

● Building a Catalog to capture asset information is extremely important!

● Catalog stores information about assets


○ Who owns it?
○ When was this created?

○ What attributes it has?


○ Does it contain PII data?
○ Which domain / entity it belongs too?
○ …

21
Products / Services

Reporting Tools
Data Sources Power BI, Tableau
Azure SQL, On-prem DBs,
Azure Event Hubs, Kafka

Data Science / ML
Data Lake Azure ML, Databricks
HDFS, Azure Data Lake,
AWS S3, GCS

Ingestion Tools Processing Tools


Azure Data Factory, SSIS, Azure Synapse, Databricks,
Informatica Azure Stream Analytics

22
Using Azure Data
Lake Gen2 to
build Data Lake
Options to build Data Lake in Azure

Azure Data Lake Azure Data Lake


Azure Storage
Gen1 Gen2

• Performance options • Compatible with webHDFS • Performance options


• Tiered options – • Allows nested folders • Tiered options –
Hot/cool/archive (hierarchical namespace) Hot/cool/archive
• Global Redundancy options • Better performance • Global Redundancy options
• Fine-grain permissions
• Compatible with webHDFS
• Allows nested folders
(hierarchical namespace)
• Better performance
• Fine-grain permissions

24
Performance Tiers

Standard Premium

25
Standard Performance Tier – Access Tiers

Hot storage tier Cool storage tier Archive storage tier

When to use? Read & write frequently Infrequently accessed data Archival

Availability 99.9% 99% Offline

Higher storage costs Lower storage costs Lowest storage costs


Charges
Lower access costs Higher access costs Highest access costs

Minimum storage
duration N/A 30 days 180 days

Blob rehydration
Latency
milliseconds milliseconds Standard – upto 15 hours
(Time to first byte)
High – upto 1 hour (for 10GB)

Reference: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers 26
Redundancy Options
Primary
Region Only

Locally-redundant storage (LRS) Zone-redundant storage (ZRS)

Regions 1 1

Datacenters used 1 3

Copies 3 – within same datacenter 3 – one copy in each datacenter

Benefit Prevents against failure in server racks Prevents against datacenter failures

27
Redundancy Options
Multiple
Regions

Primary Secondary Primary Secondary

Geo-redundant storage (GRS) Geo-Zone-redundant (GZRS)

Regions 2 2

Datacenters used 2 4

Copies 3 in primary & 3 in secondary 3 in primary & 3 in secondary

Benefit Prevents against regional failures Prevents against regional & datacenter failures

28
Exercise

● Create Azure Data Lake Gen2 account

● Setup permissions on Data Lake account

● Configure data lifecycle

● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days

● Data Files @ http://tinyurl.com/data-lake-bootcamp-2days

29
Introduction to
Apache Spark and
Azure Databricks
Apache Spark

● Open-source compute engine to perform distributed data processing at-scale


● Based on cluster computing – Data is processed in the memory of the cluster
● Support for multiple languages – Scala, Python, SQL, R & Java
● Supports multiple use cases – Batch processing, streaming processing, ML, Data Science

● Works very well with data in Data Lake

31
Spark Architecture Driver Node

1. Read a file from Data Lake


2. Apply transformations
3. Write processed data to Data Lake
Driver Process

• Driver & Executors = JVM processes


Executor Process Executor Process
• One Worker Node can have one or Core Core Core Core
more Executor processes

• Executor size = 2 cores + 6 GB RAM

Worker Node Worker Node


32
Spark Architecture Driver Node

Split file logically


1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Executor Process Executor Process


File1.csv
Core Core Core Core

Data Lake Worker Node Worker Node


33
Spark Architecture Driver Node

Split file logically


1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process
Task 1 Task 2

Task 3 Task 4

Part 1

Part 2 Executor Process Executor Process


Part 3 Core Core Core Core

Part 4

Data Lake Worker Node Worker Node


34
Spark Architecture Driver Node

Split file logically


1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Part 1

Part 2 Executor Process Executor Process


Part 3 Core Core Core Core
Task 1 Task 2 Task 3 Task 4
Part 4

Data Lake Worker Node Worker Node


35
Spark Architecture Driver Node

Split file logically


1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Executor Process Executor Process


Core Core Core Core
Task 1 Task 2 Task 3 Task 4
Part 1 Part 2 Part 3 Part 4

Data Lake Worker Node Worker Node


36
Spark Architecture Driver Node

Split file logically


1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Executor Process Executor Process


Core Core Core Core
Output 1 Output 2 Output 3 Output 4

Data Lake Worker Node Worker Node


37
Spark Architecture Driver Node

Split file logically


1. Read a file from Data Lake into 4 parts!
2. Apply transformations Job
3. Write processed data to Data Lake
Driver Process

Output 1

Output 2 Executor Process Executor Process

Output 3 Core Core Core Core

Output 4

Data Lake Worker Node Worker Node


38
Azure Databricks
Unified analytics platform based on Apache Spark and runs in the cloud

● Managed and optimized platform for running Spark

● Setup infrastructure with few clicks

● Integrated workspace to write code and collaborate

● Deploy and schedule jobs

● Enterprise grade security

● Built-in optimizations – Up to 50x higher


performance than vanilla Spark deployments

*Image Source: Databricks 39


Azure Databricks - Components

Workspace Cluster / Pools Notebooks

Jobs Databases & Tables Models

40
Exercise

● Setup Azure Databricks workspace

● Setup Spark cluster in Databricks

● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days

● Data Files @ http://tinyurl.com/data-lake-bootcamp-2days

41
Extracting &
Transforming Data
from Multiple File
Formats
DataFrames

Id Name Salary
● Higher-level Spark API to write code

● Reads data from source as a tabular 1 Mohit 20000


structure
2 Ivan 27500 Row
● Like a table in relational database –
so you can apply typical table type 3 Sabrina 24200
operations
4 Andrew 19000

5 Neha 32000

Column

43
Exercise

● Work with multiple file formats like CSV, JSON and Parquet

● Read files from Data Lake using Spark’s DataFrame API

● Apply operations to clean and transform data

● Instructions @ https://github.com/Crystal-Talks/DataLakeBootcamp_2Days

● Data Files @ http://tinyurl.com/data-lake-bootcamp-2days

44
Tomorrow…

Building Reliable Data Lake with Delta Lake


● Loading processed data to Data Lake

● Introduction to Delta Lake


● Storing Data in Delta format
● Using Delta Lake features

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy