Slide 2
Slide 2
ENGINEERING PIPELINES
Part 2: Extract, Transform, Load
1
Part 1: Data Formats
5
NOT SO EASY TO OPTIMIZE
All parts of the technology stack come into play
6
THE MOST EFFICIENT PIPELINES CONSIDER ALL AT ONCE
Software
Individual
Infrastructure
Machine
7
ETL SYSTEMS ENGINEERING
SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities
Cars Promising
Marketing
Leads
Leads
(Production)
8
ETL SYSTEMS ENGINEERING
SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities
Cars Promising
Marketing
Leads
Leads
(Production)
9
ETL SYSTEMS ENGINEERING
SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities
Cars Promising
Marketing
Leads
Leads
(Production)
10
CUDA BASICS
11
CUDA COMPONENTS
Kernels and Threads
Thread
0 1 2 3 4 5 6 7 Kernel
Id
… • A function to run in
float x = input[threadID] ; parallel on the GPU
Kernel
float y = func(x) ;
Code
output[threadID] = y;
… Thread
• Runs an instance of
the kernel
12
CUDA COMPONENTS
Blocks and Grids
Grid
Block Block
…
float x = inpu t[threadID] ;
float y = func (x) ;
…
float x = inpu t[threadID] ;
float y = func (x) ;
• A group of threads
output[threadI D] = y; output[threadI D] = y;
… …
Grid
…
float x = inpu t[threadID] ;
float y = func (x) ;
…
float x = inpu t[threadID] ;
float y = func (x) ;
• All blocks mapped
on the GPU
output[threadI D] = y; output[threadI D] = y;
… …
13
CUDA COMPONENTS
Hardware to Software
RTX 3080
68 Streaming
Multiprocessors (SMs)
14
MORE CORES MORE PERFORMANCE?
A Hardware Integration Example
Motherboard • If hardware
is not
CPU RAM Memory Drive balanced,
bottlenecks
occur
• The right
hardware
depends on
GPU PSU the task
15
HOW DATA MOVES IN A COMPUTER
A Hardware Integration Example
Motherboard
GPU PSU
GPUs excel on matrix multiplication computation which is used There should be
frequently in ETL and ML model training. RAM speed, GPU speed enough juice to
and Memory Bandwidth can all be a potential bottleneck for power all the cool
computation. Model and data size will impact the efficacy of the hardware. If it’s
GPU. not cool, try
adding more fans.
16
HOW DATA MOVES IN A COMPUTER
A Hardware Integration Example
Motherboard
17
DEBUGGING: WHY IS MY CPU FASTER THAN MY GPU?
How data moves to the GPU with CUDA
18
DATA MOVEMENT
CPU vs GPU
CPU-Centric Data Movement GPU-Centric Data Movement
CPU CPU
PCIe PCIe
GPUDirect Storage
Local Local
GPU 1 GPU 0 Network GPU 1 GPU 0 Network
Storage Storage
20
NVTABULAR
21
GPU-ACCELERATED ETL
The average data scientist spends up to 80% of their
time in ETL, as opposed to training models
22
Built on top of
CPU version GPU-Accelerated
Pandas cuDF
ETL
ETL
Analytics Analytics
Dask
Dask
Scikit-Learn cuML
CPU Memory
Machine Learning Machine Learning
Model Training
Model Training
NumPy CuPy
GPU Memory
Matrix Mathematics Matrix Mathematics
pyViz, Plotly,
Matplotlib, Plotly
Viz
Viz
cuXfilter
Visualization Visualization
23
www.rapids.ai
NVTABULAR KEY FEATURES
Faster and Easier GPU-based ETL
25
NVTabular vs Pandas code
100x fewer lines of code required
import glob Import libraries.
import nvtabular as nvt
Create training and
# Create datasets from input files
validation datasets.
Equivalent Pandas/Numpy code: https://github.com/facebookresearch/dlrm/blob/master/data_utils.py
train_files = glob.glob("./dataset/train/*.parquet")
valid_files = glob.glob("./dataset/valid/*.parquet")
# Add feature engineering and pre-processing ops to workflow Zero fill any nulls, log
proc.add_cont_feature([nvt.ops.ZeroFill(), nvt.ops.LogOp()]) transform and normalize
proc.add_cont_preprocess(nvt.ops.Normalize()) continuous variables.
Encode categorical data.
proc.add_cat_preprocess(nvt.ops.Categorify(use_frequency=True, freq_threshold=15))
= NVTabular Operations
27
Case Study: 1TB Ads Dataset
ETL 660x faster. Training 96x faster.
ETL Training
10000
7920
2880
1000
Length in minutes, log scale.
100 180
60
30
10
12
1
Numpy CPU ETL Optimised Spark CPU ETL NVTabular GPU ETL
+ + +
CPU Training PyTorch GPU Training HugeCTR GPU Training
29
LET’S GO!
30
31
APPENDIX
32
GPU
Another Hardware Integration Example
Without With
GPUDirect GPUDirect
33
MEMORY HIERARCHY IN GPUS
34
PARTS OF AN ETL PIPELINE
Individual Cloud
Purpose
Machine Servers
Controller
Send instructions to other parts of the system CPU - Host
Server
Devices –
Parallelize calculation Workers
GPUs
Disk Database
Data storage
Memory server
35
KERNELS AND PAGEABLE MEMORY
Why they do not work together
Infinite loop.
36
BIG DATA FOR RETAIL
37
WORKING WITH NVTABULAR
Workflow
Step 1: Step 3:
Define Column Define Graph as
Group Workflow
cols = [“col_a”, “col_b]
Step 2: Step 4:
Transform Run Data through
Columns with Ops Workflow
cleaned_cols =
cols >> ops.FillMissing()
38