0% found this document useful (0 votes)
3 views38 pages

Slide 2

The document discusses the Extract, Transform, Load (ETL) process in data engineering, focusing on the use of NVTabular for GPU-accelerated ETL. It highlights the importance of optimizing data pipelines by considering hardware, software, and infrastructure together. Key features of NVTabular include faster processing, reduced code complexity, and compatibility with popular machine learning frameworks.

Uploaded by

aanderson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views38 pages

Slide 2

The document discusses the Extract, Transform, Load (ETL) process in data engineering, focusing on the use of NVTabular for GPU-accelerated ETL. It highlights the importance of optimizing data pipelines by considering hardware, software, and infrastructure together. Key features of NVTabular include faster processing, reduced code complexity, and compatibility with popular machine learning frameworks.

Uploaded by

aanderson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

ACCELERATING DATA

ENGINEERING PIPELINES
Part 2: Extract, Transform, Load

1
Part 1: Data Formats

Part 2: ETL with NVTabular


AGENDA
Part 3: Data Visualization
AGENDA – PART 2
• ETL Basics
• CUDA
• NVTabular
• Lab
ETL BASICS
4
DATA MANIPULATION IN 3 “EASY” STEPS

Extract Transform Load

• Pull data from a • Alter the data in • Export


database some way transformed data
• SQL, Blob • Cleaning, to a new
Storage, Image deduping, database location
Archive feature
engineering

5
NOT SO EASY TO OPTIMIZE
All parts of the technology stack come into play

Individual Machines Software Infrastructure


• CPU • Python vs C vs SQL • Number of servers
• GPU • Algorithm efficiency • Bandwidth
• RAM • MapReduce • Database location
• PCIe • GPU support • Client location

6
THE MOST EFFICIENT PIPELINES CONSIDER ALL AT ONCE

Software

Individual
Infrastructure
Machine

7
ETL SYSTEMS ENGINEERING

SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities

Cars Promising
Marketing
Leads
Leads
(Production)

8
ETL SYSTEMS ENGINEERING

SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities

Cars Promising
Marketing
Leads
Leads
(Production)

9
ETL SYSTEMS ENGINEERING

SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities

Cars Promising
Marketing
Leads
Leads
(Production)

10
CUDA BASICS
11
CUDA COMPONENTS
Kernels and Threads

Thread
0 1 2 3 4 5 6 7 Kernel
Id

… • A function to run in
float x = input[threadID] ; parallel on the GPU
Kernel
float y = func(x) ;
Code
output[threadID] = y;
… Thread

• Runs an instance of
the kernel

12
CUDA COMPONENTS
Blocks and Grids

Grid
Block Block

float x = inpu t[threadID] ;
float y = func (x) ;

float x = inpu t[threadID] ;
float y = func (x) ;
• A group of threads
output[threadI D] = y; output[threadI D] = y;
… …

Grid

float x = inpu t[threadID] ;
float y = func (x) ;

float x = inpu t[threadID] ;
float y = func (x) ;
• All blocks mapped
on the GPU
output[threadI D] = y; output[threadI D] = y;
… …

13
CUDA COMPONENTS
Hardware to Software

RTX 3080

8704 CUDA Cores

68 Streaming
Multiprocessors (SMs)

14
MORE CORES MORE PERFORMANCE?
A Hardware Integration Example
Motherboard • If hardware
is not
CPU RAM Memory Drive balanced,
bottlenecks
occur
• The right
hardware
depends on
GPU PSU the task

15
HOW DATA MOVES IN A COMPUTER
A Hardware Integration Example
Motherboard

CPU RAM Memory Drive


1 – 2 Cores Should be at least a little If the data is too large to be loaded
per GPU. bigger than GPU RAM. PCIe into RAM, then reading speed from the
lane speed can be a factor memory drive can be a bottleneck.
with massive datasets.

GPU PSU
GPUs excel on matrix multiplication computation which is used There should be
frequently in ETL and ML model training. RAM speed, GPU speed enough juice to
and Memory Bandwidth can all be a potential bottleneck for power all the cool
computation. Model and data size will impact the efficacy of the hardware. If it’s
GPU. not cool, try
adding more fans.

16
HOW DATA MOVES IN A COMPUTER
A Hardware Integration Example
Motherboard

CPU 1. CPU RAM Memory Drive


tells RAM
to load in 2. RAM preprocesses data (ex: prepare
data from images for ml model) and sends to open
memory GPU
drive

GPU 3. GPU completes backpropagation and PSU


sends model updates to host. Requests
next batch.
4. Go back to step 1 and repeat until
model convergence.

17
DEBUGGING: WHY IS MY CPU FASTER THAN MY GPU?
How data moves to the GPU with CUDA

Host Device (GPU)

DRAM GPU Chip


CPU Multiprocessor
Local
Memory Multiprocessor
Multiprocessor
DRAM PCIe Shared
Memory

18
DATA MOVEMENT
CPU vs GPU
CPU-Centric Data Movement GPU-Centric Data Movement

CPU CPU

PCIe PCIe

GPUDirect Storage

Local Local
GPU 1 GPU 0 Network GPU 1 GPU 0 Network
Storage Storage

NVLink GPUDirect RDMA


(same node)
19
HOW TO OPTIMIZE DATA TRANSFERS

20
NVTABULAR
21
GPU-ACCELERATED ETL
The average data scientist spends up to 80% of their
time in ETL, as opposed to training models

22
Built on top of
CPU version GPU-Accelerated

Pandas cuDF
ETL

ETL
Analytics Analytics
Dask

Dask
Scikit-Learn cuML

CPU Memory
Machine Learning Machine Learning
Model Training

Model Training
NumPy CuPy

GPU Memory
Matrix Mathematics Matrix Mathematics

PyTorch, MxNext, PyTorch, MxNext,


TensorFlow TensorFlow
Deep Learning Deep Learning

pyViz, Plotly,
Matplotlib, Plotly
Viz

Viz
cuXfilter
Visualization Visualization
23
www.rapids.ai
NVTABULAR KEY FEATURES
Faster and Easier GPU-based ETL

• GPU-accelerated, eliminating CPU bottlenecks.


• Out-of-core execution. No GPU memory limits and
reduced I/O through lazy execution. NVTabular
• PyTorch, TensorFlow and HugeCTR compatible. Dataset size limitation Unlimited CPU Memory
Code complexity Simple Moderate
• Filtering outliers or missing values.
Lines of code 10 - 20 100 - 1000
• Inputting and filling in missing data. Flexibility Domain specific General

• Discretization or bucketing of continuous features. Data loading Transforms Yes No


Inference Transforms Yes No
• Creating features by splitting or combining existing
features.
• Normalizing numerical features to have zero mean and
unit variance.
• Encoding discrete features using one-hot vectors or
converting them to continuous integer indices.
• More to come ☺
24
NVTABULAR
Integration with RAPIDS/DASK

25
NVTabular vs Pandas code
100x fewer lines of code required
import glob Import libraries.
import nvtabular as nvt
Create training and
# Create datasets from input files
validation datasets.
Equivalent Pandas/Numpy code: https://github.com/facebookresearch/dlrm/blob/master/data_utils.py

train_files = glob.glob("./dataset/train/*.parquet")
valid_files = glob.glob("./dataset/valid/*.parquet")

train_ds = nvt.Dataset(train_files, gpu_memory_frac=0.1)


valid_ds = nvt.Dataset(valid_files, gpu_memory_frac=0.1)

# Initialise workflow Initialise workflow


cat_names = ["C" + str(x) for x in range(1, 27)] # Specify categorical feature names specifying categorical,
and continuous data.
cont_names = ["I" + str(x) for x in range(1, 14)] # Specify continuous feature names
label_name = ["label"] # Specify target feature

proc = nvt.Workflow(cat_names=cat_names, cont_names=cont_names, label_name=label_name)

# Add feature engineering and pre-processing ops to workflow Zero fill any nulls, log
proc.add_cont_feature([nvt.ops.ZeroFill(), nvt.ops.LogOp()]) transform and normalize
proc.add_cont_preprocess(nvt.ops.Normalize()) continuous variables.
Encode categorical data.
proc.add_cat_preprocess(nvt.ops.Categorify(use_frequency=True, freq_threshold=15))

# Compute statistics, transform data, and export to disk


proc.apply(train_dataset, shuffle=True, output_path="./processed_data/train", num_out_files=len(train_files)) Apply the operations,
creating new shuffled
proc.apply(valid_dataset, shuffle=False, output_path="./processed_data/valid", num_out_files=len(valid_files))
training and validation
datasets.
26
NVTABULAR DAG

= NVTabular Operations
27
Case Study: 1TB Ads Dataset
ETL 660x faster. Training 96x faster.
ETL Training

10000
7920
2880
1000
Length in minutes, log scale.

100 180

60
30
10
12

1
Numpy CPU ETL Optimised Spark CPU ETL NVTabular GPU ETL
+ + +
CPU Training PyTorch GPU Training HugeCTR GPU Training

CPU: AWS r5d.24xl, 96 cores, 768 GB RAM https://github.com/NVIDIA-Merlin/NVTabular/blob/master/examples/criteo-example.ipynb


GPU: 1 x NVIDIA V100 32GB 28
NVTABULAR
Position compared to other popular DataFrame libraries

29
LET’S GO!

30
31
APPENDIX
32
GPU
Another Hardware Integration Example

Without With
GPUDirect GPUDirect

33
MEMORY HIERARCHY IN GPUS

34
PARTS OF AN ETL PIPELINE

Individual Cloud
Purpose
Machine Servers

Controller
Send instructions to other parts of the system CPU - Host
Server

Devices –
Parallelize calculation Workers
GPUs

Data transfer PCIe Bandwidth

Disk Database
Data storage
Memory server
35
KERNELS AND PAGEABLE MEMORY
Why they do not work together

Kernel requests pageable data from disk memory

Data does not exist.

Kernel asks page fault handler to fetch data.

Kernel restarts user-defined code.

Page fault handler not loaded, fault not handled

Infinite loop.

36
BIG DATA FOR RETAIL

37
WORKING WITH NVTABULAR

Workflow
Step 1: Step 3:
Define Column Define Graph as
Group Workflow
cols = [“col_a”, “col_b]

Step 2: Step 4:
Transform Run Data through
Columns with Ops Workflow
cleaned_cols =
cols >> ops.FillMissing()

38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy