0% found this document useful (0 votes)

3 views38 pages

Slide 2

The document discusses the Extract, Transform, Load (ETL) process in data engineering, focusing on the use of NVTabular for GPU-accelerated ETL. It highlights the importance of optimizing data pipelines by considering hardware, software, and infrastructure together. Key features of NVTabular include faster processing, reduced code complexity, and compatibility with popular machine learning frameworks.

Uploaded by

aanderson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views38 pages

Slide 2

Uploaded by

aanderson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

ACCELERATING DATA

ENGINEERING PIPELINES
Part 2: Extract, Transform, Load

1
Part 1: Data Formats

Part 2: ETL with NVTabular

AGENDA
Part 3: Data Visualization
AGENDA – PART 2
• ETL Basics
• CUDA
• NVTabular
• Lab
ETL BASICS
4
DATA MANIPULATION IN 3 “EASY” STEPS

Extract Transform Load

• Pull data from a • Alter the data in • Export

database some way transformed data
• SQL, Blob • Cleaning, to a new
Storage, Image deduping, database location
Archive feature
engineering

5
NOT SO EASY TO OPTIMIZE
All parts of the technology stack come into play

Individual Machines Software Infrastructure

• CPU • Python vs C vs SQL • Number of servers
• GPU • Algorithm efficiency • Bandwidth
• RAM • MapReduce • Database location
• PCIe • GPU support • Client location

6
THE MOST EFFICIENT PIPELINES CONSIDER ALL AT ONCE

Software

Individual
Infrastructure
Machine

7
ETL SYSTEMS ENGINEERING

SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities

Cars Promising
Marketing
Leads
Leads
(Production)

8
ETL SYSTEMS ENGINEERING

SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities

Cars Promising
Marketing
Leads
Leads
(Production)

9
ETL SYSTEMS ENGINEERING

SQL Join
2 Hours Promising
Drivers Leads
Calculate Python
Demographics 4 Hours (Staging)
by Car Brand
Calculate Cloud Storage Copy
Promising Sales .5 Hours
Opportunities

Cars Promising
Marketing
Leads
Leads
(Production)

10
CUDA BASICS
11
CUDA COMPONENTS
Kernels and Threads

Thread
0 1 2 3 4 5 6 7 Kernel
Id

… • A function to run in
float x = input[threadID] ; parallel on the GPU
Kernel
float y = func(x) ;
Code
output[threadID] = y;
… Thread

• Runs an instance of
the kernel

12
CUDA COMPONENTS
Blocks and Grids

Grid
Block Block
…
float x = inpu t[threadID] ;
float y = func (x) ;
…
float x = inpu t[threadID] ;
float y = func (x) ;
• A group of threads
output[threadI D] = y; output[threadI D] = y;
… …

Grid
…
float x = inpu t[threadID] ;
float y = func (x) ;
…
float x = inpu t[threadID] ;
float y = func (x) ;
• All blocks mapped
on the GPU
output[threadI D] = y; output[threadI D] = y;
… …

13
CUDA COMPONENTS
Hardware to Software

RTX 3080

8704 CUDA Cores

68 Streaming
Multiprocessors (SMs)

14
MORE CORES MORE PERFORMANCE?
A Hardware Integration Example
Motherboard • If hardware
is not
CPU RAM Memory Drive balanced,
bottlenecks
occur
• The right
hardware
depends on
GPU PSU the task

15
HOW DATA MOVES IN A COMPUTER
A Hardware Integration Example
Motherboard

CPU RAM Memory Drive

1 – 2 Cores Should be at least a little If the data is too large to be loaded
per GPU. bigger than GPU RAM. PCIe into RAM, then reading speed from the
lane speed can be a factor memory drive can be a bottleneck.
with massive datasets.

GPU PSU
GPUs excel on matrix multiplication computation which is used There should be
frequently in ETL and ML model training. RAM speed, GPU speed enough juice to
and Memory Bandwidth can all be a potential bottleneck for power all the cool
computation. Model and data size will impact the efficacy of the hardware. If it’s
GPU. not cool, try
adding more fans.

16
HOW DATA MOVES IN A COMPUTER
A Hardware Integration Example
Motherboard

CPU 1. CPU RAM Memory Drive

tells RAM
to load in 2. RAM preprocesses data (ex: prepare
data from images for ml model) and sends to open
memory GPU
drive

GPU 3. GPU completes backpropagation and PSU

sends model updates to host. Requests
next batch.
4. Go back to step 1 and repeat until
model convergence.

17
DEBUGGING: WHY IS MY CPU FASTER THAN MY GPU?
How data moves to the GPU with CUDA

Host Device (GPU)

DRAM GPU Chip

CPU Multiprocessor
Local
Memory Multiprocessor
Multiprocessor
DRAM PCIe Shared
Memory

18
DATA MOVEMENT
CPU vs GPU
CPU-Centric Data Movement GPU-Centric Data Movement

CPU CPU

PCIe PCIe

GPUDirect Storage

Local Local
GPU 1 GPU 0 Network GPU 1 GPU 0 Network
Storage Storage

NVLink GPUDirect RDMA

(same node)
19
HOW TO OPTIMIZE DATA TRANSFERS

20
NVTABULAR
21
GPU-ACCELERATED ETL
The average data scientist spends up to 80% of their
time in ETL, as opposed to training models

22
Built on top of
CPU version GPU-Accelerated

Pandas cuDF
ETL

ETL
Analytics Analytics
Dask

Dask
Scikit-Learn cuML

CPU Memory
Machine Learning Machine Learning
Model Training

Model Training
NumPy CuPy

GPU Memory
Matrix Mathematics Matrix Mathematics

PyTorch, MxNext, PyTorch, MxNext,

TensorFlow TensorFlow
Deep Learning Deep Learning

pyViz, Plotly,
Matplotlib, Plotly
Viz

Viz
cuXfilter
Visualization Visualization
23
www.rapids.ai
NVTABULAR KEY FEATURES
Faster and Easier GPU-based ETL

• GPU-accelerated, eliminating CPU bottlenecks.

• Out-of-core execution. No GPU memory limits and
reduced I/O through lazy execution. NVTabular
• PyTorch, TensorFlow and HugeCTR compatible. Dataset size limitation Unlimited CPU Memory
Code complexity Simple Moderate
• Filtering outliers or missing values.
Lines of code 10 - 20 100 - 1000
• Inputting and filling in missing data. Flexibility Domain specific General

• Discretization or bucketing of continuous features. Data loading Transforms Yes No

Inference Transforms Yes No
• Creating features by splitting or combining existing
features.
• Normalizing numerical features to have zero mean and
unit variance.
• Encoding discrete features using one-hot vectors or
converting them to continuous integer indices.
• More to come ☺
24
NVTABULAR
Integration with RAPIDS/DASK

25
NVTabular vs Pandas code
100x fewer lines of code required
import glob Import libraries.
import nvtabular as nvt
Create training and
# Create datasets from input files
validation datasets.
Equivalent Pandas/Numpy code: https://github.com/facebookresearch/dlrm/blob/master/data_utils.py

train_files = glob.glob("./dataset/train/*.parquet")
valid_files = glob.glob("./dataset/valid/*.parquet")

train_ds = nvt.Dataset(train_files, gpu_memory_frac=0.1)

valid_ds = nvt.Dataset(valid_files, gpu_memory_frac=0.1)

# Initialise workflow Initialise workflow

cat_names = ["C" + str(x) for x in range(1, 27)] # Specify categorical feature names specifying categorical,
and continuous data.
cont_names = ["I" + str(x) for x in range(1, 14)] # Specify continuous feature names
label_name = ["label"] # Specify target feature

proc = nvt.Workflow(cat_names=cat_names, cont_names=cont_names, label_name=label_name)

# Add feature engineering and pre-processing ops to workflow Zero fill any nulls, log
proc.add_cont_feature([nvt.ops.ZeroFill(), nvt.ops.LogOp()]) transform and normalize
proc.add_cont_preprocess(nvt.ops.Normalize()) continuous variables.
Encode categorical data.
proc.add_cat_preprocess(nvt.ops.Categorify(use_frequency=True, freq_threshold=15))

# Compute statistics, transform data, and export to disk

proc.apply(train_dataset, shuffle=True, output_path="./processed_data/train", num_out_files=len(train_files)) Apply the operations,
creating new shuffled
proc.apply(valid_dataset, shuffle=False, output_path="./processed_data/valid", num_out_files=len(valid_files))
training and validation
datasets.
26
NVTABULAR DAG

= NVTabular Operations
27
Case Study: 1TB Ads Dataset
ETL 660x faster. Training 96x faster.
ETL Training

10000
7920
2880
1000
Length in minutes, log scale.

100 180

60
30
10
12

1
Numpy CPU ETL Optimised Spark CPU ETL NVTabular GPU ETL
+ + +
CPU Training PyTorch GPU Training HugeCTR GPU Training

CPU: AWS r5d.24xl, 96 cores, 768 GB RAM https://github.com/NVIDIA-Merlin/NVTabular/blob/master/examples/criteo-example.ipynb

GPU: 1 x NVIDIA V100 32GB 28
NVTABULAR
Position compared to other popular DataFrame libraries

29
LET’S GO!

30
31
APPENDIX
32
GPU
Another Hardware Integration Example

Without With
GPUDirect GPUDirect

33
MEMORY HIERARCHY IN GPUS

34
PARTS OF AN ETL PIPELINE

Individual Cloud
Purpose
Machine Servers

Controller
Send instructions to other parts of the system CPU - Host
Server

Devices –
Parallelize calculation Workers
GPUs

Data transfer PCIe Bandwidth

Disk Database
Data storage
Memory server
35
KERNELS AND PAGEABLE MEMORY
Why they do not work together

Kernel requests pageable data from disk memory

Data does not exist.

Kernel asks page fault handler to fetch data.

Kernel restarts user-defined code.

Page fault handler not loaded, fault not handled

Infinite loop.

36
BIG DATA FOR RETAIL

37
WORKING WITH NVTABULAR

Workflow
Step 1: Step 3:
Define Column Define Graph as
Group Workflow
cols = [“col_a”, “col_b]

Step 2: Step 4:
Transform Run Data through
Columns with Ops Workflow
cleaned_cols =
cols >> ops.FillMissing()

NVIDIA GEN AI Cheat Sheet
No ratings yet
NVIDIA GEN AI Cheat Sheet
97 pages
PyTorch Artificial Intelligence Fundamentals by Jibin Mathew (2020)
100% (1)
PyTorch Artificial Intelligence Fundamentals by Jibin Mathew (2020)
191 pages
Nvidia-Learning-Training Course-Catalog
No ratings yet
Nvidia-Learning-Training Course-Catalog
38 pages
Programming in Parallel With CUDA A Practical Guide (Richard Ansorge)
100% (1)
Programming in Parallel With CUDA A Practical Guide (Richard Ansorge)
477 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Accelerating Machine Learning On GPUs With NVIDIA and H2O.ai
No ratings yet
Accelerating Machine Learning On GPUs With NVIDIA and H2O.ai
40 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Part List DCP t500w
No ratings yet
Part List DCP t500w
29 pages
Gpu PPT
100% (1)
Gpu PPT
30 pages
Owens
No ratings yet
Owens
67 pages
Distributed Deep Learning For Parallel Training
No ratings yet
Distributed Deep Learning For Parallel Training
7 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Preet Hi
No ratings yet
Preet Hi
75 pages
Bods Interview
100% (3)
Bods Interview
61 pages
SERVER
No ratings yet
SERVER
483 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
34 pages
Slide 2
No ratings yet
Slide 2
38 pages
GPGPU
No ratings yet
GPGPU
139 pages
00 CourseIntroduction
No ratings yet
00 CourseIntroduction
33 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
OptaSense Third Party Interface Specification
No ratings yet
OptaSense Third Party Interface Specification
32 pages
w13s1 MultiprocessingGPU
No ratings yet
w13s1 MultiprocessingGPU
21 pages
Environment Setup
No ratings yet
Environment Setup
25 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
Slide 1
No ratings yet
Slide 1
23 pages
Nvidia Learning Training Course Catalog
No ratings yet
Nvidia Learning Training Course Catalog
33 pages
Law485 Director
No ratings yet
Law485 Director
67 pages
Nvidia-Learning-Training Course-Catalog
No ratings yet
Nvidia-Learning-Training Course-Catalog
34 pages
SIUT Registration Form
No ratings yet
SIUT Registration Form
4 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
6.5.1.8 Proposal For AI Lab
No ratings yet
6.5.1.8 Proposal For AI Lab
40 pages
Nvidia Update For Lenovo
No ratings yet
Nvidia Update For Lenovo
30 pages
Final Bachelor Project 07 Vikram
No ratings yet
Final Bachelor Project 07 Vikram
62 pages
Ai 101
No ratings yet
Ai 101
4 pages
Scipy09 Pycuda Tut
No ratings yet
Scipy09 Pycuda Tut
162 pages
Finance Trading Executive Briefing HR Web
No ratings yet
Finance Trading Executive Briefing HR Web
7 pages
Note2 4
No ratings yet
Note2 4
11 pages
Presented by Ragasudha.B Pavitha.P
No ratings yet
Presented by Ragasudha.B Pavitha.P
13 pages
Unit 4
No ratings yet
Unit 4
28 pages
Parameters To Compare GPUs
No ratings yet
Parameters To Compare GPUs
7 pages
Master Plan Porto Romano Bay Albania
100% (1)
Master Plan Porto Romano Bay Albania
138 pages
Accelerating Data Science With GPUs
No ratings yet
Accelerating Data Science With GPUs
53 pages
Natwar Lal Joshi - Resume 2023
No ratings yet
Natwar Lal Joshi - Resume 2023
1 page
Tensorflow: Features
No ratings yet
Tensorflow: Features
10 pages
Car Safety Comprehension
100% (1)
Car Safety Comprehension
9 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Introduction To Linear Programming Sau
No ratings yet
Introduction To Linear Programming Sau
42 pages
DL24/DL24P User Manual
No ratings yet
DL24/DL24P User Manual
9 pages
Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller
No ratings yet
Tensorflow Implementation For Job Market Classification: Taras Mitran Jeff Waller
46 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Corporate Governanceand Ethics
No ratings yet
Corporate Governanceand Ethics
8 pages
TOPIC 7 Unemployment
No ratings yet
TOPIC 7 Unemployment
13 pages
(Student Version) 91264 - 2023 - Anything Is Popsicle
No ratings yet
(Student Version) 91264 - 2023 - Anything Is Popsicle
4 pages
SF Calibration Impl en
No ratings yet
SF Calibration Impl en
114 pages
INSPI - Yaoure-ESIA-Appendix-34-Cultural-Heritage-Management-Plan
100% (1)
INSPI - Yaoure-ESIA-Appendix-34-Cultural-Heritage-Management-Plan
7 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Nvidia - Rapids
No ratings yet
Nvidia - Rapids
33 pages
CAO 7-2022 Reference Materials
No ratings yet
CAO 7-2022 Reference Materials
1 page
Excel Vba-Based Solution To Pipe Flow Measurement Problem: Spreadsheets in Education (Ejsie)
No ratings yet
Excel Vba-Based Solution To Pipe Flow Measurement Problem: Spreadsheets in Education (Ejsie)
16 pages
Nvitu 230307121950 c3b682cc
No ratings yet
Nvitu 230307121950 c3b682cc
24 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Losing Track of Time
No ratings yet
Losing Track of Time
2 pages
Intro To Deep Learning
No ratings yet
Intro To Deep Learning
39 pages
Day5 FDP IoT Part1
No ratings yet
Day5 FDP IoT Part1
89 pages
Testing & Commissioning of Irrigation System
No ratings yet
Testing & Commissioning of Irrigation System
13 pages
Density Based Clustering
No ratings yet
Density Based Clustering
22 pages
NVIDIA Investor Day 2017
No ratings yet
NVIDIA Investor Day 2017
70 pages
Cage Trim Valves
100% (1)
Cage Trim Valves
57 pages
Summary-RK Narayan - The Financial Expert
100% (13)
Summary-RK Narayan - The Financial Expert
5 pages
(MIS BANA3050) Draft #2 - Group 7
No ratings yet
(MIS BANA3050) Draft #2 - Group 7
2 pages
Acceleratingpythonongpus
No ratings yet
Acceleratingpythonongpus
33 pages
Academic Full Length Practice Test
0% (1)
Academic Full Length Practice Test
25 pages
The First Artificial Neuron
No ratings yet
The First Artificial Neuron
2 pages
Deep Learning With TensorFlow and Spark Using GPUs and Docker Containers Presentation
No ratings yet
Deep Learning With TensorFlow and Spark Using GPUs and Docker Containers Presentation
38 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
CUDA Center Using Robots To Build Real Time Maps - The Official NVIDIA Blog
No ratings yet
CUDA Center Using Robots To Build Real Time Maps - The Official NVIDIA Blog
6 pages
E11 BR PD
No ratings yet
E11 BR PD
6 pages
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
No ratings yet
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
1 page
2,000 Most Common Italian Words
No ratings yet
2,000 Most Common Italian Words
30 pages
Our Development Board: Product Details
No ratings yet
Our Development Board: Product Details
4 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
ADSL Application Form
No ratings yet
ADSL Application Form
6 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
Intro To Deep Learning
100% (1)
Intro To Deep Learning
35 pages
Digiscope Slimhole MWD Ps
No ratings yet
Digiscope Slimhole MWD Ps
2 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Slide 2

Uploaded by

Slide 2

Uploaded by

ACCELERATING DATA

Part 2: ETL with NVTabular

Extract Transform Load

• Pull data from a • Alter the data in • Export

Individual Machines Software Infrastructure

8704 CUDA Cores

CPU RAM Memory Drive

CPU 1. CPU RAM Memory Drive

GPU 3. GPU completes backpropagation and PSU

Host Device (GPU)

DRAM GPU Chip

NVLink GPUDirect RDMA

PyTorch, MxNext, PyTorch, MxNext,

• GPU-accelerated, eliminating CPU bottlenecks.

• Discretization or bucketing of continuous features. Data loading Transforms Yes No

train_ds = nvt.Dataset(train_files, gpu_memory_frac=0.1)

# Initialise workflow Initialise workflow

proc = nvt.Workflow(cat_names=cat_names, cont_names=cont_names, label_name=label_name)

# Compute statistics, transform data, and export to disk

CPU: AWS r5d.24xl, 96 cores, 768 GB RAM https://github.com/NVIDIA-Merlin/NVTabular/blob/master/examples/criteo-example.ipynb

Data transfer PCIe Bandwidth

Kernel requests pageable data from disk memory

Data does not exist.

Kernel asks page fault handler to fetch data.

Kernel restarts user-defined code.

Page fault handler not loaded, fault not handled

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.