0% found this document useful (0 votes)

7 views19 pages

LM32 Ait L21

This document serves as an introduction to CUDA, covering its principles, programming models, thread hierarchy, and memory management. It outlines the objectives of learning CUDA, including understanding basic concepts and programming techniques. Additionally, it discusses kernel execution, memory allocation, data transfer between host and device, and the compilation process using NVCC.

Uploaded by

jayowi9795

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views19 pages

LM32 Ait L21

Uploaded by

jayowi9795

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

ni

co
ar
oM
INTRODUZIONE A CUDA ‐ PARTE 1

lm
lie
ug
iG
ud
St
gli

ARGOMENTI
• Nella presente lezione verranno trattati i seguenti argomenti:
de

– CUDA principles
– CUDA programming models
– Thread Hierarchy
– Kernel
à

– Basic CUDA programming

sit

– HOST and DEVICE in CUDA

– MEMORY in CUDA
er
iv

2
Un

1
ni
co
ar
OBIETTIVI
• La fruizione di questa lezione permetterà il raggiungimento dei seguenti obiettivi didattici:

oM
– apprendimento dei concetti base di CUDA;
– modello di programmazione CUDA;
– concetto di Hardware Thread e Thread in CUDA;
– concetto di Kernel;

lm
– Basic CUDA programming;
– HOST and DEVICE in CUDA;
– MEMORY in CUDA.

lie
ug 3
iG
ud
St
gli

COMPUTE UNIFIED DEVICE ARCHITECTURE

• CUDA: a general purpose parallel computing
platform and programming model
de

– leverages the parallel compute engine in

NVIDIA GPUs to solve complex
computational problems.
• Emphasis on scalability.
à

• not a new programming language

– C/C++ is the programming language.
sit
er
iv

4
Un

2
ni
co
ar
CUDA PROGRAMMING MODEL
• CUDA model assumes:

oM
– CUDA kernels, made of a number of threads, execute on a
physically separate device that operates as a coprocessor
to the host running the C program;
– both host and device maintain their own separate
memory spaces in DRAM, referred to as host memory

lm
and device memory, respectively;
– host executes serial code and can, at some time, launch a
parallel kernel on the device.

lie
ug 5
iG
ud
St
gli

THREAD HIERARCHY
• Multi‐dimensional organization of data and
computation (up to 3 dimensions)
de

– a built‐in concept in CUDA:

– massively parallel computation usually takes place on
arrays and matrices.
• Groups of threads are organized into one‐, two‐, or three‐
dimensional grid of thread blocks.
à

• Each block within the grid can be identified by a one‐,

two‐, or three‐ dimensional index:
sit

– the index can be accessed within the kernel through

the built‐in blockIdx variable;
– the size of the thread block is accessible within the
er

kernel through the built‐in blockDim variable.

6
Un

3
ni
co
ar
KERNELS
• C functions that, when called, are executed N times in parallel by N different CUDA threads

oM
• is defined using the global declaration specifier
• the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<
>>> execution configuration syntax
• Kernel_name<<< Dg, Db, Ns, S >>>(param1,param2,…);
– Dg: dim3 type, specifies the dimensions and size of grid;
– Gb: dim3 type, specifies the dimensions and size of each thread block
– Ns: size_t type, specifies the number of bytes of shared memory that is dynamically allocated

lm
 optional
 0 byte default
– S: cudaStream_t type, specifies the stream associated
with this call

lie
 optional
 0 default
• Launches are asynchronous!
• – Use cudaDeviceSychronize() to
synchronize Host and Device. ug 7
iG
ud
St
gli

THREAD HIERARCHY
• Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-
in threadIdx variable.
de

– threadIdx is a 3-component vector forming a one-, two-, or three-dimensional block of threads, the thread
block
– blockIdx is a 3-component vector
 is a built-in variable
 unique ID of a thread block
– blockDim is a 3-component vector
à

 is a built-in variable
 defines the dimension of a thread block
sit

• There is a limit to the number of threads per block

– all threads of a block are expected to reside on the same
processor core
– must share the limited memory resources of that core
er

– Thread within a block can cooperate by sharing data through shared

memory and by synchronizing their execution to coordinate memory
accesses using
• synchtreads()
iv

8
Un

4
ni
co
ar
MEMORY HIERARCHY
• CUDA threads may access data from multiple memory

oM
spaces during their execution.
• Each thread has private local
• memory
• Each thread block has shared memory visible to all
threads of the block and with the same lifetime as

lm
the block.
• All threads have access to the same global memory.
• There are also two additional read‐only memory
spaces accessible by all threads:

lie
– the constant memory space;
– the texture memory space.

ug 9
iG
ud
St
gli

BASIC CUDA PROGRAM STRUCTURE (HOST SIDE)

int main (int argc, char **argv ) {
de

1. Allocate memory space in device (GPU) for data

2. Allocate memory space in host (CPU) for data
3. Copy data to GPU
4. Call “kernel” routine to execute on GPU
à

5. Transfer results from GPU to CPU

sit

6. Free memory space in device (GPU)

7. Free memory space in host (CPU) return;
er

}
iv

10
Un

5
ni
co
ar
ALLOCATING “DEVICE” MEMORY FOR DATA
• Use cudaMalloc routines:

oM
int size = N*sizeof(int); // spacefor N integers
int *devA; // devA ptr
cudaMalloc( (void**)&devA, size );

lm
• There are others methods to allocate memory space in device:
– cudaMallocPitch()

lie
– cudaMallocArray()
– …
– (see Programming Guide)
ug 11
iG
ud
St
gli

ALLOCATING “HOST” MEMORY FOR DATA

• Use regular C malloc routines:
de

int a, b, *c;

…
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
à

• or statically declare variables:

sit

#define N 256
...
int a[N], b[N], c[N];
er

• To allocate page-locked memory in the host memory:

– cudaMallocHost()
iv

12
Un

6
ni
co
ar
TRANSFERRING DATA FROM HOST TO DEVICE
• Use CUDA routine cudaMemcpy :

oM
cudaMemcpy( Destination, Source, size, cudaMemcpyHostToDevice);

• where:
– Destination is a pointer to destination in device
– Source is a pointer to host data

lm
– Size is the memory size to transfer
– cudaMemcpyHostToDevice is an enum
 Specifies the copy orientation

lie
• cudaMemcpy is a blocking function
– Host must wait for the end of the copy
• cudaMemcpyAsync() is the asynchronous version
ug 13
iG
ud
St
gli

LAUNCHING A KERNEL
• As seen before:
de

– kernel definition:

global void VecAdd(int A, int B, int *C){

Int I = threadIdx.x;
à

C[i] = A[i] + B[i];

}
sit

– kernel launch:

int main(){
er

...
VacAdd<<<N,N>>>(devA,devB,devC);
...
iv

}
14
Un

7
ni
co
ar
TRANSFERRING DATA FROM DEVICE TO HOST
• Use CUDA routine cudaMemcpy :

oM
cudaMemcpy( Destination, Source, size, cudaMemcpyDeviceToHost);

• where:
– Destination is a pointer to destination in device
– Source is a pointer to host data

lm
– Size is the memory size to transfer
– cudaMemcpyDeviceToHost is an enum
 Specifies the copy orientation

lie
• cudaMemcpy is a blocking function
• – Host must wait for the end of the copy
• cudaMemcpyAsync() is the asynchronous version
ug 15
iG
ud
St
gli

FREE MEMORY SPACE IN DEVICE AND HOST

• Device: use CUDA cudaFree routines:
de

cudaFree(dev_ptr);

• Host: use regular C free routine to deallocate memory if previously allocated with
malloc:
à
sit

free(ptr);
er
iv

16
Un

8
ni
co
ar
PROGRAM EXAMPLE

oM
lm
lie
ug 17
iG
ud
St
gli

COMPILATION WITH NVCC

• Kernels can be written using the CUDA instruction set architecture, called PTX
de

(Parallel Thread Execution)

– It is however usually more effective to use a high-level programming
language such as C.
• In both cases, kernels must be compiled into binary code by NVCC to execute on the
à

device.
• NVCC is a compiler driver that simplifies the process of compiling C or PTX code
sit

– The process can be divided in two phases:

• Offline compilation;
• Just-in-Time (JIT) compilation.
er
iv

18
Un

9
ni
co
ar
OFFLINE COMPILATION
• Consists in separating device code from host code and then:

oM
– compiling the device code into an assembly form (PTX code) and/or binary form (cubin
object);
– modifying the host code by replacing the <<<…>>> syntax with the necessary CUDA C
runtime function calls to load and launch each compiled kernel from the PTX code and/or
cubin object.

lm
• The modified host code can be compiled using another tool like gcc
• Applications can then:
– link to the compiled host code.

lie
ug 19
iG
ud
St
gli

JUST‐IN‐TIME COMPILATION
• Any PTX code loaded by an application at runtime is compiled further to binary code by the
de

device driver (just‐ in‐time):

– this increases application load time;
– but allows the application to benefit from any new compiler improvements coming
with each new device driver;
– It is also the only way for applications to run on devices that did not exist at the time the
à

application was compiled.

sit

• Saving some JIT compilation time:

– when the device driver just‐in‐time compiles some PTX code for some application, it
automatically caches a copy of the generated binary code;
er

– avoid repeating the compilation in subsequent invocations of the application.

20
Un

10
ni
co
ar
NVCC WORKFLOW

oM
lm
lie
ug 21
iG
ud
St
gli

CUDA DEVICE MEMORY SPACE OVERVIEW

• Memory hierarchy as seen by a block while running on an
SM (see the figure).
de

• Each thread can:

– R/W per‐thread registers (~1 cycle)
– R/W per‐thread local memory
– R/W per‐block shared memory (~ 5 cycles)
à

– R/W per‐grid global memory (~ 400 cycles)

– Read only per‐grid constant memory (~ 5 cycle with
sit

caching)
– Read only per‐grid texture memory
• The host can R/W global, constant, and texture memory.
er

NOTE: Global, constant, and texture

memory spaces are persistent between
iv

kernels called by the same host application

22
Un

11
ni
co
ar
HOST‐DEVICE CONNECTION
• First problem:

oM
– Global Memory bandwidth:
 80 GB/s on G80
 500 GB/s on Maxwell
 1 TB/s on Pascal
– PCIexpress x16 Gen3 bandwidth:

lm
 16 GB/s peak
– DDR3 max bandwidth:
 17 GB/s peak

lie
• Data Movement is expensive.
• For these reasons NVIDIA introduced the
Dynamic Parallelism feature.

ug 23
iG
ud
St
gli

CUDA VARIABLE TYPE QUALIFIERS

Variable declaration Memory Scope Lifetime

int var; register thread thread

int array_var[10]; local thread thread
shared int shared_var; shared block block
à

device int global_var; global grid application

constant grid application
sit

constant int constant_var;

• “automatic” scalar variables without qualifier reside in a register

– compiler may spill the variable to thread local memory.

• “automatic” array variables without qualifier reside in thread‐local memory.
iv

24
Un

12
ni
co
ar
CUDA VARIABLE TYPE PERFORMANCE
Variable declaration Memory Penalty

oM
int var; register 1x
int array_var[10]; local 100x
shared int shared_var; shared 1x
device int global_var; global 100x

lm
constant int constant_var; constant 1x

• scalar variables reside in fast, on-chip registers.

lie
• shared variables reside in fast, on-chip memories.
• thread-local arrays and global variables reside in uncached off- chip memory
– caches are available on newer GPUs, but are still significantly slower.

ug
• constant variables reside in cached off-chip memory.
25
iG
ud
St
gli

STORAGE LOCATIONS
de

Memory Location Cached Access Who

Global Off-chip Yes Read/Write All threads + host

sit

Constant Off-chip Yes Read All threads + host

Texture Off-chip Yes Read All threads + host
er

NOTE: Global Memory caching depends on Compute Capability

26
Un

13
ni
co
ar
LOCAL MEMORY
• Local memory does not exist physically

oM
– “local” in scope but not in location: it’s
specific to one thread.
– Data in “local memory” is actually placed
in cache or the global memory at run time

lm
or by the compiler.
– If too many registers are needed for
computation (“high register pressure”) the
exceeding data is stored in local memory.

lie
– Long access times for local memory
 even when cached.

ug 27
iG
ud
St
gli

– 16K on Tesla
– 32K on Fermi
– 64K on Kepler and Maxwell
• Registers are dynamically partitioned
à

• across all Blocks assigned to the SM

sit

• Once assigned to a Block, these registers are not

accessible by threads in other Blocks.
• A thread in a Block can only access registers assigned
to itself
er

– Kepler and Maxwell: a thread can have assigned by

the compiler up to 255 registers.
iv

28
Un

14
ni
co
ar
GLOBAL MEMORY

oM
• Uncached global and local memory
as well as constant and texture
memory are mapped to off‐chip
DRAM.

lm
• The DRAM can be quite large:
– e.g. 4 GB, 6 GB, or 8 GB
• accessing the off‐chip DRAM takes

lie
much longer than accessing on‐
chip memory.

ug 29
iG
ud
St
gli

MEMORY COALESCING
• The GPU executes in a SIMD fashion 32‐thread Warps
de

• (or 16‐thread Half Warps, depending on the device)

– Load/Store operations are also executed concurrently
– the different addresses accessed by the threads in a Warp may dramatically affect the
performance.
à

• The device tries to coalesce global memory accesses issued by threads in a warp into as few
sit

transactions as possible to minimize the used DRAM bandwidth

– relatively large windows (e.g. 32, 64, or 128 bytes) in the DRAM can be accessed by a single
transaction
– windows need to be properly aligned (e.g. on addresses multiple of 32, 64, or 128, resp.) to
er

be accessed by a single transaction

• “strided” memory accesses (i.e. addresses spaced out by a fixed stride) can hurt
iv

performance! 30
Un

15
ni
co
ar
GLOBAL MEMORY ALIGNMENT ISSUES
– CUDA driver aligns data in device memory at 256 byte boundaries

oM
– device accesses global memory via 32‐, 64‐, or 128‐byte transactions that are
aligned to their size
• Compute capability 1.0 (e.g. Tesla C870)
– misaligned accesses by a half warp of threads (or aligned but non‐ sequential accesses)
results in 16 separate 32‐byte transactions.
– each reads 4 bytes  bandwidth reduced by a factor of eight!

lm
• Compute capability 1.2 or 1.3 (e.g. Tesla C1060)
– misaligned accesses of contiguous data by a half warp are serviced in a few
transactions that “cover” the requesteddata

lie
– far less penalty than the C870 case
• Compute capability 2.0 (e.g. Tesla C2050)
– L1 cache in each multiprocessor with a 128‐byte line size
ug
– The device coalesces accesses by threads into as few cache lines as possible, hiding
misalignment effects for sequential accesses 31
iG
ud
St
gli

COALESCING: IMPACT OF DATA LOCALITY

à de
sit
er
iv

32
Un

16
ni
co
ar
COALESCING: IMPACT OF DATA LOCALITY
• Alignment issues for accesses with offset

oM
– Compute capability 1.0, 1.1 requires linear aligned accesses from threads for coalescing 
bandwidth usage can be as low as 1/8
– Compute capability 1.2+ can coalesce accesses that fall into aligned segments  32, 64, or
128 byte segments on CC 1.2/1.3
– 128‐byte cache lines on Compute Capability 2.0 and higher

lm
• For accesses with large strides, the bandwidth is poor
– happens on all architectures: if addresses are far apart in physical memory, there’s no
chance for the GPU to combine the accesses

lie
• Multidimensional arrays often requires strided access
– e.g. scan the elements of a matrix column (stride is the row size)
– can be mitigated by using Shared Memory:

shared memory ug
– extract a 2D tile of a multidimensional array from global memory in a coalesced fashion into
33
iG
ud
St
gli

SHARED MEMORY
• On‐chip memory: one for each SM.
de

• Comparing Shared Memory to Global Memory:

– One order of magnitude (20‐30x) lower latency;
– One order of magnitude (~10x) higher bandwidth;
– accessed at bank‐width granularity (minimum chunk of data that can be handled):
 Fermi: 4 bytes;
à

 Kepler: 8 bytes;
sit

 For comparison: Global Memory granularity is either 32 or 128 bytes.

• Shared Memory instruction operation:
– 32 threads in a warp provide addresses;
– determine into which 8‐byte words (4‐byte for Fermi) addresses fall;
er

– fetch the words, distribute the requested bytes among the threads:
 Multi‐cast capable;
iv

 Bank conflicts cause serialization of accesses  increased access time.

34
Un

17
ni
co
ar
SHARED MEMORY
• Shared Memory is made of on‐

oM
chip, physically separate
memory banks
• Shared Memory is accessed
• concurrently by the 32 threads in a
warp

lm
• The way data is distributed
across the banks may impact
performance

lie
 bank access conflicts

ug 35
iG
ud
St
gli

CONCLUSIONI
• Nell seguente lezione abbiamo affrontato i conceetti base di CUDA:
de

– CUDA principles;
– CUDA programming models;
– Thread Hierarchy;
– Kernel;
à

– Basic CUDA programming;

sit

– HOST and DEVICE in CUDA;

– MEMORY in CUDA.
er
iv

36
Un

18
ni
co
ar
BIBLIOGRAFIA/WEBGRAFIA
• CUDA C Programming Guide:

oM
– https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

• CUDA C Best Practices Guide:

– http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf

lm
• David B. Kirk, When‐mei W. Hwu – Programming Massively Parallel Processor, A Hands‐on
Approach (Second Edition)

lie
ug 37
iG
ud
St
gli

AVVISO

Ai sensi dell'art. 1, comma 1 del decreto‐legge 22 marzo 2004, n. 72,

come modificato dalla legge di conversione 21 maggio 2004 n. 128,
er

le opere presenti su questo sito hanno assolto gli obblighi derivanti dalla normativa sul diritto d'autore e sui diritti connessi.

Tutti i contenuti sono proprietà letteraria riservata e protetti dal diritto di autore della Università degli Studi Guglielmo Marconi.
Si ricorda che il materiale didattico fornito è per uso personale degli studenti, al solo scopo didattico.
Per ogni diverso utilizzo saranno applicate le sanzioni previste dalla legge 22 aprile 1941, n. 633.
iv

01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
LM32 Ait L23
No ratings yet
LM32 Ait L23
22 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Stockpile Calculations
100% (2)
Stockpile Calculations
6 pages
Govind 6
No ratings yet
Govind 6
4 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Threads
No ratings yet
Threads
54 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Class 13
No ratings yet
Class 13
19 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA
No ratings yet
CUDA
18 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
1 Cuda
100% (1)
1 Cuda
173 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction To Electricity Magnetism and Circuits 1536849524
100% (1)
Introduction To Electricity Magnetism and Circuits 1536849524
995 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Cuda C
No ratings yet
Cuda C
70 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Elevator Planning UOM
No ratings yet
Elevator Planning UOM
41 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Catia V5 Bending Torsion Tension Shear Tutorial
No ratings yet
Catia V5 Bending Torsion Tension Shear Tutorial
18 pages
7SR10 Argus Complete Technical Manual
0% (1)
7SR10 Argus Complete Technical Manual
205 pages
How To - Determine Stayed Surface Calculations - Power Engineering 101
No ratings yet
How To - Determine Stayed Surface Calculations - Power Engineering 101
12 pages
Bucks Engines 2007 GM Powertrain Owners Manual
100% (1)
Bucks Engines 2007 GM Powertrain Owners Manual
11 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
B Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide 7x PDF
No ratings yet
B Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide 7x PDF
268 pages
Year 4 Statistics and Probability Assessment
No ratings yet
Year 4 Statistics and Probability Assessment
9 pages
SP Tools - MaytoJuly2013
No ratings yet
SP Tools - MaytoJuly2013
24 pages
Mathematics - A Course in Fluid Mechanics With Vector Field
No ratings yet
Mathematics - A Course in Fluid Mechanics With Vector Field
198 pages
10.10.10 Brain Teasers
No ratings yet
10.10.10 Brain Teasers
7 pages
Experiment-2 Name of Student: Waghule Shubham Kalyan Batch: B3 Branch: CS-D Roll No.: 94 Problem Statement
100% (2)
Experiment-2 Name of Student: Waghule Shubham Kalyan Batch: B3 Branch: CS-D Roll No.: 94 Problem Statement
3 pages
Crotch-Grained Chess Table: Walnut, Poplar
100% (2)
Crotch-Grained Chess Table: Walnut, Poplar
5 pages
Variable Reviewer
No ratings yet
Variable Reviewer
24 pages
PHY210 CHAPTER 5 - THERMAL PHYSICS Students PDF
No ratings yet
PHY210 CHAPTER 5 - THERMAL PHYSICS Students PDF
34 pages
Chapters 3 To 7 Study Guide
No ratings yet
Chapters 3 To 7 Study Guide
38 pages
1LE1601-1AB53-4FB4-Z F01+F11+F50+L05 Datasheet en
No ratings yet
1LE1601-1AB53-4FB4-Z F01+F11+F50+L05 Datasheet en
2 pages
Herrmann Gerlach Seelig 2015
No ratings yet
Herrmann Gerlach Seelig 2015
13 pages
Keyboard Shortcuts Linux
No ratings yet
Keyboard Shortcuts Linux
1 page
HND Year 1 Isa
No ratings yet
HND Year 1 Isa
51 pages
5 5kw+8p+ie3
No ratings yet
5 5kw+8p+ie3
5 pages
14 Loci and Transformations
No ratings yet
14 Loci and Transformations
83 pages
Advanced Materials For Space Applications
No ratings yet
Advanced Materials For Space Applications
9 pages
38th IMO 1997-FIX
No ratings yet
38th IMO 1997-FIX
6 pages
Fingerprint Identification and Verification System Using Minuate Matching
No ratings yet
Fingerprint Identification and Verification System Using Minuate Matching
4 pages
Mathematics Notes
No ratings yet
Mathematics Notes
6 pages
Term 1: Mechanics and Thermodynamics: Chapter 2: Kinematics
No ratings yet
Term 1: Mechanics and Thermodynamics: Chapter 2: Kinematics
7 pages
LNB For KU Band
No ratings yet
LNB For KU Band
6 pages
School of Mechanical Engineering MEE437 Operations Research - FS 2016-17 - PBL Faculty: Siva Prasad Darla Project Based Learning Course
No ratings yet
School of Mechanical Engineering MEE437 Operations Research - FS 2016-17 - PBL Faculty: Siva Prasad Darla Project Based Learning Course
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

LM32 Ait L21

Uploaded by

LM32 Ait L21

Uploaded by

ni

– Basic CUDA programming

– HOST and DEVICE in CUDA

COMPUTE UNIFIED DEVICE ARCHITECTURE

– leverages the parallel compute engine in

• not a new programming language

– a built‐in concept in CUDA:

• Each block within the grid can be identified by a one‐,

– the index can be accessed within the kernel through

kernel through the built‐in blockDim variable.

• There is a limit to the number of threads per block

– Thread within a block can cooperate by sharing data through shared

BASIC CUDA PROGRAM STRUCTURE (HOST SIDE)

1. Allocate memory space in device (GPU) for data

5. Transfer results from GPU to CPU

6. Free memory space in device (GPU)

ALLOCATING “HOST” MEMORY FOR DATA

int *a, *b, *c;

• or statically declare variables:

• To allocate page-locked memory in the host memory:

global void VecAdd(int *A, int *B, int *C){

C[i] = A[i] + B[i];

FREE MEMORY SPACE IN DEVICE AND HOST

COMPILATION WITH NVCC

(Parallel Thread Execution)

– The process can be divided in two phases:

device driver (just‐ in‐time):

application was compiled.

• Saving some JIT compilation time:

– avoid repeating the compilation in subsequent invocations of the application.

CUDA DEVICE MEMORY SPACE OVERVIEW

• Each thread can:

– R/W per‐grid global memory (~ 400 cycles)

NOTE: Global, constant, and texture

kernels called by the same host application

CUDA VARIABLE TYPE QUALIFIERS

Variable declaration Memory Scope Lifetime

int var; register thread thread

device int global_var; global grid application

constant int constant_var;

• “automatic” scalar variables without qualifier reside in a register

– compiler may spill the variable to thread local memory.

• scalar variables reside in fast, on-chip registers.

Memory Location Cached Access Who

Global Off-chip Yes Read/Write All threads + host

Constant Off-chip Yes Read All threads + host

NOTE: Global Memory caching depends on Compute Capability

• across all Blocks assigned to the SM

• Once assigned to a Block, these registers are not

– Kepler and Maxwell: a thread can have assigned by

• (or 16‐thread Half Warps, depending on the device)

transactions as possible to minimize the used DRAM bandwidth

be accessed by a single transaction

COALESCING: IMPACT OF DATA LOCALITY

• Comparing Shared Memory to Global Memory:

 For comparison: Global Memory granularity is either 32 or 128 bytes.

 Bank conflicts cause serialization of accesses  increased access time.

– Basic CUDA programming;

– HOST and DEVICE in CUDA;

• CUDA C Best Practices Guide:

Ai sensi dell'art. 1, comma 1 del decreto‐legge 22 marzo 2004, n. 72,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

int a, b, *c;

global void VecAdd(int A, int B, int *C){