0% found this document useful (0 votes)
7 views19 pages

LM32 Ait L21

This document serves as an introduction to CUDA, covering its principles, programming models, thread hierarchy, and memory management. It outlines the objectives of learning CUDA, including understanding basic concepts and programming techniques. Additionally, it discusses kernel execution, memory allocation, data transfer between host and device, and the compilation process using NVCC.

Uploaded by

jayowi9795
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

LM32 Ait L21

This document serves as an introduction to CUDA, covering its principles, programming models, thread hierarchy, and memory management. It outlines the objectives of learning CUDA, including understanding basic concepts and programming techniques. Additionally, it discusses kernel execution, memory allocation, data transfer between host and device, and the compilation process using NVCC.

Uploaded by

jayowi9795
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

ni

co
ar
oM
INTRODUZIONE A CUDA ‐ PARTE 1

lm
lie
ug
iG
ud
St
gli

ARGOMENTI
• Nella presente lezione verranno trattati i seguenti argomenti:
de

– CUDA principles
– CUDA programming models
– Thread Hierarchy
– Kernel
à

– Basic CUDA programming


sit

– HOST and DEVICE in CUDA


– MEMORY in CUDA
er
iv

2
Un

1
ni
co
ar
OBIETTIVI
• La fruizione di questa lezione permetterà il raggiungimento dei seguenti obiettivi didattici:

oM
– apprendimento dei concetti base di CUDA;
– modello di programmazione CUDA;
– concetto di Hardware Thread e Thread in CUDA;
– concetto di Kernel;

lm
– Basic CUDA programming;
– HOST and DEVICE in CUDA;
– MEMORY in CUDA.

lie
ug 3
iG
ud
St
gli

COMPUTE UNIFIED DEVICE ARCHITECTURE


• CUDA: a general purpose parallel computing
platform and programming model
de

– leverages the parallel compute engine in


NVIDIA GPUs to solve complex
computational problems.
• Emphasis on scalability.
à

• not a new programming language


– C/C++ is the programming language.
sit
er
iv

4
Un

2
ni
co
ar
CUDA PROGRAMMING MODEL
• CUDA model assumes:

oM
– CUDA kernels, made of a number of threads, execute on a
physically separate device that operates as a coprocessor
to the host running the C program;
– both host and device maintain their own separate
memory spaces in DRAM, referred to as host memory

lm
and device memory, respectively;
– host executes serial code and can, at some time, launch a
parallel kernel on the device.

lie
ug 5
iG
ud
St
gli

THREAD HIERARCHY
• Multi‐dimensional organization of data and
computation (up to 3 dimensions)
de

– a built‐in concept in CUDA:


– massively parallel computation usually takes place on
arrays and matrices.
• Groups of threads are organized into one‐, two‐, or three‐
dimensional grid of thread blocks.
à

• Each block within the grid can be identified by a one‐,


two‐, or three‐ dimensional index:
sit

– the index can be accessed within the kernel through


the built‐in blockIdx variable;
– the size of the thread block is accessible within the
er

kernel through the built‐in blockDim variable.


iv

6
Un

3
ni
co
ar
KERNELS
• C functions that, when called, are executed N times in parallel by N different CUDA threads

oM
• is defined using the global declaration specifier
• the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<
>>> execution configuration syntax
• Kernel_name<<< Dg, Db, Ns, S >>>(param1,param2,…);
– Dg: dim3 type, specifies the dimensions and size of grid;
– Gb: dim3 type, specifies the dimensions and size of each thread block
– Ns: size_t type, specifies the number of bytes of shared memory that is dynamically allocated

lm
 optional
 0 byte default
– S: cudaStream_t type, specifies the stream associated
with this call

lie
 optional
 0 default
• Launches are asynchronous!
• – Use cudaDeviceSychronize() to
synchronize Host and Device. ug 7
iG
ud
St
gli

THREAD HIERARCHY
• Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-
in threadIdx variable.
de

– threadIdx is a 3-component vector forming a one-, two-, or three-dimensional block of threads, the thread
block
– blockIdx is a 3-component vector
 is a built-in variable
 unique ID of a thread block
– blockDim is a 3-component vector
à

 is a built-in variable
 defines the dimension of a thread block
sit

• There is a limit to the number of threads per block


– all threads of a block are expected to reside on the same
processor core
– must share the limited memory resources of that core
er

– Thread within a block can cooperate by sharing data through shared


memory and by synchronizing their execution to coordinate memory
accesses using
• synchtreads()
iv

8
Un

4
ni
co
ar
MEMORY HIERARCHY
• CUDA threads may access data from multiple memory

oM
spaces during their execution.
• Each thread has private local
• memory
• Each thread block has shared memory visible to all
threads of the block and with the same lifetime as

lm
the block.
• All threads have access to the same global memory.
• There are also two additional read‐only memory
spaces accessible by all threads:

lie
– the constant memory space;
– the texture memory space.

ug 9
iG
ud
St
gli

BASIC CUDA PROGRAM STRUCTURE (HOST SIDE)


int main (int argc, char **argv ) {
de

1. Allocate memory space in device (GPU) for data


2. Allocate memory space in host (CPU) for data
3. Copy data to GPU
4. Call “kernel” routine to execute on GPU
à

5. Transfer results from GPU to CPU


sit

6. Free memory space in device (GPU)


7. Free memory space in host (CPU) return;
er

}
iv

10
Un

5
ni
co
ar
ALLOCATING “DEVICE” MEMORY FOR DATA
• Use cudaMalloc routines:

oM
int size = N*sizeof(int); // spacefor N integers
int *devA; // devA ptr
cudaMalloc( (void**)&devA, size );

lm
• There are others methods to allocate memory space in device:
– cudaMallocPitch()

lie
– cudaMallocArray()
– …
– (see Programming Guide)
ug 11
iG
ud
St
gli

ALLOCATING “HOST” MEMORY FOR DATA


• Use regular C malloc routines:
de

int *a, *b, *c;



a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
à

• or statically declare variables:


sit

#define N 256
...
int a[N], b[N], c[N];
er

• To allocate page-locked memory in the host memory:


– cudaMallocHost()
iv

12
Un

6
ni
co
ar
TRANSFERRING DATA FROM HOST TO DEVICE
• Use CUDA routine cudaMemcpy :

oM
cudaMemcpy( Destination, Source, size, cudaMemcpyHostToDevice);

• where:
– Destination is a pointer to destination in device
– Source is a pointer to host data

lm
– Size is the memory size to transfer
– cudaMemcpyHostToDevice is an enum
 Specifies the copy orientation

lie
• cudaMemcpy is a blocking function
– Host must wait for the end of the copy
• cudaMemcpyAsync() is the asynchronous version
ug 13
iG
ud
St
gli

LAUNCHING A KERNEL
• As seen before:
de

– kernel definition:

global void VecAdd(int *A, int *B, int *C){

Int I = threadIdx.x;
à

C[i] = A[i] + B[i];


}
sit

– kernel launch:

int main(){
er

...
VacAdd<<<N,N>>>(devA,devB,devC);
...
iv

}
14
Un

7
ni
co
ar
TRANSFERRING DATA FROM DEVICE TO HOST
• Use CUDA routine cudaMemcpy :

oM
cudaMemcpy( Destination, Source, size, cudaMemcpyDeviceToHost);

• where:
– Destination is a pointer to destination in device
– Source is a pointer to host data

lm
– Size is the memory size to transfer
– cudaMemcpyDeviceToHost is an enum
 Specifies the copy orientation

lie
• cudaMemcpy is a blocking function
• – Host must wait for the end of the copy
• cudaMemcpyAsync() is the asynchronous version
ug 15
iG
ud
St
gli

FREE MEMORY SPACE IN DEVICE AND HOST


• Device: use CUDA cudaFree routines:
de

cudaFree(dev_ptr);

• Host: use regular C free routine to deallocate memory if previously allocated with
malloc:
à
sit

free(ptr);
er
iv

16
Un

8
ni
co
ar
PROGRAM EXAMPLE

oM
lm
lie
ug 17
iG
ud
St
gli

COMPILATION WITH NVCC

• Kernels can be written using the CUDA instruction set architecture, called PTX
de

(Parallel Thread Execution)


– It is however usually more effective to use a high-level programming
language such as C.
• In both cases, kernels must be compiled into binary code by NVCC to execute on the
à

device.
• NVCC is a compiler driver that simplifies the process of compiling C or PTX code
sit

– The process can be divided in two phases:


• Offline compilation;
• Just-in-Time (JIT) compilation.
er
iv

18
Un

9
ni
co
ar
OFFLINE COMPILATION
• Consists in separating device code from host code and then:

oM
– compiling the device code into an assembly form (PTX code) and/or binary form (cubin
object);
– modifying the host code by replacing the <<<…>>> syntax with the necessary CUDA C
runtime function calls to load and launch each compiled kernel from the PTX code and/or
cubin object.

lm
• The modified host code can be compiled using another tool like gcc
• Applications can then:
– link to the compiled host code.

lie
ug 19
iG
ud
St
gli

JUST‐IN‐TIME COMPILATION
• Any PTX code loaded by an application at runtime is compiled further to binary code by the
de

device driver (just‐ in‐time):


– this increases application load time;
– but allows the application to benefit from any new compiler improvements coming
with each new device driver;
– It is also the only way for applications to run on devices that did not exist at the time the
à

application was compiled.


sit

• Saving some JIT compilation time:


– when the device driver just‐in‐time compiles some PTX code for some application, it
automatically caches a copy of the generated binary code;
er

– avoid repeating the compilation in subsequent invocations of the application.


iv

20
Un

10
ni
co
ar
NVCC WORKFLOW

oM
lm
lie
ug 21
iG
ud
St
gli

CUDA DEVICE MEMORY SPACE OVERVIEW


• Memory hierarchy as seen by a block while running on an
SM (see the figure).
de

• Each thread can:


– R/W per‐thread registers (~1 cycle)
– R/W per‐thread local memory
– R/W per‐block shared memory (~ 5 cycles)
à

– R/W per‐grid global memory (~ 400 cycles)


– Read only per‐grid constant memory (~ 5 cycle with
sit

caching)
– Read only per‐grid texture memory
• The host can R/W global, constant, and texture memory.
er

NOTE: Global, constant, and texture


memory spaces are persistent between
iv

kernels called by the same host application


22
Un

11
ni
co
ar
HOST‐DEVICE CONNECTION
• First problem:

oM
– Global Memory bandwidth:
 80 GB/s on G80
 500 GB/s on Maxwell
 1 TB/s on Pascal
– PCIexpress x16 Gen3 bandwidth:

lm
 16 GB/s peak
– DDR3 max bandwidth:
 17 GB/s peak

lie
• Data Movement is expensive.
• For these reasons NVIDIA introduced the
Dynamic Parallelism feature.

ug 23
iG
ud
St
gli

CUDA VARIABLE TYPE QUALIFIERS

Variable declaration Memory Scope Lifetime


de

int var; register thread thread


int array_var[10]; local thread thread
shared int shared_var; shared block block
à

device int global_var; global grid application


constant grid application
sit

constant int constant_var;

• “automatic” scalar variables without qualifier reside in a register


er

– compiler may spill the variable to thread local memory.


• “automatic” array variables without qualifier reside in thread‐local memory.
iv

24
Un

12
ni
co
ar
CUDA VARIABLE TYPE PERFORMANCE
Variable declaration Memory Penalty

oM
int var; register 1x
int array_var[10]; local 100x
shared int shared_var; shared 1x
device int global_var; global 100x

lm
constant int constant_var; constant 1x

• scalar variables reside in fast, on-chip registers.

lie
• shared variables reside in fast, on-chip memories.
• thread-local arrays and global variables reside in uncached off- chip memory
– caches are available on newer GPUs, but are still significantly slower.

ug
• constant variables reside in cached off-chip memory.
25
iG
ud
St
gli

STORAGE LOCATIONS
de

Memory Location Cached Access Who


Register On-chip N/A Read/Write One thread
Shared On-chip N/A Read/Write All threads in a block
à

Global Off-chip Yes Read/Write All threads + host


sit

Constant Off-chip Yes Read All threads + host


Texture Off-chip Yes Read All threads + host
er

NOTE: Global Memory caching depends on Compute Capability


iv

26
Un

13
ni
co
ar
LOCAL MEMORY
• Local memory does not exist physically

oM
– “local” in scope but not in location: it’s
specific to one thread.
– Data in “local memory” is actually placed
in cache or the global memory at run time

lm
or by the compiler.
– If too many registers are needed for
computation (“high register pressure”) the
exceeding data is stored in local memory.

lie
– Long access times for local memory
 even when cached.

ug 27
iG
ud
St
gli

REGISTER FILE
• Number of 32‐bit registers in one SM:
de

– 16K on Tesla
– 32K on Fermi
– 64K on Kepler and Maxwell
• Registers are dynamically partitioned
à

• across all Blocks assigned to the SM


sit

• Once assigned to a Block, these registers are not


accessible by threads in other Blocks.
• A thread in a Block can only access registers assigned
to itself
er

– Kepler and Maxwell: a thread can have assigned by


the compiler up to 255 registers.
iv

28
Un

14
ni
co
ar
GLOBAL MEMORY

oM
• Uncached global and local memory
as well as constant and texture
memory are mapped to off‐chip
DRAM.

lm
• The DRAM can be quite large:
– e.g. 4 GB, 6 GB, or 8 GB
• accessing the off‐chip DRAM takes

lie
much longer than accessing on‐
chip memory.

ug 29
iG
ud
St
gli

MEMORY COALESCING
• The GPU executes in a SIMD fashion 32‐thread Warps
de

• (or 16‐thread Half Warps, depending on the device)


– Load/Store operations are also executed concurrently
– the different addresses accessed by the threads in a Warp may dramatically affect the
performance.
à

• The device tries to coalesce global memory accesses issued by threads in a warp into as few
sit

transactions as possible to minimize the used DRAM bandwidth


– relatively large windows (e.g. 32, 64, or 128 bytes) in the DRAM can be accessed by a single
transaction
– windows need to be properly aligned (e.g. on addresses multiple of 32, 64, or 128, resp.) to
er

be accessed by a single transaction


• “strided” memory accesses (i.e. addresses spaced out by a fixed stride) can hurt
iv

performance! 30
Un

15
ni
co
ar
GLOBAL MEMORY ALIGNMENT ISSUES
– CUDA driver aligns data in device memory at 256 byte boundaries

oM
– device accesses global memory via 32‐, 64‐, or 128‐byte transactions that are
aligned to their size
• Compute capability 1.0 (e.g. Tesla C870)
– misaligned accesses by a half warp of threads (or aligned but non‐ sequential accesses)
results in 16 separate 32‐byte transactions.
– each reads 4 bytes  bandwidth reduced by a factor of eight!

lm
• Compute capability 1.2 or 1.3 (e.g. Tesla C1060)
– misaligned accesses of contiguous data by a half warp are serviced in a few
transactions that “cover” the requesteddata

lie
– far less penalty than the C870 case
• Compute capability 2.0 (e.g. Tesla C2050)
– L1 cache in each multiprocessor with a 128‐byte line size
ug
– The device coalesces accesses by threads into as few cache lines as possible, hiding
misalignment effects for sequential accesses 31
iG
ud
St
gli

COALESCING: IMPACT OF DATA LOCALITY


à de
sit
er
iv

32
Un

16
ni
co
ar
COALESCING: IMPACT OF DATA LOCALITY
• Alignment issues for accesses with offset

oM
– Compute capability 1.0, 1.1 requires linear aligned accesses from threads for coalescing 
bandwidth usage can be as low as 1/8
– Compute capability 1.2+ can coalesce accesses that fall into aligned segments  32, 64, or
128 byte segments on CC 1.2/1.3
– 128‐byte cache lines on Compute Capability 2.0 and higher

lm
• For accesses with large strides, the bandwidth is poor
– happens on all architectures: if addresses are far apart in physical memory, there’s no
chance for the GPU to combine the accesses

lie
• Multidimensional arrays often requires strided access
– e.g. scan the elements of a matrix column (stride is the row size)
– can be mitigated by using Shared Memory:

shared memory ug
– extract a 2D tile of a multidimensional array from global memory in a coalesced fashion into
33
iG
ud
St
gli

SHARED MEMORY
• On‐chip memory: one for each SM.
de

• Comparing Shared Memory to Global Memory:


– One order of magnitude (20‐30x) lower latency;
– One order of magnitude (~10x) higher bandwidth;
– accessed at bank‐width granularity (minimum chunk of data that can be handled):
 Fermi: 4 bytes;
à

 Kepler: 8 bytes;
sit

 For comparison: Global Memory granularity is either 32 or 128 bytes.


• Shared Memory instruction operation:
– 32 threads in a warp provide addresses;
– determine into which 8‐byte words (4‐byte for Fermi) addresses fall;
er

– fetch the words, distribute the requested bytes among the threads:
 Multi‐cast capable;
iv

 Bank conflicts cause serialization of accesses  increased access time.


34
Un

17
ni
co
ar
SHARED MEMORY
• Shared Memory is made of on‐

oM
chip, physically separate
memory banks
• Shared Memory is accessed
• concurrently by the 32 threads in a
warp

lm
• The way data is distributed
across the banks may impact
performance

lie
 bank access conflicts

ug 35
iG
ud
St
gli

CONCLUSIONI
• Nell seguente lezione abbiamo affrontato i conceetti base di CUDA:
de

– CUDA principles;
– CUDA programming models;
– Thread Hierarchy;
– Kernel;
à

– Basic CUDA programming;


sit

– HOST and DEVICE in CUDA;


– MEMORY in CUDA.
er
iv

36
Un

18
ni
co
ar
BIBLIOGRAFIA/WEBGRAFIA
• CUDA C Programming Guide:

oM
– https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

• CUDA C Best Practices Guide:


– http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf

lm
• David B. Kirk, When‐mei W. Hwu – Programming Massively Parallel Processor, A Hands‐on
Approach (Second Edition)

lie
ug 37
iG
ud
St
gli

COPYRIGHT
à de
sit

AVVISO

Ai sensi dell'art. 1, comma 1 del decreto‐legge 22 marzo 2004, n. 72,


come modificato dalla legge di conversione 21 maggio 2004 n. 128,
er

le opere presenti su questo sito hanno assolto gli obblighi derivanti dalla normativa sul diritto d'autore e sui diritti connessi.

Tutti i contenuti sono proprietà letteraria riservata e protetti dal diritto di autore della Università degli Studi Guglielmo Marconi.
Si ricorda che il materiale didattico fornito è per uso personale degli studenti, al solo scopo didattico.
Per ogni diverso utilizzo saranno applicate le sanzioni previste dalla legge 22 aprile 1941, n. 633.
iv

Copyright©UNIMARCONI
Un

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy