LM32 Ait L21
LM32 Ait L21
co
ar
oM
INTRODUZIONE A CUDA ‐ PARTE 1
lm
lie
ug
iG
ud
St
gli
ARGOMENTI
• Nella presente lezione verranno trattati i seguenti argomenti:
de
– CUDA principles
– CUDA programming models
– Thread Hierarchy
– Kernel
à
2
Un
1
ni
co
ar
OBIETTIVI
• La fruizione di questa lezione permetterà il raggiungimento dei seguenti obiettivi didattici:
oM
– apprendimento dei concetti base di CUDA;
– modello di programmazione CUDA;
– concetto di Hardware Thread e Thread in CUDA;
– concetto di Kernel;
lm
– Basic CUDA programming;
– HOST and DEVICE in CUDA;
– MEMORY in CUDA.
lie
ug 3
iG
ud
St
gli
4
Un
2
ni
co
ar
CUDA PROGRAMMING MODEL
• CUDA model assumes:
oM
– CUDA kernels, made of a number of threads, execute on a
physically separate device that operates as a coprocessor
to the host running the C program;
– both host and device maintain their own separate
memory spaces in DRAM, referred to as host memory
lm
and device memory, respectively;
– host executes serial code and can, at some time, launch a
parallel kernel on the device.
lie
ug 5
iG
ud
St
gli
THREAD HIERARCHY
• Multi‐dimensional organization of data and
computation (up to 3 dimensions)
de
6
Un
3
ni
co
ar
KERNELS
• C functions that, when called, are executed N times in parallel by N different CUDA threads
oM
• is defined using the global declaration specifier
• the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<
>>> execution configuration syntax
• Kernel_name<<< Dg, Db, Ns, S >>>(param1,param2,…);
– Dg: dim3 type, specifies the dimensions and size of grid;
– Gb: dim3 type, specifies the dimensions and size of each thread block
– Ns: size_t type, specifies the number of bytes of shared memory that is dynamically allocated
lm
optional
0 byte default
– S: cudaStream_t type, specifies the stream associated
with this call
lie
optional
0 default
• Launches are asynchronous!
• – Use cudaDeviceSychronize() to
synchronize Host and Device. ug 7
iG
ud
St
gli
THREAD HIERARCHY
• Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-
in threadIdx variable.
de
– threadIdx is a 3-component vector forming a one-, two-, or three-dimensional block of threads, the thread
block
– blockIdx is a 3-component vector
is a built-in variable
unique ID of a thread block
– blockDim is a 3-component vector
à
is a built-in variable
defines the dimension of a thread block
sit
8
Un
4
ni
co
ar
MEMORY HIERARCHY
• CUDA threads may access data from multiple memory
oM
spaces during their execution.
• Each thread has private local
• memory
• Each thread block has shared memory visible to all
threads of the block and with the same lifetime as
lm
the block.
• All threads have access to the same global memory.
• There are also two additional read‐only memory
spaces accessible by all threads:
lie
– the constant memory space;
– the texture memory space.
ug 9
iG
ud
St
gli
}
iv
10
Un
5
ni
co
ar
ALLOCATING “DEVICE” MEMORY FOR DATA
• Use cudaMalloc routines:
oM
int size = N*sizeof(int); // spacefor N integers
int *devA; // devA ptr
cudaMalloc( (void**)&devA, size );
lm
• There are others methods to allocate memory space in device:
– cudaMallocPitch()
lie
– cudaMallocArray()
– …
– (see Programming Guide)
ug 11
iG
ud
St
gli
#define N 256
...
int a[N], b[N], c[N];
er
12
Un
6
ni
co
ar
TRANSFERRING DATA FROM HOST TO DEVICE
• Use CUDA routine cudaMemcpy :
oM
cudaMemcpy( Destination, Source, size, cudaMemcpyHostToDevice);
• where:
– Destination is a pointer to destination in device
– Source is a pointer to host data
lm
– Size is the memory size to transfer
– cudaMemcpyHostToDevice is an enum
Specifies the copy orientation
lie
• cudaMemcpy is a blocking function
– Host must wait for the end of the copy
• cudaMemcpyAsync() is the asynchronous version
ug 13
iG
ud
St
gli
LAUNCHING A KERNEL
• As seen before:
de
– kernel definition:
Int I = threadIdx.x;
à
– kernel launch:
int main(){
er
...
VacAdd<<<N,N>>>(devA,devB,devC);
...
iv
}
14
Un
7
ni
co
ar
TRANSFERRING DATA FROM DEVICE TO HOST
• Use CUDA routine cudaMemcpy :
oM
cudaMemcpy( Destination, Source, size, cudaMemcpyDeviceToHost);
• where:
– Destination is a pointer to destination in device
– Source is a pointer to host data
lm
– Size is the memory size to transfer
– cudaMemcpyDeviceToHost is an enum
Specifies the copy orientation
lie
• cudaMemcpy is a blocking function
• – Host must wait for the end of the copy
• cudaMemcpyAsync() is the asynchronous version
ug 15
iG
ud
St
gli
cudaFree(dev_ptr);
• Host: use regular C free routine to deallocate memory if previously allocated with
malloc:
à
sit
free(ptr);
er
iv
16
Un
8
ni
co
ar
PROGRAM EXAMPLE
oM
lm
lie
ug 17
iG
ud
St
gli
• Kernels can be written using the CUDA instruction set architecture, called PTX
de
device.
• NVCC is a compiler driver that simplifies the process of compiling C or PTX code
sit
18
Un
9
ni
co
ar
OFFLINE COMPILATION
• Consists in separating device code from host code and then:
oM
– compiling the device code into an assembly form (PTX code) and/or binary form (cubin
object);
– modifying the host code by replacing the <<<…>>> syntax with the necessary CUDA C
runtime function calls to load and launch each compiled kernel from the PTX code and/or
cubin object.
lm
• The modified host code can be compiled using another tool like gcc
• Applications can then:
– link to the compiled host code.
lie
ug 19
iG
ud
St
gli
JUST‐IN‐TIME COMPILATION
• Any PTX code loaded by an application at runtime is compiled further to binary code by the
de
20
Un
10
ni
co
ar
NVCC WORKFLOW
oM
lm
lie
ug 21
iG
ud
St
gli
caching)
– Read only per‐grid texture memory
• The host can R/W global, constant, and texture memory.
er
11
ni
co
ar
HOST‐DEVICE CONNECTION
• First problem:
oM
– Global Memory bandwidth:
80 GB/s on G80
500 GB/s on Maxwell
1 TB/s on Pascal
– PCIexpress x16 Gen3 bandwidth:
lm
16 GB/s peak
– DDR3 max bandwidth:
17 GB/s peak
lie
• Data Movement is expensive.
• For these reasons NVIDIA introduced the
Dynamic Parallelism feature.
ug 23
iG
ud
St
gli
24
Un
12
ni
co
ar
CUDA VARIABLE TYPE PERFORMANCE
Variable declaration Memory Penalty
oM
int var; register 1x
int array_var[10]; local 100x
shared int shared_var; shared 1x
device int global_var; global 100x
lm
constant int constant_var; constant 1x
lie
• shared variables reside in fast, on-chip memories.
• thread-local arrays and global variables reside in uncached off- chip memory
– caches are available on newer GPUs, but are still significantly slower.
ug
• constant variables reside in cached off-chip memory.
25
iG
ud
St
gli
STORAGE LOCATIONS
de
26
Un
13
ni
co
ar
LOCAL MEMORY
• Local memory does not exist physically
oM
– “local” in scope but not in location: it’s
specific to one thread.
– Data in “local memory” is actually placed
in cache or the global memory at run time
lm
or by the compiler.
– If too many registers are needed for
computation (“high register pressure”) the
exceeding data is stored in local memory.
lie
– Long access times for local memory
even when cached.
ug 27
iG
ud
St
gli
REGISTER FILE
• Number of 32‐bit registers in one SM:
de
– 16K on Tesla
– 32K on Fermi
– 64K on Kepler and Maxwell
• Registers are dynamically partitioned
à
28
Un
14
ni
co
ar
GLOBAL MEMORY
oM
• Uncached global and local memory
as well as constant and texture
memory are mapped to off‐chip
DRAM.
lm
• The DRAM can be quite large:
– e.g. 4 GB, 6 GB, or 8 GB
• accessing the off‐chip DRAM takes
lie
much longer than accessing on‐
chip memory.
ug 29
iG
ud
St
gli
MEMORY COALESCING
• The GPU executes in a SIMD fashion 32‐thread Warps
de
• The device tries to coalesce global memory accesses issued by threads in a warp into as few
sit
performance! 30
Un
15
ni
co
ar
GLOBAL MEMORY ALIGNMENT ISSUES
– CUDA driver aligns data in device memory at 256 byte boundaries
oM
– device accesses global memory via 32‐, 64‐, or 128‐byte transactions that are
aligned to their size
• Compute capability 1.0 (e.g. Tesla C870)
– misaligned accesses by a half warp of threads (or aligned but non‐ sequential accesses)
results in 16 separate 32‐byte transactions.
– each reads 4 bytes bandwidth reduced by a factor of eight!
lm
• Compute capability 1.2 or 1.3 (e.g. Tesla C1060)
– misaligned accesses of contiguous data by a half warp are serviced in a few
transactions that “cover” the requesteddata
lie
– far less penalty than the C870 case
• Compute capability 2.0 (e.g. Tesla C2050)
– L1 cache in each multiprocessor with a 128‐byte line size
ug
– The device coalesces accesses by threads into as few cache lines as possible, hiding
misalignment effects for sequential accesses 31
iG
ud
St
gli
32
Un
16
ni
co
ar
COALESCING: IMPACT OF DATA LOCALITY
• Alignment issues for accesses with offset
oM
– Compute capability 1.0, 1.1 requires linear aligned accesses from threads for coalescing
bandwidth usage can be as low as 1/8
– Compute capability 1.2+ can coalesce accesses that fall into aligned segments 32, 64, or
128 byte segments on CC 1.2/1.3
– 128‐byte cache lines on Compute Capability 2.0 and higher
lm
• For accesses with large strides, the bandwidth is poor
– happens on all architectures: if addresses are far apart in physical memory, there’s no
chance for the GPU to combine the accesses
lie
• Multidimensional arrays often requires strided access
– e.g. scan the elements of a matrix column (stride is the row size)
– can be mitigated by using Shared Memory:
shared memory ug
– extract a 2D tile of a multidimensional array from global memory in a coalesced fashion into
33
iG
ud
St
gli
SHARED MEMORY
• On‐chip memory: one for each SM.
de
Kepler: 8 bytes;
sit
– fetch the words, distribute the requested bytes among the threads:
Multi‐cast capable;
iv
17
ni
co
ar
SHARED MEMORY
• Shared Memory is made of on‐
oM
chip, physically separate
memory banks
• Shared Memory is accessed
• concurrently by the 32 threads in a
warp
lm
• The way data is distributed
across the banks may impact
performance
lie
bank access conflicts
ug 35
iG
ud
St
gli
CONCLUSIONI
• Nell seguente lezione abbiamo affrontato i conceetti base di CUDA:
de
– CUDA principles;
– CUDA programming models;
– Thread Hierarchy;
– Kernel;
à
36
Un
18
ni
co
ar
BIBLIOGRAFIA/WEBGRAFIA
• CUDA C Programming Guide:
oM
– https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
lm
• David B. Kirk, When‐mei W. Hwu – Programming Massively Parallel Processor, A Hands‐on
Approach (Second Edition)
lie
ug 37
iG
ud
St
gli
COPYRIGHT
à de
sit
AVVISO
le opere presenti su questo sito hanno assolto gli obblighi derivanti dalla normativa sul diritto d'autore e sui diritti connessi.
Tutti i contenuti sono proprietà letteraria riservata e protetti dal diritto di autore della Università degli Studi Guglielmo Marconi.
Si ricorda che il materiale didattico fornito è per uso personale degli studenti, al solo scopo didattico.
Per ogni diverso utilizzo saranno applicate le sanzioni previste dalla legge 22 aprile 1941, n. 633.
iv
Copyright©UNIMARCONI
Un
19