HPC
HPC
1
CUDA Kernel Execution
GPUs do work in parallel
performWork<<<2, 4>>>()
GPU
GPU work is done in a thread
performWork<<<2, 4>>>()
GPU
Many threads run in parallel
performWork<<<2, 4>>>()
GPU
A collection of threads is a block
performWork<<<2, 4>>>()
GPU
There are many blocks
performWork<<<2, 4>>>()
GPU
A collection of blocks is a grid
performWork<<<2, 4>>>()
GPU
GPU functions are called kernels
performWork<<<2, 4>>>()
GPU
Kernels are launched with an
execution configuration
performWork<<<2, 4>>>()
GPU
The execution configuration defines
the number of blocks in the grid
performWork<<<2, 4>>>()
GPU
… as well as the number of threads in
each block
performWork<<<2, 4>>>()
GPU
Every block in the grid contains the
same number of threads
performWork<<<2, 4>>>()
GPU
CUDA-Provided Thread Hierarchy Variables
Inside kernels definitions, CUDA-
provided variables describe its
executing thread, block, and grid
performWork<<<2, 4>>>()
GPU
gridDim.x is the number of blocks in
the grid, in this case 2
performWork<<<2, 4>>>()
GPU
2
blockIdx.x is the index of the
current block within the grid, in this
case 0
performWork<<<2, 4>>>()
GPU
0 1
blockIdx.x is the index of the
current block within the grid, in this
case 1
performWork<<<2, 4>>>()
GPU
0 1
Inside a kernel blockDim.x describes
the number of threads in a block. In
this case 4
performWork<<<2, 4>>>()
GPU
4
All blocks in a grid contain the same
number of threads
performWork<<<2, 4>>>()
GPU
4 4
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 0
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 1
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 2
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Inside a kernel threadIdx.x
describes the index of the thread within
a block. In this case 3
performWork<<<2, 4>>>()
GPU
0 1 2 3 0 1 2 3
Coordinating Parallel Threads
Assume data is in a 0 indexed vector
GPU
DATA
performWork<<<2, 4>>>()
GPU
0 4 Assume data is in a 0 indexed vector
GPU 1 5
DATA
2 6
3 7
performWork<<<2, 4>>>()
GPU
0 4 Somehow, each thread must be
mapped to work on an element in the
vector
GPU 1 5
DATA
2 6
3 7
performWork<<<2, 4>>>()
GPU
0 4 Recall that each thread has access to
the size of its block via blockDim.x
GPU 1 5
DATA
2 6
3 7
performWork<<<2, 4>>>()
GPU
4 4
0 4 …and the index of its block within the
grid via blockIdx.x
GPU 1 5
DATA
2 6
3 7
performWork<<<2, 4>>>()
0 1
GPU
4 4
0 4 …and its own index within its block via
threadIdx.x
GPU 1 5
DATA
2 6
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 Using these variables, the formula
threadIdx.x + blockIdx.x *
blockDim.x will map each thread to
GPU 1 5 one element in the vector
DATA
2 6
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 0 0 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 0 0 4
GPU
DATA dataIndex
2 6 0
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 1 0 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 1 0 4
GPU
DATA dataIndex
2 6 1
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 2 0 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 2 0 4
GPU
DATA dataIndex
2 6 2
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 3 0 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 3 0 4
GPU
DATA dataIndex
2 6 3
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 0 1 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 0 1 4
GPU
DATA dataIndex
2 6 4
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 1 1 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 1 1 4
GPU
DATA dataIndex
2 6 5
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 2 1 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 2 1 4
GPU
DATA dataIndex
2 6 6
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 3 1 4
GPU
DATA dataIndex
2 6 ?
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 5 3 1 4
GPU
DATA dataIndex
2 6 7
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
Grid Size Work Amount Mismatch
0 4 In previous scenarios, the number of
threads in the grid matched the
number of elements exactly
GPU 1 5
DATA
2 6
3 7
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 What if there are more threads than
work to be done?
GPU 1
DATA
2
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 Attempting to access non-existent
elements can result in a runtime error
GPU 1
DATA
2
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 Code must check that the dataIndex
calculated by threadIdx.x +
blockIdx.x * blockDim.x is less
1 than N, the number of data elements.
GPU
DATA
2
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 0 1 4
GPU
DATA dataIndex < N = Can work
2 4 5 ?
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 0 1 4
GPU
DATA dataIndex < N = Can work
2 4 5 true
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 1 1 4
GPU
DATA dataIndex < N = Can work
2 5 5 ?
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 1 1 4
GPU
DATA dataIndex < N = Can work
2 5 5 false
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 2 1 4
GPU
DATA dataIndex < N = Can work
2 6 5 ?
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 2 1 4
GPU
DATA dataIndex < N = Can work
2 6 5 false
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 2 1 4
GPU
DATA dataIndex < N = Can work
2 6 5 ?
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
0 4 threadIdx. + blockIdx. blockDim.x
*
x x
1 2 1 4
GPU
DATA dataIndex < N = Can work
2 6 5 false
performWork<<<2, 4>>>()
0 1
GPU 0 1 2 3 0 1 2 3
4 4
66
UNIT-5
CUDA THREADS
CUDA THREAD ORGANIZATION
• All CUDA threads in a grid execute the same kernel function and they
rely on coordinates to distinguish themselves from each other and to
identify the appropriate portion of the data to process.
• These threads are organized into a two-level hierarchy:
a grid consists of one or more blocks
each block in turn consists of one or more threads
• All threads in a block share the same block index, which can be accessed
as the blockIdx variable in a kernel.
• Each thread also has a thread index, which can be accessed as the
threadIdx variable in a kernel.
67
The execution
When a thread
configuration
executes a kernel
parameters in a kernel
function, references to
launch statement
the blockIdx and
specify the dimensions
threadIdx variables
of the grid and the
return the coordinates
dimensions of each
of the thread.
block.
These dimensions are
available as predefined
built-in variables
gridDim and blockDim
in kernel functions.
68
The exact organization of a grid is determined by the
execution configuration parameters (within <<< >>> )
of the kernel launch statement.
Each such parameter is of dim3 type, which is a C struct
with three unsigned integer fields, x, y, and z. These
three fields correspond to the three dimensions.
69
Example : 1
• Host code can be used to launch the vecAddkernel() kernel
function and generate a 1D grid that consists of 128 blocks, each
of which consists of 32 threads. The total number of threads in
the grid is 128*32=4,096.
• Note that dimBlock and dimGrid are host code variables defined
by the programmer. 70
Example : 2
73
• The total size of a block is limited to 1,024 threads, with
flexibility in distributing these elements into the three
dimensions as long as the total number of threads does not
exceed 1,024.
• For example, (512, 1, 1), (8, 16, 4), and (32, 16, 2) are all
allowable blockDim values, but (32, 32, 2) is not allowable
since the total number of threads would exceed 1,024.
• Grid can have higher dimensionality than its blocks and vice
versa
74
Example : 3
75
The choice of 1D, 2D, or 3D
thread organizations is usually
MAPPING THREADS based on the nature of the data.
TO
MULTIDIMENSIONAL Ex: pictures are a 2D array of
DATA pixels. It is often convenient to
use a 2D grid that consists of 2D
blocks to process the pixels in a
picture.
76
76 *62 picture = 4712
80 * 64 threads
2D GRID to
process
picture
16*16
5 * 4 = 20 block
77
• Host code uses an integer variable n,m to track the number of pixels in the x, y
direction respectively.
• We further assume that the input picture data has been copied to the device
memory and can be accessed through a pointer variable d_Pin.
• The output picture has been allocated in the device memory and can be
accessed through a pointer variable d_Pout.
• The following host code can be used to launch a 2D kernel to process the picture:
dim3 dimGrid(ceil(n/16.0), ceil(m/16.0), 1);
dim3 dimBlock(16, 16, 1);
pictureKernel<<<dimGrid, dimBlock >>>(d_Pin, d_Pout, n, m);
78
To process a 2,000 * 1,500 (3 M pixel) picture, we will
generate 14,100 blocks, 150 in the x direction and 94
in the y direction.
79
MAPPING THREADS TO
MULTIDIMENSIONAL DATA
• In reality, all multidimensional arrays in C are linearized.
This is due to the use of a “flat” memory space in modern
computers.
• In the case of statically allocated arrays, the compilers allow
the programmers to use higher-dimensional indexing.
• compiler linearizes them into an equivalent 1D array and
translates the multidimensional indexing syntax into a 1D
offset.
• In the case of dynamically allocated arrays, the current
CUDA C compiler leaves the work of such translation to the
programmers due to lack of dimensional information
80
• There are at least two ways one can linearize
a 2D array. One is to place all elements of the
same row into consecutive locations.
• The rows are then placed one after another
into the memory space. This arrangement,
called row-major layout.
• Another way to linearize a 2D array is to place
all elements of the same column into
consecutive locations.
• The columns are then placed one after
another into the memory space. This
arrangement, called column major layout, is
used by FORTRAN compilers. N
81
Example
82
Let’s assume that the kernel will scale every pixel
value in the picture by a factor of 2.0. The kernel
code is conceptually quite simple. There are a total
of blockDim.x*gridDim.x threads in the horizontal
direction.
83
Example 4:
84
Execution of pictureKernel()
85
Area 1:
• Consists of the threads that belong to the 12
blocks covering the majority of pixels in the
picture.
• Both Col and Row values of these threads are
within range;
• All these threads will pass the if statement
test and process pixels in the dark shaded
area of the picture.
• That is, all 16 * 16 = 256 threads in each block
will process pixels.
86
AREA 2
• The second area, contains the threads that belong to the 3 blocks in
the medium-shaded area covering the upper-right.
• Although the Row values of these threads are always within range,
the Col values of some of them exceed the n value (76).
• This is because the number of threads in the horizontal direction is
always a multiple of the blockDim.x value chosen by the
programmer (16 in this case).
• The smallest multiple of 16 needed to cover 76 pixels is 80.
• As a result, 12 threads in each row will find their Col values within
range and will process pixels.
• On the other hand, 4 threads in each row will find their Col values
out of range, and thus fail the if statement condition.
• These threads will not process any pixels. Overall, 12 * 16 = 192 out
of the 16 * 16 = 256 threads will process pixels.
87
Area 3 • The third area, accounts for the 3 lower-left
blocks covering the medium-shaded area of the
picture.
• Although the Col values of these threads are
always within range, the Row values of some of
them exceed the m value (62).
• This is because the number of threads in the
vertical direction is always multiples of the
blockDim.y value chosen by the programmer (16
in this case).
• The smallest multiple of 16 to cover 62 is 64.
• As a result, 14 threads in each column will find
their Row values within range and will process
pixels.
• On the other hand, 2 threads in each column will
fail the if statement of area 2, and will not
process any pixels; 16 * 14 = 224 out of the 256
threads will process pixels.
88
Area 4
• The fourth area, contains the threads that
cover the lower-right light-shaded area of the
picture.
• Similar to area 2, 4 threads in each of the top
14 rows will find their Col values out of range.
• Similar to area 3, the entire bottom two rows
of this block will find their Row values out of
range.
• So, only 14 * 12 = 168 of the 16 * 16 = 256
threads will be allowed to process thread.
89
3D ARRAYS
• Similarly in 3D arrays are implemented by including another dimension when we
linearize arrays.
• This is done by placing each “plane” of the array one after another.
• Assume that the programmer uses variables m and n to track the number of rows and
columns in a 3D array.
• The programmer also needs to determine the values of blockDim.z and gridDim.z when
launching a kernel.
• In the kernel, the array index will involve another global index:
int Plane = blockIdx.z*blockDim.z + threadIdx.z
90