Unit_IV-Topic_7-CUDA_programming_model_features
Unit_IV-Topic_7-CUDA_programming_model_features
AND ARCHITECTURE
The batch of threads that executes a kernel is organized as a grid. A grid consists
of either cooperative thread arrays or clusters of cooperative thread arrays as
described in this section and illustrated in Figure 1 and Figure 2. Cooperative
thread arrays (CTAs) implement CUDA thread blocks and clusters implement
CUDA thread block clusters.
Threads within a CTA can communicate with each other. To coordinate the
communication of the threads within the CTA, one can specify synchronization
points where threads wait until all threads in the CTA have arrived.
Each thread has a unique thread identifier within the CTA. Programs use a data
parallel decomposition to partition inputs, work, and results across the threads of
the CTA. Each CTA thread uses its thread identifier to determine its assigned role,
assign specific input and output positions, compute addresses, and select work to
perform. The thread identifier is a three-element vector tid, (with elements tid.x,
tid.y, and tid.z) that specifies the thread’s position within a 1D, 2D, or 3D CTA.
Each thread identifier component ranges from zero up to the number of thread ids
in that CTA dimension.
Each CTA has a 1D, 2D, or 3D shape specified by a three-element vector ntid
(with elements ntid.x, ntid.y, and ntid.z). The vector ntid specifies the number of
threads in each CTA dimension.
Grids may be launched with dependencies between one another - a grid may
be a dependent grid and/or a prerequisite grid. To understand how grid
dependencies may be defined, refer to the section on CUDA Graphs in the Cuda
Programming Guide.
PTX threads may access data from multiple state spaces during their
execution as illustrated by Figure 3 where cluster level is introduced from target
architecture sm_90 onwards. Each thread has a private local memory. Each thread
block (CTA) has a shared memory visible to all threads of the block and to all
active blocks in the cluster and with the same lifetime as the block. Finally, all
threads have access to the same global memory.
There are additional state spaces accessible by all threads: the constant,
param, texture, and surface state spaces. Constant and texture memory are
read-only; surface memory is readable and writable. The global, constant, param,
texture, and surface state spaces are optimized for different memory usages. For
example, texture memory offers different addressing modes as well as data
filtering for specific data formats. Note that texture and surface memory is cached,
and within the same kernel call, the cache is not kept coherent with respect to
global memory writes and surface memory writes, so any texture fetch or surface
read to an address that has been written to via a global or a surface write in the
same kernel call returns undefined data. In other words, a thread can safely read
some texture or surface memory location only if this memory location has been
updated by a previous kernel call or memory copy, but not if it has been
previously updated by the same thread or another thread from the same kernel call.
The global, constant, and texture state spaces are persistent across kernel
launches by the same application. Both the host and the device maintain their own
local memory, referred to as host memory and device memory, respectively. The
device memory may be mapped and read or written by the host, or, for more
efficient transfer, copied from the host memory through optimized API calls that
utilize the device’s high-performance Direct Memory Access (DMA) engine.