0% found this document useful (0 votes)
24 views6 pages

Unit_IV-Topic_7-CUDA_programming_model_features

The CUDA programming model enables applications to utilize the parallel processing capabilities of NVIDIA GPUs by offloading data-parallel tasks from the CPU to the GPU. It features a highly multithreaded coprocessor organized into grids of cooperative thread arrays (CTAs) and clusters, allowing for efficient thread communication and synchronization. Additionally, the model incorporates a memory hierarchy that optimizes access to various memory states, ensuring efficient data handling across threads and kernels.

Uploaded by

aksharadeepa2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views6 pages

Unit_IV-Topic_7-CUDA_programming_model_features

The CUDA programming model enables applications to utilize the parallel processing capabilities of NVIDIA GPUs by offloading data-parallel tasks from the CPU to the GPU. It features a highly multithreaded coprocessor organized into grids of cooperative thread arrays (CTAs) and clusters, allowing for efficient thread communication and synchronization. Additionally, the model incorporates a memory hierarchy that optimizes access to various memory states, ensuring efficient data handling across threads and kernels.

Uploaded by

aksharadeepa2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

22CD303 - COMPUTER ORGANIZATION

AND ARCHITECTURE

LM 7 - UNIT IV - CUDA PROGRAMMING MODEL


FEATURES

The CUDA programming model establishes a framework for developing


applications that leverage the parallel processing power of NVIDIA GPUs. Here's a
breakdown of its key features:

1. A HIGHLY MULTITHREADED COPROCESSOR

The GPU is a compute device capable of executing a very large number of


threads in parallel. It operates as a coprocessor to the main CPU, or host: In other
words, data-parallel, compute-intensive portions of applications running on the
host are off-loaded onto the device. More precisely, a portion of an application that
is executed many times, but independently on different data, can be isolated into a
kernel function that is executed on the GPU as many different threads. To that
effect, such a function is compiled to the PTX instruction set and the resulting
kernel is translated at install time to the target GPU instruction set.

1.1 THREAD HIERARCHY

The batch of threads that executes a kernel is organized as a grid. A grid consists
of either cooperative thread arrays or clusters of cooperative thread arrays as
described in this section and illustrated in Figure 1 and Figure 2. Cooperative
thread arrays (CTAs) implement CUDA thread blocks and clusters implement
CUDA thread block clusters.

Figure.1 Grid with CTAs Figure.2 Grid with clusters

1.2 COOPERATIVE THREAD ARRAYS

The Parallel Thread Execution (PTX) programming model is explicitly parallel: a


PTX program specifies the execution of a given thread of a parallel thread array. A
cooperative thread array, or CTA, is an array of threads that execute a kernel
concurrently or in parallel.

Threads within a CTA can communicate with each other. To coordinate the
communication of the threads within the CTA, one can specify synchronization
points where threads wait until all threads in the CTA have arrived.

Each thread has a unique thread identifier within the CTA. Programs use a data
parallel decomposition to partition inputs, work, and results across the threads of
the CTA. Each CTA thread uses its thread identifier to determine its assigned role,
assign specific input and output positions, compute addresses, and select work to
perform. The thread identifier is a three-element vector tid, (with elements tid.x,
tid.y, and tid.z) that specifies the thread’s position within a 1D, 2D, or 3D CTA.
Each thread identifier component ranges from zero up to the number of thread ids
in that CTA dimension.
Each CTA has a 1D, 2D, or 3D shape specified by a three-element vector ntid
(with elements ntid.x, ntid.y, and ntid.z). The vector ntid specifies the number of
threads in each CTA dimension.

Threads within a CTA execute in SIMT (single-instruction, multiple-thread)


fashion in groups called warps. A warp is a maximal subset of threads from a
single CTA, such that the threads execute the same instructions at the same time.
Threads within a warp are sequentially numbered. The warp size is a
machine-dependent constant. Typically, a warp has 32 threads. Some applications
may be able to maximize performance with knowledge of the warp size, so PTX
includes a run-time immediate constant, WARP_SZ, which may be used in any
instruction where an immediate operand is allowed.

1.3 CLUSTER OF COOPERATIVE THREAD ARRAYS

Cluster is a group of CTAs that run concurrently or in parallel and can


synchronize and communicate with each other via shared memory. The executing
CTA has to make sure that the shared memory of the peer CTA exists before
communicating with it via shared memory and the peer CTA hasn’t exited before
completing the shared memory operation.

Threads within the different CTAs in a cluster can synchronize and


communicate with each other via shared memory. Cluster-wide barriers can be
used to synchronize all the threads within the cluster. Each CTA in a cluster has a
unique CTA identifier within its cluster (cluster_ctaid). Each cluster of CTAs has
1D, 2D or 3D shape specified by the parameter cluster_nctaid. Each CTA in the
cluster also has a unique CTA identifier (cluster_ctarank) across all dimensions.
The total number of CTAs across all the dimensions in the cluster is specified by
cluster_nctarank. Threads may read and use these values through predefined,
read-only special registers %cluster_ctaid, %cluster_nctaid, %cluster_ctarank,
%cluster_nctarank.

Cluster level is applicable only on target architecture sm_90 or higher. Specifying


cluster level during launch time is optional. If the user specifies the cluster
dimensions at launch time then it will be treated as explicit cluster launch,
otherwise it will be treated as implicit cluster launch with default dimension
1x1x1. PTX provides read-only special register %is_explicit_cluster to
differentiate between explicit and implicit cluster launch.

1.4 GRID OF CLUSTERS

There is a maximum number of threads that a CTA can contain and a


maximum number of CTAs that a cluster can contain. However, clusters with
CTAs that execute the same kernel can be batched together into a grid of clusters,
so that the total number of threads that can be launched in a single kernel
invocation is very large. This comes at the expense of reduced thread
communication and synchronization, because threads in different clusters cannot
communicate and synchronize with each other.

Each cluster has a unique cluster identifier (clusterid) within a grid of


clusters. Each grid of clusters has a 1D, 2D , or 3D shape specified by the
parameter nclusterid. Each grid also has a unique temporal grid identifier (gridid).
Threads may read and use these values through predefined, read-only special
registers %tid, %ntid, %clusterid, %nclusterid, and %gridid. Each CTA has a
unique identifier (ctaid) within a grid. Each grid of CTAs has 1D, 2D, or 3D shape
specified by the parameter nctaid. Thread may use and read these values through
predefined, read-only special registers %ctaid and %nctaid.
Each kernel is executed as a batch of threads organized as a grid of clusters
consisting of CTAs where cluster is optional level and is applicable only for target
architectures sm_90 and higher. Figure 1 shows a grid consisting of CTAs and
Figure 2 shows a grid consisting of clusters.

Grids may be launched with dependencies between one another - a grid may
be a dependent grid and/or a prerequisite grid. To understand how grid
dependencies may be defined, refer to the section on CUDA Graphs in the Cuda
Programming Guide.

1.5 MEMORY HIERARCHY

PTX threads may access data from multiple state spaces during their
execution as illustrated by Figure 3 where cluster level is introduced from target
architecture sm_90 onwards. Each thread has a private local memory. Each thread
block (CTA) has a shared memory visible to all threads of the block and to all
active blocks in the cluster and with the same lifetime as the block. Finally, all
threads have access to the same global memory.

There are additional state spaces accessible by all threads: the constant,
param, texture, and surface state spaces. Constant and texture memory are
read-only; surface memory is readable and writable. The global, constant, param,
texture, and surface state spaces are optimized for different memory usages. For
example, texture memory offers different addressing modes as well as data
filtering for specific data formats. Note that texture and surface memory is cached,
and within the same kernel call, the cache is not kept coherent with respect to
global memory writes and surface memory writes, so any texture fetch or surface
read to an address that has been written to via a global or a surface write in the
same kernel call returns undefined data. In other words, a thread can safely read
some texture or surface memory location only if this memory location has been
updated by a previous kernel call or memory copy, but not if it has been
previously updated by the same thread or another thread from the same kernel call.

The global, constant, and texture state spaces are persistent across kernel
launches by the same application. Both the host and the device maintain their own
local memory, referred to as host memory and device memory, respectively. The
device memory may be mapped and read or written by the host, or, for more
efficient transfer, copied from the host memory through optimized API calls that
utilize the device’s high-performance Direct Memory Access (DMA) engine.

Figure.3 Memory Hierarchy

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy