0% found this document useful (0 votes)

8 views

07_gpuarch

Lecture 7 of Stanford CS149 discusses the evolution of GPU architecture from 3D rendering to parallel computing applications like deep learning and scientific computing. It introduces CUDA programming for GPUs, highlighting its low-level abstractions and the separation of host and device code. The lecture also covers the CUDA execution model, memory model, and the hierarchical structure of concurrent threads in CUDA programs.

Uploaded by

saudiqbal886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

07_gpuarch

Uploaded by

saudiqbal886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Lecture 7:

GPU Architecture &

CUDA Programming
Parallel Computing
Stanford CS149, Fall 2023
Today
▪ History: how graphics processors, originally designed to accelerate 3D games,
evolved into highly parallel compute engines for a broad class of applications like:
- deep learning
- computer vision
- scientific computing

▪ Programming GPUs using the CUDA language

▪ A more detailed look at GPU architecture

Stanford CS149, Fall 2023

Basic GPU architecture (from lecture 2)

~150-300 GB/sec
(high end GPUs)
Memory
CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016
DDR5 DRAM

(a few GB)

CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016

GPU
Multi-core chip
SIMD execution within a single core (many execution units performing the same instruction)
Multi-threaded execution on a single core (multiple threads executed concurrently by a core)
Stanford CS149, Fall 2023
Graphics 101 + GPU history
(for fun)

Stanford CS149, Fall 2023

What GPUs were originally designed to do:
3D rendering

Image credit: Henrik Wann Jensen

Input: description of a scene: Output: image of the scene

3D surface geometry (e.g., triangle mesh)
surface materials, lights, camera, etc.

Simple definition of rendering task: computing how each triangle in 3D

mesh contributes to appearance of each pixel in the image?
Stanford CS149, Fall 2023
What GPUs were originally designed to do

Unreal Engine Kite Demo (Epic Games 2015) Stanford CS149, Fall 2023
Render high complexity 3D scenes, in real-time

Epic Nanite Demo Stanford CS149, Fall 2023

The 3D graphics workload

Stanford CS149, Fall 2023

Real-time graphics primitives (entities)
Represent surfaces as 3D triangle meshes

3
1
4

2
Vertices Primitives
(points in space) (e.g., triangles, points, lines)

Stanford CS149, Fall 2023

Workload in one slide
▪ Given a triangle, determine where it lies on screen given the position of a virtual camera

▪ For all output image pixels covered by the triangle, compute the color of the surface at
that pixel.

Stanford CS149, Fall 2023

What does the
surface look like
at a point?

Images from Matusik et al. SIGGRAPH 2003

Stanford CS149, Fall 2023
Great diversity of materials and lights in the world!

Stanford CS149, Fall 2023

Example “shader program” *
Run once per fragment (per pixel covered by a triangle)
myTexture is a texture map
OpenGL shading language (GLSL) shader program:
defines behavior of fragment processing stage

uniform sampler2D myTexture;

read-only global variables
uniform float3 lightDir;
varying vec3 norm;
Inputs whose value changes per pixel: think
varying vec2 uv;
of these as shader function parameters

void myShader()
{
vec3 kd = texture2D(myTexture, uv); “Shader” function
kd *= clamp(dot(lightDir, norm), 0.0, 1.0); (a.k.a function invoked to compute the color of the pixel)
return vec4(kd, 1.0);
}

per-pixel output: RGBA surface color at pixel

* Syntax/details of this code not important to CS149
What is important is that a shader is a pure function invoked on a stream of inputs. Stanford CS149, Fall 2023
Shaded result
Image contains output of
myShader() for each pixel
covered by surface
(pixels covered by multiple
surfaces contain output from
surface closest to camera)

Stanford CS149, Fall 2023

Why do GPU’s have many high-throughput cores?
Many SIMD, multi-threaded cores provide efficient execution of shader programs

~1TB/sec
(high end GPUs)
Memory
CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016
DDR6 DRAM

(10s of GB)

CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016

GPU
Stanford CS149, Fall 2023
Observation circa 2001-2003
GPUs are very fast processors for performing the same
computation (shader programs) in parallel on large
collections of data (streams of vertices, fragments,
and pixels)

Wait a minute! That sounds a lot like data-parallelism

to me! I remember data-parallelism from exotic
supercomputers in the 90s.

And every year GPUs are getting faster because more

transistors = more parallelism.

Stanford CS149, Fall 2023

Hack! early GPU-based scientific computation
Say you want to run a function on all elements of a 512x512 array
Set output image size to be array size (512 x 512)
Render two triangles that exactly cover screen
(one shader computation per pixel = one shader computation output image element)
v3=(0, 512) v2=(512, 512)

We now can use the GPU like a data-parallel

programming system.

Fragment shader function is mapped over 512 x 512

element collection.
v0=(0,0) v1=(512,0)
Hack!

Stanford CS149, Fall 2023

“GPGPU” 2002-2003
GPGPU = “general purpose” computation on GPUs

Coupled Map Lattice Simulation [Harris 02]

Sparse Matrix Solvers [Bolz 03]

Ray Tracing on Programmable Graphics Hardware [Purcell 02]

Stanford CS149, Fall 2023
Brook stream programming language (2004) [Buck 2004]

▪ Stanford graphics lab research project

▪ Abstract GPU hardware as data-parallel processor
kernel void scale(float amount, float a<>, out float b<>)
{
b = amount * a;
}

float scale_amount;
float input_stream<1000>; // stream declaration
float output_stream<1000>; // stream declaration

// omitting stream element initialization...

// map kernel function onto streams

scale(scale_amount, input_stream, output_stream);

▪ Brook compiler translated generic stream program into graphics commands (such
as drawTriangles) and a set of graphics shader programs that could be run on GPUs
of the day.
Stanford CS149, Fall 2023
GPU compute mode

Stanford CS149, Fall 2023

Review: how to run code on a CPU
Lets say a user wants to run a program on a multi-core CPU…
- OS loads program text into memory Fetch/ Fetch/
- OS selects CPU execution context Decode Decode

- OS interrupts processor, prepares execution context (sets contents ALU

(Execute)
ALU
(Execute)
of registers, program counter, etc. to prepare execution context)
Execution Execution
- Go! Context Context

- Processor begins executing instructions from within the

environment maintained in the execution context.
Multi-core CPU

Stanford CS149, Fall 2023

How to run code on a GPU (prior to 2007)
Let’s say a user wants to draw a picture using a GPU… Input vertex
buffer

- Application (via graphics driver) provides GPU shader program binaries Vertex Generation

- Application sets graphics pipeline parameters

(e.g., output image size) Vertex Processing

- Application provides GPU a buffer of vertices

- Application sends GPU a “draw” command: Primitive Generation

drawPrimitives(vertex_buffer)
Fragment Generation
(“Rasterization”)

Fragment Processing
This was the only interface to GPU hardware.
GPU hardware could only execute graphics pipeline computations. Output image
buffer
Pixel Operations
(pixels)

Stanford CS149, Fall 2023

NVIDIA Tesla architecture (2007)
First alternative, non-graphics-specific (“compute mode”) interface to GPU hardware

Let’s say a user wants to run a non-graphics program on the GPU’s

programmable cores…
- Application can allocate buffers in GPU memory and copy data
to/from buffers
- Application (via graphics driver) provides GPU a single kernel
program binary
- Application tells GPU to run the kernel in an SPMD fashion
(“run N instances of this kernel”)
launch(myKernel, N)

Interestingly, this is a far simpler operation

than the graphics operation drawPrimitives()

Stanford CS149, Fall 2023

CUDA programming language
▪ Introduced in 2007 with NVIDIA Tesla architecture
▪ “C-like” language to express programs that run on GPUs using the compute-mode hardware
interface

▪ Relatively low-level: CUDA’s abstractions closely match the capabilities/performance

characteristics of modern GPUs
(design goal: maintain low abstraction distance)

Stanford CS149, Fall 2023

The plan
1. CUDA programming abstractions
2. CUDA implementation on modern GPUs
3. More detail on GPU architecture

Things to consider throughout this lecture:

- Is CUDA a data-parallel programming model?
- Is CUDA an example of the shared address space model?
- Or the message passing model?
- Can you draw analogies to ISPC instances and tasks? What about pthreads?

Stanford CS149, Fall 2023

Clarification (here we go again...)
▪ I am going to describe CUDA abstractions using CUDA terminology

▪ Specifically, be careful with the use of the term “CUDA thread”. A CUDA thread presents a
similar abstraction as a pthread in that both correspond to logical threads of control,
but the implementation of a CUDA thread is very different

▪ We will discuss these differences at the end of the lecture

Stanford CS149, Fall 2023

CUDA programs consist of a hierarchy of concurrent threads
Thread IDs can be up to 3-dimensional (2D example below)
Multi-dimensional thread ids are convenient for problems that are naturally N-D

Regular application thread running on CPU (the “host”)

const int Nx = 12;
const int Ny = 6;

dim3 threadsPerBlock(4, 3);

dim3 numBlocks(Nx/threadsPerBlock.x, Ny/threadsPerBlock.y);

// assume A, B, C are allocated Nx x Ny float arrays

// this call will launch 72 CUDA threads:

// 6 thread blocks of 12 threads each
matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

Stanford CS149, Fall 2023

Basic CUDA syntax Regular application thread running on CPU (the “host”)
const int Nx = 12;
const int Ny = 6;
“Host” code : serial execution
Running as part of normal C/C++ application on CPU dim3 threadsPerBlock(4, 3);
dim3 numBlocks(Nx/threadsPerBlock.x, Ny/
threadsPerBlock.y);

// assume A, B, C are allocated Nx x Ny float arrays

Bulk launch of many CUDA threads
“launch a grid of CUDA thread blocks” // this call will launch 72 CUDA threads:
// 6 thread blocks of 12 threads each
Call returns when all threads have terminated matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

SPMD execution of device kernel function:

CUDA kernel definition
“CUDA device” code: kernel function (__global__ denotes a CUDA // kernel definition (runs on GPU)
__global__ void matrixAdd(float A[Ny][Nx],
kernel function) runs on GPU float B[Ny][Nx],
float C[Ny][Nx])
{
Each thread computes its overall grid thread id from its position in its block int i = blockIdx.x * blockDim.x + threadIdx.x;
(threadIdx) and its block’s position in the grid (blockIdx) int j = blockIdx.y * blockDim.y + threadIdx.y;

C[j][i] = A[j][i] + B[j][i];

}

Stanford CS149, Fall 2023

Clear separation of host and device code
Separation of execution into host and device code is performed statically by the programmer
const int Nx = 12;
const int Ny = 6;

dim3 threadsPerBlock(4, 3);

dim3 numBlocks(Nx/threadsPerBlock.x, Ny/threadsPerBlock.y);
“Host” code : serial execution on CPU
// assume A, B, C are allocated Nx x Ny float arrays

// this call will cause execution of 72 threads

// 6 blocks of 12 threads each
matrixAddDoubleB<<<numBlocks, threadsPerBlock>>>(A, B, C);

device float doubleValue(float x)

{
return 2 * x;
}

// kernel definition
__global__ void matrixAddDoubleB(float A[Ny][Nx],
“Device” code (SPMD execution on GPU) float B[Ny][Nx],
float C[Ny][Nx])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

C[j][i] = A[j][i] + doubleValue(B[j][i]);

}

Stanford CS149, Fall 2023

Number of SPMD “CUDA threads” is explicit in the program
Number of kernel invocations is not determined by size of data collection
(a kernel launch is not specified by map(kernel, collection) as was the case with graphics shader programming)

Regular application thread running on CPU (the “host”)

const int Nx = 11; // not a multiple of threadsPerBlock.x
const int Ny = 5; // not a multiple of threadsPerBlock.y

dim3 threadsPerBlock(4, 3);

dim3 numBlocks((Nx+threadsPerBlock.x-1)/threadsPerBlock.x,
(Ny+threadsPerBlock.y-1)/threadsPerBlock.y);

// assume A, B, C are allocated Nx x Ny float arrays

// this call will cause execution of 72 threads

// 6 blocks of 12 threads each
matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

CUDA kernel definition

__global__ void matrixAdd(float A[Ny][Nx],
float B[Ny][Nx],
float C[Ny][Nx])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

// guard against out of bounds array access

if (i < Nx && j < Ny)
C[j][i] = A[j][i] + B[j][i];
}

Stanford CS149, Fall 2023

CUDA execution model

Host CUDA device

(serial execution) (SPMD execution)

Implementation: CPU Implementation: GPU

Stanford CS149, Fall 2023

CUDA memory model
Distinct host and device address spaces

Host CUDA device

(serial execution) (SPMD execution)

Host memory Device “global”

address space memory address space

Implementation: CPU Implementation: GPU

Stanford CS149, Fall 2023

memcpy primitive
Move data between address spaces Host Device

float* A = new float[N]; // allocate buffer in host mem

Host memory Device “global”
address space memory address space
// populate host address space pointer A
for (int i=0 i<N; i++)
A[i] = (float)i;

int bytes = sizeof(float) * N;

float* deviceA; // allocate buffer in
cudaMalloc(&deviceA, bytes); // device address space

// populate deviceA
What does cudaMemcpy remind you of?
cudaMemcpy(deviceA, A, bytes, cudaMemcpyHostToDevice);

// note: directly accessing deviceA[i] is an invalid

// operation here (cannot manipulate contents of deviceA
// directly from host, since deviceA is not a pointer
// into the host’s address space)

Stanford CS149, Fall 2023

CUDA device memory model
Three distinct types of address spaces visible to kernels

Readable/ writable by Per-block

all threads in block shared memory

Readable/ writable by Per-thread

thread private memory

Device global
Different address spaces reflect different regions of locality in the program memory

As we will soon see, this has important implications to efficiency of GPU

implementations of CUDA
e.g., how might you schedule threads if you know a priori that certain threads access
the same variables)?

Readable/writable
by all threads

Stanford CS149, Fall 2023

CUDA example: 1D convolution

input[0] input[1] input[2] input[3] input[4] input[5] input[6] input[7] input[8] input[9]

output[0] output[1] output[2] output[3] output[4] output[5] output[6] output[7]

output[i] = (input[i] + input[i+1] + input[i+2]) / 3.f;

Stanford CS149, Fall 2023

1D convolution in CUDA (version 1)
One thread per output element
input[0] input[129] input[N-128] input[N+1]

... ... ...

... ... ...
output[0] output[127] output[N-128] output[N-1]

CUDA Kernel
#define THREADS_PER_BLK 128

global void convolve(int N, float* input, float* output) {

int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local variable

float result = 0.0f; // thread-local variable

each thread computes
for (int i=0; i<3; i++)
result += input[index + i];
result for one element

output[index] = result / 3.f; each thread writes result

} to global memory
Host code
int N = 1024 * 1024;
cudaMalloc(&devInput, sizeof(float) * (N+2) ); // allocate input array in device memory
cudaMalloc(&devOutput, sizeof(float) * N); // allocate output array in device memory

// properly initialize contents of devInput here ...

convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput);

Stanford CS149, Fall 2023

1D convolution in CUDA (version 2)
One thread per output element: stage input data in per-block shared memory
CUDA Kernel
#define THREADS_PER_BLK 128

global void convolve(int N, float* input, float* output) {

shared float support[THREADS_PER_BLK+2]; // per-block allocation All threads cooperatively load

int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local variable block’s support region from
global memory into shared
support[threadIdx.x] = input[index];
if (threadIdx.x < 2) {
memory
support[THREADS_PER_BLK + threadIdx.x] = input[index+THREADS_PER_BLK]; (total of 130 load instructions
} instead of 3 * 128 load instructions)

__syncthreads(); barrier (all threads in block)

float result = 0.0f; // thread-local variable each thread computes
for (int i=0; i<3; i++)
result += support[threadIdx.x + i];
result for one element

output[index] = result / 3.f; write result to global

}
memory

Host code
int N = 1024 * 1024
cudaMalloc(&devInput, sizeof(float) * (N+2) ); // allocate array in device memory
cudaMalloc(&devOutput, sizeof(float) * N); // allocate array in device memory

// property initialize contents of devInput here ...

convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput);

Stanford CS149, Fall 2023

CUDA synchronization constructs
▪ __syncthreads()
- Barrier: wait for all threads in the block to arrive at this point

▪ Atomic operations
- e.g., float atomicAdd(float* addr, float amount)
- CUDA provides atomic operations on both global memory addresses and per-block shared memory addresses

▪ Host/device synchronization
- Implicit barrier across all threads at return of kernel

Stanford CS149, Fall 2023

Summary: CUDA abstractions
▪ Execution: thread hierarchy
- Bulk launch of many threads (this is imprecise... I’ll clarify later)
- Two-level hierarchy: threads are grouped into thread blocks

▪ Distributed address space

- Built-in memcpy primitives to copy between host and device address spaces
- Three different types of device address spaces
- Per thread, per block (“shared”), or per program (“global”)

▪ Barrier synchronization primitive for threads in thread block

▪ Atomic primitives for additional synchronization (shared and global variables)

Stanford CS149, Fall 2023

CUDA semantics
#define THREADS_PER_BLK 128
Consider implementation of call to pthread_create() or std::thread():
__global__ void convolve(int N, float* input, float* output) {
Allocate thread state:
- Stack space for thread
__shared__ float support[THREADS_PER_BLK+2]; // per-block allocation
int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local var
- Allocate control block so OS can schedule thread
support[threadIdx.x] = input[index];
if (threadIdx.x < 2) {
support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK];
}

__syncthreads(); Will running this CUDA program create 1 million

float result = 0.0f; // thread-local variable
instances of local variables/per-thread stack?
for (int i=0; i<3; i++)
result += support[threadIdx.x + i];

output[index] = result / 3.f;

} 8K instances of shared variables? (support)
// host code //////////////////////////////////////////////////////
int N = 1024 * 1024;
cudaMalloc(&devInput, N+2); // allocate array in device memory
cudaMalloc(&devOutput, N); // allocate array in device memory

// property initialize contents of devInput here ...

convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput); Launch over 1 million CUDA threads (over 8K thread blocks)

Stanford CS149, Fall 2023

Assigning work

Mid-range GPU
(6 cores)

Desirable for CUDA program to run on all of these GPUs

without modification
High-end GPU Note: there is no concept of num_cores in the CUDA
(16 cores)
programs I have shown you. (CUDA thread launch is similar in
spirit to a forall loop in data parallel model examples)

Stanford CS149, Fall 2023

CUDA compilation
#define THREADS_PER_BLK 128
A compiled CUDA device binary includes:
__global__ void convolve(int N, float* input, float* output) {

shared float support[THREADS_PER_BLK+2]; // per block allocation Program text (instructions)

int index = blockIdx.x * blockDim.x + threadIdx.x; // thread local var Information about required resources:
support[threadIdx.x] = input[index]; - 128 threads per block
if (threadIdx.x < 2) { - B bytes of local data per thread
- 128+2=130 floats (520 bytes) of shared space
support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK];
}
per thread block
__syncthreads();

float result = 0.0f; // thread-local variable

for (int i=0; i<3; i++)
result += support[threadIdx.x + i];

output[index] = result;
}

int N = 1024 * 1024;

cudaMalloc(&devInput, N+2); // allocate array in device memory
cudaMalloc(&devOutput, N); // allocate array in device memory

// property initialize contents of devInput here ...

convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput); launch 8K thread blocks

Stanford CS149, Fall 2023

CUDA thread-block assignment
...
Grid of 8K convolve thread blocks (specified by kernel launch)
Block resource requirements:
(contained in compiled kernel binary)
Kernel launch command from host 128 threads
launch(blockDim, convolve) 520 bytes of shared mem
(128 x B) bytes of local mem
Special HW
in GPU
Thread block scheduler Major CUDA assumption: thread block execution can be carried out in
any order (no dependencies between blocks)

GPU implementation maps thread blocks (“work”) to cores using a

dynamic scheduling policy that respects resource requirements

Shared mem Shared mem Shared mem Shared mem

Device global memory

(DRAM) Shared mem is fast
on-chip memory

Stanford CS149, Fall 2023

Another instance of our common design pattern:
a pool of worker “threads”

Problem to solve

Decomposition

Sub-problems
(aka “tasks”, “work”)

Assignment
Worker Threads

Other examples:
- ISPC’s implementation of launching tasks
- Creates one pthread for each hyper-thread on CPU. Threads kept alive for remainder of program
- Thread pool in a web server
- Number of threads is a function of number of cores, not number of outstanding requests
- Threads spawned at web server launch, wait for work to arrive
Stanford CS149, Fall 2023
NVIDIA V100 SM “sub-core” Warp Selector
Fetch/
Decode

= SIMD fp32 functional unit,

control shared across 16 units
(16 x MUL-ADD per clock *)

= SIMD int functional unit,

control shared across 16 units
(16 x MUL/ADD per clock *)

= SIMD fp64 functional unit,

control shared across 8 units
(8 x MUL/ADD per clock **) …
R0 0 1 2 30 31
R1
R2
= Tensor core unit …
R0
R1
R2
= Load/store unit …

* one 32-wide SIMD operation every two clocks R0

R1
R2
** one 32-wide SIMD operation every four clocks …

Stanford CS149, Fall 2023

NVIDIA V100 SM “sub-core” Warp Selector
Fetch/
Decode

Scalar registers for one CUDA thread: R0, R1, etc…

…
R0 0 1 2 30 31
R1
R2
…
R0 32 33 62 63
R1
R2
…
64 65 94 95

Scalar registers for another CUDA thread: R0, R1, etc… …

R0 96 97 126 127
R1
R2
…

Stanford CS149, Fall 2023

NVIDIA V100 SM “sub-core” Warp Selector
Fetch/
Decode

Scalar registers for 32 threads in the same “warp”

A group of 32 threads in thread block is called a warp.

- In a thread block, threads 0-31 fall into the same warp
(so do threads 32-63, etc.)
…
R0 0 1 2 30 31
R1
Warp 0
- Therefore, a thread block with 256 CUDA threads is
R2
…

mapped to 8 warps. R0
R1
Warp 4
- Each sub-core in the V100 is capable of scheduling and
R2
…

interleaving execution of up to 16 warps …

R0
R1 Warp 60
R2
…

Stanford CS149, Fall 2023

NVIDIA V100 SM “sub-core” Warp Selector
Fetch/
Decode

Scalar registers for 32 threads in the same “warp”

Threads in a warp are executed in a SIMD manner

if they share the same instruction
- NVIDIA calls this SIMT (single instruction multiple CUDA thread)
- If the 32 CUDA threads do not share the same instruction,
performance can suffer due to divergent execution.
- This mapping is similar to how ISPC runs program instances in a R0
R1
0 1 2 … 30 31

Warp 0
gang * R2
…
R0
R1
Warp 4
A warp is not part of CUDA, but is an important CUDA implementation R2
…

detail on modern NVIDIA GPUs …

R0
R1 Warp 60
R2
…

* But GPU hardware is dynamically checking whether 32 independent CUDA threads share an instruction, and if this is true, it
executes all 32 threads in a SIMD manner. The CUDA program is not compiled to SIMD instructions like ISPC gangs. Stanford CS149, Fall 2023
Instruction execution Warp Selector
Fetch/
Decode
Instruction stream for CUDA threads in a warp…
(note in this example all instructions are independent)
00 fp32 mul r0 r1 r2
01 int32 add r3 r4 r5
02 fp32 mul r6 r7 r8
...

00 Fetch fp32 fp32

01 Fetch int32 int32
…
R0 0 1 2 30 31
02 Fetch fp32 fp32 R1
R2 Warp 0
…

Time (clocks) R0
R1
R2 Warp 4
…
Remember, entire warp of CUDA threads is running this instruction stream.
So each instruction is run by all 32 CUDA threads in the warp. …

Since there are 16 ALUs, running the instruction for the entire warp takes two clocks.
R0
R1 Warp 60
R2
…

Stanford CS149, Fall 2023

NVIDIA V100 GPU SM
This is one NVIDIA V100 streaming multi-processor (SM) unit
Warp Selector Warp Selector Warp Selector Warp Selector
Fetch/ Fetch/ Fetch/ Fetch/
Decode Decode Decode Decode

R0 0 1 2 … 30 31 R0 0 1 2 … 30 31 R0 0 1 2 … 30 31 R0 0 1 2 … 30 31
R1 R1 R1 R1
R2
…
Warp 0 R2
…
Warp 1 R2
…
Warp 2 R2
…
Warp 3 64 KB registers
R0 R0 R0 R0 per sub-core
R1 R1 R1 R1
R2 Warp 4 R2 Warp 5 R2 Warp 6 R2 Warp 7
… … … …
256 KB registers
… … … … in total per SM
R0 R0 R0 R0
R1 Warp 60 R1 Warp 61 R1 Warp 62 R1 Warp 63 Registers divided among
R2 R2 R2 R2
… … … … (up to) 64 “warps” per SM
“Shared” memory + L1 cache storage (128 KB)

= SIMD fp32 functional unit, = SIMD int functional unit, = SIMD fp64 functional unit, = Tensor core unit
control shared across 16 units control shared across 16 units control shared across 8 units
= Load/store unit
(16 x MUL-ADD per clock *) (16 x MUL/ADD per clock *) (8 x MUL/ADD per clock **)
* one 32-wide SIMD operation every 2 clocks ** one 32-wide SIMD operation every 4 clocks Stanford CS149, Fall 2023
Running a thread block on a V100 SM
#define THREADS_PER_BLK 128

global void convolve(int N, float* input,

float* output)
{
__shared__ float support[THREADS_PER_BLK+2];
int index = blockIdx.x * blockDim.x +
threadIdx.x;

support[threadIdx.x] = input[index];
if (threadIdx.x < 2) {
support[THREADS_PER_BLK+threadIdx.x]
= input[index+THREADS_PER_BLK];
}

__syncthreads();

float result = 0.0f; // thread-local

support for (int i=0; i<3; i++)
(520 bytes) result += support[threadIdx.x + i];

A convolve thread block is executed by 4 warps output[index] = result;

(4 warps x 32 threads/warp = 128 CUDA threads per block) }

SM core operation each clock:

- Each sub-core selects one runnable warp (from the 16 warps in its partition)
- Each sub-core runs next instruction for the CUDA threads in the warp (this instruction may apply to all or a subset of the CUDA
threads in a warp depending on divergence)
Stanford CS149, Fall 2023
NVIDIA V100 GPU (80 SMs)

L2 Cache (6 MB)
900 GB/sec
(4096 bit interface)
GPU memory (HBM)
(16 GB)
Stanford CS149, Fall 2023
Summary: geometry of the V100 GPU
1.245 GHz clock

80 SM cores per chip

80 x 4 x 16 = 5,120 fp32 mul-add ALUs

= 12.7 TFLOPs *

Up to 80 x 64 = 5120 interleaved warps per chip

(163,840 CUDA threads/chip)

L2 Cache (6 MB)
900 GB/sec

GPU memory (16 GB)

* mul-add counted as 2 flops:

Stanford CS149, Fall 2023
Running a CUDA program on a GPU

Stanford CS149, Fall 2023

Running the convolve kernel
convolve kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block requires 130 x sizeof(float) = 520 bytes of shared memory

Let’s assume array size N is very large, so the host-side kernel launch generates thousands of thread blocks.
#define THREADS_PER_BLK 128
convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, input_array, output_array);

Let’s run this program on the fictitious two-core GPU below.

(Note: my fictitious cores are much “smaller” than the V100 SM cores discussed earlier in lecture: they have
fewer execution units, support for fewer active warps, less shared memory, etc.)

GPU Work Scheduler

Fetch/Decode Fetch/Decode

Execution context Execution context

“Shared” memory “Shared” memory
storage for 384 CUDA storage for 384 CUDA
storage (1.5 KB) storage (1.5 KB)
threads threads
(12 warps) (12 warps)

Core 0 Core 1 Stanford CS149, Fall 2023

Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory

Step 1: host sends CUDA device (GPU) a command (“execute this kernel”)

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

GPU Work Scheduler

Fetch/Decode Fetch/Decode

Execution context Execution context

“Shared” memory “Shared” memory
storage for 384 CUDA storage for 384 CUDA
storage (1.5 KB) storage (1.5 KB)
threads threads
(12 warps) (12 warps)

Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory

Step 2: scheduler maps block 0 to core 0 (reserves execution contexts for 128 threads and 520 bytes of shared storage)

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 1 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 0 (contexts 0-127) Block 0: support

(520 bytes)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown)

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 2 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 0 (contexts 0-127) Block 0: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown)

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 3 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 0 (contexts 0-127) Block 0: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)

Block 2 (contexts 128-255)

Block 2: support
(520 bytes 0x520)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown). Only two thread blocks fit on a core
(third block won’t fit due to insufficient shared storage 3 x 520 bytes > 1.5 KB)

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 4 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 0 (contexts 0-127) Block 0: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)

Block 2 (contexts 128-255) Block 3 (contexts 128-255)

Block 2: support Block 3: support
(520 bytes 0x520) (520 bytes @ 0x520)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Step 4: thread block 0 completes on core 0

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 4 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 1 (contexts 0-127) Block 1: support

(520 bytes @ 0x0)

Block 2 (contexts 128-255) Block 3 (contexts 128-255)

Block 2: support Block 3: support
(520 bytes 0x520) (520 bytes @ 0x520)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Step 5: block 4 is scheduled on core 0 (mapped to execution contexts 0-127)

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 5 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 4 (contexts 0-127) Block 4: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)

Block 2 (contexts 128-255) Block 3 (contexts 128-255)

Block 2: support Block 3: support
(520 bytes 0x520) (520 bytes @ 0x520)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Step 6: thread block 2 completes on core 0

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 5 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 4 (contexts 0-127) Block 4: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)
Block 3 (contexts 128-255)
Block 3: support
(520 bytes @ 0x520)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Step 7: thread block 5 is scheduled on core 0 (mapped to execution contexts 128-255)

EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000

NEXT = 6 GPU Work Scheduler

TOTAL = 1000

Fetch/Decode Fetch/Decode

Block 4 (contexts 0-127) Block 4: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)

Block 5 (contexts 128-255) Block 3 (contexts 128-255)

Block 5: support Block 3: support
(520 bytes 0x520) (520 bytes @ 0x520)

Execution context Execution context

storage for 384 CUDA “Shared” memory storage for 384 CUDA “Shared” memory
threads storage (1.5 KB) threads storage (1.5 KB)

Core 0 Core 1
Stanford CS149, Fall 2023
More advanced scheduling questions:
(If you understand the following examples you really understand how CUDA programs
run on a GPU, and also have a good handle on the work scheduling issues we’ve
discussed in the course up to this point.)

Stanford CS149, Fall 2023

Why must CUDA allocate execution contexts for all threads in a block?
#define THREADS_PER_BLK 256

global void convolve(int N, float* input,

float* output)
{
__shared__ float support[THREADS_PER_BLK+2];
int index = blockIdx.x * blockDim.x +
threadIdx.x;

support[threadIdx.x] = input[index];
if (threadIdx.x < 2) {
support[THREADS_PER_BLK+threadIdx.x]
= input[index+THREADS_PER_BLK];
}

__syncthreads();

float result = 0.0f; // thread-local

for (int i=0; i<3; i++)
Imagine a thread block with 256 CUDA threads result += support[threadIdx.x + i];

(see code, top-right) output[index] = result;

}

Assume a fictitious SM core (shown above) with only 128 threads (four
warps) worth of parallel execution in HW CUDA kernels may create dependencies between threads in a block
Simplest example is __syncthreads()
Why not just run threads 0-127 to completion, then run threads
Threads in a block cannot be executed by the system in any order when
128-255 to completion in order to execute the entire thread block? dependencies exist.
CUDA semantics: threads in a block ARE running concurrently. If a thread in a
block is runnable it will eventually be run! (no deadlock)
Stanford CS149, Fall 2023
Implementation of CUDA abstractions
▪ Thread blocks can be scheduled in any order by the system
- System assumes no dependencies between blocks
- Logically concurrent
- A lot like ISPC tasks, right?

▪ CUDA threads in same block run concurrently (live at same time)

- When block begins executing, all threads exist and have register state allocated
(these semantics impose a scheduling constraint on the system)
- A CUDA thread block is itself an SPMD program (like an ISPC gang of program instances)
- Threads in thread block are concurrent, cooperating “workers”

▪ CUDA implementation:
- A NVIDIA GPU warp has performance characteristics akin to an ISPC gang of instances (but unlike an ISPC gang, the warp
concept does not exist in the programming model*)
- All warps in a thread block are scheduled onto the same SM, allowing for high-BW/low latency communication through
shared memory variables
- When all threads in block complete, block resources (shared memory allocations, warp execution contexts) become available
for next block
* Exceptions to this statement include intra-warp builtin operations like swizzle and vote Stanford CS149, Fall 2023
Consider a program that creates a histogram:
▪ This example: build a histogram of values in an array
- All CUDA threads atomically update shared variables in global memory
▪ Notice I have never claimed CUDA thread blocks were guaranteed to be independent. I only stated CUDA reserves the
right to schedule them in any order.

▪ This is valid code! This use of atomics does not impact implementation’s ability to
schedule blocks in any order (atomics used for mutual exclusion, and nothing more)

atomicAdd(&counts[A[i]], 1); ... atomicAdd(&counts[A[i]], 1);

Thread block 0 Thread block N

Global memory

int counts[10]
...
int A[N]
int* A = {0, 3, 4, 1, 9 , 2, . . . , 8, 4 , 1 }; // array of integers between 0-9
Stanford CS149, Fall 2023
But is this reasonable CUDA code?
▪ Consider implementation of on a single SM GPU with resources for only one CUDA thread block
per SM
- What happens if the CUDA implementation runs block 0 first?
- What happens if the CUDA implementation runs block 1 first?

// do stuff here while(atomicAdd(&myFlag, 0) == 0)

{}
atomicAdd(&myFlag, 1);
...
// do stuff here
Thread block 0 Thread block 1

Global memory
int myFlag
(assume myFlag is initialized to 0)

Stanford CS149, Fall 2023

Bonus slide: “persistent thread” CUDA programming style
#define THREADS_PER_BLK 128
#define BLOCKS_PER_CHIP 80 * (32*64/128) // specific to V100 GPU
Idea: write CUDA code that requires knowledge of the number of cores
__device__ int workCounter = 0; // global mem variable
and blocks per core that are supported by underlying GPU
__global__ void convolve(int N, float* input, float* output) {
__shared__ int startingIndex;
implementation.
__shared__ float support[THREADS_PER_BLK+2]; // shared across block
while (1) { Programmer launches exactly as many thread blocks as will fill the
if (threadIdx.x == 0) GPU
startingIndex = atomicInc(workCounter, THREADS_PER_BLK);
__syncthreads();
if (startingIndex >= N) (Program makes assumptions about GPU implementation: that GPU
break;
will in fact run all blocks concurrently. Ugg!)
int index = startingIndex + threadIdx.x; // thread local
support[threadIdx.x] = input[index];
if (threadIdx.x < 2)
support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK];

__syncthreads();
Now, work assignment to blocks is implemented entirely by the
float result = 0.0f; // thread-local variable
for (int i=0; i<3; i++) application
result += support[threadIdx.x + i];
output[index] = result;
(circumvents GPU’s thread block scheduler)
__syncthreads();
}
}

// host code ////////////////////////////////////////////////////// Now the programmer’s mental model is that *all* CUDA threads are
int N = 1024 * 1024;
cudaMalloc(&devInput, N+2); // allocate array in device memory concurrently running on the GPU at once.
cudaMalloc(&devOutput, N); // allocate array in device memory
// properly initialize contents of devInput here ...

convolve<<<BLOCKS_PER_CHIP, THREADS_PER_BLK>>>(N, devInput, devOutput);

Stanford CS149, Fall 2023
CUDA summary
▪ Execution semantics
- Partitioning of problem into thread blocks is in the spirit of the data-parallel model (intended to be machine
independent: system schedules blocks onto any number of cores)
- Threads in a thread block actually do run concurrently (they have to, since they cooperate)
- Inside a single thread block: SPMD shared address space programming
- There are subtle, but notable differences between these models of execution. Make sure you understand it. (And
ask yourself what semantics are being used whenever you encounter a parallel programming system)
▪ Memory semantics
- Distributed address space: host/device memories
- Thread local/block shared/global variables within device memory
- Loads/stores move data between them (so it is correct to think about local/shared/global memory as being
distinct address spaces)
▪ Key implementation details:
- Threads in a thread block are scheduled onto same GPU “SM” to allow fast communication through shared memory
- Threads in a thread block are are grouped into warps for SIMT execution on GPU hardware
Stanford CS149, Fall 2023
One last point…
▪ In this lecture, we talked about writing CUDA programs for the programmable cores in
a GPU
- Work (described by a CUDA kernel launch) was mapped onto the cores via a hardware work scheduler

▪ Remember, there is also the graphics pipeline interface for driving GPU execution
- And much of the interesting non-programmable functionality of the GPU exists to accelerate execution of
graphics pipeline operations
- It’s more or less “turned off” when running CUDA programs

▪ How the GPU implements the graphics pipeline efficiently is a topic for a graphics
class… *

* See CS248a or CS348K

Stanford CS149, Fall 2023
And…
▪ We didn’t even talk about the hundreds of teraflops available in the “tensor cores”
in the SM (for deep learning)

▪ A topic for later in the quarter

* See CS248 or CS348K Stanford CS149, Fall 2023

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Lecture 17-Introduction to GPU
No ratings yet
Lecture 17-Introduction to GPU
36 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Cuda Basics
No ratings yet
Cuda Basics
44 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Khan Muhammad Nafee Mostafa: Presented by
No ratings yet
Khan Muhammad Nafee Mostafa: Presented by
20 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
06_gpuarch
No ratings yet
06_gpuarch
78 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
CUDA
No ratings yet
CUDA
33 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Lec 1
No ratings yet
Lec 1
27 pages
cs179_2024_lec01
No ratings yet
cs179_2024_lec01
26 pages
3-1
No ratings yet
3-1
35 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
chapter-8
No ratings yet
chapter-8
58 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Gpu Cuda Part1
No ratings yet
Gpu Cuda Part1
27 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
No ratings yet
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
39 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
No ratings yet
Parallel Programming Models For Real-Time Graphics: Aaron Lefohn
25 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
GPU Compute
100% (1)
GPU Compute
58 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
lecture25
No ratings yet
lecture25
2 pages
Data Parallel Computation
No ratings yet
Data Parallel Computation
9 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
(Videogame) Rendering 102
No ratings yet
(Videogame) Rendering 102
32 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
GPU_Programming_slides_2
No ratings yet
GPU_Programming_slides_2
37 pages
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
Datalogger Graph Software Help: For Additional Information Refer To The User Guide Included On The Supplied Program Disk
No ratings yet
Datalogger Graph Software Help: For Additional Information Refer To The User Guide Included On The Supplied Program Disk
7 pages
Refrigerator Control Washing Machine Control Flashing LED Delay Subroutine The PIC 18F67J50 Timers
100% (1)
Refrigerator Control Washing Machine Control Flashing LED Delay Subroutine The PIC 18F67J50 Timers
34 pages
Online University Admission System: Software Requirements Specification
No ratings yet
Online University Admission System: Software Requirements Specification
16 pages
Is - Lecture 1
No ratings yet
Is - Lecture 1
37 pages
Power Bi Session Notes
No ratings yet
Power Bi Session Notes
8 pages
Conduct Assessment Script Good Morning Everybody!
No ratings yet
Conduct Assessment Script Good Morning Everybody!
4 pages
Gretl Guide
No ratings yet
Gretl Guide
424 pages
Resume Parel W. Hutahaean
No ratings yet
Resume Parel W. Hutahaean
1 page
Space Code Academy
No ratings yet
Space Code Academy
9 pages
Automatic Pneumatic High Speed Sheet Cutting Machine
No ratings yet
Automatic Pneumatic High Speed Sheet Cutting Machine
5 pages
Signalink Usb Manual
No ratings yet
Signalink Usb Manual
12 pages
Oslab
No ratings yet
Oslab
2 pages
DataStage Best Practices
100% (1)
DataStage Best Practices
63 pages
R12 Bom Table Coloumn Desc PDF
No ratings yet
R12 Bom Table Coloumn Desc PDF
214 pages
Setup Wizard
No ratings yet
Setup Wizard
19 pages
Figure 1 Original Image
No ratings yet
Figure 1 Original Image
12 pages
Independent University, Bangladesh SEMESTER: AUTUMN-2020/MGT 340
No ratings yet
Independent University, Bangladesh SEMESTER: AUTUMN-2020/MGT 340
3 pages
18MIS1059 - 74 - 69 - Blockchain - A Design and Architecture Perspective, With Future Applications
No ratings yet
18MIS1059 - 74 - 69 - Blockchain - A Design and Architecture Perspective, With Future Applications
26 pages
CHAPTER 4 - OPERATING A COMPUTER 3
No ratings yet
CHAPTER 4 - OPERATING A COMPUTER 3
8 pages
Android Based Idir Information Management System
No ratings yet
Android Based Idir Information Management System
90 pages
DrillPlan - Case Study
No ratings yet
DrillPlan - Case Study
2 pages
Control Design: 5.1 Basic Concepts
No ratings yet
Control Design: 5.1 Basic Concepts
7 pages
About Life - Tere Liye
No ratings yet
About Life - Tere Liye
131 pages
Exinda 4061 Datasheet PDF
No ratings yet
Exinda 4061 Datasheet PDF
2 pages
BeneVision N1 - Brochure - ENG - 20180316-Small
No ratings yet
BeneVision N1 - Brochure - ENG - 20180316-Small
6 pages
Item Import Migration Oracle Apps
No ratings yet
Item Import Migration Oracle Apps
13 pages
Vanadium 12
No ratings yet
Vanadium 12
4 pages
APP Holy Book - Advanced Programming Practice
No ratings yet
APP Holy Book - Advanced Programming Practice
159 pages
Create or Change A Equipment Record - Ver1.1 - WI
No ratings yet
Create or Change A Equipment Record - Ver1.1 - WI
10 pages
Ac LCS MFL71860159 00 231121 00 Om Web
No ratings yet
Ac LCS MFL71860159 00 231121 00 Om Web
32 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

07_gpuarch

Uploaded by

07_gpuarch

Uploaded by

Lecture 7:

GPU Architecture &

▪ Programming GPUs using the CUDA language

Stanford CS149, Fall 2023

CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016

Stanford CS149, Fall 2023

Image credit: Henrik Wann Jensen

Input: description of a scene: Output: image of the scene

Simple definition of rendering task: computing how each triangle in 3D

Epic Nanite Demo Stanford CS149, Fall 2023

Stanford CS149, Fall 2023

Stanford CS149, Fall 2023

Stanford CS149, Fall 2023

Images from Matusik et al. SIGGRAPH 2003

Stanford CS149, Fall 2023

uniform sampler2D myTexture;

per-pixel output: RGBA surface color at pixel

Stanford CS149, Fall 2023

CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016

Wait a minute! That sounds a lot like data-parallelism

And every year GPUs are getting faster because more

Stanford CS149, Fall 2023

We now can use the GPU like a data-parallel

Fragment shader function is mapped over 512 x 512

Stanford CS149, Fall 2023

Coupled Map Lattice Simulation [Harris 02]

Ray Tracing on Programmable Graphics Hardware [Purcell 02]

▪ Stanford graphics lab research project

// omitting stream element initialization...

// map kernel function onto streams

Stanford CS149, Fall 2023

- OS interrupts processor, prepares execution context (sets contents ALU

- Processor begins executing instructions from within the

Stanford CS149, Fall 2023

- Application sets graphics pipeline parameters

- Application provides GPU a buffer of vertices

Stanford CS149, Fall 2023

Let’s say a user wants to run a non-graphics program on the GPU’s

Interestingly, this is a far simpler operation

Stanford CS149, Fall 2023

▪ Relatively low-level: CUDA’s abstractions closely match the capabilities/performance

Stanford CS149, Fall 2023

Things to consider throughout this lecture:

Stanford CS149, Fall 2023

▪ We will discuss these differences at the end of the lecture

Stanford CS149, Fall 2023

Regular application thread running on CPU (the “host”)

dim3 threadsPerBlock(4, 3);

// assume A, B, C are allocated Nx x Ny float arrays

// this call will launch 72 CUDA threads:

Stanford CS149, Fall 2023

// assume A, B, C are allocated Nx x Ny float arrays

SPMD execution of device kernel function:

C[j][i] = A[j][i] + B[j][i];

Stanford CS149, Fall 2023

dim3 threadsPerBlock(4, 3);

// this call will cause execution of 72 threads

__device__ float doubleValue(float x)

C[j][i] = A[j][i] + doubleValue(B[j][i]);

Stanford CS149, Fall 2023

Regular application thread running on CPU (the “host”)

dim3 threadsPerBlock(4, 3);

// assume A, B, C are allocated Nx x Ny float arrays

// this call will cause execution of 72 threads

CUDA kernel definition

// guard against out of bounds array access

Stanford CS149, Fall 2023

Host CUDA device

Implementation: CPU Implementation: GPU

Stanford CS149, Fall 2023

Host CUDA device

Host memory Device “global”

Implementation: CPU Implementation: GPU

Stanford CS149, Fall 2023

float* A = new float[N]; // allocate buffer in host mem

int bytes = sizeof(float) * N;

device float doubleValue(float x)

global void convolve(int N, float* input, float* output) {

global void convolve(int N, float* input, float* output) {

shared float support[THREADS_PER_BLK+2]; // per-block allocation All threads cooperatively load

shared float support[THREADS_PER_BLK+2]; // per block allocation Program text (instructions)

global void convolve(int N, float* input,