07_gpuarch
07_gpuarch
~150-300 GB/sec
(high end GPUs)
Memory
CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016
DDR5 DRAM
(a few GB)
GPU
Multi-core chip
SIMD execution within a single core (many execution units performing the same instruction)
Multi-threaded execution on a single core (multiple threads executed concurrently by a core)
Stanford CS149, Fall 2023
Graphics 101 + GPU history
(for fun)
Unreal Engine Kite Demo (Epic Games 2015) Stanford CS149, Fall 2023
Render high complexity 3D scenes, in real-time
3
1
4
2
Vertices Primitives
(points in space) (e.g., triangles, points, lines)
▪ For all output image pixels covered by the triangle, compute the color of the surface at
that pixel.
void myShader()
{
vec3 kd = texture2D(myTexture, uv); “Shader” function
kd *= clamp(dot(lightDir, norm), 0.0, 1.0); (a.k.a function invoked to compute the color of the pixel)
return vec4(kd, 1.0);
}
~1TB/sec
(high end GPUs)
Memory
CMU 15-418/618, Spring 2016 CMU 15-418/618, Spring 2016
DDR6 DRAM
(10s of GB)
GPU
Stanford CS149, Fall 2023
Observation circa 2001-2003
GPUs are very fast processors for performing the same
computation (shader programs) in parallel on large
collections of data (streams of vertices, fragments,
and pixels)
float scale_amount;
float input_stream<1000>; // stream declaration
float output_stream<1000>; // stream declaration
▪ Brook compiler translated generic stream program into graphics commands (such
as drawTriangles) and a set of graphics shader programs that could be run on GPUs
of the day.
Stanford CS149, Fall 2023
GPU compute mode
- Application (via graphics driver) provides GPU shader program binaries Vertex Generation
drawPrimitives(vertex_buffer)
Fragment Generation
(“Rasterization”)
Fragment Processing
This was the only interface to GPU hardware.
GPU hardware could only execute graphics pipeline computations. Output image
buffer
Pixel Operations
(pixels)
▪ Specifically, be careful with the use of the term “CUDA thread”. A CUDA thread presents a
similar abstraction as a pthread in that both correspond to logical threads of control,
but the implementation of a CUDA thread is very different
// kernel definition
__global__ void matrixAddDoubleB(float A[Ny][Nx],
“Device” code (SPMD execution on GPU) float B[Ny][Nx],
float C[Ny][Nx])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
// populate deviceA
What does cudaMemcpy remind you of?
cudaMemcpy(deviceA, A, bytes, cudaMemcpyHostToDevice);
Device global
Different address spaces reflect different regions of locality in the program memory
Readable/writable
by all threads
input[0] input[1] input[2] input[3] input[4] input[5] input[6] input[7] input[8] input[9]
CUDA Kernel
#define THREADS_PER_BLK 128
Host code
int N = 1024 * 1024
cudaMalloc(&devInput, sizeof(float) * (N+2) ); // allocate array in device memory
cudaMalloc(&devOutput, sizeof(float) * N); // allocate array in device memory
▪ Atomic operations
- e.g., float atomicAdd(float* addr, float amount)
- CUDA provides atomic operations on both global memory addresses and per-block shared memory addresses
▪ Host/device synchronization
- Implicit barrier across all threads at return of kernel
convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput); Launch over 1 million CUDA threads (over 8K thread blocks)
Mid-range GPU
(6 cores)
output[index] = result;
}
Problem to solve
Decomposition
Sub-problems
(aka “tasks”, “work”)
Assignment
Worker Threads
Other examples:
- ISPC’s implementation of launching tasks
- Creates one pthread for each hyper-thread on CPU. Threads kept alive for remainder of program
- Thread pool in a web server
- Number of threads is a function of number of cores, not number of outstanding requests
- Threads spawned at web server launch, wait for work to arrive
Stanford CS149, Fall 2023
NVIDIA V100 SM “sub-core” Warp Selector
Fetch/
Decode
…
R0 0 1 2 30 31
R1
R2
…
R0 32 33 62 63
R1
R2
…
64 65 94 95
R0 96 97 126 127
R1
R2
…
mapped to 8 warps. R0
R1
Warp 4
- Each sub-core in the V100 is capable of scheduling and
R2
…
R0
R1 Warp 60
R2
…
Warp 0
gang * R2
…
R0
R1
Warp 4
A warp is not part of CUDA, but is an important CUDA implementation R2
…
R0
R1 Warp 60
R2
…
* But GPU hardware is dynamically checking whether 32 independent CUDA threads share an instruction, and if this is true, it
executes all 32 threads in a SIMD manner. The CUDA program is not compiled to SIMD instructions like ISPC gangs. Stanford CS149, Fall 2023
Instruction execution Warp Selector
Fetch/
Decode
Instruction stream for CUDA threads in a warp…
(note in this example all instructions are independent)
00 fp32 mul r0 r1 r2
01 int32 add r3 r4 r5
02 fp32 mul r6 r7 r8
...
Time (clocks) R0
R1
R2 Warp 4
…
Remember, entire warp of CUDA threads is running this instruction stream.
So each instruction is run by all 32 CUDA threads in the warp. …
Since there are 16 ALUs, running the instruction for the entire warp takes two clocks.
R0
R1 Warp 60
R2
…
R0 0 1 2 … 30 31 R0 0 1 2 … 30 31 R0 0 1 2 … 30 31 R0 0 1 2 … 30 31
R1 R1 R1 R1
R2
…
Warp 0 R2
…
Warp 1 R2
…
Warp 2 R2
…
Warp 3 64 KB registers
R0 R0 R0 R0 per sub-core
R1 R1 R1 R1
R2 Warp 4 R2 Warp 5 R2 Warp 6 R2 Warp 7
… … … …
256 KB registers
… … … … in total per SM
R0 R0 R0 R0
R1 Warp 60 R1 Warp 61 R1 Warp 62 R1 Warp 63 Registers divided among
R2 R2 R2 R2
… … … … (up to) 64 “warps” per SM
“Shared” memory + L1 cache storage (128 KB)
= SIMD fp32 functional unit, = SIMD int functional unit, = SIMD fp64 functional unit, = Tensor core unit
control shared across 16 units control shared across 16 units control shared across 8 units
= Load/store unit
(16 x MUL-ADD per clock *) (16 x MUL/ADD per clock *) (8 x MUL/ADD per clock **)
* one 32-wide SIMD operation every 2 clocks ** one 32-wide SIMD operation every 4 clocks Stanford CS149, Fall 2023
Running a thread block on a V100 SM
#define THREADS_PER_BLK 128
support[threadIdx.x] = input[index];
if (threadIdx.x < 2) {
support[THREADS_PER_BLK+threadIdx.x]
= input[index+THREADS_PER_BLK];
}
__syncthreads();
L2 Cache (6 MB)
900 GB/sec
(4096 bit interface)
GPU memory (HBM)
(16 GB)
Stanford CS149, Fall 2023
Summary: geometry of the V100 GPU
1.245 GHz clock
L2 Cache (6 MB)
900 GB/sec
Let’s assume array size N is very large, so the host-side kernel launch generates thousands of thread blocks.
#define THREADS_PER_BLK 128
convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, input_array, output_array);
Fetch/Decode Fetch/Decode
Step 1: host sends CUDA device (GPU) a command (“execute this kernel”)
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
Step 2: scheduler maps block 0 to core 0 (reserves execution contexts for 128 threads and 520 bytes of shared storage)
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown)
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Block 0 (contexts 0-127) Block 0: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown)
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Block 0 (contexts 0-127) Block 0: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
Step 3: scheduler continues to map blocks to available execution contexts (interleaved mapping shown). Only two thread blocks fit on a core
(third block won’t fit due to insufficient shared storage 3 x 520 bytes > 1.5 KB)
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Block 0 (contexts 0-127) Block 0: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Block 4 (contexts 0-127) Block 4: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Block 4 (contexts 0-127) Block 4: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)
Block 3 (contexts 128-255)
Block 3: support
(520 bytes @ 0x520)
Core 0 Core 1
Stanford CS149, Fall 2023
Running the CUDA kernel
Kernel’s execution requirements:
Each thread block must execute 128 CUDA threads
Each thread block must allocate 130 x sizeof(float) = 520 bytes of shared memory
EXECUTE: convolve
ARGS: N, input_array, output_array
NUM_BLOCKS: 1000
Fetch/Decode Fetch/Decode
Block 4 (contexts 0-127) Block 4: support Block 1 (contexts 0-127) Block 1: support
(520 bytes @ 0x0) (520 bytes @ 0x0)
Core 0 Core 1
Stanford CS149, Fall 2023
More advanced scheduling questions:
(If you understand the following examples you really understand how CUDA programs
run on a GPU, and also have a good handle on the work scheduling issues we’ve
discussed in the course up to this point.)
support[threadIdx.x] = input[index];
if (threadIdx.x < 2) {
support[THREADS_PER_BLK+threadIdx.x]
= input[index+THREADS_PER_BLK];
}
__syncthreads();
Assume a fictitious SM core (shown above) with only 128 threads (four
warps) worth of parallel execution in HW CUDA kernels may create dependencies between threads in a block
Simplest example is __syncthreads()
Why not just run threads 0-127 to completion, then run threads
Threads in a block cannot be executed by the system in any order when
128-255 to completion in order to execute the entire thread block? dependencies exist.
CUDA semantics: threads in a block ARE running concurrently. If a thread in a
block is runnable it will eventually be run! (no deadlock)
Stanford CS149, Fall 2023
Implementation of CUDA abstractions
▪ Thread blocks can be scheduled in any order by the system
- System assumes no dependencies between blocks
- Logically concurrent
- A lot like ISPC tasks, right?
▪ CUDA implementation:
- A NVIDIA GPU warp has performance characteristics akin to an ISPC gang of instances (but unlike an ISPC gang, the warp
concept does not exist in the programming model*)
- All warps in a thread block are scheduled onto the same SM, allowing for high-BW/low latency communication through
shared memory variables
- When all threads in block complete, block resources (shared memory allocations, warp execution contexts) become available
for next block
* Exceptions to this statement include intra-warp builtin operations like swizzle and vote Stanford CS149, Fall 2023
Consider a program that creates a histogram:
▪ This example: build a histogram of values in an array
- All CUDA threads atomically update shared variables in global memory
▪ Notice I have never claimed CUDA thread blocks were guaranteed to be independent. I only stated CUDA reserves the
right to schedule them in any order.
▪ This is valid code! This use of atomics does not impact implementation’s ability to
schedule blocks in any order (atomics used for mutual exclusion, and nothing more)
Global memory
int counts[10]
...
int A[N]
int* A = {0, 3, 4, 1, 9 , 2, . . . , 8, 4 , 1 }; // array of integers between 0-9
Stanford CS149, Fall 2023
But is this reasonable CUDA code?
▪ Consider implementation of on a single SM GPU with resources for only one CUDA thread block
per SM
- What happens if the CUDA implementation runs block 0 first?
- What happens if the CUDA implementation runs block 1 first?
Global memory
int myFlag
(assume myFlag is initialized to 0)
__syncthreads();
Now, work assignment to blocks is implemented entirely by the
float result = 0.0f; // thread-local variable
for (int i=0; i<3; i++) application
result += support[threadIdx.x + i];
output[index] = result;
(circumvents GPU’s thread block scheduler)
__syncthreads();
}
}
// host code ////////////////////////////////////////////////////// Now the programmer’s mental model is that *all* CUDA threads are
int N = 1024 * 1024;
cudaMalloc(&devInput, N+2); // allocate array in device memory concurrently running on the GPU at once.
cudaMalloc(&devOutput, N); // allocate array in device memory
// properly initialize contents of devInput here ...
▪ Remember, there is also the graphics pipeline interface for driving GPU execution
- And much of the interesting non-programmable functionality of the GPU exists to accelerate execution of
graphics pipeline operations
- It’s more or less “turned off” when running CUDA programs
▪ How the GPU implements the graphics pipeline efficiently is a topic for a graphics
class… *