Ece408 Lecture5 CUDA Tiled Matrix Multiplication
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
Lectures 5:
Locality and Tiled Matrix
Multiplication
1
Objective
• To learn to evaluate the performance implications of global
memory accesses
• To prepare for MP-3: tiled matrix multiplication
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Kernel Invocation (Host-side Code)
4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
How about performance on a device
with 150 GB/s memory bandwidth?
Grid
• All threads access global memory for their input
matrix elements Block (0, 0) Block (1, 0)
– Two memory accesses (8 bytes) per floating
point multiply-add (2 fp ops) Shared Memory Shared Memory
– 4B/s of memory bandwidth/FLOPS
Registers Registers Registers Registers
– 150 GB/s limits the code at 37.5 GFLOPS
WIDTH
• Idea: leverage the re-use pattern to reduce
pressure on global memory
M P
Row
WIDTH
Col
WIDTH WIDTH
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
A Common Programming Strategy
• Global memory is implemented with DRAM - slow
• To avoid Global Memory bottleneck, tile the input data to take
advantage of Shared Memory:
– Partition data into subsets (tiles) that fit into the (smaller but faster)
shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple threads
to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread can
efficiently access any data element
• Copying results from shared memory to global memory
– Tiles are also called blocks in the literature
7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Declaring Shared Memory Arrays
• Kernel memory objects (eg., variables) declared as __shared__ are shared across all
threads in the thread block and are allocated in the shared memory of an SM
8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Shared Memory Tiling Basic Idea
Data in Global Memory
Thread 1 Thread 2 …
Data in Global Memory
Shared Memory
Thread 1 Thread 2 …
9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Outline of Technique
• Identify a tile of global data that are accessed by multiple threads
• Load the tile from global memory into shared memory
• Have the multiple threads access their data from shared memory
• Move on to the next block/tile
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Use Shared Memory for data that will be reused
WIDTH
• Load each element into Shared Memory and have
several threads use the local version to reduce the
memory bandwidth
M P
Row
WIDTH
Col
WIDTH WIDTH
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Tiled Multiply 0 1 2
tx
• Break up the execution of the kernel into 0 1 2 TILE_WIDTH-1
phases so that the data accesses in each
TILE_WIDTH
N
phase are focused on one tile of M and N
WIDTH
TILE_WIDTH
• For each tile:
– Phase 1: Load tiles of M & N into share memory
– Phase 2: Calculate partial dot product for tile of P
M P
0
TILE_WIDTHE
0
WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 0
Shared Memory
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 1
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M3,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 4
SM
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 5
SM
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Phase 1: Loading a Tile
• All threads in a block participate
– Each thread loads one M element and one N element in basic tiling code
• Assign the loaded element to each thread such that the accesses
within each warp is coalesced (more later).
19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile 0 0 1 2
tx
0 1 2 TILE_WIDTH-1
TILE_WIDTH
N
M[Row][tx]
Width
N[ty][Col]
TILE_WIDTH
M P
0
TILE_WIDTHE
0
1
Width
2 ty
by 1 ty tx
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 Width Width
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile 1 0 1 2
tx
0 1 2 TILE_WIDTH-1
TILE_WIDTH
N
Accessing Tile 1 in 2D indexing:
WIDTH
M[Row][1*TILE_WIDTH+tx]
TILE_WIDTH
N[1*TILE_WIDTH+ty][Col]
M P
TILE_WIDTHE
0
WIDTH
1
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile m 0 1 2
M[Row][m*TILE_WIDTH+tx] tx
0 1 2 TILE_WIDTH-1
N[m*TILE_WIDTH+ty][Col]
TILE_WIDTH
N
WIDTH
only use 1D indexing: m
TILE_WIDTH
M[Row*Width + m*TILE_WIDTH + tx]
N[(m*TILE_WIDTH+ty) * Width + Col]
M P
0 m
TILE_WIDTHE
0
WIDTH
1
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Phase 2: Compute partial product
tx
0 1 2 TILE_WIDTH-1
To perform the kth step of the product within the tile:
TILE_WIDTH
subTileM[ty][k]
subTileN
subTileN[k][tx]
WIDTH
TILE_WIDTH
P
subTileM
TILE_WIDTHE
0
WIDTH
1
2
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
WIDTH WIDTH 23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Barrier Synchronization
• An API function call in CUDA __syncthreads()
24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Time
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
… …
Thread N-3
Thread N-2
Thread N-1
25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];
P[Row*Width+Col] = Pvalue;
}
}
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory
• Shared memory can potentially support up to 32 active blocks
• The threads per SM constraint (2048) will limit the number of blocks to 8
• This allows up to 8*512 = 4096 pending loads. (2 per thread, 256 threads per block)
– TILE_WIDTH = 32 would lead to 2*32*32*4B= 8KB shared memory per thread block
• Shared memory can potentially support up to 8 active blocks
• The threads per SM constraint (2048) will limit the number of blocks to 2
• This allows up to 2*2048 = 4096 pending loads (2 per thread, 1024 threads per block)
28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Memory Bandwidth Consumption
• Using 16x16 tiling, we reduce the global memory by a factor of 16
– Each float is now used by 16 floating-point operations
– The 150GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!
29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Cache behavior of simple CPU version?
// Matrix multiplication on the (CPU) host in single precision
void MatrixMul(float *M, float *N, float *P, int Width)
{
for (int row = 0; row < Width; ++row) N
for (int col = 0; col < Width; ++col) {
float sum = 0;
for (int k = 0; k < Width; ++k) {
k
float a = M[row * Width + k];
float b = N[k * Width + col];
WIDTH
sum += a * b; col
}
P[row * Width + col] = sum;
}
}
WIDTH
Why doesn’t the cache strategy apply to
GPU shared memory? k
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 WIDTH WIDTH 30
ANY MORE QUESTIONS?
READ CHAPTER 4!
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018