0% found this document useful (0 votes)
97 views31 pages

Ece408 Lecture5 CUDA Tiled Matrix Multiplication

1

Uploaded by

chengyao zou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views31 pages

Ece408 Lecture5 CUDA Tiled Matrix Multiplication

1

Uploaded by

chengyao zou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

ECE408/CS483/CSE408 Fall 2019

Applied Parallel Programming

Lectures 5:
Locality and Tiled Matrix
Multiplication

1
Objective
• To learn to evaluate the performance implications of global
memory accesses
• To prepare for MP-3: tiled matrix multiplication

2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Kernel Invocation (Host-side Code)

// Setup the execution configuration


// BLOCK_WIDTH is a #define constant
dim3 dimGrid(ceil((1.0*Width)/BLOCK_WIDTH),
ceil((1.0*Width)/BLOCK_WIDTH), 1);

dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH, 1);

// Launch the device computation threads!


MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 3


A Simple Matrix Multiplication Kernel
__global__
void MatrixMulKernel(float *d_M, float *d_N, float *d_P, int Width)
{
// Calculate the row index of the d_P element and d_M
int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of d_P and d_N


int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;
}
}

4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
How about performance on a device
with 150 GB/s memory bandwidth?
Grid
• All threads access global memory for their input
matrix elements Block (0, 0) Block (1, 0)
– Two memory accesses (8 bytes) per floating
point multiply-add (2 fp ops) Shared Memory Shared Memory
– 4B/s of memory bandwidth/FLOPS
Registers Registers Registers Registers
– 150 GB/s limits the code at 37.5 GFLOPS

• The actual code runs at about 25 GFLOPS


Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory


• Need to drastically cut down memory accesses to
get closer to the peak of more than 1,000 Constant Memory
GFLOPS

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 5


Data reuse within MatrixMulKernel

• Observe that each input element of M and N is N


used WIDTH times

WIDTH
• Idea: leverage the re-use pattern to reduce
pressure on global memory

M P

Row

WIDTH
Col
WIDTH WIDTH
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
A Common Programming Strategy
• Global memory is implemented with DRAM - slow
• To avoid Global Memory bottleneck, tile the input data to take
advantage of Shared Memory:
– Partition data into subsets (tiles) that fit into the (smaller but faster)
shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple threads
to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread can
efficiently access any data element
• Copying results from shared memory to global memory
– Tiles are also called blocks in the literature
7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Declaring Shared Memory Arrays

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)


{
__shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
__shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];

• Kernel memory objects (eg., variables) declared as __shared__ are shared across all
threads in the thread block and are allocated in the shared memory of an SM

8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Shared Memory Tiling Basic Idea
Data in Global Memory

Thread 1 Thread 2 …
Data in Global Memory

Shared Memory

Thread 1 Thread 2 …
9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Outline of Technique
• Identify a tile of global data that are accessed by multiple threads
• Load the tile from global memory into shared memory
• Have the multiple threads access their data from shared memory
• Move on to the next block/tile

10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Use Shared Memory for data that will be reused

• Observe that each input element of M and N is N


used WIDTH times

WIDTH
• Load each element into Shared Memory and have
several threads use the local version to reduce the
memory bandwidth
M P

Row

WIDTH
Col
WIDTH WIDTH
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the kernel into 0 1 2 TILE_WIDTH-1
phases so that the data accesses in each

TILE_WIDTH
N
phase are focused on one tile of M and N

WIDTH
TILE_WIDTH
• For each tile:
– Phase 1: Load tiles of M & N into share memory
– Phase 2: Calculate partial dot product for tile of P

M P
0

TILE_WIDTHE
0

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 0

Shared Memory

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N1,0 N1,1 N1,2 N1,3 N1,0 N1,1

N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

Shared Memory

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 1

N0,0 N0,1 N0,2 N0,3


N0,0 N0,1
Shared Memory
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


Shared Memory

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 2

N0,0 N0,1 N0,2 N0,3


SM N0,0 N0,1
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3


SM

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3


15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 3

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3 Shared Memory

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1

N3,0 N3,1 N3,2 N3,3 N3,0 N3,1

Shared Memory

M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M3,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 4

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1


SM
N3,0 N3,1 N3,2 N3,3 N3,0 N1,3

SM

M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 5

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1


SM
N3,0 N3,1 N3,2 N3,3 N3,0 N1,3

SM

M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Phase 1: Loading a Tile
• All threads in a block participate
– Each thread loads one M element and one N element in basic tiling code

• Assign the loaded element to each thread such that the accesses
within each warp is coalesced (more later).

19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile 0 0 1 2

tx
0 1 2 TILE_WIDTH-1

2D indexing for Tile 0

TILE_WIDTH
N

M[Row][tx]

Width
N[ty][Col]

TILE_WIDTH
M P
0

TILE_WIDTHE
0
1

Width
2 ty
by 1 ty tx
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 Width Width
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile 1 0 1 2

tx
0 1 2 TILE_WIDTH-1

TILE_WIDTH
N
Accessing Tile 1 in 2D indexing:

WIDTH
M[Row][1*TILE_WIDTH+tx]

TILE_WIDTH
N[1*TILE_WIDTH+ty][Col]

M P

TILE_WIDTHE
0

WIDTH
1
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile m 0 1 2

M[Row][m*TILE_WIDTH+tx] tx
0 1 2 TILE_WIDTH-1
N[m*TILE_WIDTH+ty][Col]

TILE_WIDTH
N

However, recall that M and N are dynamically allocated and can

WIDTH
only use 1D indexing: m

TILE_WIDTH
M[Row*Width + m*TILE_WIDTH + tx]
N[(m*TILE_WIDTH+ty) * Width + Col]

M P

0 m

TILE_WIDTHE
0

WIDTH
1
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Phase 2: Compute partial product
tx
0 1 2 TILE_WIDTH-1
To perform the kth step of the product within the tile:

TILE_WIDTH
subTileM[ty][k]
subTileN
subTileN[k][tx]

WIDTH
TILE_WIDTH
P

subTileM

TILE_WIDTHE
0

WIDTH
1
2
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

WIDTH WIDTH 23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Barrier Synchronization
• An API function call in CUDA __syncthreads()

• All threads in the same block must reach the __syncthreads()


before any can move on

• Can be used to coordinate tiled algorithms


– To ensure that all elements of a tile are loaded
– To ensure that certain computation on elements is complete

24
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Time
Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

… …
Thread N-3

Thread N-2

Thread N-1

25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the P element to work on


5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element


// The code assumes that the Width is a multiple of TILE_WIDTH!
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of M and N tiles into shared memory
9. subTileM[ty][tx] = M[Row*Width+m*TILE_WIDTH+tx];
10. subTileN[ty][tx] = N[(m*TILE_WIDTH+ty)*Width+Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += subTileM[ty][k] * subTileN[k][tx];
14. __syncthreads();
15. }
16. P[Row*Width+Col] = Pvalue;
} 26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Compare with Basic MM Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;

// each thread computes one element of the block sub-matrix


for (int k = 0; k < Width; ++k)
Pvalue += M[Row*Width+k] * N[k*Width+Col];

P[Row*Width+Col] = Pvalue;
}
}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 27


Shared Memory and Threading
• Each SM in Maxwell has 64KB shared memory (48KB max per block)
– Shared memory size is implementation dependent!

– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory
• Shared memory can potentially support up to 32 active blocks
• The threads per SM constraint (2048) will limit the number of blocks to 8
• This allows up to 8*512 = 4096 pending loads. (2 per thread, 256 threads per block)

– TILE_WIDTH = 32 would lead to 2*32*32*4B= 8KB shared memory per thread block
• Shared memory can potentially support up to 8 active blocks
• The threads per SM constraint (2048) will limit the number of blocks to 2
• This allows up to 2*2048 = 4096 pending loads (2 per thread, 1024 threads per block)

28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Memory Bandwidth Consumption
• Using 16x16 tiling, we reduce the global memory by a factor of 16
– Each float is now used by 16 floating-point operations
– The 150GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!

• Using 32x32 tiling, we reduce the global memory accesses by a factor of 32


– Each float is now used by 32 floating-point operations
– The 150 GB/s bandwidth can now support (150/4)*32 = 1200 GFLOPS!
– The memory bandwidth is no longer a limiting factor for performance!

29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Cache behavior of simple CPU version?
// Matrix multiplication on the (CPU) host in single precision
void MatrixMul(float *M, float *N, float *P, int Width)
{
for (int row = 0; row < Width; ++row) N
for (int col = 0; col < Width; ++col) {
float sum = 0;
for (int k = 0; k < Width; ++k) {
k
float a = M[row * Width + k];
float b = N[k * Width + col];

WIDTH
sum += a * b; col
}
P[row * Width + col] = sum;
}
}

Recall the CPU version of MatrixMul.


M P
What level of reuse would we achieve with a row
simple HW cache?

WIDTH
Why doesn’t the cache strategy apply to
GPU shared memory? k
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 WIDTH WIDTH 30
ANY MORE QUESTIONS?
READ CHAPTER 4!
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy