0% found this document useful (0 votes)

97 views31 pages

Ece408 Lecture5 CUDA Tiled Matrix Multiplication

Uploaded by

chengyao zou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views31 pages

Ece408 Lecture5 CUDA Tiled Matrix Multiplication

Uploaded by

chengyao zou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

ECE408/CS483/CSE408 Fall 2019

Applied Parallel Programming

Lectures 5:
Locality and Tiled Matrix
Multiplication

1
Objective
• To learn to evaluate the performance implications of global
memory accesses
• To prepare for MP-3: tiled matrix multiplication

2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Kernel Invocation (Host-side Code)

// Setup the execution configuration

// BLOCK_WIDTH is a #define constant
dim3 dimGrid(ceil((1.0*Width)/BLOCK_WIDTH),
ceil((1.0*Width)/BLOCK_WIDTH), 1);

dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH, 1);

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 3

A Simple Matrix Multiplication Kernel
__global__
void MatrixMulKernel(float *d_M, float *d_N, float *d_P, int Width)
{
// Calculate the row index of the d_P element and d_M
int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of d_P and d_N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col];
d_P[Row*Width+Col] = Pvalue;
}
}

4
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
How about performance on a device
with 150 GB/s memory bandwidth?
Grid
• All threads access global memory for their input
matrix elements Block (0, 0) Block (1, 0)
– Two memory accesses (8 bytes) per floating
point multiply-add (2 fp ops) Shared Memory Shared Memory
– 4B/s of memory bandwidth/FLOPS
Registers Registers Registers Registers
– 150 GB/s limits the code at 37.5 GFLOPS

• The actual code runs at about 25 GFLOPS

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

• Need to drastically cut down memory accesses to
get closer to the peak of more than 1,000 Constant Memory
GFLOPS

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 5

Data reuse within MatrixMulKernel

• Observe that each input element of M and N is N

used WIDTH times

WIDTH
• Idea: leverage the re-use pattern to reduce
pressure on global memory

M P

Row

WIDTH
Col
WIDTH WIDTH
6
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
A Common Programming Strategy
• Global memory is implemented with DRAM - slow
• To avoid Global Memory bottleneck, tile the input data to take
advantage of Shared Memory:
– Partition data into subsets (tiles) that fit into the (smaller but faster)
shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple threads
to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread can
efficiently access any data element
• Copying results from shared memory to global memory
– Tiles are also called blocks in the literature
7
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Declaring Shared Memory Arrays

global void MatrixMulKernel(float* M, float* N, float* P, int Width)

{
__shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
__shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];

• Kernel memory objects (eg., variables) declared as __shared__ are shared across all
threads in the thread block and are allocated in the shared memory of an SM

8
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Shared Memory Tiling Basic Idea
Data in Global Memory

Thread 1 Thread 2 …
Data in Global Memory

Shared Memory

Thread 1 Thread 2 …
9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Outline of Technique
• Identify a tile of global data that are accessed by multiple threads
• Load the tile from global memory into shared memory
• Have the multiple threads access their data from shared memory
• Move on to the next block/tile

10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Use Shared Memory for data that will be reused

• Observe that each input element of M and N is N

used WIDTH times

WIDTH
• Load each element into Shared Memory and have
several threads use the local version to reduce the
memory bandwidth
M P

Row

WIDTH
Col
WIDTH WIDTH
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the kernel into 0 1 2 TILE_WIDTH-1
phases so that the data accesses in each

TILE_WIDTH
N
phase are focused on one tile of M and N

WIDTH
TILE_WIDTH
• For each tile:
– Phase 1: Load tiles of M & N into share memory
– Phase 2: Calculate partial dot product for tile of P

M P
0

TILE_WIDTHE
0

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
12
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 0

Shared Memory

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N1,0 N1,1 N1,2 N1,3 N1,0 N1,1

N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

Shared Memory

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 1

N0,0 N0,1 N0,2 N0,3

N0,0 N0,1
Shared Memory
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

Shared Memory

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 2

N0,0 N0,1 N0,2 N0,3

SM N0,0 N0,1
N1,0 N1,1 N1,2 N1,3
N1,0 N1,1
N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3

M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 3

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3 Shared Memory

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1

N3,0 N3,1 N3,2 N3,3 N3,0 N3,1

Shared Memory

M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M3,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
16
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 4

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1

SM
N3,0 N3,1 N3,2 N3,3 N3,0 N1,3

M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Work for Block (0,0)
Step 5

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1

SM
N3,0 N3,1 N3,2 N3,3 N3,0 N1,3

M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3
18
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Phase 1: Loading a Tile
• All threads in a block participate
– Each thread loads one M element and one N element in basic tiling code

• Assign the loaded element to each thread such that the accesses
within each warp is coalesced (more later).

19
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile 0 0 1 2

tx
0 1 2 TILE_WIDTH-1

2D indexing for Tile 0

TILE_WIDTH
N

M[Row][tx]

Width
N[ty][Col]

TILE_WIDTH
M P
0

TILE_WIDTHE
0
1

Width
2 ty
by 1 ty tx
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 Width Width
20
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile 1 0 1 2

tx
0 1 2 TILE_WIDTH-1

TILE_WIDTH
N
Accessing Tile 1 in 2D indexing:

WIDTH
M[Row][1*TILE_WIDTH+tx]

TILE_WIDTH
N[1*TILE_WIDTH+ty][Col]

M P

TILE_WIDTHE
0

WIDTH
1
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
bx
Loading an Input Tile m 0 1 2

M[Row][m*TILE_WIDTH+tx] tx
0 1 2 TILE_WIDTH-1
N[m*TILE_WIDTH+ty][Col]

TILE_WIDTH
N

However, recall that M and N are dynamically allocated and can

WIDTH
only use 1D indexing: m

TILE_WIDTH
M[Row*Width + m*TILE_WIDTH + tx]
N[(m*TILE_WIDTH+ty) * Width + Col]

M P

0 m

TILE_WIDTHE
0

WIDTH
1
2
by 1 ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
22
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Phase 2: Compute partial product
tx
0 1 2 TILE_WIDTH-1
To perform the kth step of the product within the tile:

TILE_WIDTH
subTileM[ty][k]
subTileN
subTileN[k][tx]

WIDTH
TILE_WIDTH
P

subTileM

TILE_WIDTHE
0

WIDTH
1
2
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

WIDTH WIDTH 23
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Barrier Synchronization
• An API function call in CUDA __syncthreads()

• All threads in the same block must reach the __syncthreads()

before any can move on

• Can be used to coordinate tiled algorithms

– To ensure that all elements of a tile are loaded
– To ensure that certain computation on elements is complete

Thread 1

Thread 2

Thread 3

Thread 4

… …
Thread N-3

Thread N-2

Thread N-1

25
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
1. __shared__ float subTileM[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float subTileN[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the P element to work on

5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

// The code assumes that the Width is a multiple of TILE_WIDTH!
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of M and N tiles into shared memory
9. subTileM[ty][tx] = M[Row*Width+m*TILE_WIDTH+tx];
10. subTileN[ty][tx] = N[(m*TILE_WIDTH+ty)*Width+Col];
11. __syncthreads();
12. for (int k = 0; k < TILE_WIDTH; ++k)
13. Pvalue += subTileM[ty][k] * subTileN[k][tx];
14. __syncthreads();
15. }
16. P[Row*Width+Col] = Pvalue;
} 26
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Compare with Basic MM Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)
{
// Calculate the row index of the P element and M
int Row = blockIdx.y * blockDim.y + threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x * blockDim.x + threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;

// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k)
Pvalue += M[Row*Width+k] * N[k*Width+Col];

P[Row*Width+Col] = Pvalue;
}
}

Shared Memory and Threading
• Each SM in Maxwell has 64KB shared memory (48KB max per block)
– Shared memory size is implementation dependent!

– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory
• Shared memory can potentially support up to 32 active blocks
• The threads per SM constraint (2048) will limit the number of blocks to 8
• This allows up to 8*512 = 4096 pending loads. (2 per thread, 256 threads per block)

– TILE_WIDTH = 32 would lead to 2*32*32*4B= 8KB shared memory per thread block
• Shared memory can potentially support up to 8 active blocks
• The threads per SM constraint (2048) will limit the number of blocks to 2
• This allows up to 2*2048 = 4096 pending loads (2 per thread, 1024 threads per block)

28
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Memory Bandwidth Consumption
• Using 16x16 tiling, we reduce the global memory by a factor of 16
– Each float is now used by 16 floating-point operations
– The 150GB/s bandwidth can now support (150/4)*16 = 600 GFLOPS!

• Using 32x32 tiling, we reduce the global memory accesses by a factor of 32

– Each float is now used by 32 floating-point operations
– The 150 GB/s bandwidth can now support (150/4)*32 = 1200 GFLOPS!
– The memory bandwidth is no longer a limiting factor for performance!

29
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018
Cache behavior of simple CPU version?
// Matrix multiplication on the (CPU) host in single precision
void MatrixMul(float *M, float *N, float *P, int Width)
{
for (int row = 0; row < Width; ++row) N
for (int col = 0; col < Width; ++col) {
float sum = 0;
for (int k = 0; k < Width; ++k) {
k
float a = M[row * Width + k];
float b = N[k * Width + col];

WIDTH
sum += a * b; col
}
P[row * Width + col] = sum;
}
}

Recall the CPU version of MatrixMul.

M P
What level of reuse would we achieve with a row
simple HW cache?

WIDTH
Why doesn’t the cache strategy apply to
GPU shared memory? k
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 WIDTH WIDTH 30
ANY MORE QUESTIONS?
READ CHAPTER 4!
31
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018

VSCSE-Lecture3-cuda-memory-model-2012
No ratings yet
VSCSE-Lecture3-cuda-memory-model-2012
31 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
CSC447 Multidimensional Grids and Data
No ratings yet
CSC447 Multidimensional Grids and Data
65 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
7. Moving to Parallel - Addition of 2 Matrices
No ratings yet
7. Moving to Parallel - Addition of 2 Matrices
14 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
No ratings yet
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
43 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
cuuda nvidai guide_Part3
No ratings yet
cuuda nvidai guide_Part3
15 pages
tilining
No ratings yet
tilining
23 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
GPU Architecture and Parallel Programming: Tiled Convolution Analysis
No ratings yet
GPU Architecture and Parallel Programming: Tiled Convolution Analysis
18 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
Hpc file
No ratings yet
Hpc file
22 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
5-computation
No ratings yet
5-computation
13 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Lec 6
No ratings yet
Lec 6
16 pages
HPC-Practical-4Addition of two large vectors
No ratings yet
HPC-Practical-4Addition of two large vectors
4 pages
vertopal.com_Lab7_GPU (1)
No ratings yet
vertopal.com_Lab7_GPU (1)
10 pages
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids a 2023 Programming Massively Parallel Pro
22 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
4 MM in CUDA
No ratings yet
4 MM in CUDA
38 pages
Threads
No ratings yet
Threads
54 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
CUDA_Memory
No ratings yet
CUDA_Memory
56 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
HPC
No ratings yet
HPC
90 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Class 10
No ratings yet
Class 10
13 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
DBA Sheet 6.1
100% (2)
DBA Sheet 6.1
520 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Lec 14
No ratings yet
Lec 14
52 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Command: Bash Quick Reference 6.004 Computation Structures
No ratings yet
Command: Bash Quick Reference 6.004 Computation Structures
3 pages
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
From Everand
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
Friend Good
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Sysinternals Guide
No ratings yet
Sysinternals Guide
213 pages
Report TreatGrid
No ratings yet
Report TreatGrid
201 pages
S1 ICT Scheme Term2 CBC Format
No ratings yet
S1 ICT Scheme Term2 CBC Format
4 pages
Logcat CSC Update Log
No ratings yet
Logcat CSC Update Log
592 pages
SGDHSDH
No ratings yet
SGDHSDH
84 pages
DBMS UNIT-3
No ratings yet
DBMS UNIT-3
39 pages
Anr 7.2.4 (70246989) 20231011 010630
No ratings yet
Anr 7.2.4 (70246989) 20231011 010630
37 pages
Week 1 Lec 2
No ratings yet
Week 1 Lec 2
27 pages
HPC Note
No ratings yet
HPC Note
39 pages
Install Notes Iclone8 CC4 2023
No ratings yet
Install Notes Iclone8 CC4 2023
8 pages
KP Virtio
No ratings yet
KP Virtio
20 pages
Computer Introductory Notes
No ratings yet
Computer Introductory Notes
30 pages
For The Diploma in Computer Engineering: A Micro Project Report On "Commands in Linux"
No ratings yet
For The Diploma in Computer Engineering: A Micro Project Report On "Commands in Linux"
23 pages
Storage Craft DocPlayer
No ratings yet
Storage Craft DocPlayer
34 pages
Huong Dan Cai Dat Honeynet
No ratings yet
Huong Dan Cai Dat Honeynet
40 pages
AVHDXMerging
No ratings yet
AVHDXMerging
6 pages
Microsoft Hyper-V Over SMB 3.0 With Clustered Data ONTAP: Best Practices
No ratings yet
Microsoft Hyper-V Over SMB 3.0 With Clustered Data ONTAP: Best Practices
21 pages
1
No ratings yet
1
5 pages
En Assignment
No ratings yet
En Assignment
9 pages
Script Virus
No ratings yet
Script Virus
35 pages
Issues in Sap
No ratings yet
Issues in Sap
7 pages
CMake Lists
No ratings yet
CMake Lists
2 pages
Knowledge Base 8156 - Mass Deployment of The ClickShare Desktop App (Version 4.28 and Higher) Across Your Company
No ratings yet
Knowledge Base 8156 - Mass Deployment of The ClickShare Desktop App (Version 4.28 and Higher) Across Your Company
2 pages
Basic Files Processing: Professor Dumont Csc119 - Introduction To Unix/Linux
No ratings yet
Basic Files Processing: Professor Dumont Csc119 - Introduction To Unix/Linux
11 pages
Ke Service Descriptor Table
No ratings yet
Ke Service Descriptor Table
2 pages
Plan of Action To Upgrade F5 Software Image To BIGIP and Redundancy Test
No ratings yet
Plan of Action To Upgrade F5 Software Image To BIGIP and Redundancy Test
3 pages
6.to Study Data Centric and Client Centric Consistency Model
100% (7)
6.to Study Data Centric and Client Centric Consistency Model
6 pages
Operating System Notes
No ratings yet
Operating System Notes
63 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Ece408 Lecture5 CUDA Tiled Matrix Multiplication

Uploaded by

Ece408 Lecture5 CUDA Tiled Matrix Multiplication

Uploaded by

ECE408/CS483/CSE408 Fall 2019

Applied Parallel Programming

// Setup the execution configuration

dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH, 1);

// Launch the device computation threads!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 3

// Calculate the column index of d_P and d_N

if ((Row < Width) && (Col < Width)) {

• The actual code runs at about 25 GFLOPS

Host Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 5

• Observe that each input element of M and N is N

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)

• Observe that each input element of M and N is N

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N1,0 N1,1 N1,2 N1,3 N1,0 N1,1

N2,0 N2,1 N2,2 N2,3

N3,0 N3,1 N3,2 N3,3

N0,0 N0,1 N0,2 N0,3

N3,0 N3,1 N3,2 N3,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

N0,0 N0,1 N0,2 N0,3

N3,0 N3,1 N3,2 N3,3

M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3

M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3 Shared Memory

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1

N3,0 N3,1 N3,2 N3,3 N3,0 N3,1

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3

N2,0 N2,1 N2,2 N2,3 N2,0 N2,1

2D indexing for Tile 0

However, recall that M and N are dynamically allocated and can

• All threads in the same block must reach the __syncthreads()

• Can be used to coordinate tiled algorithms

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the P element to work on

// Loop over the M and N tiles required to compute the P element

if ((Row < Width) && (Col < Width)) {

// each thread computes one element of the block sub-matrix

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 27

• Using 32x32 tiling, we reduce the global memory accesses by a factor of 32

Recall the CPU version of MatrixMul.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

global void MatrixMulKernel(float* M, float* N, float* P, int Width)