0% found this document useful (0 votes)

11 views22 pages

Hpc file

Uploaded by

Tecno Incentive

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views22 pages

Hpc file

Uploaded by

Tecno Incentive

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Experiment 6

Matrix Multiplication in CUDA

Let’s say we want to multiply matrix A with matrix B to compute matrix C. Assume A is

a p × w matrix and B is a w × q matrix, So C will be p × q matrix. Matrix multiplication is

simple. To calculate (i,j) th element in C we need to multiply i th row of A with j th column

in B (Fig.1). So an individual element in C will be a vector-vector multiplication.

Fig. 1: What happens in matrix multiplication?

Obvious way to implement our parallel matrix multiplication in CUDA is to let each thread
do a vector-vector multiplication i.e. each element in C matrix will be calculated by a
separate CUDA thread.

Simple(st) CUDA implementation

In CUDA programming model threads are organized into thread-blocks and grids. Thread-

block is the smallest group of threads allowed by the programming model and grid is an

arrangement of multiple thread-blocks. If you are unfamiliar with thread-blocks and grid,
refer to this. A thread-block or grid can be arranged in 1-D, 2-D or 3-D. Sine we are

multiplying 2-D matrices it only makes sense to arrange the thread-blocks and grid in 2-D.

In most modern NVIDIA GPUs one thread-block can have a maximum of 1024 threads.

Therefore we can use a 32 x 32 2-D thread-block (Let’s assume that our thread-block size

is BLOCK_SIZE x BLOCK_SIZE from here). Now how should we arrange our grid?

Since the output matrix is p × q, we need to have at least ⌈p/32⌉ number of thread-blocks

in y-dimension and ⌈q/32⌉ number of thread-blocks in x-dimension (Fig. 2).

Fig.2 : Thread-block and grid organization for simple matrix multiplication

So block and grid dimension can be specified as follows using CUDA. Here I assumed that

columns in the matrix are indexed in x-dimension and rows in y-dimension. So x-dimension
of the grid will have ⌈q/32⌉ blocks.
dim3 dim_grid(ceilf(P/(float)BLOCK_SIZE), ceilf(Q/(float)BLOCK_SIZE),
1);
dim3 dim_block(BLOCK_SIZE, BLOCK_SIZE, 1);

Now let’s move on to our matrix multiplication kernel. First, what are the arguments we

need for the kernel? We need A matrix, B matrix and result C matrix. Assume that all of our

matrices are arranged in row-major order (i.e. elements in a row will be placed in

consecutive memory locations). We also need width which is the length of our vector-vector
multiplication each threads have to do. Because we take the ceiling of q/32 and p/32 CUDA

kernel launcher will launch more threads than we need. Therefore we need

values P and Q (dimensions of C matrix) to check if a given thread computes a valid element

in the output matrix.

template<typename T>
__global__
void naive_matrix_multiply(const T *A, const T *B, T* C, int width,
int P, int Q)
{
int r = blockIdx.y * blockDim.y + threadIdx.y;
int c = blockIdx.x * blockDim.x + threadIdx.x;
// check boundry conditions
if( r < P && c < Q){
// do the multiplication for one row and col
T value = 0;
for(int k = 0; k < width; k++){
value += A[r * width + k] * B[k * Q + c];
}
// store the result
C[r * Q + c] = value;
}}

To understand this code first you need to know that each CUDA thread will be executing this

code independently. There will be P×Q number of threads executing this code. Because each

thread is computing an element in the C matrix first we must calculate the row and column

of this element.

Threads are arranged in 2-D thread-blocks in a 2-D grid. CUDA provides a simple indexing

mechanism to obtain the thread-ID within a thread-block

(threadIdx.x, threadIdx.y and threadIdx.z) and block-ID within a grid

(blockIdx.x, blockIdx.y and blockIdx.z) . In our case rows are indexed in the y-dimension. To

compute the index of row r in terms of CUDA threadIdx and blockIdx, we can

take blockIdx.y and multiply it with blockDim.y to get the total number of threads up

to blockIdx.y number of blocks. Then we add threadIdx.y which is the thread-ID along y-

dimension within the block this thread belongs to (Fig. 3). Column index for column c can be

computed similarly along x-dimension.

Fig. 3: Row computation

The next steps are pretty straightforward. We need to check r and c are within the

bounds P and Q. Then we do the vector-vector multiplication multiplying r th row

in A with c th column in B. Since A and B are laid out in memory in row-major order we can
access all elements in row r of A using A[r*width + k](0≤k≤width). Accessing

column c in B is a little tricky. Value of c th column in row 0 is easy! that will be just the
value at index c in the whole B array. Now value of c th column in row 1 will be at B[1*Q +
c] . How come? Remember Q is the number of columns in B. So to access the c th column in

row 1, we jump by Q elements along the whole B array starting from the c th column index
in row 0. Now to access all elements in column c we can use B[k*Q + c] (0≤k≤width),

Pretty simple!. After computing the vector-vector product, final step is storing the result. So

we have computed r th row c th column in the output matrix C. Index for this element will
be simply r*Q + c.

Obviously this matrix multiplication is very simple and it does not exploit the full potential of

GPUs. In the next post I will explain how to optimize this code using shared memory and

tiling.
Experiment-7

Text processing

Example 1. Convert text to lowercase

Python code:
input_str = ”The 5 biggest countries by population in 2017 are China,
India, United States, Indonesia, and Brazil.”
input_str = input_str.lower()
print(input_str)

Output:
the 5 biggest countries by population in 2017 are china, india, united
states, indonesia, and brazil.

Remove numbers

Remove numbers if they are not relevant to your analyses. Usually,

regular expressions are used to remove numbers.

Example 2. Numbers removing

Python code:
import re
input_str = ‟Box A contains 3 red and 5 white balls, while Box B
contains 4 red and 2 blue balls.‟
result = re.sub(r‟\d+‟, „‟, input_str)
print(result)

Output:
Box A contains red and white balls, while Box B contains red and blue
balls.
Remove punctuation

The following code removes this set of symbols [!”#$%&’()*+,-

./:;<=>?@[\]^_`{|}~]:

Example 3. Punctuation removal

Python code:
import string
input_str = “This &is [an] example? {of} string. with.?
punctuation!!!!” # Sample string
result = input_str.translate(string.maketrans(“”,””),
string.punctuation)
print(result)

Output:
This is an example of string with punctuation

Remove whitespaces

To remove leading and ending spaces, you can use the strip() function:

Example 4. White spaces removal

Python code:
input_str = “ \t a string example\t “
input_str = input_str.strip()
input_str

Output:
„a string example‟

Matrix Mult
100% (1)
Matrix Mult
55 pages
4 MM in CUDA
No ratings yet
4 MM in CUDA
38 pages
Case Version 6.5 Service Manual
80% (5)
Case Version 6.5 Service Manual
276 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
GPU_Programming_slides_3
No ratings yet
GPU_Programming_slides_3
73 pages
ORNL Tensor Core Training Aug2019
No ratings yet
ORNL Tensor Core Training Aug2019
113 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
HPC (Pra 04)
No ratings yet
HPC (Pra 04)
11 pages
BECOA157 Parallel Matrix Multiplication
No ratings yet
BECOA157 Parallel Matrix Multiplication
3 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Threads
No ratings yet
Threads
54 pages
5-computation
No ratings yet
5-computation
13 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
HPC 4 B
No ratings yet
HPC 4 B
5 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Lab Report 6
No ratings yet
Lab Report 6
12 pages
Lab 1 Parallel
No ratings yet
Lab 1 Parallel
4 pages
cuuda nvidai guide_Part3
No ratings yet
cuuda nvidai guide_Part3
15 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
tilining
No ratings yet
tilining
23 pages
HPC-Practical-4Addition of two large vectors
No ratings yet
HPC-Practical-4Addition of two large vectors
4 pages
Web GPU
0% (1)
Web GPU
40 pages
Cuda Examples
No ratings yet
Cuda Examples
5 pages
Class 10
No ratings yet
Class 10
13 pages
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
No ratings yet
How To Optimize A CUDA Matmul Kernel For CuBLAS-like Performance - A Worklog
23 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Analytic Geometry by Douglas F. Riddle
0% (2)
Analytic Geometry by Douglas F. Riddle
5 pages
PDC assignment
No ratings yet
PDC assignment
9 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
Parallel Computing Lab4
No ratings yet
Parallel Computing Lab4
13 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
ISSAI Overview - 2022
No ratings yet
ISSAI Overview - 2022
18 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Processors
No ratings yet
Processors
25 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Mathematics Used in My Career: Dr. Carlo Lisi Senior Manger Business Insights and Analytics
No ratings yet
Mathematics Used in My Career: Dr. Carlo Lisi Senior Manger Business Insights and Analytics
48 pages
Week 11
No ratings yet
Week 11
21 pages
CUDA MatrixMultiplication
No ratings yet
CUDA MatrixMultiplication
2 pages
Solar Kiln by Bill Stuewe
100% (2)
Solar Kiln by Bill Stuewe
5 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Ebook10508 (WWW Takbook Com)
No ratings yet
Ebook10508 (WWW Takbook Com)
48 pages
Developing Creative and Critical Thinking Skills in An Authentic Learning
0% (1)
Developing Creative and Critical Thinking Skills in An Authentic Learning
12 pages
01-05 SDH Boards
No ratings yet
01-05 SDH Boards
53 pages
vm5k Me PDF
0% (1)
vm5k Me PDF
72 pages
CUDA
No ratings yet
CUDA
33 pages
Good Pass
No ratings yet
Good Pass
34 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
NCP Inflamed Tonsillitis
No ratings yet
NCP Inflamed Tonsillitis
3 pages
WCR Annualreport 2022 Final-With-Update
No ratings yet
WCR Annualreport 2022 Final-With-Update
26 pages
PVAR Catalog PDF
No ratings yet
PVAR Catalog PDF
16 pages
Marriage Agreement
100% (1)
Marriage Agreement
14 pages
Fdi 3.2023
No ratings yet
Fdi 3.2023
16 pages
Final Dead Load, Along Wind, Across Wind and Temparature Stresses Calculation of 80m Chimney Line Model 181019
No ratings yet
Final Dead Load, Along Wind, Across Wind and Temparature Stresses Calculation of 80m Chimney Line Model 181019
50 pages
Solid Waste
No ratings yet
Solid Waste
25 pages
Chapter 7 Law On Sales 2020
No ratings yet
Chapter 7 Law On Sales 2020
22 pages
FIT9131 Assignment 2 S1 2020
No ratings yet
FIT9131 Assignment 2 S1 2020
7 pages
How2 Foundations
No ratings yet
How2 Foundations
8 pages
50T - Rt600e
No ratings yet
50T - Rt600e
20 pages
Aerodynamics & Hydraulics Laboratory: Laboratory Report Cover Page
No ratings yet
Aerodynamics & Hydraulics Laboratory: Laboratory Report Cover Page
3 pages
A1274 Manual SM-A390-V02 20160707 65X90mm For Reviewing PDF
No ratings yet
A1274 Manual SM-A390-V02 20160707 65X90mm For Reviewing PDF
2 pages
090-2100-MMM-DET-20080-01
No ratings yet
090-2100-MMM-DET-20080-01
1 page
Foreign Facilities Approved For SA Category I and Category II/III Operations
No ratings yet
Foreign Facilities Approved For SA Category I and Category II/III Operations
5 pages
Iloilo Jar Corporation Digest
No ratings yet
Iloilo Jar Corporation Digest
2 pages
Safety Data Sheet: 1 Identification of The Substance/Mixture and of The Company/Undertaking
No ratings yet
Safety Data Sheet: 1 Identification of The Substance/Mixture and of The Company/Undertaking
6 pages
Major Project-Assignment 1
No ratings yet
Major Project-Assignment 1
4 pages
Model Answer Sheet: Multistored Cropping 11% 0.002mm CAM
No ratings yet
Model Answer Sheet: Multistored Cropping 11% 0.002mm CAM
11 pages
MV Voltage Selection Notes
No ratings yet
MV Voltage Selection Notes
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hpc file

Uploaded by

Hpc file

Uploaded by

Experiment 6

Matrix Multiplication in CUDA

a p × w matrix and B is a w × q matrix, So C will be p × q matrix. Matrix multiplication is

simple. To calculate (i,j) th element in C we need to multiply i th row of A with j th column

in B (Fig.1). So an individual element in C will be a vector-vector multiplication.

Fig. 1: What happens in matrix multiplication?

Simple(st) CUDA implementation

in y-dimension and ⌈q/32⌉ number of thread-blocks in x-dimension (Fig. 2).

Fig.2 : Thread-block and grid organization for simple matrix multiplication

in the output matrix.

mechanism to obtain the thread-ID within a thread-block

(threadIdx.x, threadIdx.y and threadIdx.z) and block-ID within a grid

computed similarly along x-dimension.

bounds P and Q. Then we do the vector-vector multiplication multiplying r th row

Example 1. Convert text to lowercase

Remove numbers if they are not relevant to your analyses. Usually,

Example 2. Numbers removing

The following code removes this set of symbols [!”#$%&’()*+,-

Example 3. Punctuation removal

Example 4. White spaces removal

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.