0% found this document useful (0 votes)
11 views22 pages

Hpc file

Uploaded by

Tecno Incentive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Hpc file

Uploaded by

Tecno Incentive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Experiment 6

Matrix Multiplication in CUDA

Let’s say we want to multiply matrix A with matrix B to compute matrix C. Assume A is

a p × w matrix and B is a w × q matrix, So C will be p × q matrix. Matrix multiplication is

simple. To calculate (i,j) th element in C we need to multiply i th row of A with j th column

in B (Fig.1). So an individual element in C will be a vector-vector multiplication.

Fig. 1: What happens in matrix multiplication?

Obvious way to implement our parallel matrix multiplication in CUDA is to let each thread
do a vector-vector multiplication i.e. each element in C matrix will be calculated by a
separate CUDA thread.

Simple(st) CUDA implementation

In CUDA programming model threads are organized into thread-blocks and grids. Thread-

block is the smallest group of threads allowed by the programming model and grid is an

arrangement of multiple thread-blocks. If you are unfamiliar with thread-blocks and grid,
refer to this. A thread-block or grid can be arranged in 1-D, 2-D or 3-D. Sine we are

multiplying 2-D matrices it only makes sense to arrange the thread-blocks and grid in 2-D.

In most modern NVIDIA GPUs one thread-block can have a maximum of 1024 threads.

Therefore we can use a 32 x 32 2-D thread-block (Let’s assume that our thread-block size

is BLOCK_SIZE x BLOCK_SIZE from here). Now how should we arrange our grid?

Since the output matrix is p × q, we need to have at least ⌈p/32⌉ number of thread-blocks

in y-dimension and ⌈q/32⌉ number of thread-blocks in x-dimension (Fig. 2).

Fig.2 : Thread-block and grid organization for simple matrix multiplication

So block and grid dimension can be specified as follows using CUDA. Here I assumed that

columns in the matrix are indexed in x-dimension and rows in y-dimension. So x-dimension
of the grid will have ⌈q/32⌉ blocks.
dim3 dim_grid(ceilf(P/(float)BLOCK_SIZE), ceilf(Q/(float)BLOCK_SIZE),
1);
dim3 dim_block(BLOCK_SIZE, BLOCK_SIZE, 1);

Now let’s move on to our matrix multiplication kernel. First, what are the arguments we

need for the kernel? We need A matrix, B matrix and result C matrix. Assume that all of our

matrices are arranged in row-major order (i.e. elements in a row will be placed in

consecutive memory locations). We also need width which is the length of our vector-vector
multiplication each threads have to do. Because we take the ceiling of q/32 and p/32 CUDA

kernel launcher will launch more threads than we need. Therefore we need

values P and Q (dimensions of C matrix) to check if a given thread computes a valid element

in the output matrix.


template<typename T>
__global__
void naive_matrix_multiply(const T *A, const T *B, T* C, int width,
int P, int Q)
{
int r = blockIdx.y * blockDim.y + threadIdx.y;
int c = blockIdx.x * blockDim.x + threadIdx.x;
// check boundry conditions
if( r < P && c < Q){
// do the multiplication for one row and col
T value = 0;
for(int k = 0; k < width; k++){
value += A[r * width + k] * B[k * Q + c];
}
// store the result
C[r * Q + c] = value;
}}

To understand this code first you need to know that each CUDA thread will be executing this

code independently. There will be P×Q number of threads executing this code. Because each

thread is computing an element in the C matrix first we must calculate the row and column

of this element.

Threads are arranged in 2-D thread-blocks in a 2-D grid. CUDA provides a simple indexing

mechanism to obtain the thread-ID within a thread-block

(threadIdx.x, threadIdx.y and threadIdx.z) and block-ID within a grid

(blockIdx.x, blockIdx.y and blockIdx.z) . In our case rows are indexed in the y-dimension. To

compute the index of row r in terms of CUDA threadIdx and blockIdx, we can

take blockIdx.y and multiply it with blockDim.y to get the total number of threads up

to blockIdx.y number of blocks. Then we add threadIdx.y which is the thread-ID along y-

dimension within the block this thread belongs to (Fig. 3). Column index for column c can be

computed similarly along x-dimension.


Fig. 3: Row computation

The next steps are pretty straightforward. We need to check r and c are within the

bounds P and Q. Then we do the vector-vector multiplication multiplying r th row

in A with c th column in B. Since A and B are laid out in memory in row-major order we can
access all elements in row r of A using A[r*width + k](0≤k≤width). Accessing

column c in B is a little tricky. Value of c th column in row 0 is easy! that will be just the
value at index c in the whole B array. Now value of c th column in row 1 will be at B[1*Q +
c] . How come? Remember Q is the number of columns in B. So to access the c th column in

row 1, we jump by Q elements along the whole B array starting from the c th column index
in row 0. Now to access all elements in column c we can use B[k*Q + c] (0≤k≤width),

Pretty simple!. After computing the vector-vector product, final step is storing the result. So

we have computed r th row c th column in the output matrix C. Index for this element will
be simply r*Q + c.

Obviously this matrix multiplication is very simple and it does not exploit the full potential of

GPUs. In the next post I will explain how to optimize this code using shared memory and

tiling.
Experiment-7

Text processing

Example 1. Convert text to lowercase

Python code:
input_str = ”The 5 biggest countries by population in 2017 are China,
India, United States, Indonesia, and Brazil.”
input_str = input_str.lower()
print(input_str)

Output:
the 5 biggest countries by population in 2017 are china, india, united
states, indonesia, and brazil.

Remove numbers

Remove numbers if they are not relevant to your analyses. Usually,


regular expressions are used to remove numbers.

Example 2. Numbers removing

Python code:
import re
input_str = ‟Box A contains 3 red and 5 white balls, while Box B
contains 4 red and 2 blue balls.‟
result = re.sub(r‟\d+‟, „‟, input_str)
print(result)

Output:
Box A contains red and white balls, while Box B contains red and blue
balls.
Remove punctuation

The following code removes this set of symbols [!”#$%&’()*+,-


./:;<=>?@[\]^_`{|}~]:

Example 3. Punctuation removal

Python code:
import string
input_str = “This &is [an] example? {of} string. with.?
punctuation!!!!” # Sample string
result = input_str.translate(string.maketrans(“”,””),
string.punctuation)
print(result)

Output:
This is an example of string with punctuation

Remove whitespaces

To remove leading and ending spaces, you can use the strip() function:

Example 4. White spaces removal

Python code:
input_str = “ \t a string example\t “
input_str = input_str.strip()
input_str

Output:
„a string example‟

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy