Hpc file
Hpc file
Let’s say we want to multiply matrix A with matrix B to compute matrix C. Assume A is
Obvious way to implement our parallel matrix multiplication in CUDA is to let each thread
do a vector-vector multiplication i.e. each element in C matrix will be calculated by a
separate CUDA thread.
In CUDA programming model threads are organized into thread-blocks and grids. Thread-
block is the smallest group of threads allowed by the programming model and grid is an
arrangement of multiple thread-blocks. If you are unfamiliar with thread-blocks and grid,
refer to this. A thread-block or grid can be arranged in 1-D, 2-D or 3-D. Sine we are
multiplying 2-D matrices it only makes sense to arrange the thread-blocks and grid in 2-D.
In most modern NVIDIA GPUs one thread-block can have a maximum of 1024 threads.
Therefore we can use a 32 x 32 2-D thread-block (Let’s assume that our thread-block size
is BLOCK_SIZE x BLOCK_SIZE from here). Now how should we arrange our grid?
Since the output matrix is p × q, we need to have at least ⌈p/32⌉ number of thread-blocks
So block and grid dimension can be specified as follows using CUDA. Here I assumed that
columns in the matrix are indexed in x-dimension and rows in y-dimension. So x-dimension
of the grid will have ⌈q/32⌉ blocks.
dim3 dim_grid(ceilf(P/(float)BLOCK_SIZE), ceilf(Q/(float)BLOCK_SIZE),
1);
dim3 dim_block(BLOCK_SIZE, BLOCK_SIZE, 1);
Now let’s move on to our matrix multiplication kernel. First, what are the arguments we
need for the kernel? We need A matrix, B matrix and result C matrix. Assume that all of our
matrices are arranged in row-major order (i.e. elements in a row will be placed in
consecutive memory locations). We also need width which is the length of our vector-vector
multiplication each threads have to do. Because we take the ceiling of q/32 and p/32 CUDA
kernel launcher will launch more threads than we need. Therefore we need
values P and Q (dimensions of C matrix) to check if a given thread computes a valid element
To understand this code first you need to know that each CUDA thread will be executing this
code independently. There will be P×Q number of threads executing this code. Because each
thread is computing an element in the C matrix first we must calculate the row and column
of this element.
Threads are arranged in 2-D thread-blocks in a 2-D grid. CUDA provides a simple indexing
(blockIdx.x, blockIdx.y and blockIdx.z) . In our case rows are indexed in the y-dimension. To
compute the index of row r in terms of CUDA threadIdx and blockIdx, we can
take blockIdx.y and multiply it with blockDim.y to get the total number of threads up
to blockIdx.y number of blocks. Then we add threadIdx.y which is the thread-ID along y-
dimension within the block this thread belongs to (Fig. 3). Column index for column c can be
The next steps are pretty straightforward. We need to check r and c are within the
in A with c th column in B. Since A and B are laid out in memory in row-major order we can
access all elements in row r of A using A[r*width + k](0≤k≤width). Accessing
column c in B is a little tricky. Value of c th column in row 0 is easy! that will be just the
value at index c in the whole B array. Now value of c th column in row 1 will be at B[1*Q +
c] . How come? Remember Q is the number of columns in B. So to access the c th column in
row 1, we jump by Q elements along the whole B array starting from the c th column index
in row 0. Now to access all elements in column c we can use B[k*Q + c] (0≤k≤width),
Pretty simple!. After computing the vector-vector product, final step is storing the result. So
we have computed r th row c th column in the output matrix C. Index for this element will
be simply r*Q + c.
Obviously this matrix multiplication is very simple and it does not exploit the full potential of
GPUs. In the next post I will explain how to optimize this code using shared memory and
tiling.
Experiment-7
Text processing
Python code:
input_str = ”The 5 biggest countries by population in 2017 are China,
India, United States, Indonesia, and Brazil.”
input_str = input_str.lower()
print(input_str)
Output:
the 5 biggest countries by population in 2017 are china, india, united
states, indonesia, and brazil.
Remove numbers
Python code:
import re
input_str = ‟Box A contains 3 red and 5 white balls, while Box B
contains 4 red and 2 blue balls.‟
result = re.sub(r‟\d+‟, „‟, input_str)
print(result)
Output:
Box A contains red and white balls, while Box B contains red and blue
balls.
Remove punctuation
Python code:
import string
input_str = “This &is [an] example? {of} string. with.?
punctuation!!!!” # Sample string
result = input_str.translate(string.maketrans(“”,””),
string.punctuation)
print(result)
Output:
This is an example of string with punctuation
Remove whitespaces
To remove leading and ending spaces, you can use the strip() function:
Python code:
input_str = “ \t a string example\t “
input_str = input_str.strip()
input_str
Output:
„a string example‟