0% found this document useful (0 votes)
3 views4 pages

Lab Experiment 6

This document outlines a lab experiment focused on solving the parallel reduction problem using loop unrolling in CUDA. It includes a comparison of traditional and unrolled loops for adding elements of two arrays, along with sample code demonstrating the implementation. The experiment also measures GPU computation time and verifies results against CPU addition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Lab Experiment 6

This document outlines a lab experiment focused on solving the parallel reduction problem using loop unrolling in CUDA. It includes a comparison of traditional and unrolled loops for adding elements of two arrays, along with sample code demonstrating the implementation. The experiment also measures GPU computation time and verifies results against CPU addition.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Lab Experiment # 6

The Parallel Reduction Problem using Loop-Unrolling


in CUDA [CLO 1, CLO 2, CLO 3]

In the last Lab, the parallel reduction problem was solved using two different ways.
Neighboured pair: Elements are paired with their immediate neighbour.
Interleaved pair: Paired elements are separated by a given stride.
In this Lab, the same problem will be solved with loop unrolling.
Unrolled Loops: Paired elements in different blocks are added without using loops.
A simple loop to add n elements of two arrays is as follows:
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
An unrolled loop to do the same computation in n/3 iterations is as follows:
for (int i = 0; i < n/3; i += 3) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
A[i+2] = b[i+2] + c[i+2];

Sample Code for Unrolled Loops

import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import math
import time

# Function to initialize the arrays with random values


def initialize_arrays(n):
A = np.random.rand(n).astype(np.float32)
B = np.random.rand(n).astype(np.float32)
C = np.zeros_like(A)
return A, B, C
# CUDA C Kernel with an unrolled loop
kernel_code = """
__global__ void add_arrays_unrolled(float *A, float *B, float *C, int
N)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;

// Check bounds
if (i >= N) return;

// Unrolled loop for adding 4 elements at a time


int stride = 4; // Number of elements to process in each iteration
if (i + 3 < N) {
C[i] = A[i] + B[i];
C[i+1] = A[i+1] + B[i+1];
C[i+2] = A[i+2] + B[i+2];
C[i+3] = A[i+3] + B[i+3];
} else {
// Handle the remaining elements
if (i < N) C[i] = A[i] + B[i];
if (i + 1 < N) C[i + 1] = A[i + 1] + B[i + 1];
if (i + 2 < N) C[i + 2] = A[i + 2] + B[i + 2];
}
}
"""

# Main function to run the program


def main():
# Set array size
N = 1024 # You can change this to any size (1 million, 2 million,
etc.)

# Initialize data
A, B, C = initialize_arrays(N)
# Allocate memory on the GPU
d_A = cuda.mem_alloc(A.nbytes)
d_B = cuda.mem_alloc(B.nbytes)
d_C = cuda.mem_alloc(C.nbytes)

# Copy data from host to device


cuda.memcpy_htod(d_A, A)
cuda.memcpy_htod(d_B, B)

# Compile the kernel code


mod = SourceModule(kernel_code)
add_arrays = mod.get_function("add_arrays_unrolled")

# Set block and grid size


block_size = 256 # 256 threads per block
grid_size = math.ceil(N / block_size)

# Launch the kernel


start_time = time.time()
add_arrays(d_A, d_B, d_C, np.int32(N), block=(block_size, 1, 1),
grid=(grid_size, 1))
cuda.Context.synchronize() # Ensure the kernel finishes
elapsed_time = time.time() - start_time
print(f"GPU Computation Time: {elapsed_time:.4f} seconds")

# Copy the result back to host


cuda.memcpy_dtoh(C, d_C)

# Verify the result by comparing with CPU addition


cpu_result = A + B
assert np.allclose(C, cpu_result), "Results don't match!"

print("Computation complete and verified.")


if __name__ == "__main__":
main()

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy