Lab Experiment 6
Lab Experiment 6
In the last Lab, the parallel reduction problem was solved using two different ways.
Neighboured pair: Elements are paired with their immediate neighbour.
Interleaved pair: Paired elements are separated by a given stride.
In this Lab, the same problem will be solved with loop unrolling.
Unrolled Loops: Paired elements in different blocks are added without using loops.
A simple loop to add n elements of two arrays is as follows:
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
An unrolled loop to do the same computation in n/3 iterations is as follows:
for (int i = 0; i < n/3; i += 3) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
A[i+2] = b[i+2] + c[i+2];
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import math
import time
// Check bounds
if (i >= N) return;
# Initialize data
A, B, C = initialize_arrays(N)
# Allocate memory on the GPU
d_A = cuda.mem_alloc(A.nbytes)
d_B = cuda.mem_alloc(B.nbytes)
d_C = cuda.mem_alloc(C.nbytes)