GPU Series III CUDA Compilation Host Side 1721302802
GPU Series III CUDA Compilation Host Side 1721302802
Img src: Programming Massively Parallel Processors by David B.Kirk, Wen-mei W.Hwu
Each CUDA source file can have a mixture of both host and device code. Here’s a brief overview of the compilation
process.
• Allocate memory on the GPU for the data required by the computation.
• This involves using CUDA runtime functions like cudaMalloc.
2. Data Transfer:
• Transfer data from the CPU (host) memory to the GPU (device) memory using functions like cudaMemcpy.
3. Kernel Launch:
• Launch a kernel, specifying the number of threads and blocks. The kernel runs on the GPU, with each thread
executing a part of the computation.
• The kernel launch syntax includes configuration parameters defining the grid and block dimensions.
4. Device Synchronization:
• Ensure that all threads complete execution before proceeding. This is achieved using functions like
cudaDeviceSynchronize.
• Transfer the results from the GPU memory back to the CPU memory.
• This again uses cudaMemcpy.
6. Memory Deallocation:
• Free the allocated memory on the GPU to avoid memory leaks. This involves using cudaFree.
The DRAM on any GPU is the global memory. In order to execute a kernel on a device, the programmer needs to allocate
global memory on the device and transfer the needed data from host to allocated device memory.
In this article, we will see how the vector addition is done normally. In the upcoming one, we will dive into CUDA code
for Vector addition kernel.
// #define DISPLAY
for(int i=0;i<SIZE;i++)
{
{
printf(" %d\t",*(pointerToVectorArray+i));
}
printf("\n");
int main(){
// Allocate Memory for VectorA and Vector B and initialize them with 0
int vectorA[SIZE]={0};
int vectorB[SIZE]={0};
int resultVector[SIZE]={0};
// For timing
clock_t start, end;
double cpu_time_used;
srand(time(NULL));
start = clock();
// Vector Addition
addVectors(vectorA,vectorB,resultVector);
end = clock();
printf("\nVector B: \n");
printVector(vectorB);
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
//#define DISPLAY
// Initialize values
for(int i=0;i<SIZE;i++)
{
*(pointerToVector+i) = rand();
}
{
printf(" %d\t",*(pointerToVector+i));
}
printf("\n");
int main(){
// For timing
clock_t start, end;
double cpu_time_used;
srand(time(NULL)); // To see the random Dealy
// Allocate memory
vectorA = (int*)malloc(SIZE*(sizeof(int)));
vectorB = (int*)malloc(SIZE*(sizeof(int)));
resultVector = (int*)malloc(SIZE*(sizeof(int)));
initVectors(vectorA);
initVectors(vectorB);
start = clock();
// Vector Addition
addVectors(vectorA,vectorB,resultVector);
end = clock();
printf("\nVector B: \n");
printVector(vectorB);
free(vectorA);
free(vectorB);
free(resultVector);
return 0;
}
Output of Stack :
For 1000:
For 100000
For 1000000
For Dynamic:
For 1000:
For 10000
For 100000
For 1000000
For 100000000
~~ To be Continued ~~
___________________________________________________________________________________________________________