0% found this document useful (0 votes)

24 views8 pages

GPU Series III CUDA Compilation Host Side 1721302802

Gpu

Uploaded by

Imran Immu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views8 pages

GPU Series III CUDA Compilation Host Side 1721302802

Gpu

Uploaded by

Imran Immu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

GPU Series – III

CUDA COMPILATION PROCESS

Host Side

Img src: Programming Massively Parallel Processors by David B.Kirk, Wen-mei W.Hwu

Each CUDA source file can have a mixture of both host and device code. Here’s a brief overview of the compilation
process.

Some terminology before we dive deeper

NVCC – NVIDIA C Compiler is a CUDA C compiler.

Host – CPU, Device – GPU.
Host Code: The portion of the CUDA program that runs on the CPU. It is traditional C code.
Device Code: The portion of the CUDA program that runs on the GPU.
Kernel: A function written in CUDA that runs on the GPU and is executed by multiple threads in parallel.
PTX (Parallel Thread Execution): An intermediate representation of the device code, generated by the NVCC.
JIT Compiler (Just-In-Time Compiler): A compiler that converts PTX code into binary code that can be executed by
the GPU.
Grid: A collection of thread blocks that execute a kernel function on the GPU.
Thread Block: A group of threads that execute together and can share data through shared memory.
Global Memory: The main memory on the GPU, accessible by all threads.
Shared Memory: Fast memory accessible by all threads within a block.
Registers: The fastest type of memory available to each thread for temporary storage.
Constant and Texture Memory: Special types of read-only memory optimized for specific access patterns.

Yashwanth Naidu Tikkisetty

Source Code Structure: A CUDA source file typically contains both host and device code. Host code is written in
standard C/C++ and runs on the CPU, while device code, which includes kernel functions, is marked with CUDA-specific
keywords and runs on the GPU.
NVCC Compilation
The NVIDIA CUDA Compiler (NVCC) processes the CUDA source file, separating the host and device code. The host
code is compiled using a standard C/C++ compiler, while the device code is compiled into PTX code.
Host Compilation:
• The host code is compiled by a standard C preprocessor, compiler, and linker.
• This part of the code remains largely unchanged from traditional C/C++ programs.
Device Compilation:
• The device code is compiled into PTX, an intermediate assembly-like language that represents the device code.
• PTX code is designed to be compatible across different GPU architectures, allowing for flexibility and
optimization.

Host Code Compilation

The host code is compiled using the host C/C++ compiler. This process involves preprocessing, compiling, and linking to
generate the host executable. The compiled host code includes calls to CUDA runtime functions that manage the device
code execution.
PTX to Binary Translation
Once the PTX code is generated, it is further compiled into binary code by a Just-In-Time (JIT) compiler at runtime. This
step is specific to the target GPU architecture, optimizing the code for the particular hardware it will run on.
Linking and Executable Generation
The host executable, which contains both the host code and the embedded PTX code, is generated. This executable can
manage and launch the device code on the GPU.
Execution Flow of CUDA Program
1. Memory Allocation:

• Allocate memory on the GPU for the data required by the computation.
• This involves using CUDA runtime functions like cudaMalloc.

2. Data Transfer:

• Transfer data from the CPU (host) memory to the GPU (device) memory using functions like cudaMemcpy.

3. Kernel Launch:

• Launch a kernel, specifying the number of threads and blocks. The kernel runs on the GPU, with each thread
executing a part of the computation.
• The kernel launch syntax includes configuration parameters defining the grid and block dimensions.

4. Device Synchronization:

• Ensure that all threads complete execution before proceeding. This is achieved using functions like
cudaDeviceSynchronize.

Yashwanth Naidu Tikkisetty

5. Data Transfer Back:

• Transfer the results from the GPU memory back to the CPU memory.
• This again uses cudaMemcpy.

6. Memory Deallocation:

• Free the allocated memory on the GPU to avoid memory leaks. This involves using cudaFree.

Execution of Device code Summarized:

- When a kernel function (parallel device code) is called or launched, it is executed by a large number of threads on
a device.
- All the threads that are generated by a kernel launch are collectively called a grid.
- These threads are the primary vehicle of parallel execution in a CUDA platform.
- When all threads of a kernel complete their execution, the corresponding grid terminates, the execution continues
on the host until another kernel is launched.
CPU Execution and GPU execution do not overlap.
What is a threads? – https://cthecosmos.com/2022/10/26/threads/
Let us follow through this by an example of a vector addition kernel.
Vector addition kernel in CUDA is equivalent to “Hello World” in sequential programming.
Traditional vector addition:
Part 1: Allocate memory for vectors A, B, and C in the host (CPU) memory. Initialize the vectors A and B with the input
values.
Part 2: Iterate through each element of the vectors A and B. Compute the sum of corresponding elements and store the
result in vector C.
Part 3: The result vector C is now available in the host memory for further processing or output.

CUDA Vector Addition:

Part 1: Allocate space in device memory to hold copies of A, B and C vectors and copies the vectors from the host.
Part 2: Launches parallel execution of the actual vector additional kernel on the device.
Part 3: Copies the sum vector C from device memory back to host memory and frees the vectors in device memory.

The DRAM on any GPU is the global memory. In order to execute a kernel on a device, the programmer needs to allocate
global memory on the device and transfer the needed data from host to allocated device memory.

In this article, we will see how the vector addition is done normally. In the upcoming one, we will dive into CUDA code
for Vector addition kernel.

Traditional Vector Addition ( Using Static Memory):

#include<stdio.h>
#include<time.h>
#include<stdlib.h>

// #define DISPLAY

#define SIZE 1000000

void initVectors(int *pointerToVectorArray){

for(int i=0;i<SIZE;i++)
{

Yashwanth Naidu Tikkisetty

*(pointerToVectorArray+i) = rand();
}

void addVectors(int pointerForVectorA, int pointerForVectorB, int *pointerForResultVector)

{
for(int i=0;i<SIZE;i++)
{
*(pointerForResultVector + i) = *(pointerForVectorA + i) + *(pointerForVectorB + i);
}

void printVector(int *pointerToVectorArray)

{
for(int i=0;i<SIZE;i++)

{
printf(" %d\t",*(pointerToVectorArray+i));

}
printf("\n");

int main(){

// Allocate Memory for VectorA and Vector B and initialize them with 0
int vectorA[SIZE]={0};
int vectorB[SIZE]={0};
int resultVector[SIZE]={0};

// For timing
clock_t start, end;
double cpu_time_used;
srand(time(NULL));

// Initialize Vector A and Vector B with Random values.

initVectors(vectorA);
initVectors(vectorB);

printf("\nInitialized Vector A and Vector B with Random values\n");

start = clock();

// Vector Addition
addVectors(vectorA,vectorB,resultVector);

end = clock();

cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

printf("Execution time: %f seconds\n", cpu_time_used);

//For Displaying Vectors

#ifdef DISPLAY
printf("\nVector A: \n");

Yashwanth Naidu Tikkisetty

printVector(vectorA);

printf("\nVector B: \n");
printVector(vectorB);

printf("\nResult Vector: \n");

printVector(resultVector);
#endif
return 0;
}

Using Dynamic Memory:

#include<stdio.h>
#include<time.h>
#include<stdlib.h>

//#define DISPLAY

#define SIZE 100000000

void initVectors(int *pointerToVector){

// Initialize values
for(int i=0;i<SIZE;i++)
{
*(pointerToVector+i) = rand();
}

void addVectors(int pointerForVectorA, int pointerForVectorB, int *pointerForResultVector)

{
for(int i=0;i<SIZE;i++)
{
*(pointerForResultVector + i) = *(pointerForVectorA + i) + *(pointerForVectorB + i);
}

void printVector(int *pointerToVector)

{
for(int i=0;i<SIZE;i++)

{
printf(" %d\t",*(pointerToVector+i));

}
printf("\n");

int main(){

// Declare Vector A and Vector B

int *vectorA=NULL;
int *vectorB=NULL;
int *resultVector=NULL;

// For timing
clock_t start, end;
double cpu_time_used;
srand(time(NULL)); // To see the random Dealy

Yashwanth Naidu Tikkisetty

// Initialize Vector A and Vector B with Random values.

// Allocate memory
vectorA = (int*)malloc(SIZE*(sizeof(int)));
vectorB = (int*)malloc(SIZE*(sizeof(int)));
resultVector = (int*)malloc(SIZE*(sizeof(int)));

initVectors(vectorA);
initVectors(vectorB);

printf("\nInitialized Vector A and Vector B with Random values\n");

start = clock();

// Vector Addition
addVectors(vectorA,vectorB,resultVector);

end = clock();

cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

printf("Execution time: %f seconds\n", cpu_time_used);

//For Displaying Vectors

#ifdef DISPLAY
printf("\nVector A: \n");
printVector(vectorA);

printf("\nVector B: \n");
printVector(vectorB);

printf("\nResult Vector: \n");

printVector(resultVector);
#endif

free(vectorA);
free(vectorB);
free(resultVector);

return 0;
}

Why Static and Dynamic codes?

Parallel processing works for huge amount of datasets. For vector addition, if we use arrays, we would be using stack,
which would be limited to a certain number on any device. If we use Dynamic memory, i.e, using pointers, we would be
using heap, whose area is much more greater than Stack.

Output of Stack :
For 1000:

Yashwanth Naidu Tikkisetty

For 10000

For 100000

For 1000000

For Dynamic:
For 1000:

For 10000

For 100000

For 1000000

For 100000000

Now you see why we will be comparing it with Dynamic memory.

~~ To be Continued ~~

___________________________________________________________________________________________________________

Reference: Programming Massively Parallel Processors by David B.Kirk, Wen-mei W.Hwu

Yashwanth Naidu Tikkisetty

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Article Written By: Yashwanth Naidu Tikkisetty
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yashwanth Naidu Tikkisetty

Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Part2 22
No ratings yet
Part2 22
97 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Writing Modules For Ansible
No ratings yet
Writing Modules For Ansible
32 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Cuda C
No ratings yet
Cuda C
70 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Develop A Java Program To Demonstrate Applet Life Cycle
No ratings yet
Develop A Java Program To Demonstrate Applet Life Cycle
8 pages
CUDA
No ratings yet
CUDA
33 pages
Coding and Debugging 2nd Round
No ratings yet
Coding and Debugging 2nd Round
5 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Program Structure of CUDA
No ratings yet
Program Structure of CUDA
3 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Threads
No ratings yet
Threads
54 pages
خوارزميات عملي ١
No ratings yet
خوارزميات عملي ١
43 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
4.1 C Programming For Embedded
No ratings yet
4.1 C Programming For Embedded
126 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
DXC - Java MCQs - 1-25
No ratings yet
DXC - Java MCQs - 1-25
13 pages
Icjecapu 09
No ratings yet
Icjecapu 09
7 pages
Create IDOC Using Application Server
No ratings yet
Create IDOC Using Application Server
3 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Sap Rap
No ratings yet
Sap Rap
54 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CS111 - Tutorial/Laboratory - Week 2: Programming Exercise 2.1 - Introduction To The IDE
No ratings yet
CS111 - Tutorial/Laboratory - Week 2: Programming Exercise 2.1 - Introduction To The IDE
2 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Notes For Shirish
No ratings yet
Notes For Shirish
6 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Crash 20231125
No ratings yet
Crash 20231125
3 pages
ABSTRACT Mini Project
No ratings yet
ABSTRACT Mini Project
5 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Devesh Anand Srivastava: Education
No ratings yet
Devesh Anand Srivastava: Education
1 page
INTERNAL
No ratings yet
INTERNAL
11 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
Exception Handling
No ratings yet
Exception Handling
7 pages
OPSQL
No ratings yet
OPSQL
2 pages
NEP BCA II Sem Java 203
No ratings yet
NEP BCA II Sem Java 203
2 pages
Lab 10,11
No ratings yet
Lab 10,11
4 pages
KCA102 Unit1 - Input-Output
No ratings yet
KCA102 Unit1 - Input-Output
4 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
C++ File For Print Out
No ratings yet
C++ File For Print Out
39 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Decorators and Generators in Python
No ratings yet
Decorators and Generators in Python
3 pages
Visual Basic If Statement
No ratings yet
Visual Basic If Statement
2 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Cuda
No ratings yet
Cuda
4 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Mini Project KCA353 2024 25
No ratings yet
Mini Project KCA353 2024 25
12 pages
Zustand React JS Basics
No ratings yet
Zustand React JS Basics
16 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Monthly Test 2022-23
No ratings yet
Monthly Test 2022-23
3 pages
Combinepdf
No ratings yet
Combinepdf
28 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
LOW-LEVEL AND HIGH-LEVEL PROGRAMMING Office
No ratings yet
LOW-LEVEL AND HIGH-LEVEL PROGRAMMING Office
2 pages
Python Day 3
No ratings yet
Python Day 3
22 pages
C++ - Hello, World! - Program
No ratings yet
C++ - Hello, World! - Program
6 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Cuda Add Mult
No ratings yet
Cuda Add Mult
3 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Language Script Structure
No ratings yet
Language Script Structure
7 pages
Addition Cuda
No ratings yet
Addition Cuda
2 pages
Javascript Control Flow
No ratings yet
Javascript Control Flow
27 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

GPU Series III CUDA Compilation Host Side 1721302802

Uploaded by

GPU Series III CUDA Compilation Host Side 1721302802

Uploaded by

GPU Series – III

CUDA COMPILATION PROCESS

Some terminology before we dive deeper

NVCC – NVIDIA C Compiler is a CUDA C compiler.

Yashwanth Naidu Tikkisetty

Host Code Compilation

Yashwanth Naidu Tikkisetty

Execution of Device code Summarized:

CUDA Vector Addition:

Traditional Vector Addition ( Using Static Memory):

#define SIZE 1000000

void initVectors(int *pointerToVectorArray){

Yashwanth Naidu Tikkisetty

void addVectors(int *pointerForVectorA, int *pointerForVectorB, int *pointerForResultVector)

void printVector(int *pointerToVectorArray)

// Initialize Vector A and Vector B with Random values.

printf("\nInitialized Vector A and Vector B with Random values\n");

cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

//For Displaying Vectors

Yashwanth Naidu Tikkisetty

printf("\nResult Vector: \n");

Using Dynamic Memory:

#define SIZE 100000000

void initVectors(int *pointerToVector){

void addVectors(int *pointerForVectorA, int *pointerForVectorB, int *pointerForResultVector)

void printVector(int *pointerToVector)

// Declare Vector A and Vector B

Yashwanth Naidu Tikkisetty

printf("\nInitialized Vector A and Vector B with Random values\n");

cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

//For Displaying Vectors

printf("\nResult Vector: \n");

Why Static and Dynamic codes?

Yashwanth Naidu Tikkisetty

Now you see why we will be comparing it with Dynamic memory.

Reference: Programming Massively Parallel Processors by David B.Kirk, Wen-mei W.Hwu

Yashwanth Naidu Tikkisetty

Yashwanth Naidu Tikkisetty

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

void addVectors(int pointerForVectorA, int pointerForVectorB, int *pointerForResultVector)

void addVectors(int pointerForVectorA, int pointerForVectorB, int *pointerForResultVector)