0% found this document useful (0 votes)

12 views45 pages

Cuda 1

This document provides an introduction to CUDA, focusing on its architecture, programming model, and practical examples. It covers memory management, parallel kernels, thread synchronization, and CUDA C language extensions, along with various code examples demonstrating basic CUDA operations. Additionally, it discusses performance optimization techniques and debugging tools for CUDA applications.

Uploaded by

Akshat Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views45 pages

Cuda 1

Uploaded by

Akshat Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

Introduction to CUDA

heterogeneous programming

Katia Oleinik
koleinik@bu.edu
Scientific Computing and Visualization

Boston University
GPU memory
• Memory management
CUDA Basics • Parallel kernels
• Threads synchronization
• Hello, World!
• Race conditions and atomic
• CUDA kernels
operations
• Blocks and
threads overview
CUDA
• Architecture
• C Language extensions
• Terminology
Architecture

NVIDIA Tesla M2070:

Core clock: 1.15GHz

Single instruction
448 CUDA cores

1.15 x 1 x 448 =
515 Gigaflops double precision (peak)

1.03 Tflops single precision (peak)

3GB total dedicated memory

Delivers performance at about 10% of the cost and 5% the power of CPU
Architecture

CUDA:

• Compute Unified Device Architecture

• General Purpose Parallel Computing Architecture by NVIDIA

• Supports traditional OpenGL graphics

Architecture

Memory Bandwidth:

the rate at which data can be read from or stored into memory, expressed in bytes per
second

Intel Xeon X5650: 32 GB/s Tesla M2070: 148 GB/s

Architecture

Tesla M2070 Processor:

• Streaming Multiprocessors (SM): 14

• Streaming Processors on each SM: 32

Total: 14 x 32 = 448 Cores

Each Streaming Multiprocessor supports 1024 threads.

Architecture

CUDA:

SIMT philosophy: Single Instruction Multiple Thread

Computationally intensive—The time spent on computation significantly

exceeds the time spent on transferring data to and from GPU memory.

Massively parallel—The computations can be broken down into

hundreds or thousands of independent units of work.
Architecture

# Copy tutorial files

scc1 % cp –r /scratch/katia/cuda .

# Request interactive session on the node with GPU

scc1 % qrsh –l gpus=1

# Change directory
scc1-ha1 % cd deviceQuery

# Set Environment variables to link to CUDA 5/0

scc1-ha1 % module load cuda/5.0

# Execute deviceQuery program

scc1-ha1 % ./deviceQuery
Architecture
Information that we will need later in this tutorial:

CUDA Driver Version / Runtime Version 5.0 / 5.0

CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes
(14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
CUDA Architecture
Information that we will need later in this tutorial:

Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
CUDA Architecture
Query device capabilities and measure GPU/CPU bandwidth.
This is a simple test program to measure the memcopy bandwidth of the GPU and
memcpy bandwidth across PCI-e

# Change directory
scc1-ha1 % cd bandwidthTest

# Execute bandwidthTest program

scc1-ha1 % ./bandwidthTest
CUDA Terminology

CUDA:

Host
The CPU and its memory (host memory)

Device
The GPU and its memory (device memory)
CUDA: C Language Extensions

CUDA:

• Based on industry-standard C

• Language extensions allow heterogeneous programming

• APIs for memory and device managing

Hello, Cuda!

CUDA: Basic example HelloCuda1.cu

#include <stdio.h>
int main(void){

printf("Hello, Cuda! \n");

return(0);
}

To build the program, use nvcc compiler:

scc-he1: % nvcc -o helloCuda1 helloCuda1.cu

Hello, Cuda!

CUDA Language closely follows C/C++ syntax with minimum set of extensions:

Function to be executed on the device (GPU) and called

from host code

device void foo(){ . . . }

NVCC compiler will compile the function that run on the device and host compiler
(gcc) will take care about all other functions that run on the host (e.g. main() )
Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

#include <stdio.h>

global void cudakernel(void){

printf("Hello, I am CUDA kernel ! Nice to meet you!\n");
}
Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

int main(void){

printf("Hello, Cuda! \n");

cudakernel<<<1,1>>>();
cudaDeviceSynchronize();

printf("Nice to meet you too! Bye, CUDA\n");

return(0);
}
Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

cudakernel<<<N,M>>>();

cudaDeviceSynchronize();

Triple angle brackets indicate that the function will be executed on the device (GPU).
This function is called kernel.

Kernel is always of type void.

Program returns immediately after launching the kernel. To prevent program to finish
before kernel is completed, we have call cudaDeviceSynchronize().
CUDA: C Language Extensions

There is a number of cuda functions:

Device management:
cudaGetDeviceCount(), cudaGetDeviceProperties()

Error management:
cudaGetLastError(), cudaSafeCall(), cudaCheckError()

Device memory management:

cudaMalloc(), cudaFree(), cudaMemcpy()
Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

To build the program, use nvcc compiler:

scc-he1: % nvcc -o helloCuda2 helloCuda2.cu –arch sm_20

The ability to print from within the kernel was added in a later generation of
architectural evolution. To request the support of Compute Capability 2.0, we need
to add this option into compilation command line.
Hello, Cuda!

CUDA: Basic example HelloCudaBlock.cu

#include <stdio.h>

global void cudakernel(void){

printf("Hello, I am CUDA block %d !\n", blockIdx.x);
}

int main(void){

. . .
cudakernel<<<16,1>>>();
. . . To simplify compilation process we will use Makefile:
}
% make HelloCudaBlock
CUDA: C Language Extensions

CUDA provides special variable for thread identification in the

kernal:

dim3 threadIdx; // thread ID within the block

dim3 blockIdx; // block ID within the grid

dim3 blockDim; // number of threads per block

dim3 gridDim; // number of blocks in the grid

In the simple 1-dimentional case, we use only the first component of each variable,
e.g. threadIdx.x
CUDA: Blocks and Threads

Host
Serial Code

Device
Kernel A

Host
Serial Code

Device
Kernel B
CUDA: C Language Extensions

CUDA: Basic example HelloCudaThread.cu

#include <stdio.h>

global void cudakernel(void){

printf("Hello, I am CUDA thread %d !\n", threadIdx.x);
}

int main(void){

. . .
cudakernel<<<1,16>>>();
. . .
}
CUDA: Blocks and Threads

• One kernel is executed on the device at a time

• Many threads execute each kernel
• Each thread execute the same code (SPMD)
• Threads are grouped into thread blocks
• Kernel is a grid of thread blocks
• Threads are scheduled as sets of warps
• Warp is a group of 32 threads
• SM executes same instruction on all threads in the warp
• Blocks cannot synchronize and can run in any order
Vector Addition Example

CUDA: vectorAdd.cu

global void vectorAdd(const float *A,

const float *B,
float *C,
int numElements){

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < numElements) {
C[i] = A[i] + B[i];
}
}
Vector Addition Example

CUDA: vectorAdd.cu

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

01 2 3 4 5 6701234567012345670123 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

int i = blockDim.x * blockIdx.x + threadIdx.x;

Unlike blocks, threads have mechanisms to communicate and synchronize

Vector Addition Example

CUDA: vectorAdd.cu device memory allocation

int main(void) {
. . .
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);

float *d_B = NULL;

err = cudaMalloc((void **)&d_B, size);

float *d_C = NULL;

err = cudaMalloc((void **)&d_C, size);
. . .
}
Vector Addition Example

CUDA: vectorAdd.cu

int main(void) {

. . .
// Copy input values to the device
cudaMemcpy(d_A, &A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_A, &A, size, cudaMemcpyHostToDevice);

. . .
}
Vector Addition Example

CUDA: vectorAdd.cu

int main(void) {
. . .
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) /
threadsPerBlock;

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

err = cudaGetLastError();
. . .
}
Vector Addition Example

CUDA: vectorAdd.cu

int main(void) {

. . .
// Copy result back to host
cudaMemcpy(&C, d_C, size, cudaMemcpyDeviceToHost);

// Clean-up
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
. . .
}
Timing CUDA kernel

CUDA: vectorAddTime.cu
float memsettime;
cudaEvent_t start, stop;

// initialize CUDA timer

cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start,0);

// CUDA Kernel
. . .

// stop CUDA timer

cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&memsettime,start,stop);
printf(" *** CUDA execution time: %f *** \n", memsettime);
cudaEventDestroy(start);
cudaEventDestroy(stop);
Timing CUDA kernel

CUDA: vectorAddTime.cu
scc-ha1 % make

// specify the number of threads per block

scc-ha1 % vectorAddTime 128

Explore the CUDA kernel execution time based on the block size:

Remember:
• CUDA Streaming Multiprocessor executes threads in warps (32 threads)
• There is a maximum of 1024 threads per block (for our GPU)
• There is a maximum of 1536 threads per multiprocessor (for our GPU)
Dot Product

CUDA: dotProd1.cu

a0 * b0

a1 * b1
+ C
a2 * b2

a3 * b3

C = A * B = ( a 0, a 1 , a 2 , a 3 ) * ( b 0, b 1 , b 2 , b 3 ) = a 0 * b 0 + a 1 * b 1 + a 2 * b 2 + a 3 * b 3
Dot Product

CUDA: dotProd1.cu

A block of threads shares common memory, called shared memory

Shared Memory is extremely fast on-chip memory

To declare shared memory use shared keyword

Shared Memory is not visible to the threads in other blocks

Dot Product

CUDA: dotProd1.cu
#define N 512
__global__ voiddot( int*a, int*b, int*c ) {

// Shared memory for results of multiplication

__shared__ inttemp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

// Thread 0 sums the pairwise products

if( threadIdx.x == 0 ) {
int sum = 0;
for( int i= 0; i< N; i++ ) sum += temp[i];
*c = sum;
}
}

What if thread 0 starts to calculate sum before other threads completed their calculations?
Thread Synchronization

CUDA: dotProd1.cu
#define N 512
__global__ voiddot( int*a, int*b, int*c ) {

// Shared memory for results of multiplication

__shared__ inttemp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

__syncthreads();

// Thread 0 sums the pairwise products

if( threadIdx.x == 0 ) {
int sum = 0;
for( int i= 0; i< N; i++ ) sum += temp[i];
*c = sum;
}
}
Thread Synchronization

CUDA: dotProd1.cu
int main(void) {
. . .
// copy input vectors to the device
. . .

// Launch CUDA kernel

dotProductKernel <<<1, N >>> (dev_A, dev_B, dev_C);

. . .
// copy input vectors from the device
. . .
}

But our vector is limited to the maximum block size. Can we use blocks?
Race Condition
CUDA: dotProd2.cu

a0 * b0
Block 0
a1 * b1
+ sum
a2 * b2
a3 * b3
C

a4 * b4
Block 1
a5 * b5
+ sum
a6 * b6
a7 * b7
Race Condition
CUDA: dotProd2.cu
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__ void dotProductKernel( int*a, int*b, int*c ) {
__shared__ int temp[THREADS_PER_BLOCK];

int index = threadIdx.x + blockIdx.x * blockDim.x;

temp[threadIdx.x] = a[index] * b[index];

__syncthreads();

if( threadIdx.x == 0) {
intsum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i];
*c += sum;
}
}
Blocks interfere with each other – Race condition
Race Condition
CUDA: dotProd2.cu
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__ void dotProductKernel( int*a, int*b, int*c ) {
__shared__ int temp[THREADS_PER_BLOCK];

int index = threadIdx.x + blockIdx.x * blockDim.x;

temp[threadIdx.x] = a[index] * b[index];

__syncthreads();

if( threadIdx.x == 0) {
intsum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i];
atomicAdd(c,sum);
}
}
Atomic Operations
Race conditions - behavior depends upon relative timing of
multiple event sequences.
Can occur when an implied read-modify-write is interruptible

Read-Modify-Write uninterruptible – atomic

atomicAdd() atomicInc()
atomicSub() atomicDec()
atomicMin() atomicExch()
atomicMax() atomicCAS()
CUDA Best Practices

NVIDIA’s link:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

1. Locate part of the slowest part of the code

Assess gcc -O2 -g -pg myprog.c
gprof ./a.out > profile.txt

Compare the outcome with the 4. 2.

original expectations. Deploy Parallelize Use CUDA to parallelize code;
Use optimize cu* libraries if
possible;

3. Overlapping data transfers, fine-tuning

Optimize operation sequences
CUDA Debugging

CUDA-GDB - GNU Debugger that runs on Linux and Mac: http://

developer.nvidia.com/cuda-gdb

The NVIDIA Parallel Nsight debugging and profiling tool for

Microsoft Windows Vista and Windows 7 is available as a free plugin for Microsoft
Visual Studio:
http://developer.nvidia.com/nvidia-parallel-nsight
This tutorial has been made possible by
Scientific Computing and Visualization
group
at Boston University.

Katia Oleinik
koleinik@bu.edu

http://www.bu.edu/tech/research/training/tutorials/list/

CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA
No ratings yet
CUDA
18 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Cuda C
No ratings yet
Cuda C
70 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Cuda
No ratings yet
Cuda
25 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
1 Cuda
100% (1)
1 Cuda
173 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Lec 1
No ratings yet
Lec 1
27 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
CUDA
No ratings yet
CUDA
33 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Govind 6
No ratings yet
Govind 6
4 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Moving To Parallel With CUDA - Hello Program
No ratings yet
Moving To Parallel With CUDA - Hello Program
14 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Mastering CUDA C Programming
From Everand
Mastering CUDA C Programming
Ed Norex
No ratings yet
EE5900 Advanced Embedded System For Smart Infrastructure: RMS and EDF Scheduling
No ratings yet
EE5900 Advanced Embedded System For Smart Infrastructure: RMS and EDF Scheduling
35 pages
Cloud Computing Modulr - 1-6th Sem
No ratings yet
Cloud Computing Modulr - 1-6th Sem
7 pages
Anr 6.42 (64200002) 0
No ratings yet
Anr 6.42 (64200002) 0
10 pages
3 - Inter-Process Communication
No ratings yet
3 - Inter-Process Communication
5 pages
SYMLINKS
No ratings yet
SYMLINKS
15 pages
Chapter 2-MultiThreading
No ratings yet
Chapter 2-MultiThreading
11 pages
Scribleindia CS2056-Distributed-Systems-Question-Bank PDF
No ratings yet
Scribleindia CS2056-Distributed-Systems-Question-Bank PDF
8 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
No ratings yet
Multi Threaded Programming: Heavyweight Process. There Is One Program Counter, and One Sequence of
39 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
OS - 2023 - 2 - First Answers
No ratings yet
OS - 2023 - 2 - First Answers
4 pages
Motivation For Parallelism Motivation For Parallelism
No ratings yet
Motivation For Parallelism Motivation For Parallelism
6 pages
Unit 4 Coa
No ratings yet
Unit 4 Coa
25 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
5 pages
Multi Threading and Multi Core Handout
No ratings yet
Multi Threading and Multi Core Handout
3 pages
OS Kol1 - Merged PDF
No ratings yet
OS Kol1 - Merged PDF
408 pages
Course Description Document Advanced Java CSF206 AY 2022 23
No ratings yet
Course Description Document Advanced Java CSF206 AY 2022 23
8 pages
Settings Provider
No ratings yet
Settings Provider
11 pages
Scripts Basicos
No ratings yet
Scripts Basicos
4 pages
Crash 2024 02 25 - 20.20.05 Client
No ratings yet
Crash 2024 02 25 - 20.20.05 Client
24 pages
CPU Scheduling
No ratings yet
CPU Scheduling
10 pages
Flynn's Taxonomy of Computer Architecture
No ratings yet
Flynn's Taxonomy of Computer Architecture
8 pages
Chapter 14 - Processor Structure and Function
No ratings yet
Chapter 14 - Processor Structure and Function
74 pages
CPU Affinity
No ratings yet
CPU Affinity
4 pages
TD Osb Asinc
No ratings yet
TD Osb Asinc
217 pages
Farre 2
No ratings yet
Farre 2
1 page
Java Cert Guide SE11-2
No ratings yet
Java Cert Guide SE11-2
1 page
Operating System Unit 2
No ratings yet
Operating System Unit 2
17 pages
Operating System 2
No ratings yet
Operating System 2
29 pages
PPT01 - Operating System - Process
No ratings yet
PPT01 - Operating System - Process
53 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cuda 1

Uploaded by

Cuda 1

Uploaded by

Introduction to CUDA

NVIDIA Tesla M2070:

Core clock: 1.15GHz

1.03 Tflops single precision (peak)

3GB total dedicated memory

• Compute Unified Device Architecture

• General Purpose Parallel Computing Architecture by NVIDIA

• Supports traditional OpenGL graphics

Intel Xeon X5650: 32 GB/s Tesla M2070: 148 GB/s

Tesla M2070 Processor:

• Streaming Multiprocessors (SM): 14

Total: 14 x 32 = 448 Cores

Each Streaming Multiprocessor supports 1024 threads.

SIMT philosophy: Single Instruction Multiple Thread

Computationally intensive—The time spent on computation significantly

Massively parallel—The computations can be broken down into

# Copy tutorial files

# Request interactive session on the node with GPU

# Set Environment variables to link to CUDA 5/0

# Execute deviceQuery program

CUDA Driver Version / Runtime Version 5.0 / 5.0

Total amount of constant memory: 65536 bytes

# Execute bandwidthTest program

• Language extensions allow heterogeneous programming

• APIs for memory and device managing

CUDA: Basic example HelloCuda1.cu

printf("Hello, Cuda! \n");

To build the program, use nvcc compiler:

scc-he1: % nvcc -o helloCuda1 helloCuda1.cu

Function to be executed on the device (GPU) and called

__device__ void foo(){ . . . }

CUDA: Basic example HelloCuda2.cu

__global__ void cudakernel(void){

CUDA: Basic example HelloCuda2.cu

printf("Hello, Cuda! \n");

printf("Nice to meet you too! Bye, CUDA\n");

CUDA: Basic example HelloCuda2.cu

Kernel is always of type void.

There is a number of cuda functions:

Device memory management:

CUDA: Basic example HelloCuda2.cu

To build the program, use nvcc compiler:

scc-he1: % nvcc -o helloCuda2 helloCuda2.cu –arch sm_20

CUDA: Basic example HelloCudaBlock.cu

__global__ void cudakernel(void){

CUDA provides special variable for thread identification in the

dim3 threadIdx; // thread ID within the block

dim3 blockIdx; // block ID within the grid

dim3 blockDim; // number of threads per block

dim3 gridDim; // number of blocks in the grid

CUDA: Basic example HelloCudaThread.cu

__global__ void cudakernel(void){

• One kernel is executed on the device at a time

__global__ void vectorAdd(const float *A,

int i = blockDim.x * blockIdx.x + threadIdx.x;

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

int i = blockDim.x * blockIdx.x + threadIdx.x;

Unlike blocks, threads have mechanisms to communicate and synchronize

CUDA: vectorAdd.cu device memory allocation

float *d_B = NULL;

float *d_C = NULL;

cudaMemcpy(d_A, &A, size, cudaMemcpyHostToDevice);

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

// initialize CUDA timer

// stop CUDA timer

// specify the number of threads per block

A block of threads shares common memory, called shared memory

Shared Memory is extremely fast on-chip memory

To declare shared memory use __shared__ keyword

Shared Memory is not visible to the threads in other blocks

// Shared memory for results of multiplication

// Thread 0 sums the pairwise products

// Shared memory for results of multiplication

// Thread 0 sums the pairwise products

// Launch CUDA kernel

device void foo(){ . . . }

global void cudakernel(void){

global void cudakernel(void){

global void cudakernel(void){

global void vectorAdd(const float *A,

To declare shared memory use shared keyword