Cuda 1
Cuda 1
heterogeneous programming
Katia Oleinik
koleinik@bu.edu
Scientific Computing and Visualization
Boston University
GPU memory
• Memory management
CUDA Basics • Parallel kernels
• Threads synchronization
• Hello, World!
• Race conditions and atomic
• CUDA kernels
operations
• Blocks and
threads overview
CUDA
• Architecture
• C Language extensions
• Terminology
Architecture
1.15 x 1 x 448 =
515 Gigaflops double precision (peak)
Delivers performance at about 10% of the cost and 5% the power of CPU
Architecture
CUDA:
Memory Bandwidth:
the rate at which data can be read from or stored into memory, expressed in bytes per
second
CUDA:
# Change directory
scc1-ha1 % cd deviceQuery
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
CUDA Architecture
Query device capabilities and measure GPU/CPU bandwidth.
This is a simple test program to measure the memcopy bandwidth of the GPU and
memcpy bandwidth across PCI-e
# Change directory
scc1-ha1 % cd bandwidthTest
CUDA:
Host
The CPU and its memory (host memory)
Device
The GPU and its memory (device memory)
CUDA: C Language Extensions
CUDA:
• Based on industry-standard C
#include <stdio.h>
int main(void){
return(0);
}
CUDA Language closely follows C/C++ syntax with minimum set of extensions:
NVCC compiler will compile the function that run on the device and host compiler
(gcc) will take care about all other functions that run on the host (e.g. main() )
Hello, Cuda!
#include <stdio.h>
int main(void){
cudakernel<<<1,1>>>();
cudaDeviceSynchronize();
return(0);
}
Hello, Cuda!
cudakernel<<<N,M>>>();
cudaDeviceSynchronize();
Triple angle brackets indicate that the function will be executed on the device (GPU).
This function is called kernel.
Device management:
cudaGetDeviceCount(), cudaGetDeviceProperties()
Error management:
cudaGetLastError(), cudaSafeCall(), cudaCheckError()
The ability to print from within the kernel was added in a later generation of
architectural evolution. To request the support of Compute Capability 2.0, we need
to add this option into compilation command line.
Hello, Cuda!
#include <stdio.h>
int main(void){
. . .
cudakernel<<<16,1>>>();
. . . To simplify compilation process we will use Makefile:
}
% make HelloCudaBlock
CUDA: C Language Extensions
In the simple 1-dimentional case, we use only the first component of each variable,
e.g. threadIdx.x
CUDA: Blocks and Threads
Host
Serial Code
Device
Kernel A
Host
Serial Code
Device
Kernel B
CUDA: C Language Extensions
#include <stdio.h>
int main(void){
. . .
cudakernel<<<1,16>>>();
. . .
}
CUDA: Blocks and Threads
CUDA: vectorAdd.cu
if (i < numElements) {
C[i] = A[i] + B[i];
}
}
Vector Addition Example
CUDA: vectorAdd.cu
01 2 3 4 5 6701234567012345670123 4 5 6 7
int main(void) {
. . .
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);
CUDA: vectorAdd.cu
int main(void) {
. . .
// Copy input values to the device
cudaMemcpy(d_A, &A, size, cudaMemcpyHostToDevice);
. . .
}
Vector Addition Example
CUDA: vectorAdd.cu
int main(void) {
. . .
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) /
threadsPerBlock;
CUDA: vectorAdd.cu
int main(void) {
. . .
// Copy result back to host
cudaMemcpy(&C, d_C, size, cudaMemcpyDeviceToHost);
// Clean-up
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
. . .
}
Timing CUDA kernel
CUDA: vectorAddTime.cu
float memsettime;
cudaEvent_t start, stop;
// CUDA Kernel
. . .
CUDA: vectorAddTime.cu
scc-ha1 % make
Explore the CUDA kernel execution time based on the block size:
Remember:
• CUDA Streaming Multiprocessor executes threads in warps (32 threads)
• There is a maximum of 1024 threads per block (for our GPU)
• There is a maximum of 1536 threads per multiprocessor (for our GPU)
Dot Product
CUDA: dotProd1.cu
a0 * b0
a1 * b1
+ C
a2 * b2
a3 * b3
C = A * B = ( a 0, a 1 , a 2 , a 3 ) * ( b 0, b 1 , b 2 , b 3 ) = a 0 * b 0 + a 1 * b 1 + a 2 * b 2 + a 3 * b 3
Dot Product
CUDA: dotProd1.cu
CUDA: dotProd1.cu
#define N 512
__global__ voiddot( int*a, int*b, int*c ) {
What if thread 0 starts to calculate sum before other threads completed their calculations?
Thread Synchronization
CUDA: dotProd1.cu
#define N 512
__global__ voiddot( int*a, int*b, int*c ) {
__syncthreads();
CUDA: dotProd1.cu
int main(void) {
. . .
// copy input vectors to the device
. . .
. . .
// copy input vectors from the device
. . .
}
But our vector is limited to the maximum block size. Can we use blocks?
Race Condition
CUDA: dotProd2.cu
a0 * b0
Block 0
a1 * b1
+ sum
a2 * b2
a3 * b3
C
a4 * b4
Block 1
a5 * b5
+ sum
a6 * b6
a7 * b7
Race Condition
CUDA: dotProd2.cu
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__ void dotProductKernel( int*a, int*b, int*c ) {
__shared__ int temp[THREADS_PER_BLOCK];
if( threadIdx.x == 0) {
intsum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i];
*c += sum;
}
}
Blocks interfere with each other – Race condition
Race Condition
CUDA: dotProd2.cu
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__ void dotProductKernel( int*a, int*b, int*c ) {
__shared__ int temp[THREADS_PER_BLOCK];
if( threadIdx.x == 0) {
intsum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i];
atomicAdd(c,sum);
}
}
Atomic Operations
Race conditions - behavior depends upon relative timing of
multiple event sequences.
Can occur when an implied read-modify-write is interruptible
NVIDIA’s link:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
Katia Oleinik
koleinik@bu.edu
http://www.bu.edu/tech/research/training/tutorials/list/