0% found this document useful (0 votes)
10 views67 pages

Owens

The document provides an introduction to parallel GPU computing, discussing the advantages of GPU over CPU for data-intensive applications, and highlights recent performance trends. It covers key concepts such as single and multi-GPU computing, programming models, and the architecture of modern computer systems. The document also emphasizes the importance of data-parallel computing and CUDA programming for optimizing performance in various applications.

Uploaded by

黄家毅
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views67 pages

Owens

The document provides an introduction to parallel GPU computing, discussing the advantages of GPU over CPU for data-intensive applications, and highlights recent performance trends. It covers key concepts such as single and multi-GPU computing, programming models, and the architecture of modern computer systems. The document also emphasizes the importance of data-parallel computing and CUDA programming for optimizing performance in various applications.

Uploaded by

黄家毅
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Introduction to Parallel GPU Computing

John Owens
Department of Electrical and Computer Engineering
Institute for Data Analysis and Visualization
University of California, Davis
Goals for this Hour

• Why GPU computing?

• Multi-GPU computing

• Single-GPU computing
“If you were plowing a field, which
would you rather use? Two strong
oxen or 1024 chickens?”
—Seymour Cray
Recent GPU Performance Trends
Historical Single−/Double−Precision Peak Compute Rates
153.6
GB/s
103 177.4
GB/s $390
! !

r5870
Vendor
!
! AMD (GPU)
! NVIDIA (GPU)
$450
GFLOPS

102
! !
34 GB/s
!
! Intel (CPU)
gtx480
Precision
! !
! DP
!
SP
!

$3692
1
!
!
x7560
10

! !
!!
!
!
!

2002 2004 2006 2008 2010


Date
Early data courtesy Ian Buck; from Owens et al. 2007 [CGF]
What’s new?

• Double precision

• Fast atomics

• Hardware cache
& ECC

• (CUDA) debuggers
& profilers
Intel ISCA Paper (June 2010)
Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU

Victor W Lee† , Changkyu Kim† , Jatin Chhugani† , Michael Deisher† ,


Daehyun Kim† , Anthony D. Nguyen† , Nadathur Satish† , Mikhail Smelyanskiy† ,
Srinivas Chennupaty! , Per Hammarlund! , Ronak Singhal! and Pradeep Dubey†
victor.w.lee@intel.com
† Throughput Computing Lab, ! IntelArchitecture Group,
Intel Corporation Intel Corporation

ABSTRACT The past decade has seen a huge increase in digital content as
Recent advances in computing have led to an explosion in the amount more documents are being created in digital form than ever be-
of data being generated. Processing the ever-growing data in a fore. Moreover, the web has become the medium of choice for
timely manner has made throughput computing an important as- storing and delivering information such as stock market data, per-
pect for emerging applications. Our analysis of a set of important sonal records, and news. Soon, the amount of digital data will ex-
throughput computing kernels shows that there is an ample amount ceed exabytes (1018 ) [31]. The massive amount of data makes stor-
of parallelism in these kernels which makes them suitable for to- ing, cataloging, processing, and retrieving information challenging.
day’s multi-core CPUs and GPUs. In the past few years there have A new class of applications has emerged across different domains
been many studies claiming GPUs deliver substantial speedups (be- such as database, games, video, and finance that can process this
tween 10X and 1000X) over multi-core CPUs on these kernels. To huge amount of data to distill and deliver appropriate content to
understand where such large performance difference comes from, users. A distinguishing feature of these applications is that they
we perform a rigorous performance analysis and find that after ap- have plenty of data level parallelism and the data can be processed
plying optimizations appropriate for both CPUs and GPUs the per- independently and in any order on different processing elements
formance gap between an Nvidia GTX280 processor and the Intel for a similar set of operations such as filtering, aggregating, rank-
Core i7 960 processor narrows to only 2.5x on average. In this pa- ing, etc. This feature together with a processing deadline defines
per, we discuss optimization techniques for both CPU and GPU, throughput computing applications. Going forward, as digital data
analyze what architecture features contributed to performance dif- continues to grow rapidly, throughput computing applications are
1 116.6 311.1/933.1 77.8
Top-Level Results
dth, SP: Single-Precision Floating Point, DP: Double-Precision
CUDA Successes

146X 36X 18X 50X 100X


Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics
U of Utah U of Illinois Elemental Tech AccelerEyes RIKEN

149X 47X 20X 130X 30X


Financial simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing
Oxford Universidad Jaime Techniscan U of Illinois U of Maryland

(c) 2010 NVIDIA Corporation [courtesy David Luebke, NVIDIA]


1024

(3.6, 515)
512 !

256 ~ 6x
(Case studies 2 &
3)
128
(1.7, 86) Platform
!
a Fermi
Gflop/s

64 (0.8, 78)
a C1060
(1.7, 43)
a Nehalem x 2
32 ~ 3x a Nehalem
(Case study 1)

16

8
Double-precision

1/8 1/4 1/2 1 2 4 8 16


Intensity (flop : byte) Courtesy Rich Vuduc, Georgia Tech
13 Dwarfs
• 1. Dense Linear Algebra • 8. Combinational Logic

• 2. Sparse Linear • 9. Graph Traversal


Algebra
• 10. Dynamic
• 3. Spectral Methods Programming

• 4. N-Body Methods • 11. Backtrack and


Branch-and-Bound
• 5. Structured Grids
• 12. Graphical Models
• 6. Unstructured Grids
• 13. Finite State
• 7. MapReduce Machines
13 Dwarfs
• 1. Dense Linear Algebra • 8. Combinational Logic

• 2. Sparse Linear • 9. Graph Traversal


Algebra
• 10. Dynamic
• 3. Spectral Methods Programming

• 4. N-Body Methods • 11. Backtrack and


Branch-and-Bound
• 5. Structured Grids
• 12. Graphical Models
• 6. Unstructured Grids
• 13. Finite State
• 7. MapReduce Machines
GPU in system (3 alternatives)

CPU GPU Mem CPU GPU

Chip Chipset

PCI
Express Discrete
Mem CPU Chipset Mem
GPU
A Modern Computer

CPU GPU

Chipset

Network
A Modern Computer
Kernel Call
CPU GPU

Chipset

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset
Se
nd
/R
cee
ive

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset
Se
nd
/R
cee
ive

Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU

Chipset
Se
nd
/R
cee
ive

Network
Mellanox GPUDirect

InfiniBand

InfiniBand
Fast & Flexible Communication

• CPUs are good at creating & manipulating data structures?

• GPUs are good at accessing & updating data structures?


http://www.watch.impress.co.jp/game/docs/20060329/3dps303.jpg
Structuring CPU-GPU Programs
CPU GPU

Marshal data
Send to GPU
Receive from CPU
Call kernel

Execute kernel
Retrieve from GPU
Send to CPU
Structuring Multi-GPU Programs
CPU Static division of work
(Global Arrays: Zippy,
CUDASA)

GPU GPU GPU GPU GPU


Structuring Multi-GPU Programs
CPU Static division of work
(Global Arrays: Zippy,
CUDASA)

GPU GPU GPU GPU GPU

Want to run on GPU:


if (foo == true) {
GPU[x][bar] = baz;
} else {
bar = GPU[y][baz];
}
Structuring Multi-GPU Programs
CPU Static division of work
(Global Arrays: Zippy,
CUDASA)

GPU GPU GPU GPU GPU

Want to run on GPU:


if (foo == true) {
Instead, GPU as slave.
GPU[x][bar] = baz;
Goal: GPU as first-class citizen.
} else {
bar = GPU[y][baz];
}
Our Research Program

Programming Models

Abstractions

Mechanisms
Example

• Abstraction: GPU Programming Models


initiates network send
Abstractions
• Problems:

• GPU can’t communicate Mechanisms


with NI

• GPU signals CPU


Example

• Abstraction: GPU
initiates network send
Programming Models
• Solution:
Abstractions
• CPU allocates
“mailbox” in GPU mem Mechanisms
• GPU sets mailbox to
initiate network send

• CPU polls mailbox


Example Take-home: Abstraction
does not change even if
underlying
mechanisms change
• Abstraction: GPU
initiates network send
Programming Models
• Solution:
Abstractions
• CPU allocates
“mailbox” in GPU mem Mechanisms
• GPU sets mailbox to
initiate network send

• CPU polls mailbox


DCGN: MPI-Like Programming Model
• Distributed Computing for GPU Networks (DCGN,
pronounced decagon)

• MPI-like interface

• Allows communication between all CPUs and GPUs in


system

• Allow GPU to source/sink communication

• Multithreaded communication via MPI

• Both synchronous and asynchronous (<- overlap!)

• Collectives

• Multiplex MPI addresses (“slots”)


Architecture
MapReduce: Keys to Performance
!"#$ !"#$
%#$ %#$
&'($ )'($

&'($8$ &'($9$
+#2$ :$$:$$:$ +#2$


)'($8$ )'($9$
Process data in chunks
63;12"01.$ 63;12"01.$

• More efficient transmission &


*+,$-$ *+,$-$
computation '+./+0$ '+./+0$
!12"31$-$ !12"31$-$

• Also allows out of core '+.//%#$ '+.//%#$

• Overlap computation and 45#$ 45#$

communication

• Accumulate 6%.7$ 6%.7$

• Partial Reduce 63;12"01.$ 63;12"01.$

!12"31$ !12"31$
Why is data-parallel computing fast?
• The GPU is specialized for compute-intensive, highly parallel
computation (exactly what graphics rendering is about)

• So, more transistors can be devoted to data processing rather than data
caching and flow control

ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU GPU
Programming Model: A Massively Multi-threaded Processor

• Move data-parallel application portions to the GPU

• Differences between GPU and CPU threads

• Lightweight threads

• GPU supports 1000s of threads

• Today:

• GPU hardware

• CUDA programming
environment
Big Idea #1

• One thread per data element.

• Doesn’t this mean that large problems will have


millions of threads?
Big Idea #2

• Write one program.

• That program runs on ALL threads in parallel.

• NVIDIA’s terminology here is “SIMT”: single-instruction,


multiple-thread.

• Roughly: SIMD means many threads run in lockstep; SIMT


means that some divergence is allowed and handled by
the hardware
CUDA Kernels and Threads
• Parallel portions of an application are executed on the
device as kernels Definitions:
Device = GPU; Host = CPU
• One SIMT kernel is executed at a time Kernel = function that
runs on the device
• Many threads execute each kernel

• Differences between CUDA and CPU threads

• CUDA threads are extremely lightweight

• Very little creation overhead

• Instant switching

• CUDA must use 1000s of threads to achieve efficiency

• Multi-core CPUs can use only a few


SM Multithreaded Multiprocessor
This figure is 1 generation old

SM • Each SM runs a block of threads

MT IU
• SM has 8 SP Thread Processors

• 32 GFLOPS peak at 1.35 GHz


SP
• IEEE 754 32-bit floating point


IU IU

SP SP
Scalar ISA

Shared
Memory
Shared
Memory
• Up to 768 threads,
hardware multithreaded

• 16KB Shared Memory


Shared
Memory • Concurrent threads share data

• Low latency load/store


GPU Computing (G80 GPUs)
• Processors execute
• 128 Thread Processors
computing threads

• Thread Execution • Parallel Data Cache


accelerates processing
Manager issues threads
Host

Input Assembler

Thread Execution Manager

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel
Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache

Load/store

Global Memory
NVIDIA Fermi
• !"#$%&'()#*+),-.-%/#%0#1*2.
Performance • 3444#!567899:#;*#<#$*#=(%>?/@#*%-/A

• 3/,+)>.)B#;C>+)B#D)E%+F#0+%E#GH#IJ#A%#H6#IJ
• KBB)B#LG#>/B#L8#1>,C).
Flexibility • 411#%/#>((#3/A)+/>(#>/B#4"A)+/>(#D)E%+-).
• 4/>'()#&M#A%#G#N)+>JFA)#%0#O*2#D)E%+-).
• P-@C#;M))B#O$$Q5#D)E%+F#3/A)+0>,)

• D&(?M()#;-E&(A>/)%&.#N>.R.#
%/#O*2
Usability • G9"#=>.A)+#KA%E-,#SM)+>?%/.
• 1TT#;&MM%+A
• ;F.A)E#1>((.U#M+-/V#.&MM%+A

Slide courtesy NVIDIA, image from http://images.anandtech.com/


reviews/video/NVIDIA/GTX460/fullGF100.jpg
Big Idea #3

• Latency hiding.

• It takes a long time to go to memory.

• So while one set of threads is waiting for memory ...

• ... run another set of threads during the wait.

• In practice, 32 threads run in a “warp” and an efficient program


usually has 128–256 threads in a block.
HW Goal: Scalability
• Scalable execution

• Program must be insensitive to the number of cores

• Write one program for any number of SM cores

• Program runs on any size GPU without recompiling

• Hierarchical execution model

• Decompose problem into sequential steps (kernels)

• Decompose kernel into computing parallel blocks

• Decompose block into computing parallel threads This is very


important.

• Hardware distributes independent blocks to SMs as available


Scaling the Architecture
• Same program

• Scalable performance

Host Host

Input Assembler Input Assembler

Thread Execution Manager Thread Execution Manager

Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel
Data Data Data Data Data Data Data Data Data Data Data Data
Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache

Load/store Load/store

Global Memory Global Memory


CUDA Software Development Kit
CUDA Optimized Libraries: Integrated CPU + GPU
math.h, FFT, BLAS, … C Source Code

NVIDIA C Compiler

NVIDIA Assembly
CPU Host Code
for Computing (PTX)

CUDA Debugger
Standard C Compiler
Driver Profiler

GPU CPU
Compiling CUDA for GPUs
C/C++ CUDA
Application

NVCC CPU Code

Generic PTX Code

Specialized
PTX to Target
Translator

GPU … GPU
Target device code
Programming Model (SPMD + SIMD): Thread Batching

• A kernel is executed as a grid of Host Device

thread blocks Grid 1

• A thread block is a batch of Kernel 1


Block
(0, 0)
Block
(1, 0)
Block
(2, 0)

threads that can cooperate with Block Block Block


each other by: (0, 1) (1, 1) (2, 1)

• Efficiently sharing data through Grid 2


shared memory
Kernel 2

• Synchronizing their execution


Block (1, 1)
• For hazard-free shared memory
accesses Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

• Two threads from two different Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
blocks cannot cooperate
Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

• Blocks are independent


Blocks must be independent
• Any possible interleaving of blocks should be valid

• presumed to run to completion without pre-emption

• can run in any order

• can run concurrently OR sequentially

• Blocks may coordinate but not synchronize

• shared queue pointer: OK

• shared lock: BAD … can easily deadlock

• Independence requirement gives scalability


Big Idea #4

• Organization into independent blocks allows


scalability / different hardware instantiations

• If you organize your kernels to run over many blocks ...

• ... the same code will be efficient on hardware that runs


one block at once and on hardware that runs many blocks
at once
CUDA: Programming GPU in C
• Philosophy: provide minimal set of extensions necessary to expose power

• Declaration specifiers to indicate where things live

__global__ void KernelFunc(...); // kernel callable from host

__device__ void DeviceFunc(...); // function callable on device

__device__ int GlobalVar; // variable in device memory

__shared__ int SharedVar; // shared within thread block

• Extend function invocation syntax for parallel kernel launch


KernelFunc<<<500, 128>>>(...); // launch 500 blocks w/ 128 threads each

• Special variables for thread identification in kernels


dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim;

• Intrinsics that expose specific operations in kernel code


__syncthreads(); // barrier synchronization within kernel
Example: Vector Addition Kernel

• Compute vector sum C = A+B means:

• n = length(C)

• for i = 0 to n-1:

• C[i] = A[i] + B[i]

• So C[0] = A[0] + B[0], C[1] = A[1] + B[1], etc.


Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Example: Vector Addition Kernel
// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C)

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}
Host Code
int main()

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}
Synchronization of blocks
• Threads within block may synchronize with barriers
… Step 1 …
__syncthreads();
… Step 2 …

• Blocks coordinate via atomic memory operations

• e.g., increment shared queue pointer with atomicInc()

• Implicit barrier between dependent kernels


vec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
CUDA: Runtime support
• Explicit memory allocation returns pointers to GPU memory

cudaMalloc(), cudaFree()

• Explicit memory copy for host device, device device

cudaMemcpy(), cudaMemcpy2D(), ...

• Texture management

cudaBindTexture(), cudaBindTextureToArray(), ...

• OpenGL & DirectX interoperability

cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C){
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}

int main(){
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));


cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each


vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));


cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each


vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));


cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each


vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;

cudaMalloc( (void**) &d_A, N * sizeof(float));


cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads each


vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Basic Efficiency Rules

• Develop algorithms with a data parallel mindset

• Minimize divergence of execution within blocks

• Maximize locality of global memory accesses

• Exploit per-block shared memory as scratchpad

• Expose enough parallelism


Summing Up
• Four big ideas:

1. One thread per data element

2. Write one program, runs on all threads

3. Hide latency by switching to different work

4. Independent blocks allow scalability

• Three key abstractions:

1. hierarchy of parallel threads

2. corresponding levels of synchronization

3. corresponding memory spaces


GPU Computing Challenges

• Addressing other dwarfs

• Sparseness & adaptivity

• Scalability: Multi-GPU algorithms and data structures

• Heterogeneity (Fusion/Knight’s Corner architectures)

• Irregularity

• Incremental data structures

• Abstract models of GPU computation


Thanks to …

• David Luebke and Rich Vuduc for slides

• NVIDIA for hardware donations; Argonne and University


of Illinois / NCSA for cluster access

• Funding agencies: Department of Energy (SciDAC


Institute for Ultrascale Visualization, Early Career
Principal Investigator Award), NSF, LANL, BMW, NVIDIA,
HP, Intel, UC MICRO, Microsoft, ChevronTexaco, Rambus

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy