Owens
Owens
John Owens
Department of Electrical and Computer Engineering
Institute for Data Analysis and Visualization
University of California, Davis
Goals for this Hour
• Multi-GPU computing
• Single-GPU computing
“If you were plowing a field, which
would you rather use? Two strong
oxen or 1024 chickens?”
—Seymour Cray
Recent GPU Performance Trends
Historical Single−/Double−Precision Peak Compute Rates
153.6
GB/s
103 177.4
GB/s $390
! !
r5870
Vendor
!
! AMD (GPU)
! NVIDIA (GPU)
$450
GFLOPS
102
! !
34 GB/s
!
! Intel (CPU)
gtx480
Precision
! !
! DP
!
SP
!
$3692
1
!
!
x7560
10
! !
!!
!
!
!
• Double precision
• Fast atomics
• Hardware cache
& ECC
• (CUDA) debuggers
& profilers
Intel ISCA Paper (June 2010)
Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU
ABSTRACT The past decade has seen a huge increase in digital content as
Recent advances in computing have led to an explosion in the amount more documents are being created in digital form than ever be-
of data being generated. Processing the ever-growing data in a fore. Moreover, the web has become the medium of choice for
timely manner has made throughput computing an important as- storing and delivering information such as stock market data, per-
pect for emerging applications. Our analysis of a set of important sonal records, and news. Soon, the amount of digital data will ex-
throughput computing kernels shows that there is an ample amount ceed exabytes (1018 ) [31]. The massive amount of data makes stor-
of parallelism in these kernels which makes them suitable for to- ing, cataloging, processing, and retrieving information challenging.
day’s multi-core CPUs and GPUs. In the past few years there have A new class of applications has emerged across different domains
been many studies claiming GPUs deliver substantial speedups (be- such as database, games, video, and finance that can process this
tween 10X and 1000X) over multi-core CPUs on these kernels. To huge amount of data to distill and deliver appropriate content to
understand where such large performance difference comes from, users. A distinguishing feature of these applications is that they
we perform a rigorous performance analysis and find that after ap- have plenty of data level parallelism and the data can be processed
plying optimizations appropriate for both CPUs and GPUs the per- independently and in any order on different processing elements
formance gap between an Nvidia GTX280 processor and the Intel for a similar set of operations such as filtering, aggregating, rank-
Core i7 960 processor narrows to only 2.5x on average. In this pa- ing, etc. This feature together with a processing deadline defines
per, we discuss optimization techniques for both CPU and GPU, throughput computing applications. Going forward, as digital data
analyze what architecture features contributed to performance dif- continues to grow rapidly, throughput computing applications are
1 116.6 311.1/933.1 77.8
Top-Level Results
dth, SP: Single-Precision Floating Point, DP: Double-Precision
CUDA Successes
(3.6, 515)
512 !
256 ~ 6x
(Case studies 2 &
3)
128
(1.7, 86) Platform
!
a Fermi
Gflop/s
64 (0.8, 78)
a C1060
(1.7, 43)
a Nehalem x 2
32 ~ 3x a Nehalem
(Case study 1)
16
8
Double-precision
Chip Chipset
PCI
Express Discrete
Mem CPU Chipset Mem
GPU
A Modern Computer
CPU GPU
Chipset
Network
A Modern Computer
Kernel Call
CPU GPU
Chipset
Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU
Chipset
Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU
Chipset
Se
nd
/R
cee
ive
Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU
Chipset
Se
nd
/R
cee
ive
Network
A Modern Computer
Kernel Call
Memory Xfer
CPU GPU
Chipset
Se
nd
/R
cee
ive
Network
Mellanox GPUDirect
InfiniBand
InfiniBand
Fast & Flexible Communication
Marshal data
Send to GPU
Receive from CPU
Call kernel
Execute kernel
Retrieve from GPU
Send to CPU
Structuring Multi-GPU Programs
CPU Static division of work
(Global Arrays: Zippy,
CUDASA)
Programming Models
Abstractions
Mechanisms
Example
• Abstraction: GPU
initiates network send
Programming Models
• Solution:
Abstractions
• CPU allocates
“mailbox” in GPU mem Mechanisms
• GPU sets mailbox to
initiate network send
• MPI-like interface
• Collectives
&'($8$ &'($9$
+#2$ :$$:$$:$ +#2$
•
)'($8$ )'($9$
Process data in chunks
63;12"01.$ 63;12"01.$
communication
!12"31$ !12"31$
Why is data-parallel computing fast?
• The GPU is specialized for compute-intensive, highly parallel
computation (exactly what graphics rendering is about)
• So, more transistors can be devoted to data processing rather than data
caching and flow control
ALU ALU
Control
ALU ALU
Cache
DRAM DRAM
CPU GPU
Programming Model: A Massively Multi-threaded Processor
• Lightweight threads
• Today:
• GPU hardware
• CUDA programming
environment
Big Idea #1
• Instant switching
MT IU
• SM has 8 SP Thread Processors
•
IU IU
SP SP
Scalar ISA
Shared
Memory
Shared
Memory
• Up to 768 threads,
hardware multithreaded
Input Assembler
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel
Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache
Load/store
Global Memory
NVIDIA Fermi
• !"#$%&'()#*+),-.-%/#%0#1*2.
Performance • 3444#!567899:#;*#<#$*#=(%>?/@#*%-/A
• 3/,+)>.)B#;C>+)B#D)E%+F#0+%E#GH#IJ#A%#H6#IJ
• KBB)B#LG#>/B#L8#1>,C).
Flexibility • 411#%/#>((#3/A)+/>(#>/B#4"A)+/>(#D)E%+-).
• 4/>'()#&M#A%#G#N)+>JFA)#%0#O*2#D)E%+-).
• P-@C#;M))B#O$$Q5#D)E%+F#3/A)+0>,)
• D&(?M()#;-E&(A>/)%&.#N>.R.#
%/#O*2
Usability • G9"#=>.A)+#KA%E-,#SM)+>?%/.
• 1TT#;&MM%+A
• ;F.A)E#1>((.U#M+-/V#.&MM%+A
• Latency hiding.
• Scalable performance
Host Host
Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors
Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel
Data Data Data Data Data Data Data Data Data Data Data Data
Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache Cache
Load/store Load/store
NVIDIA C Compiler
NVIDIA Assembly
CPU Host Code
for Computing (PTX)
CUDA Debugger
Standard C Compiler
Driver Profiler
GPU CPU
Compiling CUDA for GPUs
C/C++ CUDA
Application
Specialized
PTX to Target
Translator
GPU … GPU
Target device code
Programming Model (SPMD + SIMD): Thread Batching
• Two threads from two different Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
blocks cannot cooperate
Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
• n = length(C)
• for i = 0 to n-1:
int main()
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition
int main()
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition
int main()
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition
int main()
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition
int main()
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
Device Code
// Each thread performs one pair-wise addition
int main()
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
}
Host Code
int main()
}
Synchronization of blocks
• Threads within block may synchronize with barriers
… Step 1 …
__syncthreads();
… Step 2 …
cudaMalloc(), cudaFree()
• Texture management
cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C){
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main(){
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;
• Irregularity