GPUs DP Accelerators MSC
GPUs DP Accelerators MSC
-
Data Parallel Accelerators
Dezső Sima
1.Introduction
3. Overview of GPGPUs
6. References
1. The emergence of GPGPUs
1. Introduction (1)
Vertex
Edge Surface
Vertices
• have three spatial coordinates
• supplementary information necessary to render the object, such as
• color
• texture
• reflectance properties
• etc.
1. Introduction (2)
Shaders
8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ Windows
Server 2003
9.0 (12/2002) 2.0 2.0
Based on its FP32 computing capability and the large number of FP-units available
GPGPUs
(General Purpose GPUs)
or
cGPUs
(computational GPUs)
1. Introduction (10)
Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]
1. Introduction (11)
Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]
1. Introduction (12)
Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]
2. Basics of the SIMT execution
2. Basics of the SIMT execution (1)
• One dimensional data parallel execution, • Two dimensional data parallel execution,
i.e. it performs the same operation i.e. it performs the same operation
on all elements of given on all elements of given
FX/FP input vectors FX/FP input arrays (matrices)
• is massively multithreaded,
and provides
• data dependent flow control as well as
• barrier synchronization
Remark
(i.e. all ALUs of a SIMT core perform typically the same operation).
SIMT core
Fetch/Decode
SIMT cores are the basic building blocks of GPGPU or data parallel accelerators.
During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.
2. Basics of the SIMT execution (5)
Remark 1
Fetch/Decode
RF RF RF RF RF RF RF RF
ALUs take their operands from and write the calculated results to the register set
(RF) allocated to them.
RF
ALU
Remark 2
Actually, the register sets (RF) allocated to each ALU are given parts of a
large enough register file.
RF RF RF RF RF RF RF RF
ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU
Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs
2. Basics of the SIMT execution (9)
Massive multithreading
Principle
• Suspend stalled threads from execution and allocate ready to run threads for execution.
• When a large enough number of threads are available long stalls can be hidden.
2. Basics of the SIMT execution (12)
Multithreading is implemented by
creating and managing parallel executable threads for each data element of the
execution domain.
Same instructions
for all data elements
Figure: Parallel executable threads for each element of the execution domain
2. Basics of the SIMT execution (13)
Achieved by
• providing separate contexts (register space) for each thread, and
• implementing a zero-cycle context switch mechanism.
2. Basics of the SIMT execution (14)
SIMT core
Fetch/Decode
Context switch CTX CTX CTX CTX CTX CTX CTX CTX
ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU
Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU
2. Basics of the SIMT execution (15)
In SIMT processing both paths of a branch are executed subsequently such that
for each path the prescribed operations are executed only on those data elements which
fulfill the data condition given for that path (e.g. xi > 0).
Example
2. Basics of the SIMT execution (16)
First all ALUs meeting the condition execute the prescibed three operations,
then all ALUs missing the condition execute the next two operatons
Barrier synchronization
Lets wait all threads for completing all prior instructions before executing the next instruction.
Implemented e.g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10].
Remark
In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow
(CF) instruction format [7].
2. Basics of the SIMT execution (20)
kernel1<<<>>>()
Figure: Hierarchy of
threads [25]
3. Overview of GPGPUs
3. Overview of GPGPUs (1)
GPGPUs
90 nm G80 80 nm R600
Enhanced Enhanced
Shrink Shrink arch.
arch.
40 nm Fermi 40 nm RV870
NVidia
11/06 10/07 6/08 9/09
AMD/ATI
11/05 5/07 11/07 5/08 9/09
OpenCL+ OpenCL
11/07
Brooks+ Brook+
6/08
RapidMind 3870
support
Nvidia
GTX260 ~ 300 $
GTX280 ~ 600 $
AMD/ATI
HD4850 ~ 200 $
HD4870 na
4. Overview of data parallel accelerators
4. Overview of data parallel accelerators (1)
On card On-die
implementation integration
Recent Future
implementations implementations
Trend
On-card accelerators
E.g. Nvidia Tesla C870 Nvidia Tesla D870 Nvidia Tesla S870
Nvidia Tesla C1060 Nvidia Tesla S1070
AMD FireStream 9170 AMD FireStream 9250
Figure: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]
4. Overview of data parallel accelerators (8)
Figure: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards
inserted into PCI-E x16 slots of the host server [6]
4. Overview of data parallel accelerators (11)
NVidia Tesla
6/07 6/08
6/07
Desktop D870
G80-based
2*C870 incl.
3 GB GDDR3
1.037 GLOPS
6/07 6/08
2007 2008
AMD FireStream
11/07 6/08
12/07
Stream Computing
SDK Version 1.0
Brook+
ACM/AMD Core Math Library
CAL (Computer Abstor Layer)
Rapid Mind
2007 2008
Core
Core frequency 600 MHz 602 MHz 800 MHz 625 MHz
ALU frequency 1350 MHz 1296 GHz 800 MHz 325 MHZ
Peak FP32 performance 518 GLOPS 933 GLOPS 512 GLOPS 1 TLOPS
Memory
Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 1600 Gb/s 1986 Gb/s
Mem. bandwidth 768 GB/s 102 GB/s 51.2 GB/s 63.5 GB/s
System
Table: Main features of Nvidia’s and AMD/ATI’s data parallel accelerator cards
4. Overview of data parallel accelerators (14)
Nvidia Tesla
AMD/ATI FireStream
Performance figures:
SP FP performance: 2.72 TFLOPS
DP FP performance: 544 GFLOPS (1/5 of SP FP performance)
Radeon series/5800
ATI Radeon HD ATI Radeon HD ATI Radeon
4870 5850 HD 5870
Manufacturing Process 55-nm 40-nm 40-nm
# of Transistors 956 million 2.15 billion 2.15 billion
Core Clock Speed 750MHz 725MHz 850MHz
# of Stream Processors 800 1440 1600
Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS
Memory Type GDDR5 GDDR5 GDDR5
Memory Clock 900MHz 1000MHz 1200MHz
Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps
Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec
Max Board Power 160W 170W 188W
Idle Board Power 90W 27W 27W
Architecture overview
20 cores
16 ALUs/core
5 EUs/ALU
1600 EUs
(Stream processing units)
Figure: Architecture
overview [42]
NVidia’s Fermi
Introduced: 30. Sept. 2009 at NVidia’s GPU Technology Conference Available: 1 Q 2010
5.2 Nvidia Fermi (2)
NVidia: 16 cores
(Streaming Multiprocessors)
1 SM includes 32 ALUs
called “Cuda cores” by NVidia)
Cuda core
(ALU)
DP FP
• IEEE 754-2008-compliant
• Needs 2 clock cycles
DP FP performance: ½ of SP FP performance!!
Host Device
kernel1<<<>>>()
Figure: Hierarchy of
threads [25]
5.2 Nvidia Fermi (8)
Work scheduling
llinois.edu/ece498/al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt#316,2,
5.2 Nvidia Fermi (13)
Blocks Blocks
TF
A TPC has
Texture L1 2 SMs in the G80/G92
3 SMs in the G200
L2
Figure: Assigning thread blocks
Memory to streaming multiprocessors (SM) for execution [12]
5.2 Nvidia Fermi (14)
SP SP
SP SP
SFU SFU
SP SP
SP SP
warp 1 instruction 42
Larrabee
• Objectives:
High end graphics processing, HPC
Not a single product but a base architecture for a number of different products.
• Brief history:
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)
First brief public presentation 09/07 (Otellini) [29]
First official public presentations: in 2008 (e.g. at SIGGRAPH [27])
Due in ~ 2009
• Performance (targeted):
2 TFlops
5.3 Intel’s Larrabee (2)
Basic architecture
Main extensions
• 64-bit instructions
• 4-way multithreaded
(with 4 register sets)
• addition of a 16-wide
(16x32-bit) VU
• increased L1 caches
(32 KB vs 8 KB)
• access to its 256 KB
local subset of a
coherent L2 cache
• ring network to access
the coherent L2 $
and allow interproc.
communication.
Mask registers
The Vector Unit
have one bit per bit lane,
to control which bits of a vector reg.
or memory data are read or written
and which remain untouched.
VU scatter-gather instructions
(load a VU vector register from
16 non-contiguous data locations
from anywhere from the
on die L1 cache without penalty,
or store a VU register similarly).
Numeric conversions
8-bit, 16-bit integer and 16 bit FP
data can be read from the L1 $
or written into the L1 $,
with conversion to 32-bit integers
without penalty.
L1 D$ becomes
as an extension of the
register file
ALUs
Task scheduling
SP FP performance
2 operations/cycle 32 operations/core
16 ALUs
At present no data available for the clock frequency or the number of cores in Larrabee.
SP FP performance: 2 TFLOPS
5.3 Intel’s Larrabee (14)
Larrabee’s Native C/C++ compiler allows many available apps to be recompiled and run
correctly with no modifications.
6. References (1)
6. References
[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia,
http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf
[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD
[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,
ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007
http://ati.amd.com/developer/techpapers.html
[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:
Threading Hardware in G80
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,
http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[14]: Nvidia G80, Pc Watch, April 16 2007,
http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm
[15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,
http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf
[18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia,
http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
6. References (3)
[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for
Visual Information Technology, IIIT Hyderabad,
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html
[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,
[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008
[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,
http://www.pcper.com/article.php?aid=453
6. References (4)
[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,”
Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-
intels-biggest-leap-ahead-since-the-pentium-pro.html
[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/
Stream_Computing_User_Guide.pdf
[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/
doggett-radeon2900-gh07.pdf
[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf
[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
6. References (5)
[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?
ArticleID=RWT093009110932&mode=print
[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1
Larrabee
• Objectives:
High end graphics processing, HPC
Not a single product but a base architecture for a number of different products.
• Brief history:
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)
First brief public presentation 09/07 (Otellini) [29]
First official public presentations: in 2008 (e.g. at SIGGRAPH [27])
Due in ~ 2009
• Performance (targeted):
2 TFlops
5.2 Intel’s Larrabee (2)
Basic architecture
Main extensions
• 64-bit instructions
• 4-way multithreaded
(with 4 register sets)
• addition of a 16-wide
(16x32-bit) VU
• increased L1 caches
(32 KB vs 8 KB)
• access to its 256 KB
local subset of a
coherent L2 cache
• ring network to access
the coherent L2 $
and allow interproc.
communication.
Mask registers
The Vector Unit
have one bit per bit lane,
to control which bits of a vector reg.
or memory data are read or written
and which remain untouched.
VU scatter-gather instructions
(load a VU vector register from
16 non-contiguous data locations
from anywhere from the
on die L1 cache without penalty,
or store a VU register similarly).
Numeric conversions
8-bit, 16-bit integer and 16 bit FP
data can be read from the L1 $
or written into the L1 $,
with conversion to 32-bit integers
without penalty.
L1 D$ becomes
as an extension of the
register file
ALUs
Task scheduling
SP FP performance
2 operations/cycle 32 operations/core
16 ALUs
At present no data available for the clock frequency or the number of cores in Larrabee.
SP FP performance: 2 TFLOPS
5.2 Intel’s Larrabee (14)
Larrabee’s Native C/C++ compiler allows many available apps to be recompiled and run
correctly with no modifications.
4. Overview of data parallel accelerators (13)
Nvidia Tesla
AMD/ATI FireStream
Microarchitecture of GPUs
Microarchitecture of GPGPUs
3-level Two-level
microarchitectures microarchitectures
CB CB
CB: Core Blocks
CBA
Cores Cores CBA: Core Block Array
1 L1 Cache n L1 Cache
IN: Interconnection
IN
Network
PCI-E x 16 IF
Data
L2 L2
Hub
MC: Memory Controller
1 MC m MC
Display c.
2x32-bit 2x32-bit
Global Memory
SIMD Array
Texture Processor Cluster SIMD Engine
CB Core Block TPC
Multiprocessor SIMD core
SIMD
Streaming Processor
Stream Processing Unit
ALU Algebraic Logic Unit Thread Processor
Stream Processor
Scalar ALU
90 nm G80 80 nm R600
G80/G92
Microarchitecture
5.1 Nvidia’s GPGPU line (6)
Figure: Overview
of the G80 [14]
5.1 Nvidia’s GPGPU line (7)
Figure: Overview
of the G92 [15]
5.1 Nvidia’s GPGPU line (8)
Streaming Processors:
SIMT ALUs
R C$ Shared
F L1 Mem
Operand Select
MAD SFU
The Constant
Cache
I$
• Immediate address constants L1
• Indexed address constants
• Constants stored in DRAM, and cached on chip
Multithreaded
– L1 per SM Instruction Buffer
• A constant value can be broadcast to all threads in a Warp
– Extremely efficient way of accessing a value that is common for all
R C$ Shared
threads in a Block! F L1 Mem
Operand Select
MAD SFU
Shared
Memory
I$
• Each SM has 16 KB of Shared Memory L1
– 16 banks of 32 bit words
• CUDA uses Shared Memory as shared storage visible
Multithreaded
to all threads in a thread block Instruction Buffer
– read and write access
• Not used explicitly for pixel shader programs
R C$ Shared
F L1 Mem
Operand Select
MAD SFU
A program needs to manage the global, constant and texture memory spaces
visible to kernels through calls to the CUDA runtime.
This includes memory allocation and deallocation as well as invoking data transfers
between the CPU and GPU.
5.1 Nvidia’s GPGPU line (15)
Barrier synchronization
Principle of operation
CUDA [11]
A kernel is specified by
Execution of kernels
when called, a kernel is executed N times in parallel by N associated CUDA threads,
as opposed to only once like in case of regular C functions.
5.1 Nvidia’s GPGPU line (20)
Example
Remark
The thread index threadIdx is a vector of up to 3-components,
that identifies a one-, two- or three-dimensional thread block.
5.1 Nvidia’s GPGPU line (21)
• grids
• thread blocks
• threads
5.1 Nvidia’s GPGPU line (23)
kernel1<<<>>>()
Figure: Hierarchy of
threads [25]
5.1 Nvidia’s GPGPU line (24)
Thread blocks
Threads have
Barrier synchronization
• used to coordinate memory accesses at synchronization points,
• at synchronization points the execution of the threads is suspended
until all threads reach this point (barrel synchronization)
• synchronization is achieved by the declaration void_syncthreads();
5.1 Nvidia’s GPGPU line (28)
GT200
5.1 Nvidia’s GPGPU line (29)
Streaming Multi-
processors:
SIMT cores
6. References
[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia,
http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf
[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD
[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,
ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007
http://ati.amd.com/developer/techpapers.html
[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:
Threading Hardware in G80
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,
http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[14]: Nvidia G80, Pc Watch, April 16 2007,
http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm
[15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,
http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf
[18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia,
http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
6. References (3)
[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for
Visual Information Technology, IIIT Hyderabad,
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html
[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,
[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008
[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,
http://www.pcper.com/article.php?aid=453
6. References (4)
[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,”
Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-
intels-biggest-leap-ahead-since-the-pentium-pro.html
[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/
Stream_Computing_User_Guide.pdf
[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/
doggett-radeon2900-gh07.pdf
[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf
[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
6. References (5)
Performance figures:
Engine clock speed: 850 MHz
SP FP performance: 2.72 TFLOPS
DP FP performance: 544 GFLOPS (1/5 of SP FP performance)
RV770-RV870 Comparison
ATI Radeon HD ATI Radeon HD
Difference
4870 5870
Die Size 263 mm2 334 mm2 1.27x
# of Transistors 956 million 2.15 billion 2.25x
# of Shaders 800 1600 2x
90W idle, 160W 27W idle, 188W
Board Power 0.3x, 1.17x
load max
6. References (5)
Architecture overview
8 cores
1600 ALUs
(Stream processing units)
http://techreport.com/articles.x/17618/3
6. References (5)
NVidia’s Fermi
Introduced: 30. Sept. 2009 at NVidia’s GPU Technology Conference Available: 1 Q 2010
6. References (5)
NVidia: 16 cores
(Streaming Multiprocessors)
ort
Cuda core
(ALU)
6. References (5)
1 SM includes 32 ALUs
called “Cuda cores” by NVidia)
A
sin
g
le
"C
U
D
co
r.S
u
:N
vd
aDP FP
•
SP FP:32-bit
IEEE 754-2008-compliant
• Needs 2 clock cycles
6. References (5)
DP FP performance: ½ of SP FP performance!!
ort.com/articles.x/17670/2
FX: 32-bit
6. References (5)
Host Device
kernel1<<<>>>()
Figure: Hierarchy of
threads [25]
6. References (5)
Work scheduling
llinois.edu/ece498/al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt#316,2,
6. References (5)
Blocks Blocks
TF
A TPC has
Texture L1 2 SMs in the G80/G92
3 SMs in the G200
L2
Figure: Assigning thread blocks
Memory to streaming multiprocessors (SM) for execution [12]
6. References (5)
SP SP
SP SP
SFU SFU
SP SP
SP SP
warp 1 instruction 42
NVidia
11/06 10/07 6/08
AMD/ATI
11/05 5/07 11/07 5/08
11/07
Brooks+ Brook+
6/08
RapidMind 3870
support
The single biggest improvement from the last generation, though, is in power consumption. The
5870's peak power draw is rated at 188W, up a bit from the 4870's 160W TDP. But idle power
draw on the 5870 is rated at an impressively low 27W, down precipitously from the 90W rating of
the 4870. Much of the improvement comes from Cypress's ability to put its GDDR5 memory into a
low-power state, something the 4870's first-gen GDDR5 interface couldn't do. Additionally, the
second 5870 board in CrossFire multi-GPU config can go even lower, dropping into an ultra-low
power state just below 20W.
AMD says the plan is for Radeon HD 5870 cards to be available for purchase today at a price of
$379.
6. References (5)
Fermi
And each floating-point unit is now capable of producing IEEE 754-2008-compliant double-precision
FP results in two clock cycles, or half the performance of single-precision math. That's a huge step
up from the GT200's lone DP unit per SM—hence our estimate of a ten-fold increase in DP
performance
2. Basics of the SIMT execution (13)
Aim of multithreading
Speeding up computations