0% found this document useful (0 votes)

106 views191 pages

GPUs DP Accelerators MSC

AMD / ATI RV870 (cypress) 5. Nvidia Fermi 5. Intel's Larrabee 6. Nvidia's larrabee 7. Pixel and vertex shader models in subsequent shader models introduce typically, a number of new / enhanced features.

Uploaded by

dsima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views191 pages

GPUs DP Accelerators MSC

Uploaded by

dsima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 191

GPGPUs

-
Data Parallel Accelerators

Dezső Sima

Oct. 20. 2009

Ver. 1.0 © Dezső Sima 2009
Contents

1.Introduction

2. Basics of the SIMT execution

3. Overview of GPGPUs

4. Overview of data parallel accelerators

5. Microarchitecture of GPGPUs (examples)

5.1 AMD/ATI RV870 (Cypress)

5.2 Nvidia Fermi

5.3 Intel’s Larrabee

6. References
1. The emergence of GPGPUs
1. Introduction (1)

Representation of objects by triangels

Vertex

Edge Surface

Vertices
• have three spatial coordinates
• supplementary information necessary to render the object, such as
• color
• texture
• reflectance properties
• etc.
1. Introduction (2)

Main types of shaders in GPUs

Shaders

Vertex shaders Pixel shaders Geometry shaders

(Fragment shaders)

Transform each vertex’s Calculate the color Can add or remove

3D-position in the virtual space of the pixels vertices from a mesh
to the 2D coordinate,
at which it appears on the screen
1. Introduction (3)

DirectX version Pixel SM Vertex SM Supporting OS

8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000
8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/
Windows Server
2003
9.0 (12/2002) 2.0 2.0
9.0a (3/2003) 2_A, 2_B 2.x
9.0c (8/2004) 3.0 3.0 Windows XP SP2
10.0 (11/2006) 4.0 4.0 Windows Vista
10.1 (2/2008) 4.1 4.1 Windows Vista SP1/
Windows Server
2008
11 (in development) 5.0 5.0
Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX
and MS’s OSs [18], [21]
1. Introduction (4)

Convergence of important features of the vertex and pixel shader models

Subsequent shader models introduce typically, a number of new/enhanced features.
Differences between the vertex and pixel shader models in subsequent shader models
concerning precision requirements, instruction sets and programming resources.

Shader model 2 [19]

• Different precision requirements
Vertex shader: FP32 (coordinates)
Pixel shader: FX24 (3 colors x 8)
• Different instructions
• Different resources (e.g. registers)

Shader model 3 [19]

• Unified precision requirements for both shaders (FP32)

with the option to specify partial precision (FP16 or FP24)
by adding a modifier to the shader code
• Different instructions
• Different resources (e.g. registers)
1. Introduction (5)

Shader model 4 (introduced with DirectX10) [20]

• Unified precision requirements for both shaders (FP32)
with the possibility to use new data formats.
• Unified instruction set
• Unified resources (e.g. temporary and constant registers)

Shader architectures of GPUs prior to SM4

GPUs prior to SM4 (DirectX 10):

have separate vertex and pixel units with different features.

Drawback of having separate units for vertex and pixel shading

Inefficiency of the hardware implementation

(Vertex shaders and pixel shaders often have complementary load patterns [21]).
1. Introduction (6)

DirectX version Pixel SM Vertex SM Supporting OS

8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000

8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ Windows
Server 2003
9.0 (12/2002) 2.0 2.0

9.0a (3/2003) 2_A, 2_B 2.x

9.0c (8/2004) 3.0 3.0 Windows XP SP2

10.0 (11/2006) 4.0 4.0 Windows Vista

10.1 (2/2008) 4.1 4.1 Windows Vista SP1/

Windows Server 2008
11 (in development) 5.0 5.0

Table: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX

and MS’s OSs [18], [21]
1. Introduction (7)

Unified shader model (introduced in the SM 4.0 of DirectX 10.0)

Unified, programable shader architecture

The same (programmable) processor can be used to implement all shaders;

• the vertex shader
• the pixel shader and
• the geometry shader (new feature of the SMl 4)
1. Introduction (8)

Figure: Principle of the unified shader architecture [22]

1. Introduction (9)

Based on its FP32 computing capability and the large number of FP-units available

the unified shader is a prospective candidate for speeding up HPC!

GPUs with unified shader architectures also termed as

GPGPUs
(General Purpose GPUs)

cGPUs
(computational GPUs)
1. Introduction (10)

Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11]
1. Introduction (11)

Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11]
1. Introduction (12)

Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11]
2. Basics of the SIMT execution
2. Basics of the SIMT execution (1)

Main alternatives of data parallel execution

Data parallel execution

SIMD execution SIMT execution

• One dimensional data parallel execution, • Two dimensional data parallel execution,
i.e. it performs the same operation i.e. it performs the same operation
on all elements of given on all elements of given
FX/FP input vectors FX/FP input arrays (matrices)
• is massively multithreaded,
and provides
• data dependent flow control as well as
• barrier synchronization

Needs an FX/FP SIMD extension Needs an FX/FP SIMT extension

of the ISA of the ISA and the API

E.g. 2. and 3. generation GPGPUs,

superscalars data parallel accelerators

Figure: Main alternatives of data parallel execution

2. Basics of the SIMT execution (2)

Scalar, SIMD and SIMT execution

Scalar execution SIMD execution SIMT execution

Domain of execution: Domain of execution: Domain of execution:

single data elements elements of vectors elements of matrices
(at the programming level)

Figure: Domains of execution in case of scalar, SIMD and SIMT execution

Remark

SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia)

2. Basics of the SIMT execution (3)

Key components of the implementation of SIMT execution

• Data parallel execution

• Massive multithreading
• Data dependent flow control
• Barrier synchronization
2. Basics of the SIMT execution (4)

Data parallel execution

Performed by SIMT cores

SIMT cores execute the same instruction stream on a number of ALUs

(i.e. all ALUs of a SIMT core perform typically the same operation).

SIMT core

Fetch/Decode

ALU ALU ALU ALU ALU ALU ALU ALU

Figure: Basic layout of a SIMT core

SIMT cores are the basic building blocks of GPGPU or data parallel accelerators.

During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores.
2. Basics of the SIMT execution (5)

Remark 1

Different manufacturers designate SIMT cores differently, such as

• streaming multiprocessor (Nvidia),

• superscalar shader processor (AMD),
• wide SIMD processor, CPU core (Intel).
2. Basics of the SIMT execution (6)

Each ALU is allocated a working register set (RF)

Fetch/Decode

ALU ALU ALU ALU ALU ALU ALU ALU

RF RF RF RF RF RF RF RF

Figure: Main functional blocks of a SIMT core

2. Basics of the SIMT execution (7)

SIMT ALUs perform typically, RRR operations, that is

ALUs take their operands from and write the calculated results to the register set
(RF) allocated to them.

ALU

Figure: Principle of operation of the SIMD ALUs

2. Basics of the SIMT execution (8)

Remark 2

Actually, the register sets (RF) allocated to each ALU are given parts of a
large enough register file.

RF RF RF RF RF RF RF RF

ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU

Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs
2. Basics of the SIMT execution (9)

Basic operation of recent SIMT ALUs

• execute basically SP FP-MADD (simple precision i.e. 32-bit.

Multiply-Add) instructions of the form axb+c ,
• are pipelined,
capable of starting a new operation every new clock cycle,
(more precisely, every shader clock cycle),
RF

That is, without further enhancements

their peak performance is 2 SP FP operations/cycle

• need a few number of clock cycles, e.g. 2 or 4 shader cycles,

ALU
to present the results of the SP FMADD operations to the RF,
2. Basics of the SIMT execution (10)

Additional operations provided by SIMT ALUs

• FX operations and FX/FP conversions,

• DP FP operations,
• trigonometric functions (usually supported by special functional units).
2. Basics of the SIMT execution (11)

Massive multithreading

Aim of massive multithreading

to speed up computations by increasing the utilization of available computing resources

in case of stalls (e.g. due to cache misses).

Principle

• Suspend stalled threads from execution and allocate ready to run threads for execution.
• When a large enough number of threads are available long stalls can be hidden.
2. Basics of the SIMT execution (12)

Multithreading is implemented by
creating and managing parallel executable threads for each data element of the
execution domain.

Same instructions
for all data elements

Figure: Parallel executable threads for each element of the execution domain
2. Basics of the SIMT execution (13)

Effective implementation of multithreading

if thread switches, called context switches, do not cause cycle penalties.

Achieved by
• providing separate contexts (register space) for each thread, and
• implementing a zero-cycle context switch mechanism.
2. Basics of the SIMT execution (14)

SIMT core

Fetch/Decode

CTX CTX CTX CTX CTX CTX CTX CTX

Actual context Register file (RF)
CTX CTX CTX CTX CTX CTX CTX CTX

Context switch CTX CTX CTX CTX CTX CTX CTX CTX

CTX CTX CTX CTX CTX CTX CTX CTX

ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU ALU
ALU ALU

Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU
2. Basics of the SIMT execution (15)

Data dependent flow control

Implemented by SIMT branch processing

In SIMT processing both paths of a branch are executed subsequently such that
for each path the prescribed operations are executed only on those data elements which
fulfill the data condition given for that path (e.g. xi > 0).

Example
2. Basics of the SIMT execution (16)

Figure: Execution of branches [24]

The given condition will be checked separately for each thread
2. Basics of the SIMT execution (17)

First all ALUs meeting the condition execute the prescibed three operations,
then all ALUs missing the condition execute the next two operatons

Figure: Execution of branches [24]

2. Basics of the SIMT execution (18)

Figure: Resuming instruction stream processing after executing a branch [24]

2. Basics of the SIMT execution (19)

Barrier synchronization

Lets wait all threads for completing all prior instructions before executing the next instruction.

Implemented e.g. in AMD’s Intermediate Language (IL) by the fence threads instruction [10].

Remark

In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow
(CF) instruction format [7].
2. Basics of the SIMT execution (20)

Principle of SIMT execution

Host Device

Each kernel invocation

lets execute all
kernel0<<<>>>() thread blocks (Block(i,j))

kernel1<<<>>>()

Figure: Hierarchy of
threads [25]
3. Overview of GPGPUs
3. Overview of GPGPUs (1)

Basic implementation alternatives of the SIMT execution

GPGPUs Data parallel accelerators

Programmable GPUs Dedicated units

with appropriate supporting data parallel execution
programming environments with appropriate
programming environment

Have display outputs No display outputs

Have larger memories
than GPGPUs

E.g. Nvidia’s 8800 and GTX lines Nvidia’s Tesla lines

AMD’s HD 38xx, HD48xx lines AMD’s FireStream lines

Figure: Basic implementation alternatives of the SIMT execution

3. Overview of GPGPUs (2)

GPGPUs

Nvidia’s line AMD/ATI’s line

90 nm G80 80 nm R600

Shrink Enhanced Shrink Enhanced

arch. arch.
65 nm G92 G200 55 nm RV670 RV770

Enhanced Enhanced
Shrink Shrink arch.
arch.

40 nm Fermi 40 nm RV870

Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

3. Overview of GPGPUs (3)

NVidia
11/06 10/07 6/08 9/09

Cores G80 G92 GT200 Fermi

90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs 40 nm/3000 mtrs

Cards 8800 GTS 8800 GTX 8800 GT GTX260 GTX280

96 ALUs 128 ALUs 112 ALUs 192 ALUs 240 ALUs 512 ALUs
320-bit 384-bit 256-bit 448-bit 512-bit 384-bit

6/07 11/07 6/08

CUDA Version 1.0 Version 1.1 Version 2.0

AMD/ATI
11/05 5/07 11/07 5/08 9/09

Cores R500 R600 R670 RV770 RV870

80 nm/681 mtrs 55 nm/666 mtrs 55 nm/956 mtrs 40 nm/2100 mtrs

Cards (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870 HD 5870

48 ALUs 320 ALUs 320 ALUs 320 ALUs 800 ALUs 800 ALUs 1600 ALUs
512-bit 256-bit 256-bit 256-bit 256-bit 256-bit
12/08

OpenCL+ OpenCL

11/07
Brooks+ Brook+

6/08
RapidMind 3870
support

2005 2006 2007 2008 2009

Figure: Overview of GPGPUs

3. Overview of GPGPUs (4)

8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280

Core G80 G80 G92 GT200 GT200
Introduction 11/06 11/06 10/07 6/08 6/08
IC technology 90 nm 90 nm 65 nm 65 nm 65 nm
Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs
Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2
Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz
Computation
No.of ALUs 96 128 112 192 240
Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz
No. FP32 inst./cycle 3* (but only in a few issue cases) 3 3
Peak FP32 performance 346 GLOPS 512 GLOPS 508 GLOPS 715 GLOPS 933 GLOPS
Peak FP64 performance – – – – 77/76 GLOPS
Memory
Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s
Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit
Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s
Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB
Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3
Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit
Mem. contr. Crossbar Crossbar Crossbar Crossbar Crossbar
System
Multi. CPU techn. SLI SLI SLI SLI SLI
Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
MS Direct X 10 10 10 10.1 subset 10.1 subset
Table: Main features of Nvidia’s GPGPUs
3. Overview of GPGPUs (5)

HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

Core R600 R670 R670 RV770 RV770
Introduction 5/07 11/07 11/07 5/08 5/08
IC technology 80 nm 55 nm 55 nm 55 nm 55 nm
Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs
Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2
Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
Computation
No. of ALUs 320 320 320 800 800
Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
No. FP32 inst./cycle 2 2 2 2 2
Peak FP32 performance 471.6 GLOPS 429 GLOPS 496 GLOPS 1000 GLOPS 1200 GLOPS
Peak FP64 performance – – – 200 GLOPS 240 GLOPS
Memory
Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5)
Mem. interface 512-bit 256-bit 256-bit 265-bit 265-bit
Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s
Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB
Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5
Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit
Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar
System
Multi. CPU techn. CrossFire CrossFire X CrossFire X CrossFire X CrossFire X
Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
MS Direct X 10 10.1 10.1 10.1 10.1
Table: Main features of AMD/ATIs GPGPUs
3. Overview of GPGPUs (6)

Price relations (as of 10/2008)

Nvidia

GTX260 ~ 300 $
GTX280 ~ 600 $

AMD/ATI

HD4850 ~ 200 $
HD4870 na
4. Overview of data parallel accelerators
4. Overview of data parallel accelerators (1)

Data parallel accelerators

Implementation alternatives of data parallel accelerators

On card On-die
implementation integration

Recent Future
implementations implementations

E.g. GPU cards

Data-parallel Intel’s Heavendahl
accelerator cards

AMD’s Torrenza AMD’s Fusion

integration technology integration technology

Trend

Figure: Implementation alternatives of dedicated data parallel accelerators

4. Overview of data parallel accelerators (2)

On-card accelerators

Card Desktop 1U server

implementations implementations implementations

Usually dual cards Usually 4 cards

Single cards fitting mounted into a box, mounted into a 1U server rack,
into a free PCI Ex16 slot connected to an connected two adapter cards
of the host computer. adapter card that are inserted into
that is inserted into a two free PCIEx16 slots of a server
free PCI-E x16 slot of the through two switches
host PC through a cable. and two cables.

E.g. Nvidia Tesla C870 Nvidia Tesla D870 Nvidia Tesla S870
Nvidia Tesla C1060 Nvidia Tesla S1070
AMD FireStream 9170 AMD FireStream 9250

Figure:Implementation alternatives of on-card accelerators

4. Overview of data parallel accelerators (3)

FB: Frame Buffer

Figure: Main functional units of Nvidia’s Tesla C870 card [2]

4. Overview of data parallel accelerators (4)

Figure: Nvida’s Tesla C870 and

AMD’s FireStream 9170 cards [2], [3]
4. Overview of data parallel accelerators (5)

Figure: Tesla D870 desktop implementation [4]

4. Overview of data parallel accelerators (6)

Figure: Nvidia’s Tesla D870 desktop implementation [4]

4. Overview of data parallel accelerators (7)

Figure: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]
4. Overview of data parallel accelerators (8)

Figure: Concept of Nvidia’s Tesla S870 1U rack server [5]

4. Overview of data parallel accelerators (9)

Figure: Internal layout of Nvidia’s Tesla S870 1U rack [6]

4. Overview of data parallel accelerators (10)

Figure: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards
inserted into PCI-E x16 slots of the host server [6]
4. Overview of data parallel accelerators (11)

NVidia Tesla
6/07 6/08

Card C870 C1060

G80-based GT200-based
1.5 GB GDDR3 4 GB GDDR3
0.519 GLOPS 0.936 GLOPS

6/07
Desktop D870
G80-based
2*C870 incl.
3 GB GDDR3
1.037 GLOPS

6/07 6/08

IU Server S870 S1070

G80-based GT200-based
4*C870 incl. 4*C1060
6 GB GDDR3 16 GB GDDR3
2.074 GLOPS 3.744 GLOPS

6/07 11/07 6/08

CUDA Version 1.0 Version 1.01 Version 2.0

2007 2008

Figure: Overview of Nvidia’s Tesla family

4. Overview of data parallel accelerators (12)

AMD FireStream
11/07 6/08

Card 9170 9170

RV670-based Shipped
2 GB GDDR3
500 GLOPS FP32
~200 GLOPS FP64
6/08 10/08
9250 9250
RV770-based Shipped
1 GB GDDR3
1 TLOPS FP32
~300 GFLOPS FP64

12/07
Stream Computing
SDK Version 1.0
Brook+
ACM/AMD Core Math Library
CAL (Computer Abstor Layer)

Rapid Mind

2007 2008

Figure: Overview of AMD/ATI’s FireStream family

4. Overview of data parallel accelerators (13)

Nvidia Tesla cards AMD FireStream cards

Core type C870 C1060 9170 9250

Based on G80 GT200 RV670 RV770

Introduction 6/07 6/08 11/07 6/08

Core
Core frequency 600 MHz 602 MHz 800 MHz 625 MHz

ALU frequency 1350 MHz 1296 GHz 800 MHz 325 MHZ

No. of ALUs 128 240 320 800

Peak FP32 performance 518 GLOPS 933 GLOPS 512 GLOPS 1 TLOPS

Peak FP64 performance – – ~200 GLOPS ~250 GLOPS

Memory

Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 1600 Gb/s 1986 Gb/s

Mem. interface 384-bit 512-bit 256-bit 256-bit

Mem. bandwidth 768 GB/s 102 GB/s 51.2 GB/s 63.5 GB/s

Mem. size 1.5 GB 4 GB 2 GB 1 GB

Mem. type GDDR3 GDDR3 GDDR3 GDDR3

System

Interface PCI-E x16 PCI-E 2.0x16 PCI-E 2.0x16 PCI-E 2.0x16

Power (max) 171 W 200 W 150 W 150 W

Table: Main features of Nvidia’s and AMD/ATI’s data parallel accelerator cards
4. Overview of data parallel accelerators (14)

Price relations (as of 10/2008)

Nvidia Tesla

C870 ~ 1500 $ C1060 ~ 1600 $

D870 ~ 5000 $
S870 ~ 7500 $ S1070 ~ 8000 $

AMD/ATI FireStream

9170 ~ 800 $ 9250 ~ 800 $

5. Microarchitecture of GPGPUs (examples)

5.1 AMD/ATI RV870 (Cypress)

5.2 Nvidia Fermi
5.3 Intel’s Larrabee
5.1 AMD/ATI RV870
5.1 AMD/ATI RV870 (1)

AMD/ATI RV870 (Cypress) Radeon 5870 graphics card

Introduction: Sept. 22 2009

Availability: now

Performance figures:
SP FP performance: 2.72 TFLOPS
DP FP performance: 544 GFLOPS (1/5 of SP FP performance)

OpenCL 1.0 compliant

5.1 AMD/ATI RV870 (2)

Radeon series/5800
ATI Radeon HD ATI Radeon HD ATI Radeon
4870 5850 HD 5870
Manufacturing Process 55-nm 40-nm 40-nm
# of Transistors 956 million 2.15 billion 2.15 billion
Core Clock Speed 750MHz 725MHz 850MHz
# of Stream Processors 800 1440 1600
Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS
Memory Type GDDR5 GDDR5 GDDR5
Memory Clock 900MHz 1000MHz 1200MHz
Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps
Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec
Max Board Power 160W 170W 188W
Idle Board Power 90W 27W 27W

Figure: Radeon Series/5800 [42]

5.1 AMD/ATI RV870 (3)

Radeon 4800 series/5800 series comparison

ATI Radeon HD ATI Radeon HD ATI Radeon
4870 5850 HD 5870
Manufacturing Process 55-nm 40-nm 40-nm
# of Transistors 956 million 2.15 billion 2.15 billion
Core Clock Speed 750MHz 725MHz 850MHz
# of Stream Processors 800 1440 1600
Compute Performance 1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS
Memory Type GDDR5 GDDR5 GDDR5
Memory Clock 900MHz 1000MHz 1200MHz
Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps
Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec
Max Board Power 160W 170W 188W
Idle Board Power 90W 27W 27W

Figure: Radeon Series/5800 [42]

5.1 AMD/ATI RV870 (4)

Architecture overview

20 cores

16 ALUs/core

5 EUs/ALU

1600 EUs
(Stream processing units)

Figure: Architecture
overview [42]

8x32 = 256 bit

GDDR5
153.6 GB/s
5.1 AMD/ATI RV870 (5)

The 5870 card

Figure: The 5870 card [41]

5.2 Nvidia Fermi
5.2 Nvidia Fermi (1)

NVidia’s Fermi

Introduced: 30. Sept. 2009 at NVidia’s GPU Technology Conference Available: 1 Q 2010
5.2 Nvidia Fermi (2)

Fermi’s overall structure

NVidia: 16 cores
(Streaming Multiprocessors)

Each core: 32 ALUs

6x Dual Channel GDDR5

(384 bit)

Figure: Fermi’s overall structure [40]

5.2 Nvidia Fermi (3)

Layout of a core (SM)

1 SM includes 32 ALUs
called “Cuda cores” by NVidia)

Cuda core
(ALU)

Figure: Layout of a core [40]

5.2 Nvidia Fermi (4)

A single ALU (“Cuda core”)

SP FP:32-bit FX: 32-bit

DP FP
• IEEE 754-2008-compliant
• Needs 2 clock cycles

DP FP performance: ½ of SP FP performance!!

Figure: A single ALU [40]

5.2 Nvidia Fermi (5)

Fermi’s system architecture

Figure: Fermi’s system architecture [39]

5.2 Nvidia Fermi (6)

Contrasting Fermi and GT 200

Figure: Contrasting Fermi and GT 200 [39]

5.2 Nvidia Fermi (7)

The execution of programs utilizing GP/GPUs

Host Device

Each kernel invocation

executes a grid of
kernel0<<<>>>() thread blocks (Block(i,j))

kernel1<<<>>>()

Figure: Hierarchy of
threads [25]
5.2 Nvidia Fermi (8)

Global scheduling in Fermi

Figure: Global scheduling in Fermi [39]

5.2 Nvidia Fermi (9)

Microarchitecture of a Fermi core

5.2 Nvidia Fermi (10)

Principle of operation of the G80/G92/Fermi GPGPUs

5.2 Nvidia Fermi (11)

Principle of operation of the G80/G92 GPGPUs

The key point of operation is work scheduling

Work scheduling

• Scheduling thread blocks for execution

• Segmenting thread blocks into warps
• Scheduling warps for execution
5.2 Nvidia Fermi (12)

CUDA Thread Block

Thread scheduling in NVidia’s GPGPUs

 All threads in a block execute the same

kernel program (SPMD)
 Programmer declares block:
 Block size 1 to 512 concurrent threads
CUDA Thread Block
 Block shape 1D, 2D, or 3D
Thread Id #:
 Block dimensions in threads
0123… m
 Threads have thread id numbers within
block
 Thread program uses thread id to select
work and address shared data
Thread program
 Threads in the same block share data and
synchronize while doing their share of the
work
 Threads in different blocks cannot
cooperate Courtesy: John Nickolls,
 Each block can execute in any order NVIDIA
relative to other blocs!

llinois.edu/ece498/al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt#316,2,
5.2 Nvidia Fermi (13)

Scheduling thread blocks for execution

TPC TPC: Thread Processing Cluster
(Texture Processing Cluster)
SM0 SM1
t0 t1 t2 … tm t0 t1 t2 … tm
Up to 8 blocks can be assigned
MT IU MT IU
to an SM for execution
SP SP

Blocks Blocks

A device may run thread blocks sequentially

or even in parallel, if it has enough resources
Shared Shared for this, or usually by a combination of both.
Memory Memory

TF
A TPC has
Texture L1 2 SMs in the G80/G92
3 SMs in the G200

L2
Figure: Assigning thread blocks
Memory to streaming multiprocessors (SM) for execution [12]
5.2 Nvidia Fermi (14)

Segmenting thread blocks into warps

• Threads are scheduled for execution in groups

of 32 threads, called the warps. Block 1 Warps Block 2 Warps
• For scheduling each thread block is subdivided
… …
t0 t1 t2 … t31 t0 t1 t2 … t31
into warps.
… …
• At any point of time up to 24 warps can be
maintained by the scheduler.

Remark Streaming Multiprocessor

Instruction L1 Data L1
The number of threads constituting a warp
Instruction Fetch/Dispatch
is an implementation decision and not
part of the CUDA programming model. Shared Memory

SP SP

SP SP
SFU SFU
SP SP

SP SP

Figure: Segmenting thread blocks in warps [12]

5.2 Nvidia Fermi (15)

Scheduling warps for execution

• The warp scheduler is a zero-overhead scheduler

• Only those warps are eligible for execution
whose next instruction has all operands available.

• Eligible warps are scheduled

• coarse grained (not indicated in the figure)
SM multithreaded • priority based.
Warp scheduler
time
warp 8 instruction 11

warp 1 instruction 42

All threads in a warp execute the same instruction

warp 3 instruction 95 when selected.
..
. 4 clock cycles are needed to dispatch the same
warp 8 instruction 12 instruction to all threads in the warp (G80)

warp 3 instruction 96 Figure: Scheduling warps for execution [12]

5.3 Intel’s Larrabee
5.3 Intel’s Larrabee (1)

Larrabee

Part of Intel’s Tera-Scale Initiative.

• Objectives:
High end graphics processing, HPC
Not a single product but a base architecture for a number of different products.

• Brief history:
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)
First brief public presentation 09/07 (Otellini) [29]
First official public presentations: in 2008 (e.g. at SIGGRAPH [27])
Due in ~ 2009
• Performance (targeted):
2 TFlops
5.3 Intel’s Larrabee (2)

NI: New Instructions

Figure: Positioning of Larrabee

in Intel’s product portfolio [28]
5.2 Intel’s Larrabee (3)

Figure: First public presentation of Larrabee at IDF Fall 2007 [29]

5.3 Intel’s Larrabee (4)

Basic architecture

Figure: Block diagram of the Larrabee [30]

• Cores: In order x86 IA cores augmented with new instructions

• L2 cache: fully coherent
• Ring bus: 1024 bits wide
5.3 Intel’s Larrabee (5)

Figure: Block diagram of Larrabee’s cores [31]

5.3 Intel’s Larrabee (6)

Larrabee’ microarchitecture [27]

Derived from that of the Pentium’s in order design

5.3 Intel’s Larrabee (7)

Main extensions

• 64-bit instructions
• 4-way multithreaded
(with 4 register sets)
• addition of a 16-wide
(16x32-bit) VU
• increased L1 caches
(32 KB vs 8 KB)
• access to its 256 KB
local subset of a
coherent L2 cache
• ring network to access
the coherent L2 $
and allow interproc.
communication.

Figure: The anchestor of

Larrabee’s cores [28]
5.3 Intel’s Larrabee (8)

New instructions allow explicit cache control

• to prefetch data into the L1 and L2 caches

• to control the eviction of cache lines by reducing their priority.
the L2 cache can be used as a scratchpad memory while remaining fully
coherent.
5.3 Intel’s Larrabee (9)

The Scalar Unit

• supports the full ISA of the Pentium
(it can run existing code including OS kernels and applications)
• provides new instructions, e.g. for
• bit count
• bit scan (it finds the next bit set within a register).
5.3 Intel’s Larrabee (10)

Mask registers
The Vector Unit
have one bit per bit lane,
to control which bits of a vector reg.
or memory data are read or written
and which remain untouched.

VU scatter-gather instructions
(load a VU vector register from
16 non-contiguous data locations
from anywhere from the
on die L1 cache without penalty,
or store a VU register similarly).

Numeric conversions
8-bit, 16-bit integer and 16 bit FP
data can be read from the L1 $
or written into the L1 $,
with conversion to 32-bit integers
without penalty.
L1 D$ becomes
as an extension of the
register file

Figure: Block diagram of the Vector Unit [31]

5.3 Intel’s Larrabee (11)

ALUs

• ALUs execute integer, SP and DP FP instructions

• Multiply-add instructions are available.

Figure: Layout of the 16-wide vector ALU [31]

5.3 Intel’s Larrabee (12)

Task scheduling

performed entirely by software rather than by hardware, like in Nvidia’s or AMD/ATI’s

GPGPUs.
5.3 Intel’s Larrabee (13)

SP FP performance
2 operations/cycle 32 operations/core
16 ALUs

At present no data available for the clock frequency or the number of cores in Larrabee.

Assuming a clock frequency of 2 GHz and 32 cores

SP FP performance: 2 TFLOPS
5.3 Intel’s Larrabee (14)

Figure: Larrabee’s software stack (Source Intel)

Larrabee’s Native C/C++ compiler allows many available apps to be recompiled and run
correctly with no modifications.
6. References (1)

6. References

[1]: Torricelli F., AMD in HPC, HPC07,

http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf
[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

[3] AMD FireStream 9170,

http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia,
http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

[5]: Tesla S870 GPU Computing System, Specification, Nvida,

http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,

http://www.hardwaresecrets.com/article/495

[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,
ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007
http://ati.amd.com/developer/techpapers.html

[10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL),

Version 2.0, Oct. 2008, AMD
6. References (2)

[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7:
Threading Hardware in G80
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,
http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[14]: Nvidia G80, Pc Watch, April 16 2007,
http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm
[15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007,
http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf

[16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008,

http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm
[17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3

[18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia,
http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
6. References (3)

[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for
Visual Information Technology, IIIT Hyderabad,
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf

[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html

[23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,

http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

[25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” 09-08-2008

[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,

Version 1.1, Nov. 2007, Nvidia

[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

[28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008,

http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,
http://www.pcper.com/article.php?aid=453
6. References (4)

[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,”
Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-
intels-biggest-leap-ahead-since-the-pentium-pro.html

[31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated

First Move, Anandtech, Aug. 4. 2008,
http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/
Stream_Computing_User_Guide.pdf

[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/
doggett-radeon2900-gh07.pdf

[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
6. References (5)

[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,

http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf

[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?
ArticleID=RWT093009110932&mode=print

[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1

[41]: Wasson S., AMD's Radeon HD 5870 graphics processor,

Tech Report, Sept 23 2009, http://techreport.com/articles.x/17618/1

[42]: Bell B., ATI Radeon HD 5870 Performance Preview ,

Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/
ati_radeon_hd_5870_performance_preview/default.asp
5.3 Intel’s Larrabee
5.2 Intel’s Larrabee (1)

Larrabee

Part of Intel’s Tera-Scale Initiative.

• Objectives:
High end graphics processing, HPC
Not a single product but a base architecture for a number of different products.

NI: New Instructions

Figure: Positioning of Larrabee

in Intel’s product portfolio [28]
5.2 Intel’s Larrabee (3)

Figure: First public presentation of Larrabee at IDF Fall 2007 [29]

5.2 Intel’s Larrabee (4)

Basic architecture

Figure: Block diagram of the Larrabee [30]

• Cores: In order x86 IA cores augmented with new instructions

• L2 cache: fully coherent
• Ring bus: 1024 bits wide
5.2 Intel’s Larrabee (5)

Figure: Block diagram of Larrabee’s cores [31]

5.2 Intel’s Larrabee (6)

Larrabee’ microarchitecture [27]

Derived from that of the Pentium’s in order design

5.2 Intel’s Larrabee (7)

Main extensions

Figure: The anchestor of

Larrabee’s cores [28]
5.2 Intel’s Larrabee (8)

New instructions allow explicit cache control

• to prefetch data into the L1 and L2 caches

• to control the eviction of cache lines by reducing their priority.
the L2 cache can be used as a scratchpad memory while remaining fully
coherent.
5.2 Intel’s Larrabee (9)

The Scalar Unit

Mask registers
The Vector Unit
have one bit per bit lane,
to control which bits of a vector reg.
or memory data are read or written
and which remain untouched.

VU scatter-gather instructions
(load a VU vector register from
16 non-contiguous data locations
from anywhere from the
on die L1 cache without penalty,
or store a VU register similarly).

Figure: Block diagram of the Vector Unit [31]

5.2 Intel’s Larrabee (11)

ALUs

• ALUs execute integer, SP and DP FP instructions

• Multiply-add instructions are available.

Figure: Layout of the 16-wide vector ALU [31]

5.2 Intel’s Larrabee (12)

Task scheduling

performed entirely by software rather than by hardware, like in Nvidia’s or AMD/ATI’s

GPGPUs.
5.2 Intel’s Larrabee (13)

SP FP performance
2 operations/cycle 32 operations/core
16 ALUs

At present no data available for the clock frequency or the number of cores in Larrabee.

Assuming a clock frequency of 2 GHz and 32 cores

SP FP performance: 2 TFLOPS
5.2 Intel’s Larrabee (14)

Figure: Larrabee’s software stack (Source Intel)

Larrabee’s Native C/C++ compiler allows many available apps to be recompiled and run
correctly with no modifications.
4. Overview of data parallel accelerators (13)

Price relations (as of 10/2008)

Nvidia Tesla

C870 ~ 1500 $ C1060 ~ 1600 $

D870 ~ 5000 $
S870 ~ 7500 $ S1070 ~ 8000 $

AMD/ATI FireStream

9170 ~ 800 $ 9250 ~ 800 $

5. Microarchitecture and operation

5.1 Nvidia’s GPGPU line

5.2 AMD/ATI’s GPGPU line
5.3 Intel’s Larrabee
5.1 Nvidia’s GPGPU line
5.1 Nvidia’s GPGPU line (1)

Microarchitecture of GPUs

Microarchitecture of GPGPUs

3-level Two-level
microarchitectures microarchitectures

Microarchitectures Dedicated microarchitectures

inheriting the structure of a priory developed to support
programmable GPUs both graphics and HPC

E.g. Nvidia’s and AMD/ATI’s Intel’s

GPGPUs Larrabee

Figure: Alternative layouts of microarchitectures of GPGPUs

5.1 Nvidia’s GPGPU line (2)

Host CPU North Bridge Host memory

Command Processor Unit

Commands
Work Schedeler

CB CB
CB: Core Blocks

CBA
Cores Cores CBA: Core Block Array

1 L1 Cache n L1 Cache

IN: Interconnection
IN
Network

PCI-E x 16 IF
Data

L2 L2

Hub
MC: Memory Controller
1 MC m MC

Display c.
2x32-bit 2x32-bit
Global Memory

Simplified block diagram of recent 3-level GPUs/data-parallel accelerators

(Data parallel accelerators do not include Display controllers)
5.1 Nvidia’s GPGPU line (3)

In these slides Nvidia AMD/ATI

Core Streaming Multiprocesszor Shader-processzor

C SM
SIMT Core Multithreaded processor Thread processor

SIMD Array
Texture Processor Cluster SIMD Engine
CB Core Block TPC
Multiprocessor SIMD core
SIMD

CBA Core Block Array SPA Streaming Processor Array

Streaming Processor
Stream Processing Unit
ALU Algebraic Logic Unit Thread Processor
Stream Processor
Scalar ALU

Table: Terminologies used with GPGPUs/Data parallel accelerators

5.1 Nvidia’s GPGPU line (4)

Microarchitecture of Nvidia’s GPGPUs

GPGPUs based on 3-level microarchitectures

Nvidia’s line AMD/ATI’s line

90 nm G80 80 nm R600

Shrink Enhanced Shrink Enhanced

arch. arch.
65 nm G92 G200 55 nm RV670 RV770

Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

5.1 Nvidia’s GPGPU line (5)

G80/G92

Microarchitecture
5.1 Nvidia’s GPGPU line (6)

Figure: Overview
of the G80 [14]
5.1 Nvidia’s GPGPU line (7)

Figure: Overview
of the G92 [15]
5.1 Nvidia’s GPGPU line (8)

Figure: The Core Block of the

G80/G92 [14], [15]
5.1 Nvidia’s GPGPU line (9)

Streaming Processors:
SIMT ALUs

Figure: Block diagram

of G80/G92 cores
[14], [15]
5.1 Nvidia’s GPGPU line (10)

Individual components of the core

SM Register File (RF)

• 8K registers (each 4 bytes wide) deliver I$
4 operands/clock L1
• Load/Store pipe can also read/write RF
Multithreaded
Instruction Buffer

R C$ Shared
F L1 Mem

Operand Select

MAD SFU

Figure: Register File [12]

5.1 Nvidia’s GPGPU line (11)

Programmer’s view of the Register

File
4 thread blocks 3 thread blocks
• There are 8192 and 16384 registers in each SM in
the G80 and the G200 resp.
– This is an implementation decision, not part of
CUDA

• Registers are dynamically partitioned across

all thread blocks assigned to the SM

• Once assigned to a thread block, the register is

NOT accessible by threads in other blocks

• Each thread in the same block only accesses

registers assigned to itself

Figure: The programmer’s view of the Register File [12]

5.1 Nvidia’s GPGPU line (12)

The Constant
Cache

I$
• Immediate address constants L1
• Indexed address constants
• Constants stored in DRAM, and cached on chip
Multithreaded
– L1 per SM Instruction Buffer
• A constant value can be broadcast to all threads in a Warp
– Extremely efficient way of accessing a value that is common for all
R C$ Shared
threads in a Block! F L1 Mem

Operand Select

MAD SFU

Figure: The constant cache [12]

5.1 Nvidia’s GPGPU line (13)

Shared
Memory

I$
• Each SM has 16 KB of Shared Memory L1
– 16 banks of 32 bit words
• CUDA uses Shared Memory as shared storage visible
Multithreaded
to all threads in a thread block Instruction Buffer
– read and write access
• Not used explicitly for pixel shader programs
R C$ Shared
F L1 Mem

Operand Select

MAD SFU

Figure: Shared Memory [12]

5.1 Nvidia’s GPGPU line (14)

A program needs to manage the global, constant and texture memory spaces
visible to kernels through calls to the CUDA runtime.

This includes memory allocation and deallocation as well as invoking data transfers
between the CPU and GPU.
5.1 Nvidia’s GPGPU line (15)

Figure: Major functional blocks of G80/GT92 ALUs [14], [15]

5.1 Nvidia’s GPGPU line (16)

Barrier synchronization

• used to coordinate memory accesses at synchronization points,

• at synchronization points the execution of the threads is suspended
until all threads reach this point (barrier synchronization)
• synchronization is achieved by calling the void_syncthreads() intrinsic function [11];
5.1 Nvidia’s GPGPU line (17)

Principle of operation

Based on Nvidia’s data parallel computing model

Nvidia’s data parallel computing model is specified at different levels of

abstraction

• at the Instruction Set Architecture level (ISA) (not disclosed)

• at the intermediate level (at the level of APIs) not discussed here)
• at the high level programming language level by means of CUDA.
5.1 Nvidia’s GPGPU line (18)

CUDA [11]

• programming language and programming environment that allows

explicit data parallel execution on an attached massively parallel device (GPGPU),
• its underlying principle is to allow the programmer to target portions of the
source code for execution on the GPGPU,
• defined as a set of C-language extensions,

The key element of the language is the notion of kernel

5.1 Nvidia’s GPGPU line (19)

A kernel is specified by

• using the _global_ declaration specifier,

• a number of associated CUDA threads,
• a domain of execution (grid, blocks) using the syntax <<<….>>>

Execution of kernels
when called, a kernel is executed N times in parallel by N associated CUDA threads,
as opposed to only once like in case of regular C functions.
5.1 Nvidia’s GPGPU line (20)

Example

The above sample code

• adds two vectors A and B of size N and

• stores the result into vector C
by executing the invoked threads (identified by a one dimensional index i)
in parallel on the attached massively parallel GPGPU, rather than
adding the vectors A and B by executing embedded loops on the conventional CPU.

Remark
The thread index threadIdx is a vector of up to 3-components,
that identifies a one-, two- or three-dimensional thread block.
5.1 Nvidia’s GPGPU line (21)

The kernel concept is enhanced by three key abstractions

• the thread concept,

• the memory concept and
• the synchronization concept.
5.1 Nvidia’s GPGPU line (22)

The thread concept

based on a three level hierarchy of threads

• grids
• thread blocks
• threads
5.1 Nvidia’s GPGPU line (23)

The hierarchy of threads

Host Device

Each kernel invocation

is executed as a grid of
kernel0<<<>>>() thread blocks (Block(i,j))

kernel1<<<>>>()

Figure: Hierarchy of
threads [25]
5.1 Nvidia’s GPGPU line (24)

Thread blocks and threads

Thread blocks

• identified by two- or three-dimensional indices,

• equally shaped,
• required to execute independently,
that is they can be scheduled in any order,
• organized into a one- or two dimensional array,
• have a per block shared memory.

Threads of a thread block

• identified by thread IDs
(thread number within a block),
• share data through fast shared memory,
• synchronized to coordinate memory
accesses,
Threads in different thread blocks can not
communicate or be synchronized.

Figure: Thread blocks and threads [11]

5.1 Nvidia’s GPGPU line (25)

The memory concept

Threads have

• private registers (R/W access)

• per block shared memory (R/W access)
• per grid global memory (R/W access)
• per block constant memory (R access)
• per TPC texture memory (R access)

Shared memory is organized into banks

(16 banks in version 1)

The global, constant and texture

memory spaces can be read from or
written to by the CPU and are
persistent across kernel launches
by the same application.

Figure: Memory concept [26] (revised)

5.1 Nvidia’s GPGPU line (26)

Mapping of the memory spaces of the programming model

to the memory spaces of the streaming processor

A thread block is scheduled for execution

to a particular multithreaded SM
Streaming Multiprocessor 1 (SM 1) SMs are the fundamental
processing units for CUDA thread blocks

An SM incorporates 8 Execution Units

(designated a Processors in the figure)

Figure: Memory spaces of the SM [7]

5.1 Nvidia’s GPGPU line (27)

The synchronization concept

Barrier synchronization
• used to coordinate memory accesses at synchronization points,
• at synchronization points the execution of the threads is suspended
until all threads reach this point (barrel synchronization)
• synchronization is achieved by the declaration void_syncthreads();
5.1 Nvidia’s GPGPU line (28)

GT200
5.1 Nvidia’s GPGPU line (29)

Figure: Block diagram of the GT200 [16]

5.1 Nvidia’s GPGPU line (30)

Figure: The Core Block of the

GT200 [16]
5.1 Nvidia’s GPGPU line (31)

Streaming Multi-
processors:
SIMT cores

Figure: Block diagram

of the GT200 cores [16]
5.1 Nvidia’s GPGPU line (32)

Figure: Major functional blocks of GT200 ALUs [16]

5.1 Nvidia’s GPGPU line (33)

Figure: Die shot of the GT 200 [17]

6. References (1)

6. References

[1]: Torricelli F., AMD in HPC, HPC07,

http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf
[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia

[3] AMD FireStream 9170,

http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html

[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia,
http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf

[5]: Tesla S870 GPU Computing System, Specification, Nvida,

http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf

[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,

http://www.hardwaresecrets.com/article/495

[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD

[10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL),

Version 2.0, Oct. 2008, AMD
6. References (2)

[16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008,

[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html

[23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,

http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf

[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,

[25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” 09-08-2008

[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,

Version 1.1, Nov. 2007, Nvidia

[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008

[28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008,

http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm

[29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective,
http://www.pcper.com/article.php?aid=453
6. References (4)

[31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated

First Move, Anandtech, Aug. 4. 2008,
http://www.anandtech.com/showdoc.aspx?i=3367&p=2

[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf

[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/
Stream_Computing_User_Guide.pdf

[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/
doggett-radeon2900-gh07.pdf

[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf

[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf

[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
6. References (5)

[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,

http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf
6. References (5)

AMD/ATI RV870 (Cypress) Radeon 5870 graphics card

Introduction: Sept. 22 2009

Availability: now

Performance figures:
Engine clock speed: 850 MHz
SP FP performance: 2.72 TFLOPS
DP FP performance: 544 GFLOPS (1/5 of SP FP performance)

OpenCL 1.0 compliant

6. References (5)

Radeon 4800 series/5800 series comparison

ATI Radeon HD ATI Radeon HD ATI Radeon
4870 5850 HD 5870
Manufacturing
55-nm 40-nm 40-nm
Process
# of Transistors 956 million 2.15 billion 2.15 billion
Core Clock Speed 750MHz 725MHz 850MHz
# of Stream
800 1440 1600
Processors
Compute
1.2 TFLOPS 2.09 TFLOPS 2.72 TFLOPS
Performance
Memory Type GDDR5 GDDR5 GDDR5
Memory Clock 900MHz 1000MHz 1200MHz
Memory Data Rate 3.6 Gbps 4.0 Gbps 4.8 Gbps
Memory Bandwidth 115.2 GB/sec 128 GB/sec 153.6 GB/sec
6. References (5)

RV770-RV870 Comparison
ATI Radeon HD ATI Radeon HD
Difference
4870 5870
Die Size 263 mm2 334 mm2 1.27x
# of Transistors 956 million 2.15 billion 2.25x
# of Shaders 800 1600 2x
90W idle, 160W 27W idle, 188W
Board Power 0.3x, 1.17x
load max
6. References (5)

Architecture overview

8 cores

1600 ALUs
(Stream processing units)

8x32 = 256 bit

GDDR5
153.6 GB/s
6. References (5)

The 5870 card

http://techreport.com/articles.x/17618/3
6. References (5)

NVidia’s Fermi

Introduced: 30. Sept. 2009 at NVidia’s GPU Technology Conference Available: 1 Q 2010
6. References (5)

Fermi’s overall structure

NVidia: 16 cores
(Streaming Multiprocessors)

Each core: 32 ALUs

ort

Cuda core
(ALU)
6. References (5)

Layout of a core (SM)

1 SM includes 32 ALUs
called “Cuda cores” by NVidia)
A
sin
g
le
"C
U
D
co
r.S
u
:N
vd
aDP FP
•
SP FP:32-bit

IEEE 754-2008-compliant
• Needs 2 clock cycles
6. References (5)

A single ALU (“Cuda core”)

DP FP performance: ½ of SP FP performance!!

ort.com/articles.x/17670/2
FX: 32-bit
6. References (5)

Fermi’s system architecture

2. Basics of the SIMT execution (20)

The execution of programs utilizing GPűGPUs

Host Device

Each kernel invocation

is executed as a grid of
kernel0<<<>>>() thread blocks (Block(i,j))

kernel1<<<>>>()

Figure: Hierarchy of
threads [25]
6. References (5)

Contrasting Fermi and GT 200

6. References (5)

Global scheduling in Fermi

6. References (5)

Microarchitecture of a Fermi core

6. References (5)

Memory pipeline of a core

6. References (5)

Principle of operation of the G80/G92 GPGPUs

The key point of operation is work scheduling

Work scheduling

• Scheduling thread blocks for execution

• Segmenting thread blocks into warps
• Scheduling warps for execution
6. References (5)

CUDA Thread Block

Thread scheduling in NVidia’s GPGPUs

• All threads in a block execute the same

kernel program (SPMD)
• Programmer declares block: CUDA Thread Block
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads
0123… m
• Threads have thread id numbers within block
– Thread program uses thread id to select work
and address shared data
Thread program
• Threads in the same block share data and
synchronize while doing their share of the
work
• Threads in different blocks cannot cooperate
– Each block can execute in any order relative to Courtesy: John Nickolls,
other blocs! NVIDIA

llinois.edu/ece498/al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt#316,2,
6. References (5)

Scheduling thread blocks for execution

TPC TPC: Thread Processing Cluster
(Texture Processing Cluster)
SM0 SM1
t0 t1 t2 … tm t0 t1 t2 … tm
Up to 8 blocks can be assigned
MT IU MT IU
to an SM for execution
SP SP

Blocks Blocks

A device may run thread blocks sequentially

or even in parallel, if it has enough resources
Shared Shared for this, or usually by a combination of both.
Memory Memory

TF
A TPC has
Texture L1 2 SMs in the G80/G92
3 SMs in the G200

L2
Figure: Assigning thread blocks
Memory to streaming multiprocessors (SM) for execution [12]
6. References (5)

Segmenting thread blocks into warps

• Threads are scheduled for execution in groups

Remark Streaming Multiprocessor

Instruction L1 Data L1
The number of threads constituting a warp
Instruction Fetch/Dispatch
is an implementation decision and not
part of the CUDA programming model. Shared Memory

SP SP

SP SP
SFU SFU
SP SP

SP SP

Figure: Segmenting thread blocks in warps [12]

6. References (5)

Scheduling warps for execution

• The warp scheduler is a zero-overhead scheduler

• Only those warps are eligible for execution
whose next instruction has all operands available.

• Eligible warps are scheduled

• coarse grained (not indicated in the figure)
SM multithreaded • priority based.
Warp scheduler
time
warp 8 instruction 11

warp 1 instruction 42

All threads in a warp execute the same instruction

warp 3 instruction 95 when selected.
..
. 4 clock cycles are needed to dispatch the same
warp 8 instruction 12 instruction to all threads in the warp (G80)

warp 3 instruction 96 Figure: Scheduling warps for execution [12]

Software environment
A
sin
g
le
"C
U
D
co
r.S
u
:N
vd
a
ttp://techreport.com/articles.x/17670/2

A single Cuda core

3. Overview of GPGPUs (3)

NVidia
11/06 10/07 6/08

Cores G80 G92 GT200

90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs

Cards 8800 GTS 8800 GTX 8800 GT GTX260 GTX280

96 ALUs 128 ALUs 112 ALUs 192 ALUs 240 ALUs
320-bit 384-bit 256-bit 448-bit 512-bit

6/07 11/07 6/08

CUDA Version 1.0 Version 1.1 Version 2.0

AMD/ATI
11/05 5/07 11/07 5/08

Cores R500 R600 R670 RV770

80 nm/681 mtrs 55 nm/666 mtrs 55 nm/956 mtrs

Cards (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

48 ALUs 320 ALUs 320 ALUs 320 ALUs 800 ALUs 800 ALUs
512-bit 256-bit 256-bit 256-bit 256-bit

11/07
Brooks+ Brook+

6/08
RapidMind 3870
support

2005 2006 2007 2008 2009

Figure: Overview of GPGPUs

6. References (5)

The single biggest improvement from the last generation, though, is in power consumption. The
5870's peak power draw is rated at 188W, up a bit from the 4870's 160W TDP. But idle power
draw on the 5870 is rated at an impressively low 27W, down precipitously from the 90W rating of
the 4870. Much of the improvement comes from Cypress's ability to put its GDDR5 memory into a
low-power state, something the 4870's first-gen GDDR5 interface couldn't do. Additionally, the
second 5870 board in CrossFire multi-GPU config can go even lower, dropping into an ultra-low
power state just below 20W.
AMD says the plan is for Radeon HD 5870 cards to be available for purchase today at a price of
$379.
6. References (5)
Fermi

And each floating-point unit is now capable of producing IEEE 754-2008-compliant double-precision
FP results in two clock cycles, or half the performance of single-precision math. That's a huge step
up from the GT200's lone DP unit per SM—hence our estimate of a ten-fold increase in DP
performance
2. Basics of the SIMT execution (13)

Aim of multithreading

Speeding up computations

• by increased utilization of available computing resources in case when

threads stall due to long latency operations,
(achieved by suspending stalled threads from execution and allocating free
computing resources to runable threads)

• by increased utilization of available silicon area for performing computations

rather than for implementing sophisticated cache systems,
(achieved by hiding memory access latencies through multithreading)
6. References (5)

Contrasting Fermi and GT 200

Presentation1 (1) hpc mod 3
No ratings yet
Presentation1 (1) hpc mod 3
51 pages
Quant Masters
No ratings yet
Quant Masters
220 pages
Nintendo 64 Architecture: by Courtney Getman and Collin Reeser
No ratings yet
Nintendo 64 Architecture: by Courtney Getman and Collin Reeser
19 pages
Combined - Out Icai MCQ
No ratings yet
Combined - Out Icai MCQ
1,195 pages
GPU v1.1
No ratings yet
GPU v1.1
25 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
Unofficial RayTracingGems v1.7
No ratings yet
Unofficial RayTracingGems v1.7
622 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
TDCI Arch
No ratings yet
TDCI Arch
77 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Emu 10 K 1 Audio Signal Processor
No ratings yet
Emu 10 K 1 Audio Signal Processor
19 pages
Using Trio – Juniper Networks’ Programmable Chipset – for Emerging In-Network Applications
No ratings yet
Using Trio – Juniper Networks’ Programmable Chipset – for Emerging In-Network Applications
16 pages
Chapter08 MultipleProcessorSystems
No ratings yet
Chapter08 MultipleProcessorSystems
50 pages
Lecture13 - Full IS1500
No ratings yet
Lecture13 - Full IS1500
34 pages
CD Unit 4 Compiler Design Jntuk r20
No ratings yet
CD Unit 4 Compiler Design Jntuk r20
17 pages
MX Trio MPCs 2
No ratings yet
MX Trio MPCs 2
46 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
3D Attack - 2005 - 03
No ratings yet
3D Attack - 2005 - 03
57 pages
Mali GPU Architecture
No ratings yet
Mali GPU Architecture
21 pages
Core Spring Framework Annotations
No ratings yet
Core Spring Framework Annotations
27 pages
Structure of Java Program
No ratings yet
Structure of Java Program
25 pages
Course28-Advanced Real-Time Rendering in 3D Graphics and Games SIGGRAPH07
No ratings yet
Course28-Advanced Real-Time Rendering in 3D Graphics and Games SIGGRAPH07
144 pages
SGI RealityEngine
No ratings yet
SGI RealityEngine
8 pages
GPU - Video Card (Display, Graphic, VGA)
No ratings yet
GPU - Video Card (Display, Graphic, VGA)
38 pages
Brkarc 3437 PDF
No ratings yet
Brkarc 3437 PDF
110 pages
KeyShot6 Manual en
No ratings yet
KeyShot6 Manual en
406 pages
Instructions CPS2 Multi Boot v2
No ratings yet
Instructions CPS2 Multi Boot v2
11 pages
CS101 Mega File by Hamza PDF
No ratings yet
CS101 Mega File by Hamza PDF
184 pages
Silicon One Buffer
No ratings yet
Silicon One Buffer
9 pages
Tensor Processing Unit
100% (1)
Tensor Processing Unit
15 pages
Tai Lieu Pop
100% (1)
Tai Lieu Pop
115 pages
Computer Graphic - Chapter 02
No ratings yet
Computer Graphic - Chapter 02
108 pages
Get Ready For Photogrammetry: Unity For Games E-Book
No ratings yet
Get Ready For Photogrammetry: Unity For Games E-Book
21 pages
C Quiz
No ratings yet
C Quiz
174 pages
Core Java Interview Question and Answers
No ratings yet
Core Java Interview Question and Answers
39 pages
CCENT100 105 - 248Q Sections 05 18
No ratings yet
CCENT100 105 - 248Q Sections 05 18
161 pages
DSP Project 2
No ratings yet
DSP Project 2
10 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Chapter 24: Single-Source Shortest Paths: Given
No ratings yet
Chapter 24: Single-Source Shortest Paths: Given
48 pages
802 3ab
No ratings yet
802 3ab
63 pages
Transmeta Crusoe: A Revolutionary CPU For Mobile Computing Ashraful Alam
No ratings yet
Transmeta Crusoe: A Revolutionary CPU For Mobile Computing Ashraful Alam
23 pages
Vray Renderer Settings
No ratings yet
Vray Renderer Settings
8 pages
Yoshi's Nightmare
No ratings yet
Yoshi's Nightmare
15 pages
Switch Hardware Architecture
No ratings yet
Switch Hardware Architecture
65 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
Motion Blue Base Image v5
No ratings yet
Motion Blue Base Image v5
31 pages
8086 Instruction Set
No ratings yet
8086 Instruction Set
101 pages
Advanced RenderMan 3-Render Harder
100% (1)
Advanced RenderMan 3-Render Harder
184 pages
MMX Unit 1
No ratings yet
MMX Unit 1
33 pages
Brook Plus Installation Notes
No ratings yet
Brook Plus Installation Notes
5 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
No ratings yet
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
74 pages
DSL: An Overview: by M. V. Ramana Murthy
No ratings yet
DSL: An Overview: by M. V. Ramana Murthy
9 pages
ESXi Architecture
No ratings yet
ESXi Architecture
10 pages
3do m2
No ratings yet
3do m2
4 pages
Crusoe Processor: Seminar Guide: - By: - Prof. H. S. Kulkarni Ashish
No ratings yet
Crusoe Processor: Seminar Guide: - By: - Prof. H. S. Kulkarni Ashish
26 pages
Nexus 9K TCAM
No ratings yet
Nexus 9K TCAM
10 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
How Sound Cards Work: Anatomy of A Sound Card
No ratings yet
How Sound Cards Work: Anatomy of A Sound Card
4 pages
DP Graphics ATI1006 Wnt5 x32 Q
No ratings yet
DP Graphics ATI1006 Wnt5 x32 Q
20 pages
Programming Playstation 2
No ratings yet
Programming Playstation 2
13 pages
Graphic Engines 3
No ratings yet
Graphic Engines 3
8 pages
OpenGL e Glut - Estruturas
No ratings yet
OpenGL e Glut - Estruturas
161 pages
ATI Stream SDK Getting Started Guide v2.2
No ratings yet
ATI Stream SDK Getting Started Guide v2.2
6 pages
Crusoe Processor Model TM5800 Features
No ratings yet
Crusoe Processor Model TM5800 Features
8 pages
48423B Fusion Whitepaper WEB
No ratings yet
48423B Fusion Whitepaper WEB
8 pages
The Pixelflow Texture and Image Subsystem: Steven Molnar
100% (2)
The Pixelflow Texture and Image Subsystem: Steven Molnar
11 pages
OpenCL and PyOpenCL Installation Manual
No ratings yet
OpenCL and PyOpenCL Installation Manual
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.