0% found this document useful (0 votes)
58 views70 pages

Accelerator Architecture (Continued) : 6.5930/1 Hardware Architectures For Deep Learning

Uploaded by

h79smhyvnx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views70 pages

Accelerator Architecture (Continued) : 6.5930/1 Hardware Architectures For Deep Learning

Uploaded by

h79smhyvnx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

L11-1

6.5930/1
Hardware Architectures for Deep Learning

Accelerator Architecture
(continued)
March 11, 2024

Joel Emer and Vivienne Sze

Massachusetts Institute of Technology


Electrical Engineering & Computer Science
Sze and Emer
L11-2

Operation Sequencing

March 11, 2024 Sze and Emer


L11-3

Accelerator Taxonomy
Accelerator
Architecture

Temporally
Programmed
CPU
GPU

March 11, 2024 Sze and Emer


L11-4

Multiprocessor

L2 L2 L2 L2

L3 L3

Inter-processing element
communication is
through cache hierarchy
Memory (DRAM)

March 11, 2024 Sze and Emer


L11-5

Highly-Parallel Compute Paradigms


Temporal Architecture Spatial Architecture
(SIMD/SIMT) (Dataflow Processing)

Memory Hierarchy Memory Hierarchy


Register File
ALU ALU ALU ALU
ALU ALU ALU ALU

ALU ALU ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU


ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU


Control

March 11, 2024 Sze and Emer


L11-6

Spatial Architecture for DNN


DRAM
Local Memory Hierarchy
Global Buffer (100 – 500 kB)
• Global Buffer
• Direct inter-PE network
ALU ALU ALU ALU
• PE-local memory (RF)

ALU ALU ALU ALU


Processing
Element (PE)
ALU ALU ALU ALU Reg File 0.5 – 1.0 kB

ALU ALU ALU ALU Control

March 11, 2024 Sze and Emer


L11-9

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
FPGA RAW
GPU
TRIPS AsAP
WaveScalar PicoChip
Triggered
DySER
Instructions
TTA

March 11, 2024 Sze and Emer


L11-10

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic)
Grained
FPGA

March 11, 2024 Sze and Emer


L11-11

Field Programmable Gate Arrays


Look Up Table (LUT)
And Or
00 0 00 0

LUT 01 0 01 0
10 0 10 1
Latch 11 1 11 1

RAM .....

March 11, 2024 Sze and Emer


L11-15

Microsoft Project Catapult


Configurable Cloud (MICRO 2016) for Azure

Accelerate and reduce latency for


• Bing search
• Software defined network
• Encryption and Decryption
March 11, 2024 Sze and Emer
L11-16

Microsoft Brainwave Neural Processor

Source: Microsoft
March 11, 2024 Sze and Emer
L11-17

Heterogeneous Blocks
• Add specific purpose logic on FPGA
– Efficient if used (better area, speed, power),
wasted if not

• Soft fabric
– LUT, flops, addition, subtraction, carry logic
– Convert LUT to memories or shift registers

• Memory block (BRAM)


– Configure word and address size (aspect ratio)
– Combine memory blocks to large blocks
– Significant part for FPGA area
– Dual port memories (FIFO)

• Multipliers /MACs  DSP

• CPUs and processing elements

March 11, 2024 Sze and Emer


L11-18

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA TRIPS RAW
WaveScalar AsAP
DySER PicoChip
TTA Triggered
Instructions

March 11, 2024 Sze and Emer


L11-19

Programmable Accelerators
Processing
Element

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE
... PE PE PE

...

...

...
Many Programmable Accelerators look like an array of PEs, but have dramatically
different architectures, programming models and capabilities

March 11, 2024 Sze and Emer


L11-20

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA

Fixed-operation
TPU
NVDLA

March 11, 2024 Sze and Emer


L11-21

Fixed Operation PEs

• Each PE hard-wired to one operation


• Purely pipelined operation
– no backpressure in pipeline

• Attributes
– High-concurrency
– Regular design, but
– Regular parallelism only!
– Allows for systolic communication

March 11, 2024 Sze and Emer


L11-22

Configurable Systolic Array - WARP

Source: WARP Architecture and Implementation, ISCA 1986

March 11, 2024 Sze and Emer


L11-23

Fixed Operation - Google TPU

Systolic array does 8-bit 256x256 matrix-multiply accumulate


Source: Google
March 11, 2024 Sze and Emer
L11-24

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA

Fixed-operation
TPU
NVDLA

Configured-operation
WARP
DySER
TRIPS
WaveScalar
TTA
March 11, 2024 Sze and Emer
L11-25

Single Configured Operation - Dyser

Source: Dynamically Specialized Datapaths for Energy Efficient Computing. HPCA11

March 11, 2024 Sze and Emer


L11-26

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA

Fixed-operation
TPU
NVDLA

Configured-operation PC-based
WARP Wave
DySER RAW
TRIPS AsAP
WaveScalar PicoChip
TTA
March 11, 2024 Sze and Emer
L11-27

PC-based Control – Wave Computing

Source: Wave Computing, Hot Chips ‘17

March 11, 2024 Sze and Emer


L11-28

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA

Fixed-operation
TPU
NVDLA

Configured-operation PC-based
WARP Wave
DySER RAW
TRIPS AsAP
WaveScalar PicoChip
TTA
March 11, 2024 Sze and Emer
L11-29

Accelerator Taxonomy
Accelerator
Architecture

Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA

Fixed-operation Triggered operations


TPU Triggered
NVDLA Instructions

Configured-operation PC-based
WARP Wave
DySER RAW
TRIPS AsAP
WaveScalar PicoChip
TTA
March 11, 2024 Sze and Emer
L11-30

Guarded Actions
• Program consists of rules that may perform
computations and read/write state
• Each rule specifies conditions (guard) under
reg A; reg B; reg C; which it is allowed to fire
rule X (A > 0 && B != C) • Separates description and execution of data
{ (rule body) from control (guards)
A <= B + 1;
B <= B - 1;
• A scheduler is generated (or provided by
C <= B * A; hardware) that evaluates the guards and
} schedules rule execution
rule Y (…) {…} • Sources of Parallelism
– Intra-Rule parallelism
rule Z (…) {…}
– Inter-Rule parallelism
Scheduler
– Scheduler overlap with Rule execution
– Parallel access to state

March 11, 2024 Sze and Emer


L11-31

Triggered Instructions (TI)

• Restrict guarded actions down to efficient ISA core:


doPass when (p_did_cmp && !p_cur_is_larger)
%out0.data = %in0.first;
%in0.deq;
p_did_cmp = false;

Trigger Operation Predicate Op


read any number of read/write data regs write 1-bit preds
1-bit predicates channel control (data-dependent)
When can this What can happen
What does it do?
happen? next?

No program counter or branch instructions

March 11, 2024 Sze and Emer


L11-32

Triggered Instruction Scheduler

Trigger 0 Operation 0
Trigger 1 Operation 1 to datapath

Priority
Trigger
Trigger 2 Resolution Operation 2
... ...
Trigger n Operation n
“can trigger” “will trigger”
p0 p1 p2 p3

• Use combinational logic to evaluate triggers in parallel


• Decide winners if more than one instruction is ready
• Based on architectural fairness policy
• Could pick multiple non-conflicting instructions to issue (superscalar)
• Note: no wires toggle unless status changes

March 11, 2024 Sze and Emer


L11-38

6.5930/1
Hardware Architectures for Deep Learning

Dataflow for DNN Accelerator


Architectures (Part 1)
March 11, 2024

Joel Emer and Vivienne Sze

Massachusetts Institute of Technology


Electrical Engineering & Computer Science
Sze and Emer
L11-39

Goals of Today’s Lecture

• Impact of data movement and memory hierarchy on energy


consumption
• Taxonomy of dataflows for CNNs
– Output Stationary
– Weight Stationary
– Input Stationary

March 11, 2024 Sze and Emer


L11-40

Background Reading
• DNN Accelerators
– Efficient Processing of Deep Neural Networks
• Chapter 5 – thru 5.7.1
• Chapter 5 – 5.8

All these books and their online/e-book versions are available through
MIT libraries.

March 11, 2024 Sze and Emer


L11-42

Dataflow and Memory


Hierarchy

March 11, 2024 Sze and Emer


L11-43

Spatial Compute Paradigm


Spatial Architecture
(Dataflow Processing)

Memory Hierarchy

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

March 11, 2024 Sze and Emer


L11-44

Memory Access is the Bottleneck


Memory Read MAC* Memory Write

filter weight ALU


fmap activation
partial sum updated partial sum

* multiply-and-accumulate

March 11, 2024 Sze and Emer


L11-45

Memory Access is the Bottleneck


Memory Read MAC* Memory Write
ALU
DRAM DRAM

* multiply-and-accumulate

Worst Case: all memory R/W are DRAM accesses


• Example: AlexNet [NeurIPS 2012] has 724M MACs
 2896M DRAM accesses required

March 11, 2024 Sze and Emer


L11-46

Memory Access is the Bottleneck


Memory Read MAC* Memory Write
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

Under what circumstances will these extra levels help?

Computational intensity > 1

March 11, 2024 Sze and Emer


L11-47

Memory Access is the Bottleneck


Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation

March 11, 2024 Sze and Emer


L11-48

Types of Data Reuse in DNN


Convolutional Reuse
CONV layers only
(sliding window)

Input Fmap
Filter

Activations
Reuse:
Filter weights

March 11, 2024 Sze and Emer


L11-49

Types of Data Reuse in DNN


Convolutional Reuse Fmap Reuse
CONV layers only CONV and FC layers
(sliding window)

Filters
Input Fmap Input Fmap
Filter
1

Activations
Reuse: Reuse: Activations
Filter weights

March 11, 2024 Sze and Emer


L11-50

Types of Data Reuse in DNN


Convolutional Reuse Fmap Reuse Filter Reuse
CONV layers only CONV and FC layers CONV and FC layers
(sliding window) (batch size > 1)
Input Fmaps
Filters
Input Fmap Input Fmap
Filter Filter
1 1

2
2

Activations
Reuse: Reuse: Activations Reuse: Filter weights
Filter weights

March 11, 2024 Sze and Emer


L11-51

Memory Access is the Bottleneck


Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM

Extra levels of local memory hierarchy

Opportunities: 1 data reuse local accumulation


1) Can reduce DRAM reads of filter/fmap by up to 500×**
1
** AlexNet CONV layers

March 11, 2024 Sze and Emer


L11-52

Memory Access is the Bottleneck


Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation

1) Can reduce DRAM reads of filter/fmap by up to 500×


1
2) Partial sum accumulation does NOT have to access DRAM
2

March 11, 2024 Sze and Emer


L11-53

Memory Access is the Bottleneck


Memory Read MAC* Memory Write
1
ALU
DRAM Mem Mem DRAM
2

Extra levels of local memory hierarchy

Opportunities: 1 data reuse 2 local accumulation

1) Can reduce DRAM reads of filter/fmap by up to 500×


1
2) Partial sum accumulation does NOT have to access DRAM
2

• Example: DRAM access in AlexNet can be reduced


from 2896M to 61M (best case)
March 11, 2024 Sze and Emer
L11-54

Leverage Parallelism for Higher Performance


Memory Read MAC Memory Write
ALU
DRAM DRAM

ALU

Mem Mem


ALU

March 11, 2024 Sze and Emer


L11-55

Leverage Parallelism for Spatial Data Reuse


Memory Read MAC Memory Write
ALU
DRAM DRAM

ALU

Mem Mem


ALU

March 11, 2024 Sze and Emer


L11-56

Spatial Architecture for DNN


DRAM
Local Memory Hierarchy
Global Buffer (100 – 500 kB)
• Global Buffer
• Direct inter-PE network
ALU ALU ALU ALU
• PE-local memory (RF)

ALU ALU ALU ALU


Processing
Element (PE)
ALU ALU ALU ALU Reg File 0.5 – 1.0 kB

ALU ALU ALU ALU Control

March 11, 2024 Sze and Emer


L11-57

Low-Cost Local Data Access

PE PE
Global
DRAM
Buffer
PE ALU fetch data to run
a MAC here

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process
March 11, 2024 Sze and Emer
L11-58

Low-Cost Local Data Access

How to exploit 1 data reuse and 2 local accumulation


with limited low-cost local storage?

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process
March 11, 2024 Sze and Emer
L11-59

Low-Cost Local Data Access

How to exploit 1 data reuse and 2 local accumulation


with limited low-cost local storage?

specialized processing dataflow required!

Normalized Energy Cost*


ALU 1× (Reference)
0.5 – 1.0 kB RF ALU 1×
NoC: 200 – 1000 PEs PE ALU 2×
100 – 500 kB Buffer ALU 6×
DRAM ALU 200×
* measured from a commercial 65nm process
March 11, 2024 Sze and Emer
L11-60

How to Map the Dataflow?


Spatial Architecture
(Dataflow Processing)
CNN Convolution
Memory Hierarchy

iacts
? ALU ALU ALU ALU

weights
ALU ALU ALU ALU
partial
sums
ALU ALU ALU ALU

Goal: Increase reuse of input data


(input activations and weights) ALU ALU ALU ALU
and local partial sums
March 11, 2024
accumulation Sze and Emer
L11-62

Dataflow Taxonomy

• Output Stationary (OS)


• Weight Stationary (WS)
• Input Stationary (IS)

[Chen et al., ISCA 2016]


March 11, 2024 Sze and Emer
L11-63

Output Stationary (OS)

Global Buffer
Activation Weight

P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum

• Minimize partial sum R/W energy consumption


− maximize local accumulation

• Broadcast/Multicast filter weights and reuse


activations spatially across the PE array

March 11, 2024 Sze and Emer


L11-64

OS Example: ShiDianNao

Top-Level Architecture PE Architecture


weights activations

• Inputs streamed through array psums


• Weights broadcast
• Partial sums accumulated in PE and streamed out
[Du et al., ISCA 2015]
March 11, 2024 Sze and Emer
L11-65

OS Example: KU Leuven

activations

weights

[Moons et al., VLSI 2016, ISSCC 2017]


March 11, 2024 Sze and Emer
L11-66

1-D Convolution Einsum

𝑂𝑂𝑞𝑞 = 𝐼𝐼𝑞𝑞+𝑠𝑠 × 𝐹𝐹𝑠𝑠

Operational definition of Einsum says traverse all


valid values of “q” and “s”… but in what order….

Traversal order (fastest to slowest): S, Q

Which “for” loop is outermost? Q

Sze and Emer


L11-67

1-D Convolution
Weights Inputs Outputs

* =
S W Q = W-ceil(R/2)†

int i[W]; # Input activations


int f[S]; # Filter weights
int o[Q]; # Output activations

for q in [0, Q):


for s in [0, S):
o[q] += i[q+s]*f[s]

What dataflow is this? Output stationary


Is it easy to tell dataflow from Yes, from outermost loop index
the loop nest?
† Assuming: ‘valid’ style convolution
March 11, 2024 Sze and Emer
L11-68

Output Stationary - Movie

March 11, 2024 Sze and Emer


L11-69

Output Stationary – Spacetime View

March 11, 2024 Sze and Emer


L11-70

CONV-layer Einsum

𝑂𝑂𝑚𝑚,𝑝𝑝,𝑞𝑞 = 𝐼𝐼𝑐𝑐,𝑝𝑝+𝑟𝑟,𝑞𝑞+𝑠𝑠 × 𝐹𝐹𝑚𝑚,𝑐𝑐,𝑟𝑟,𝑠𝑠

Traversal order (fastest to slowest): S, R, Q, P

Parallel Ranks: C, M

Can you write the loop nest? I hope so

March 11, 2024 Sze and Emer


L11-71

CONV Layer OS Loop Nest


int i[C,H,W]; # Input activations
int f[M,C,R,S]; # Filter weights
int o[M,P,Q]; # Output activations

for p in [0, P):


for q in [0, Q):
for r in [0, R):
for s in [0, S):
parallel-for c in [0, C):
parallel-for m in [0, M):
o[m,p,q] += i[c,p+r,q+s]*f[m,c,r,s]

March 11, 2024 Sze and Emer


L11-72

CONV Layer OS Dataflow

input fmap output fmap


C
filters
C
H M
R P
0
S W Q
M=8

C=3
M R=2 Filter overlay
S=2
R H=3
M-1 W=3
S P=2 Incomplete partial sum
Q=2
March 11, 2024 Sze and Emer
L11-73

CONV Layer OS Dataflow

input fmap output fmap


3
filters
3
3 8
2 2
0
2 3 2

3 Filter overlay

2
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-74

CONV Layer OS Dataflow


Cycle through input fmap and weights (hold psum of output fmap)
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-75

CONV Layer OS Dataflow


Cycle through input fmap and weights (hold psum of output fmap)
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-76

CONV Layer OS Dataflow


Cycle through input fmap and weights (hold psum of output fmap)
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-77

CONV Layer OS Dataflow


Cycle through input fmap and weights (hold psum of output fmap)
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-78

CONV Layer OS Dataflow


Start processing next output feature activations
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-79

CONV Layer OS Dataflow


Cycle through input fmap and weights (hold psum of output fmap)
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-80

CONV Layer OS Dataflow


Cycle through input fmap and weights (hold psum of output fmap)
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-81

CONV Layer OS Dataflow


Cycle through input fmap and weights (hold psum of output fmap)
input fmap output fmap
3
filters
3
3 8
2 2
1
0
2 3 2

3 Filter overlay

2
8
7
2 Incomplete partial sum

March 11, 2024 Sze and Emer


L11-82

Next:
More dataflows

March 11, 2024 Sze and Emer

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy