Accelerator Architecture (Continued) : 6.5930/1 Hardware Architectures For Deep Learning
Accelerator Architecture (Continued) : 6.5930/1 Hardware Architectures For Deep Learning
6.5930/1
Hardware Architectures for Deep Learning
Accelerator Architecture
(continued)
March 11, 2024
Operation Sequencing
Accelerator Taxonomy
Accelerator
Architecture
Temporally
Programmed
CPU
GPU
Multiprocessor
L2 L2 L2 L2
L3 L3
Inter-processing element
communication is
through cache hierarchy
Memory (DRAM)
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
FPGA RAW
GPU
TRIPS AsAP
WaveScalar PicoChip
Triggered
DySER
Instructions
TTA
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic)
Grained
FPGA
LUT 01 0 01 0
10 0 10 1
Latch 11 1 11 1
RAM .....
Source: Microsoft
March 11, 2024 Sze and Emer
L11-17
Heterogeneous Blocks
• Add specific purpose logic on FPGA
– Efficient if used (better area, speed, power),
wasted if not
• Soft fabric
– LUT, flops, addition, subtraction, carry logic
– Convert LUT to memories or shift registers
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA TRIPS RAW
WaveScalar AsAP
DySER PicoChip
TTA Triggered
Instructions
Programmable Accelerators
Processing
Element
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE
... PE PE PE
...
...
...
Many Programmable Accelerators look like an array of PEs, but have dramatically
different architectures, programming models and capabilities
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA
Fixed-operation
TPU
NVDLA
• Attributes
– High-concurrency
– Regular design, but
– Regular parallelism only!
– Allows for systolic communication
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA
Fixed-operation
TPU
NVDLA
Configured-operation
WARP
DySER
TRIPS
WaveScalar
TTA
March 11, 2024 Sze and Emer
L11-25
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA
Fixed-operation
TPU
NVDLA
Configured-operation PC-based
WARP Wave
DySER RAW
TRIPS AsAP
WaveScalar PicoChip
TTA
March 11, 2024 Sze and Emer
L11-27
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA
Fixed-operation
TPU
NVDLA
Configured-operation PC-based
WARP Wave
DySER RAW
TRIPS AsAP
WaveScalar PicoChip
TTA
March 11, 2024 Sze and Emer
L11-29
Accelerator Taxonomy
Accelerator
Architecture
Temporally Spatially
Programmed Programmed
CPU
GPU
Fine (logic) Coarse (ALU)
Grained Grained
FPGA
Configured-operation PC-based
WARP Wave
DySER RAW
TRIPS AsAP
WaveScalar PicoChip
TTA
March 11, 2024 Sze and Emer
L11-30
Guarded Actions
• Program consists of rules that may perform
computations and read/write state
• Each rule specifies conditions (guard) under
reg A; reg B; reg C; which it is allowed to fire
rule X (A > 0 && B != C) • Separates description and execution of data
{ (rule body) from control (guards)
A <= B + 1;
B <= B - 1;
• A scheduler is generated (or provided by
C <= B * A; hardware) that evaluates the guards and
} schedules rule execution
rule Y (…) {…} • Sources of Parallelism
– Intra-Rule parallelism
rule Z (…) {…}
– Inter-Rule parallelism
Scheduler
– Scheduler overlap with Rule execution
– Parallel access to state
Trigger 0 Operation 0
Trigger 1 Operation 1 to datapath
Priority
Trigger
Trigger 2 Resolution Operation 2
... ...
Trigger n Operation n
“can trigger” “will trigger”
p0 p1 p2 p3
6.5930/1
Hardware Architectures for Deep Learning
Background Reading
• DNN Accelerators
– Efficient Processing of Deep Neural Networks
• Chapter 5 – thru 5.7.1
• Chapter 5 – 5.8
All these books and their online/e-book versions are available through
MIT libraries.
Memory Hierarchy
* multiply-and-accumulate
* multiply-and-accumulate
Input Fmap
Filter
Activations
Reuse:
Filter weights
Filters
Input Fmap Input Fmap
Filter
1
Activations
Reuse: Reuse: Activations
Filter weights
2
2
Activations
Reuse: Reuse: Activations Reuse: Filter weights
Filter weights
ALU
Mem Mem
…
ALU
ALU
Mem Mem
…
ALU
PE PE
Global
DRAM
Buffer
PE ALU fetch data to run
a MAC here
iacts
? ALU ALU ALU ALU
weights
ALU ALU ALU ALU
partial
sums
ALU ALU ALU ALU
Dataflow Taxonomy
Global Buffer
Activation Weight
P0 P1 P2 P3 P4 P5 P6 P7 PE
Psum
OS Example: ShiDianNao
OS Example: KU Leuven
activations
weights
1-D Convolution
Weights Inputs Outputs
* =
S W Q = W-ceil(R/2)†
CONV-layer Einsum
Parallel Ranks: C, M
C=3
M R=2 Filter overlay
S=2
R H=3
M-1 W=3
S P=2 Incomplete partial sum
Q=2
March 11, 2024 Sze and Emer
L11-73
3 Filter overlay
2
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
3 Filter overlay
2
8
7
2 Incomplete partial sum
Next:
More dataflows