DNN Accelerators
DNN Accelerators
Title
(EE4690)
o Accelerator architectures
o Summary
2
Recap: DNN hardware implementation
oLearned Floating-point operation and MAC units
3
Recap: DNN hardware implementation
oDNN hardware implementation and floating point operation
4
Recap: Training vs Inference
Source: Nvidia
Source: Mythic
o Dataflow processing
o Chain of ALU’s pass data
directly
7
Key Design Metrics
o Accuracy: Ratio of number of correct predictions to total number of predictions
• Describe the quality of result for a given application
9
Memory Access per MAC
oMemory access is bottleneck for processing DNN operations
• Each MAC requires three memory reads
• For filter weights, fmap activation and partial sum
10
Memory Access per MAC (Cont…)
oWorst case: memory accesses from DRAM
• Alexnet requires ~3000 million DRAM accesses
• DRAM access require significantly high energy consumption
11
Memory Access per MAC (Cont…)
oAccelerators minimizes the DRAM accesses
• By introducing several levels of local memory
• Data-reuse using local memory
12
Memory hierarchy & data movement energy costs
oFetching data from RF or
neighbour PEs cost low energy
13
Data Reuse Opportunities
With all data reuse options, best case AlexNet can reduce DRAM accesses from
~3K million to 61 million
14
Energy Breakdown across AlexNet Layers
o RF energy dominates in convolutional layers
o DRAM energy dominates in the fully connected layer
AlexNet Architecture
Source: V. Sze,Proceedings of the IEEE, 2017
15
DNN Accelerator versus General-Purpose Processor
o Mapper translates DNN shape & size into a hardware-compatible computation
mapping for execution given the dataflow
• Optimizes for energy efficiency
o Compiler translates program into machine-readable binary codes for execution
given hardware architecture (e.g., x86 or ARM)
• Usually optimizes for performance
16
Pruning
Methods for inference to efficiently produce models smaller in size, more memory-
efficient, more power-efficient and faster at inference with minimal loss in accuracy
Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf
17
Pruning (Cont…)
Accuracy drops after pruning, therefore network is usually trained-pruned-trained-
pruned iteratively to recover
Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf
18
Quantization
oIt is process of approximating a
neural network that uses floating-
point numbers by a neural network
of low bit width numbers
• Reduces both the memory
requirement and computational
costs
Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf
19
Quantization & Weight Sharing
Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf
20
Accelerator Architectures
21
Neural Processing Unit (NPU)
o NPU performs is computation of a multiple layer perceptron NN
• To accelerate general purpose programs, e.g, sobel edge detection and Fast Fourier transform (FFT)
o Accelerated when
program segment is
• Frequently executed,
• Approximable,
• Inputs & outputs are
well defined
22
DianNao Architecture
o It consists of following:
• A computational block
neural functional unit
(NFU)
• An input buffer for input
neurons (NBin)
• An output buffer for output
neurons (NBout)
• A synapse buffer for
synaptic weights (SB)
• A control processor (CP)
o Improve system efficiency by
minimizing memory transfer
latency
Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356
23
DianNao Series Accelerators (Cont..)
o DaDianNao targets datacenter scenario and integrates a large on-chip embedded
DRAM to avoid a long main-memory access time
o ShiDianNao is a DNN accelerator dedicated to CNN applications
• CNN parameters are mapped to SRAM
• Achieves 60x energy efficiency in comparison with DianNao architecture
o PuDianNao introduced a software-and-hardware co-design method to increase
on-chip data reuse and PE utilization ratios
• Supports DNN as well as other ML algorithms, such as k-means and classification trees
Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356
24
Tensor Processing Units (TPU)
o TPU-1 focuses on inference
tasks, deployed in Google’s
datacenter
Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356
25
RENO Architecture
o Utilizes ReRAM crossbar as computation unit to perform matrix–vector multiplication
• Supports the processing of small datasets, like MNIST
ReRAM crossbar
Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356
26
Neural Network Accelerator Comparison
Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
27
Summary
oDNN design metrics
oAccelerator architectures
• Neural Processing Unit (NPU)
• DianNao Series Accelerators
• Tensor Processing Units (TPU)
• RENO Architecture
28
Thank you
Any question ?
29