0% found this document useful (0 votes)
12 views29 pages

DNN Accelerators

The document discusses hardware architectures for Deep Neural Networks (DNNs), focusing on DNN accelerators and their design metrics such as accuracy, energy efficiency, throughput, and latency. It highlights the importance of memory access optimization and techniques like data-reuse, pruning, and quantization to improve computing efficiency. Various accelerator architectures are reviewed, including Neural Processing Units (NPU), DianNao Series, Tensor Processing Units (TPU), and RENO Architecture.

Uploaded by

TanvirAhmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views29 pages

DNN Accelerators

The document discusses hardware architectures for Deep Neural Networks (DNNs), focusing on DNN accelerators and their design metrics such as accuracy, energy efficiency, throughput, and latency. It highlights the importance of memory access optimization and techniques like data-reuse, pruning, and quantization to improve computing efficiency. Various accelerator architectures are reviewed, including Neural Processing Units (NPU), DianNao Series, Tensor Processing Units (TPU), and RENO Architecture.

Uploaded by

TanvirAhmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Hardware Architectures for Artificial Intelligence

Title
(EE4690)

Lecture-8: DNN Accelerator


Computer Engineering Lab
Faculty of Electrical Engineering, Mathematics & Computer Science
20th May 2025
Outline
o Recap:
• DNN hardware implementation

o DNN design metrics

o Memory accesses for DNNs

o Improvement of Computing efficiency


• Data-reuse, pruning, quantization

o Accelerator architectures

o Summary
2
Recap: DNN hardware implementation
oLearned Floating-point operation and MAC units

oExplored design aspects of Floating-point units for CNNs


• Design challenges associated with adder, multiplier and MAC units

oRecourse utilization of CNNs


• Area/storage requirements, bandwidth issues and energy consumption

3
Recap: DNN hardware implementation
oDNN hardware implementation and floating point operation

oAlgorithmic optimization techniques


• Pruning
• Quantization
• sparsity
oHardware optimization techniques
• Fixed point operation
• Enhanced caching
• Binary operation
• Advanced techniques

4
Recap: Training vs Inference
Source: Nvidia

• Inference → forward propagation


• Dataflow of DNN inference is in the form of a chain
• Training → forward propagation + backward propagation
• Chain rule of back-propagation resulting in many long data dependencies
• Need to improve both training and inference efficiency
5
Recap: Multiply-Accumulate (MAC) Operations
o Neural networks are slow
• Load/store latency, limited bandwidth, extra precision in most neural network
calculations ……

Source: Mythic

o Matrix-multiplications and convolutions dominate over 90% of operations


• These are main targets of DNN optimizations, i.e., accelerator designs
• AlexNet supports 724 million MACs
6
Recap: Spatial Architecture

o Dataflow processing
o Chain of ALU’s pass data
directly

o Commonly used for


DNNs in ASIC and
FPGA-based designs
Processing
Element
(PE)

Source: V. Sze,Proceedings of the IEEE, 2017

7
Key Design Metrics
o Accuracy: Ratio of number of correct predictions to total number of predictions
• Describe the quality of result for a given application

o Energy efficiency: EnergyTotal = EnergyData + EnergyMAC


• TOPS/W: Tera (1012) operations per second per Watt
• Joules/Operation

o Throughput: Number of executions in a given time period


• Processing video at 30 frames/s necessary for real-time performance

o Latency: Delay between result generation and input data arrives

o Hardware cost: To indicate the monetary cost to build a system


• More Processing Elements (PEs), memory units….
8
Purpose of DNN Accelerators
oTo improve throughput and latency of the DNN architecture
• Reduction in latency of MAC operations as well as data flow
• More parallel operations by increasing the processing element units
• Pipelining the operations
• High memory bandwidth

oWithin the energy budget (application dependent)


• Edge devices: high energy efficacy for battery operated devices
• Data center: extra cost of cooling to deal with thermal issues

9
Memory Access per MAC
oMemory access is bottleneck for processing DNN operations
• Each MAC requires three memory reads
• For filter weights, fmap activation and partial sum

Source: V. Sze,Proceedings of the IEEE, 2017

10
Memory Access per MAC (Cont…)
oWorst case: memory accesses from DRAM
• Alexnet requires ~3000 million DRAM accesses
• DRAM access require significantly high energy consumption

Source: V. Sze,Proceedings of the IEEE, 2017

11
Memory Access per MAC (Cont…)
oAccelerators minimizes the DRAM accesses
• By introducing several levels of local memory
• Data-reuse using local memory

Source: ISCA 2019 tutorials

12
Memory hierarchy & data movement energy costs
oFetching data from RF or
neighbour PEs cost low energy

o Accelerators can be designed to


support specialized processing
data-flow
• Adapt to DNN shapes
& sizes
• Optimized for energy
efficiency

Source: V. Sze,Proceedings of the IEEE, 2017

13
Data Reuse Opportunities

Source: V. Sze,Proceedings of the IEEE, 2017

With all data reuse options, best case AlexNet can reduce DRAM accesses from
~3K million to 61 million
14
Energy Breakdown across AlexNet Layers
o RF energy dominates in convolutional layers
o DRAM energy dominates in the fully connected layer

AlexNet Architecture
Source: V. Sze,Proceedings of the IEEE, 2017

15
DNN Accelerator versus General-Purpose Processor
o Mapper translates DNN shape & size into a hardware-compatible computation
mapping for execution given the dataflow
• Optimizes for energy efficiency
o Compiler translates program into machine-readable binary codes for execution
given hardware architecture (e.g., x86 or ARM)
• Usually optimizes for performance

DNN accelerators → texts in black

General-purpose proc.→ texts in red

Source: V. Sze,Proceedings of the IEEE, 2017

16
Pruning
Methods for inference to efficiently produce models smaller in size, more memory-
efficient, more power-efficient and faster at inference with minimal loss in accuracy

Procedure: rank the neurons


in the network according to
how much they contribute, then
remove the low ranking
neurons from the network,
resulting in a smaller and faster
network.

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

17
Pruning (Cont…)
Accuracy drops after pruning, therefore network is usually trained-pruned-trained-
pruned iteratively to recover

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

18
Quantization
oIt is process of approximating a
neural network that uses floating-
point numbers by a neural network
of low bit width numbers
• Reduces both the memory
requirement and computational
costs

3.96 4.02 4.10 3.93 4.1

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

19
Quantization & Weight Sharing

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

20
Accelerator Architectures

21
Neural Processing Unit (NPU)
o NPU performs is computation of a multiple layer perceptron NN
• To accelerate general purpose programs, e.g, sobel edge detection and Fast Fourier transform (FFT)

o Accelerated when
program segment is
• Frequently executed,
• Approximable,
• Inputs & outputs are
well defined

o Computation tasks are


offloaded from the CPU to
the NPU at runtime

o NPU can achieve up to 11x


speed Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

22
DianNao Architecture
o It consists of following:
• A computational block
neural functional unit
(NFU)
• An input buffer for input
neurons (NBin)
• An output buffer for output
neurons (NBout)
• A synapse buffer for
synaptic weights (SB)
• A control processor (CP)
o Improve system efficiency by
minimizing memory transfer
latency

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

23
DianNao Series Accelerators (Cont..)
o DaDianNao targets datacenter scenario and integrates a large on-chip embedded
DRAM to avoid a long main-memory access time
o ShiDianNao is a DNN accelerator dedicated to CNN applications
• CNN parameters are mapped to SRAM
• Achieves 60x energy efficiency in comparison with DianNao architecture
o PuDianNao introduced a software-and-hardware co-design method to increase
on-chip data reuse and PE utilization ratios
• Supports DNN as well as other ML algorithms, such as k-means and classification trees

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

24
Tensor Processing Units (TPU)
o TPU-1 focuses on inference
tasks, deployed in Google’s
datacenter

o TPU-2 is for cloud computing


• Handles both training and
inference in datacenter
• Introduces vector-
processing units

o TPU-3 having liquid cooling

o TPU-4 is for edge computing


• Targets inference tasks of
Internet of Things (IoT)

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

25
RENO Architecture
o Utilizes ReRAM crossbar as computation unit to perform matrix–vector multiplication
• Supports the processing of small datasets, like MNIST

ReRAM crossbar

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

26
Neural Network Accelerator Comparison

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

27
Summary
oDNN design metrics

oImportance of memory accesses for DNNs


• Any improvements in memory accesses improves the DNN efficiency

oImprovement of Computing efficiency


• Data-reuse, pruning, quantization & weight sharing

oAccelerator architectures
• Neural Processing Unit (NPU)
• DianNao Series Accelerators
• Tensor Processing Units (TPU)
• RENO Architecture
28
Thank you

Any question ?

29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy