0% found this document useful (0 votes)

12 views29 pages

DNN Accelerators

The document discusses hardware architectures for Deep Neural Networks (DNNs), focusing on DNN accelerators and their design metrics such as accuracy, energy efficiency, throughput, and latency. It highlights the importance of memory access optimization and techniques like data-reuse, pruning, and quantization to improve computing efficiency. Various accelerator architectures are reviewed, including Neural Processing Units (NPU), DianNao Series, Tensor Processing Units (TPU), and RENO Architecture.

Uploaded by

TanvirAhmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views29 pages

DNN Accelerators

Uploaded by

TanvirAhmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Hardware Architectures for Artificial Intelligence

Title
(EE4690)

Lecture-8: DNN Accelerator

Computer Engineering Lab
Faculty of Electrical Engineering, Mathematics & Computer Science
20th May 2025
Outline
o Recap:
• DNN hardware implementation

o DNN design metrics

o Memory accesses for DNNs

o Improvement of Computing efficiency

• Data-reuse, pruning, quantization

o Accelerator architectures

o Summary
2
Recap: DNN hardware implementation
oLearned Floating-point operation and MAC units

oExplored design aspects of Floating-point units for CNNs

• Design challenges associated with adder, multiplier and MAC units

oRecourse utilization of CNNs

• Area/storage requirements, bandwidth issues and energy consumption

3
Recap: DNN hardware implementation
oDNN hardware implementation and floating point operation

oAlgorithmic optimization techniques

• Pruning
• Quantization
• sparsity
oHardware optimization techniques
• Fixed point operation
• Enhanced caching
• Binary operation
• Advanced techniques

4
Recap: Training vs Inference
Source: Nvidia

• Inference → forward propagation

• Dataflow of DNN inference is in the form of a chain
• Training → forward propagation + backward propagation
• Chain rule of back-propagation resulting in many long data dependencies
• Need to improve both training and inference efficiency
5
Recap: Multiply-Accumulate (MAC) Operations
o Neural networks are slow
• Load/store latency, limited bandwidth, extra precision in most neural network
calculations ……

Source: Mythic

o Matrix-multiplications and convolutions dominate over 90% of operations

• These are main targets of DNN optimizations, i.e., accelerator designs
• AlexNet supports 724 million MACs
6
Recap: Spatial Architecture

o Dataflow processing
o Chain of ALU’s pass data
directly

o Commonly used for

DNNs in ASIC and
FPGA-based designs
Processing
Element
(PE)

Source: V. Sze,Proceedings of the IEEE, 2017

7
Key Design Metrics
o Accuracy: Ratio of number of correct predictions to total number of predictions
• Describe the quality of result for a given application

o Energy efficiency: EnergyTotal = EnergyData + EnergyMAC

• TOPS/W: Tera (1012) operations per second per Watt
• Joules/Operation

o Throughput: Number of executions in a given time period

• Processing video at 30 frames/s necessary for real-time performance

o Latency: Delay between result generation and input data arrives

o Hardware cost: To indicate the monetary cost to build a system

• More Processing Elements (PEs), memory units….
8
Purpose of DNN Accelerators
oTo improve throughput and latency of the DNN architecture
• Reduction in latency of MAC operations as well as data flow
• More parallel operations by increasing the processing element units
• Pipelining the operations
• High memory bandwidth

oWithin the energy budget (application dependent)

• Edge devices: high energy efficacy for battery operated devices
• Data center: extra cost of cooling to deal with thermal issues

9
Memory Access per MAC
oMemory access is bottleneck for processing DNN operations
• Each MAC requires three memory reads
• For filter weights, fmap activation and partial sum

Source: V. Sze,Proceedings of the IEEE, 2017

10
Memory Access per MAC (Cont…)
oWorst case: memory accesses from DRAM
• Alexnet requires ~3000 million DRAM accesses
• DRAM access require significantly high energy consumption

Source: V. Sze,Proceedings of the IEEE, 2017

11
Memory Access per MAC (Cont…)
oAccelerators minimizes the DRAM accesses
• By introducing several levels of local memory
• Data-reuse using local memory

Source: ISCA 2019 tutorials

12
Memory hierarchy & data movement energy costs
oFetching data from RF or
neighbour PEs cost low energy

o Accelerators can be designed to

support specialized processing
data-flow
• Adapt to DNN shapes
& sizes
• Optimized for energy
efficiency

Source: V. Sze,Proceedings of the IEEE, 2017

13
Data Reuse Opportunities

Source: V. Sze,Proceedings of the IEEE, 2017

With all data reuse options, best case AlexNet can reduce DRAM accesses from
~3K million to 61 million
14
Energy Breakdown across AlexNet Layers
o RF energy dominates in convolutional layers
o DRAM energy dominates in the fully connected layer

AlexNet Architecture
Source: V. Sze,Proceedings of the IEEE, 2017

15
DNN Accelerator versus General-Purpose Processor
o Mapper translates DNN shape & size into a hardware-compatible computation
mapping for execution given the dataflow
• Optimizes for energy efficiency
o Compiler translates program into machine-readable binary codes for execution
given hardware architecture (e.g., x86 or ARM)
• Usually optimizes for performance

DNN accelerators → texts in black

General-purpose proc.→ texts in red

Source: V. Sze,Proceedings of the IEEE, 2017

16
Pruning
Methods for inference to efficiently produce models smaller in size, more memory-
efficient, more power-efficient and faster at inference with minimal loss in accuracy

Procedure: rank the neurons

in the network according to
how much they contribute, then
remove the low ranking
neurons from the network,
resulting in a smaller and faster
network.

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

17
Pruning (Cont…)
Accuracy drops after pruning, therefore network is usually trained-pruned-trained-
pruned iteratively to recover

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

18
Quantization
oIt is process of approximating a
neural network that uses floating-
point numbers by a neural network
of low bit width numbers
• Reduces both the memory
requirement and computational
costs

3.96 4.02 4.10 3.93 4.1

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

19
Quantization & Weight Sharing

Source: https://web.stanford.edu/class/ee380/Abstracts/160106-slides.pdf

20
Accelerator Architectures

21
Neural Processing Unit (NPU)
o NPU performs is computation of a multiple layer perceptron NN
• To accelerate general purpose programs, e.g, sobel edge detection and Fast Fourier transform (FFT)

o Accelerated when
program segment is
• Frequently executed,
• Approximable,
• Inputs & outputs are
well defined

o Computation tasks are

offloaded from the CPU to
the NPU at runtime

o NPU can achieve up to 11x

speed Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

22
DianNao Architecture
o It consists of following:
• A computational block
neural functional unit
(NFU)
• An input buffer for input
neurons (NBin)
• An output buffer for output
neurons (NBout)
• A synapse buffer for
synaptic weights (SB)
• A control processor (CP)
o Improve system efficiency by
minimizing memory transfer
latency

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

23
DianNao Series Accelerators (Cont..)
o DaDianNao targets datacenter scenario and integrates a large on-chip embedded
DRAM to avoid a long main-memory access time
o ShiDianNao is a DNN accelerator dedicated to CNN applications
• CNN parameters are mapped to SRAM
• Achieves 60x energy efficiency in comparison with DianNao architecture
o PuDianNao introduced a software-and-hardware co-design method to increase
on-chip data reuse and PE utilization ratios
• Supports DNN as well as other ML algorithms, such as k-means and classification trees

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

24
Tensor Processing Units (TPU)
o TPU-1 focuses on inference
tasks, deployed in Google’s
datacenter

o TPU-2 is for cloud computing

• Handles both training and
inference in datacenter
• Introduces vector-
processing units

o TPU-3 having liquid cooling

o TPU-4 is for edge computing

• Targets inference tasks of
Internet of Things (IoT)

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

25
RENO Architecture
o Utilizes ReRAM crossbar as computation unit to perform matrix–vector multiplication
• Supports the processing of small datasets, like MNIST

ReRAM crossbar

Source: https://www.sciencedirect.com/science/article/pii/S2095809919306356

26
Neural Network Accelerator Comparison

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

27
Summary
oDNN design metrics

oImportance of memory accesses for DNNs

• Any improvements in memory accesses improves the DNN efficiency

oImprovement of Computing efficiency

• Data-reuse, pruning, quantization & weight sharing

oAccelerator architectures
• Neural Processing Unit (NPU)
• DianNao Series Accelerators
• Tensor Processing Units (TPU)
• RENO Architecture
28
Thank you

Any question ?

Artificial Intelligence Hardware Design - Challenges and Solutions
100% (2)
Artificial Intelligence Hardware Design - Challenges and Solutions
233 pages
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Frederick P. Brooks - The Mythical Man-Month. Essays On Software Engineering, Anniversary Edition-Addison-Wesley Professional (1995)
100% (3)
Frederick P. Brooks - The Mythical Man-Month. Essays On Software Engineering, Anniversary Edition-Addison-Wesley Professional (1995)
322 pages
NoC Based DNN Accelerators
No ratings yet
NoC Based DNN Accelerators
8 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
Deep NN - Theory, Tutorial and Survey
No ratings yet
Deep NN - Theory, Tutorial and Survey
32 pages
1119810450-3
No ratings yet
1119810450-3
6 pages
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
No ratings yet
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
290 pages
14280
No ratings yet
14280
47 pages
Advanced topics for AI
No ratings yet
Advanced topics for AI
30 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
41 pages
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
No ratings yet
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
32 pages
An_Overview_of_Efficient_Interconnection_Networks_for_Deep_Neural_Network_Accelerators
No ratings yet
An_Overview_of_Efficient_Interconnection_Networks_for_Deep_Neural_Network_Accelerators
15 pages
FPGA CNN Project Paper
No ratings yet
FPGA CNN Project Paper
31 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
L-0017398760-pdf
No ratings yet
L-0017398760-pdf
24 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
5_lecture_28_01_25
No ratings yet
5_lecture_28_01_25
47 pages
2020_01_15_vivienne_sze_efficient_computing
No ratings yet
2020_01_15_vivienne_sze_efficient_computing
86 pages
Futureinternet 12 00113 v2
No ratings yet
Futureinternet 12 00113 v2
22 pages
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
No ratings yet
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
19 pages
NOCS Rethinking Upload
No ratings yet
NOCS Rethinking Upload
42 pages
2019_neurips_tutorial
No ratings yet
2019_neurips_tutorial
138 pages
SCT 3
No ratings yet
SCT 3
9 pages
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
No ratings yet
Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications
12 pages
thesis-2
No ratings yet
thesis-2
144 pages
Lecture 3 V33
No ratings yet
Lecture 3 V33
52 pages
Kernel Slides
No ratings yet
Kernel Slides
33 pages
tesi
No ratings yet
tesi
73 pages
Capra 2020
No ratings yet
Capra 2020
48 pages
EXSY Apr 21 455 Proof Hi
No ratings yet
EXSY Apr 21 455 Proof Hi
14 pages
Design Possibilities and Challenges of DNN Models
No ratings yet
Design Possibilities and Challenges of DNN Models
61 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
Neuromorphic Architectures Lec 4-16-1731320691
No ratings yet
Neuromorphic Architectures Lec 4-16-1731320691
276 pages
Lecture Notes on Lecture Notes on Deep Learning.docx
No ratings yet
Lecture Notes on Lecture Notes on Deep Learning.docx
8 pages
Embedded Deep Learning Accelerators - A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators - A Survey On Recent Advances
19 pages
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
No ratings yet
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
35 pages
make-04-00004-v3
No ratings yet
make-04-00004-v3
37 pages
Diannao Asplos2014
No ratings yet
Diannao Asplos2014
15 pages
FT04_Haghighat_Independent_2023
No ratings yet
FT04_Haghighat_Independent_2023
40 pages
Embedded_Deep_Learning_Accelerators_A_Survey_on_Recent_Advances
No ratings yet
Embedded_Deep_Learning_Accelerators_A_Survey_on_Recent_Advances
19 pages
DL Inference FPGA Class1
No ratings yet
DL Inference FPGA Class1
56 pages
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
No ratings yet
Efficient Hardware Architectures For Accelerating Deep Neural Networks Survey
42 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
ZyNet Automating Deep Neural Network Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
ZyNet Automating Deep Neural Network Implementation On Low-Cost Reconfigurable Edge Computing Platforms
4 pages
Lect 2 Common Architectural Principles of Deep Networks (3)
No ratings yet
Lect 2 Common Architectural Principles of Deep Networks (3)
20 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
Intro To Deep Learning
100% (1)
Intro To Deep Learning
35 pages
E0294 Scribe Lecture 9
No ratings yet
E0294 Scribe Lecture 9
24 pages
Paper 8
No ratings yet
Paper 8
7 pages
NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
No ratings yet
NeuronLink An Efficient Chip-to-Chip Interconnect For Large-Scale Neural Network Accelerators
13 pages
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
No ratings yet
Hardware Approximate Techniques For Deep Neural Network Accelerators: A Survey
36 pages
MAERI: Enabling Flexible Dataflow Mapping Over DNN Accelerators Via Programmable Interconnects
No ratings yet
MAERI: Enabling Flexible Dataflow Mapping Over DNN Accelerators Via Programmable Interconnects
3 pages
Tutorial-on-DNN-6-of-9-Network-and-Hardware-Co-Design
No ratings yet
Tutorial-on-DNN-6-of-9-Network-and-Hardware-Co-Design
60 pages
Towards An Embedded Biologically-Inspired Machine Vision Processor
No ratings yet
Towards An Embedded Biologically-Inspired Machine Vision Processor
34 pages
10190052
No ratings yet
10190052
11 pages
Storage Area Networks For Dummies
From Everand
Storage Area Networks For Dummies
Christopher Poelker
3.5/5 (2)
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
From Everand
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
William Smith
No ratings yet
Transmode DS - TM - EMXP - B PDF
No ratings yet
Transmode DS - TM - EMXP - B PDF
2 pages
India-GST-Documentation 1709 Config
No ratings yet
India-GST-Documentation 1709 Config
9 pages
X Artificial Intelligence For Cybersecurity - Literature Review and Future Research Directions
No ratings yet
X Artificial Intelligence For Cybersecurity - Literature Review and Future Research Directions
29 pages
Ha Pe Pre
No ratings yet
Ha Pe Pre
17 pages
Trainer Preparation Guide For Course 10961C: Automating Administration With Windows Powershell Design of The Course
No ratings yet
Trainer Preparation Guide For Course 10961C: Automating Administration With Windows Powershell Design of The Course
8 pages
Wifi
No ratings yet
Wifi
7 pages
Host To Host Communication in Networking
No ratings yet
Host To Host Communication in Networking
3 pages
Wrapper Classes
No ratings yet
Wrapper Classes
22 pages
Cv Nabila Pangastuti
No ratings yet
Cv Nabila Pangastuti
9 pages
UCCX - 11 - Release Notes For Uccx Solution
No ratings yet
UCCX - 11 - Release Notes For Uccx Solution
50 pages
SPSD Escalation Matrix
No ratings yet
SPSD Escalation Matrix
4 pages
BCS303 Questions Bank
No ratings yet
BCS303 Questions Bank
7 pages
CMT400 Access Exercise2
No ratings yet
CMT400 Access Exercise2
7 pages
Rebong Ermintrude Roark RIPActivity
No ratings yet
Rebong Ermintrude Roark RIPActivity
9 pages
LPKF Protomat S64
No ratings yet
LPKF Protomat S64
124 pages
FINEU2012 Himmighoefer Howto
No ratings yet
FINEU2012 Himmighoefer Howto
11 pages
ST Open Source Data Pipelines Oreilly f22568 202003 en PDF
No ratings yet
ST Open Source Data Pipelines Oreilly f22568 202003 en PDF
79 pages
Python Curriculum
No ratings yet
Python Curriculum
3 pages
Design Patterns
No ratings yet
Design Patterns
184 pages
Felices PLC Module Activity
No ratings yet
Felices PLC Module Activity
22 pages
Alexis Love Resume
No ratings yet
Alexis Love Resume
1 page
Grade 3 Techskills Checklist
No ratings yet
Grade 3 Techskills Checklist
3 pages
PA System
No ratings yet
PA System
1 page
Normas CUA
0% (1)
Normas CUA
4 pages
XDR With SIEM Integration
No ratings yet
XDR With SIEM Integration
36 pages
Apache Quick Reference Card
100% (1)
Apache Quick Reference Card
2 pages
Piyush PES GRO
No ratings yet
Piyush PES GRO
10 pages
Documento_completo
No ratings yet
Documento_completo
5 pages
3D Video Animation Training by Mayorchem.
No ratings yet
3D Video Animation Training by Mayorchem.
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DNN Accelerators

Uploaded by

DNN Accelerators

Uploaded by

Hardware Architectures for Artificial Intelligence

Lecture-8: DNN Accelerator

o DNN design metrics

o Memory accesses for DNNs

o Improvement of Computing efficiency

oExplored design aspects of Floating-point units for CNNs

oRecourse utilization of CNNs

oAlgorithmic optimization techniques

• Inference → forward propagation

o Matrix-multiplications and convolutions dominate over 90% of operations

o Commonly used for

Source: V. Sze,Proceedings of the IEEE, 2017

o Energy efficiency: EnergyTotal = EnergyData + EnergyMAC

o Throughput: Number of executions in a given time period

o Latency: Delay between result generation and input data arrives

o Hardware cost: To indicate the monetary cost to build a system

oWithin the energy budget (application dependent)

Source: V. Sze,Proceedings of the IEEE, 2017

Source: V. Sze,Proceedings of the IEEE, 2017

Source: ISCA 2019 tutorials

o Accelerators can be designed to

Source: V. Sze,Proceedings of the IEEE, 2017

Source: V. Sze,Proceedings of the IEEE, 2017

DNN accelerators → texts in black

General-purpose proc.→ texts in red

Source: V. Sze,Proceedings of the IEEE, 2017

Procedure: rank the neurons

3.96 4.02 4.10 3.93 4.1

o Computation tasks are

o NPU can achieve up to 11x

o TPU-2 is for cloud computing

o TPU-3 having liquid cooling

o TPU-4 is for edge computing

oImportance of memory accesses for DNNs

oImprovement of Computing efficiency

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.