0% found this document useful (0 votes)
27 views3 pages

MAERI: Enabling Flexible Dataflow Mapping Over DNN Accelerators Via Programmable Interconnects

Uploaded by

lalladn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views3 pages

MAERI: Enabling Flexible Dataflow Mapping Over DNN Accelerators Via Programmable Interconnects

Uploaded by

lalladn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

MAERI: Enabling Flexible Dataflow Mapping over DNN

Accelerators via Programmable Interconnects


Hyoukjun Kwon Ananda Samajdar Tushar Krishna
Georgia Institute of Technology Georgia Institute of Technology Georgia Institute of Technology
Atlanta, Georgia Atlanta, Georgia Atlanta, Georgia
hyoukjun@gatech.edu anandsamajdar@gatech.edu tushar@ece.gatech.edu

ABSTRACT (within [4] and across layers [2]) to exploit data reuse, which leads
The microarchitecture of DNN inference engines is an active re- to different dataflow patterns within accelerators.
search topic in the computer architecture community because DNN The DNN Data Flow Graph (DFG) is fundamentally a multi-
accelerators are needed to maximize performance/watt for mass dimensional multiply-accumulate calculation, as Figure 1 demon-
deployment across phones, cars, and so on. This has led to a flurry strates. Each dataflow is essentially some kind of transformation of
of ASIC DNN accelerator proposals in academia over recent years. this loop [10, 12], which contains different optimization potentials
Industry is also investing heavily so every major company develop- depending on neural network layers. Unfortunately, most of DNN
ing its own neural network accelerator, which resulted in myriad accelerators cannot exploit potentials of each dataflow as they inter-
of dataflow patterns. We claim that dataflows essentially lead to nally support fixed dataflow patterns. This is because they perform
different kinds of data movement within an accelerator. Thus, to a careful co-design of the PEs and the network-on-chip (NoC) (e.g.,
support arbitrary dataflows in accelerators, we propose to make TPU [9]). Because of such inflexibility, mapping different dataflows
interconnects programmable. We achieve it by augmenting all com- on an accelerator can lead to compute resource underutilization.
pute elements (multipliers and adders) and on-chip buffers with tiny Therefore, each new optimization has required a new accelerator de-
switches, which can be configured at compile time or runtime. Our sign for that optimization [2, 7, 11, 13], which makes the hardening
design, MAERI, connects these switches via a new configurable and of accelerator designs challenging and uneconomical.
non-blocking tree topology to provide not only programmability Our insight in this work is that different dataflows essentially
but also high throughput. lead to different data movement patterns within accelerators. Thus,
to support arbitrary dataflows in spatial accelerators, we propose
ACM Reference Format: MAERI (Multiply-Accumulate Engine with Reconfigurable Inter-
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. connect)1 , a DNN accelerator with programmable interconnects.
MAERI: Enabling Flexible Dataflow Mapping over DNN Acceler- MAERI augments all compute elements (multipliers and adders)
ators via Programmable Interconnects. In Proceedings of SysML and on-chip buffers with tiny switches, which can be programmed/-
’18. ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/
configured at compile- or run-time to support myriad dataflow
nnnnnnn.nnnnnnn
scenarios. We connect these switches via a new configurable and
1 INTRODUCTION non-blocking tree topology.
The microarchitecture of deep neural network (DNN) inference 2 MAERI BUILDING BLOCKS
engines is an active research area in computer architecture commu- Figure 1 shows MAERI’s building blocks:
nity. GPUs provide are efficient training platforms with their mass
parallelism, multi-core CPUs provide platforms for algorithmic ex- • Prefetch buffer (PB): serves as a cache of DRAM, which stores
ploration, and FPGAs provide power-efficient and configurable plat- input activations, weights, intermediate partial sums that could
forms for algorithmic exploration and acceleration. However, for not be fully accumulated, and output activations.
mass deployment across various domains (phones, cars, etc.), DNN • Activation Units: Lookup Tables are used to implement dif-
accelerators are needed to maximize performance/watt. This has ferent activation functions (such as ReLU).
led to a flurry of ASIC proposals for DNN accelerators over recent • Distribution Tree: A fat-tree is used to distribute activations
years [3–7, 11]. Industry is also heavily investing, with every major and weights from the PB to multipliers.
company developing its own spatial DNN accelerator [1, 8, 9]. One
• Simple Switch (SS): Each node in the distribution tree is a
of the practical and open challenges for DNN accelerator designs is
simple 2:1 switch to unicast/multicast inputs/weights.
programmability because DNNs can be partitioned in myriad ways
• Augmented Reduction Tree (ART): A fat-tree augmented
with forwarding links is used to reduce partial sums and send
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed outputs to activation units. An ART with N leaves can support
for profit or commercial advantage and that copies bear this notice and the full citation 1 to N/2 simultaneous reductions with provable non-blocking.
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, • Adder Switch (AS): Each node in ART is an adder augmented
to post on servers or to redistribute to lists, requires prior specific permission and/or a with a switch to allow data forwarding to peers or to parents.
fee. Request permissions from permissions@acm.org.
SysML ’18, Feb 15–16, 2018, Stanford, CA
© 2018 Association for Computing Machinery.
1A version of this paper will appear in Proc. of the 23rd ACM Int. Conf. on Architectural
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn Support for Programming Languages and Operating Systems (ASPLOS), Mar 2018.
SysML ’18, Feb 15–16, 2018, Stanford, CA Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna

I
C C P C C P C C P C C C P C C C P
N
O O O O O O O O O O O O O O O O O F F F
P
N N O N N O N N O N N N O N N N O C C C
U
V V L V V L V V L V V V L V V V L
T
VGG-16 Activation Control Layer Topology

∑ ∑ …∑ W * ∑ ∑ …∑ W * Activation Units
Accelerator Controller
I I

Distribution/Reduction Network
Loop Ordering
for(ki=0; ki<K; ki++) { // C-dimension Filter; Loop_K

+
10 Augmented From/To DRAM
for(ci=0; ci<C; ci++) { // Input Channel; Loop_C ∑ ∑ …∑ W * I

+
Reduction
0
for(yi=0; yi<Y; yi++) { // Image row; Loop_Y ∑ ∑ …∑ W
Y Map Tree (ART)

+
20
+ ∑ ∑ …∑ W * I

Control
for(xi=0; xi<X; xi++) { // Image col; Loop_X *I 10

+
Loop trans-
0
… Prefetch
for(ri=0; ri<R; ri++) { // Weight filter row; Loop_R formations
Loop Blocking/Tiling Y Buffer
x x x x x x x x x x x x x x x x
+ ∑ Y-10
∑ …∑ W * I MAERI Blocks
for(si=0; si<S; si++) { // Weight filter col; Loop_S
∑ ∑ …∑R W * I Adder Switch

+
O[ki][xi][yi] += W[ki][ci][ri][si] * I[ci][yi][xi]}}}}}}}
R S
0
R x Multiplier Switch
Full Convolution Loop ∑ …∑ ∑ W * I + ∑ ∑ …∑ W * I
0 1:2 Switch
0 0

Loop Unrolling R Distribution Tree Local Buffer
+ ∑ ∑ …∑
0 W * I Weights / Input Activations Local Forwarding
MAERI Data link
Dataflow Optimizations
Figure 1: An overview of Maeri. Maeri is designed to efficiently handle CONV, LSTM, POOL and FC layers. It can also handle cross-layer
and sparse mappings. We implement this flexibility using configurable distribution and reduction trees within the fabric.

X00 X01 X02 X03


O00 O01 O02 O03 Instruction Stream for
Slide
X10 X11 X12 X13 Programming Switches To/From Prefetch Buffer
W00 W01
X X20 X21 =O
X22
10 O11 O12 O13
X23
O20 O21 O22 O23 to create Virtual Neurons
W10 W11
X30 X31 X32 X33
O30 O31 O32 O33

+
+

+
Filter Input Activation Output Activation

+
Oij = W00 X Xij + W01 X Xi(j+1)

+
+ W10 X X(i+1)j + W11 X X(i+1)(j+1)
CONV
0 Target Layer 1 Programming of Switches

To/From Prefetch Buffer To/From Prefetch Buffer


VN 0 VN 1 VN 2 VN 0 VN 1 VN 2
+

+
+

+
+

+
+

+
W00 W01 W10 W11 W00 W01 W10 W11 W00 W01 W10 W11 Mult. valA
X00 X01 X10 X11 X10 X11 X20 X21 X20 X21 X30 X31 Mult. valB
3 Weight and Input Activation Distribution 2 Virtual Neuron Construction

To/From Prefetch Buffer To/From Prefetch Buffer


4a 4b
VN 0 VN 1 VN 2 VN 0 VN 1 VN 2
+

+
+

O00 O10 O20 O01 O11 O21


X02 X12 X12 X22 X32 X33
+

X22 X03 X13 X13 X23


+

+
X23
+

+
X01 X11 X11 X21 X21 X31 X02 X12 X12 X22 X22 X32

W00 W01 W10 W11 W00 W01 W10 W11 W00 W01 W10 W11 Mult. valA W00 W01 W10 W11 W00 W01 W10 W11 W00 W01 W10 W11 Mult. valA
X01 X02 X11 X12 X11 X12 X21 X22 X21 X22 X31 X32 Mult. valB X02 X03 X12 X13 X12 X13 X22 X23 X22 X23 X32 X33 Mult. valB
4 Output Activation Calculation
Figure 2: Programming the switches to map a CONV layer in Maeri. W, X, and O represent weights, input activations, and output activations.

• Multiplier Switch (MS): Each multiplier is augmented with of the interconnects allows us to create VNs of any size, which
a switch to allow data forwarding to neighboring multipliers, provides the ability to map arbitrary dataflows simultaneously.
and data reception from the PB or neighboring MSes. Figure 2 shows a walk-through example of mapping a convolu-
The distribution and reduction trees provide full non-blocking tional layer. Similarly, recurrent, max-pool, fully-connected, sparse
bandwidth to the compute blocks, but can be pruned to reduce the and so on layers can be mapped.
bandwidth if required to reduce area and power. 4 EVALUATIONS AND CONCLUSIONS
Maeri is a spatial accelerator for mapping arbitrary dataflows that
3 MAPPING DATAFLOWS OVER MAERI arise in DNNs due to its topology or mappings by using tiny pro-
The entire accelerator is controlled by a programmable controller grammable switches next to each on-chip compute and memory en-
which manages reconfiguration of all three sets of switches (MS, AS, gine. It provides 130-283% better utilization across multiple dataflow
and SS) for mapping the target dataflow. This is done by creating mappings over baselines with rigid NoC fabrics. MAERI’s intercon-
Virtual Neurons (VN) over the multiplers and adders. The flexibility nects are 10-100× smaller than conventional NoCs.
MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Programmable Interconnects
SysML ’18, Feb 15–16, 2018, Stanford, CA

REFERENCES
[1] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John
Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam,
Brian Taba, Michael Beakes, Bernand Brezzo, Jente B. Kuang, Rajit Manohar,
William P. Risk, Brayan Jackson, and Dharmendra S. Modha, Truenorth: Design
and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,
TCADICS 34 (2015), no. 10, 1537–1557.
[2] Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder, Fused-layer CNN
accelerators, 49th Annual IEEE/ACM International Symposium on Microarchitec-
ture (MICRO), 2016.
[3] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen,
and Olivier Temam, Diannao: A small-footprint high-throughput accelerator for
ubiquitous machine-learning, ASPLOS, 2014, pp. 269–284.
[4] Yu-Hsin Chen, Joel Emer, and Vivienne Sze, Eyeriss: A spatial architecture for
energy-efficient dataflow for convolutional neural networks, ISCA, 2016.
[5] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tian-
shi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam, Dadiannao: A machine-
learning supercomputer, MICRO, 2014, pp. 609–622.
[6] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaob-
ing Feng, Yunji Chen, and Olivier Temam, Shidiannao: Shifting vision processing
closer to the sensor, ISCA, 2015.
[7] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz,
and William J Dally, Eie: efficient inference engine on compressed deep neural
network, ISCA, 2016.
[8] Intel, Intel’s new self-learning chip promises to accelerate
artificial intelligence, https://newsroom.intel.com/editorials/
intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/.
[9] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.,
In-datacenter performance analysis of a tensor processing unit, Proceedings of the
44th Annual International Symposium on Computer Architecture, ACM, 2017,
pp. 1–12.
[10] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li,
Flexflow: A flexible dataflow accelerator architecture for convolutional neural net-
works, HPCA, 2017.
[11] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rang-
harajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and
William J Dally, Scnn: An accelerator for compressed-sparse convolutional neural
networks, Proceedings of the 44th Annual International Symposium on Computer
Architecture, ACM, 2017, pp. 27–40.
[12] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong,
Optimizing fpga-based accelerator design for deep convolutional neural networks,
FPGA, 2015, pp. 161–170.
[13] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo,
Tianshi Chen, and Yunji Chen, Cambricon-x: An accelerator for sparse neural
networks, MICRO, 2016.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy