FPGA CNN Project Paper
FPGA CNN Project Paper
1 Abstract
This paper explores the vital role of convolution as a foundational element in
Deep Neural Networks (DNNs) for extracting meaningful features from input
data. The focus is on designing a real-time architecture for a high-throughput
processing unit (PU) dedicated to 2D convolution. Three distinct PU topolo-
gies are examined, progressing from a non-pipelined Multiply-Accumulate
(MAC) to a 2-staged pipelined MAC and culminating in a pipelined MAC
with multiple stages, systematically reducing convolution process delays. Be-
yond theoretical considerations, this paper evaluates resource utilization and
latency across different PU topologies on Artix-7 and Zynq-7 FPGA chips,
providing comparative insights using Xilinx Vivado software. Additionally,
the study extends to estimating and comparing throughput across various
kernel and picture sizes, offering real-world performance insights into differ-
ent MAC topologies. The thorough investigation successfully enhanced the
convolution processes (within the dynamic framework of DNNs, resulting) in
a substantial 96% increase in throughput.
2 Introduction
Day by day the usage of deep learning and deep neural networks (DNNs)
are sky rocketing in many fields like object detection, object recognition,
semantic segmentation, medical imaging, etc. this is because of their accu-
racy within lesser time exceeds the human knowledge. Deep neural networks
1
(DNNs), a fundamental component of contemporary artificial intelligence, of-
fer a host of advantages across various applications, highlighting their adapt-
ability and efficiency in managing intricate tasks.
In the realm of image and object recognition [2], DNNs play a pivotal role,
propelling advancements in facial recognition, autonomous vehicles, and med-
ical image analysis through their capacity to learn intricate patterns from
data. Natural Language Processing (NLP) has experienced a transformative
leap with DNNs, particularly through transformer models like BERT and
GPT, empowering applications such as sentiment analysis, language transla-
tion, and chatbots with unprecedented accuracy. Speech recognition systems
heavily rely on DNNs, facilitating the development of virtual assistants, tran-
scription services, and voice-controlled devices, offering robust performance
in complex audio environments. DNNs contribute significantly to recommen-
dation systems, enhancing user experience by tailoring content suggestions
in areas like streaming services and e-commerce. In healthcare, these net-
works bring about breakthroughs by analyzing medical images for tasks like
tumor detection and disease classification, thereby improving diagnostic ac-
curacy. Autonomous vehicles leverage the strengths of DNNs for perception
tasks, ensuring robust object detection, precise lane-keeping, and effective
obstacle avoidance. Moreover, DNNs contribute to financial fraud detection,
excel in generative tasks like image synthesis, and prove valuable in anomaly
detection across diverse industries. Their applications extend to robotics,
drug discovery, and climate modeling. The dynamic adaptability of DNNs,
coupled with their ability to learn from vast and complex datasets, ensures
continuous exploration and adaptation across emerging domains, solidifying
their significance in shaping technological landscapes and ushering in un-
precedented advancements.
Due to these extensive applications in DNN, there is a keen interest among
people to utilize it across a wide range of applications. Despite encountering
various challenges, Deep Neural Networks (DNNs) continue to be a sub-
ject of interest and exploration. Overfitting remains a prevalent concern
as DNNs can become overly attuned to training data, compromising their
ability to generalize to new datasets. Data quality and quantity are cru-
cial, demanding extensive, well-labeled datasets, and the presence of biases
can skew model outcomes. The computational demands of training DNNs are
substantial, requiring powerful hardware and substantial time. Interpretabil-
ity is a persistent challenge due to the complex, non-linear nature of deep
neural networks, making it difficult to comprehend their decision-making
processes. Adversarial attacks pose security risks, exploiting vulnerabilities
in DNNs and causing misclassifications. Ethical considerations arise from
biases in training data, potentially leading to unfair outcomes, raising ques-
2
tions about responsible AI deployment. Transfer learning faces challenges in
adapting pre-trained models to new tasks or domains. Training instability
is a concern, as small changes in conditions can yield significantly different
model performances. A limited theoretical understanding of why DNNs work
in certain situations hinders efforts to design more robust models. Lastly, de-
ploying DNNs in resource-constrained environments remains challenging due
to their high computational requirements. Addressing these multifaceted
challenges is crucial for unlocking the full potential of DNNs across various
applications.
Among these Challenges, these papper mainly focuses on the compu-
tational demands of Deep Neural Networks (DNNs) have necessitated the
development of specialized hardware implementations to ensure efficient and
timely model training and inference. Traditional Central Processing Units
(CPUs) provide versatility but may lack the parallel processing power re-
quired for the massive matrix operations intrinsic to DNN computations.
Graphics Processing Units (GPUs) have gained prominence due to their
ability to parallelize tasks effectively, making them well-suited for train-
ing large models and processing extensive datasets simultaneously. Field-
Programmable Gate Arrays (FPGAs) offer flexibility, enabling the customiza-
tion of circuits to specific DNN architectures, which is advantageous for
evolving or optimizing neural network structures. Application-Specific Inte-
grated Circuits (ASICs) represent a pinnacle in specialized hardware, custom-
designed for particular algorithms, delivering exceptional speed and power
efficiency, albeit with higher development costs. The ongoing exploration
of hardware accelerators, such as Google’s Tensor Processing Units (TPUs),
showcases efforts to optimize DNN performance, emphasizing the need for
hardware architectures that balance computational power, energy efficiency,
and flexibility. As the field advances, these hardware implementations play
a pivotal role in harnessing the full potential of DNNs across a spectrum of
artificial intelligence applications.
Among them Field-Programmable Gate Arrays (FPGAs) and Application-
Specific Integrated Circuits (ASICs) are favored choices for Deep Neural
Network (DNN) accelerators due to their distinctive features aligning with
the computational demands of neural networks. FPGAs offer a balance be-
tween flexibility and parallelism, making them adaptable to varying DNN
architectures. Their programmable nature allows for rapid prototyping and
reconfiguration, reducing development time and cost. Additionally, FPGAs
strike a compromise between energy efficiency and performance, making them
suitable for scenarios where the hardware needs to support different neural
network models efficiently.
On the other hand, ASICs are custom-designed for specific tasks, provid-
3
ing unparalleled parallelism and energy efficiency. While the development
of ASICs is more time-consuming and expensive, they excel in performance,
making them ideal for large-scale deployments where dedicated hardware
optimized for a particular set of computations is paramount. The choice
between FPGAs and ASICs depends on factors such as the need for adapt-
ability, development time, energy efficiency, and the scale of deployment.
FPGAs are often preferred during the prototyping phase and in applications
where flexibility is crucial, while ASICs shine in situations demanding max-
imum performance and energy efficiency at scale.
In the realm of Deep Neural Networks (DNNs), 2D convolution stands
out as a paramount process, particularly within Convolutional Neural Net-
works (CNNs)(which is a part of DNN). This operation is integral for feature
extraction from spatial data, notably in image-related tasks. It entails slid-
ing a small filter over the input, performing element-wise multiplications and
aggregations to capture spatial hierarchies, patterns, and features. The sig-
nificance of 2D convolution lies in its inherent ability to autonomously learn
and extract meaningful representations from the input, enabling the discern-
ment of complex patterns and structures. This process is indispensable for
various applications in computer vision, including image recognition and ob-
ject detection. The hierarchical and localized nature of 2D convolution serves
as a cornerstone in the success of many deep learning applications, providing
a framework for CNNs to comprehend and interpret intricate features present
in the input data.
Convolutional Neural Networks (CNNs) offer several advantages over tra-
ditional Fully Connected Neural Networks (DNNs), especially in tasks involv-
ing grid-like data. CNNs capture spatial hierarchies effectively, recognizing
patterns at different levels of abstraction. They achieve this through param-
eter sharing, reducing the number of parameters and enhancing efficiency.
The concept of translation invariance, enabled by shared weights and pool-
ing layers, makes CNNs robust to variations in position, orientation, or scale
of objects in images. Additionally, local connectivity in CNNs allows them
to focus on capturing local patterns, contributing to their effectiveness in
computer vision applications. While CNNs excel in spatial tasks, the choice
between CNNs and DNNs depends on the nature of the data and the specific
requirements of the task at hand.
In the current work, the implementation of a high throughput process
unit for 2D convolution in FPGA has been presented. we have studied how a
pipeline technique can be used to provide high throughput and maximize per-
formance in the process unit of convolution by partitioning the critical path.
Increasing the throughput by means of pipeling in convolution process unit
makes this study differ from other former studies. The main contributions
4
of this work are as follows:
i) This paper explored various convolution techniques that are used for
CNN accelerators. Based on our analysis, MAC have been used for the
design of convolution architecture. Thus, this paper proposes a multi-
staged MAC which can be used to construct a faster processing unit
for 2D convolution.
iii) Finally, the proposed MAC with other MACs are implemented in FPGA
boards. And those MACs are compared in terms of latency and through-
put parameters to distinguish which MAC is the best suited MAC for
the convolution layer.
5
Convolution is pivotal process in the field deep neural networks because
it enables these networks to efficiently process and extract features from
complex data like images, audio, and even text. It also allows deep neural
networks to automatically learn and recognize intricate hierarchical patterns,
such as edges, textures, and object parts in images, making them highly effec-
tive for tasks like image recognition, object detection, and image generation.
Convolution process uses two matrix data; input and filter/kernel to produce
a output matrix data. Here, input matrix/map may be a picture data, au-
dio data, text data or processed picture/audio/text data, filter/kernel is the
pattern
here is how a 5X5 image is convoluted with 3X3 filter with stride as 1.
First the 3X3 filter is scalar multiplied with the first 3X3 section in image
and the 3X3 multiplied result is added to get the first output value. then
the filter moves with to the second row from first row since the stride is 1.
then the multiplication and the addition happens to get the second value and
continues to get last column value of first row of output since the filter can’t
be moved to the next row. so the filter is moved to the first row and second
column since the stride is 1 and gets the first output value in second row.
6
Then continues to get upto the last column value of second row of output.
This process continues upto the last column value of last row of output i.e.
moving the filter to the third row which makes the filter placing at the corner
of the input image. The whole convolution procedure discussed above is as
shown in the figure 2.2.
By taking the consideration of stride and zero padding, the output size can
be determined using below formula where as A is the input image size, ZP
is the Zero padding, f is the filter/kernel size and str is stride.
A + 2 ∗ ZP − f â
Outputsize = +1 (1)
str â
7
3.2.2 Dilated convolution layer
Dilated Convolution[1] allows for an increased receptive field without increas-
ing the number of parameters or the amount of computation required. This
is achieved by introducing gaps or ”dilations” between the filter values of the
convolutional kernel as shown in Figure 2.8. The dilation factor controls the
size of these gaps, which can be thought of as holes in the filter. By increasing
the dilation factor, the receptive field of the layer increases, allowing for the
layer to capture features across larger spatial regions. Dilated convolutional
layers are mostly used to detect certain patterns in the CNN layer which are
very much distance apart.
8
The architecture of an accelerator should be well-defined such that it should
perform multiple functions and operations quickly and precisely without any
structural hazards. And accelerator architecture for DNN should perform all
DNN functions and operations with higher performance than normal CPU.
The basic architectures of accelerators commonly used to execute the func-
tions and operations quicker and more precisely using a small multi-functional
building unit called a processing element (PE). This PE has multiple func-
tionalities and these functionalities are controlled by the controller placed
inside of PE or outside of PE based on the type of structure it was designed.
The Control unit, Register File, and computational unit combined in PE,
then that type of architecture is called spatial architecture and if the com-
putational unit only forms a PE with outside memory and was controlled by
one main controller, then that type of architecture is called Temporal archi-
tectures. Spatial architectures can be seen in CPUs and GPUs and Temporal
Architectures in FPGA-based and ASCI-based accelerators as shown in the
figure.
9
Thus FPGA is preferred in the projects. As shown in the table 2.1, are the
architecture of some of the many accelerators for CNN and the description
of convolution operation in those architectures. And one more thing is that
the PE in these CNN accelerators is multiply-accumulate (MAC). MAC is
the basic and most used block for convolution architecture as a PE block.
In [10] and [9], a lot of MAC are used for building a Accelerator for DNN
applications.
10
SL. Frequency Performance Power
Name Year FPGA Platform
No. (MHz) (GOPS) (mW)
1. VIP[11] 1996 Altera EPF81500 16 N/A N/A
Parallel Coprocessor
2. 2009 Virtex5 LX330T 115 6.74 0.61
for CNN[27]
3. MAPLE[4] 2010 Virtex5 SX240T 125 7 N/A
11
4. DC-CNN[6] 2010 Virtex5 SX240T 120 16 1.14
5. NeuFlow[15] 2011 Virtex6 VLX240T 200 147 14.7
6. Memory-Centric Accelerator[25] 2013 Virtex6 VLX240T 150 17 N/A
Dynamic reconfigurable
7. 2022 Xilinx Zynq 7020 200 N/A N/A
Architecture[18]
Pin Name Input/Output Operation
Go Input To start the convolution operation
Reset Input To restart the operation
Clock Input To run the system
Pic data Input To send picture/image data for convolution
Filter data Input To send filter data for convolution
Done Output To indicate the operation completion
Conv data Output To receive final convoluted data
3.3.2 MAC
In the image 2.2, we can see that the output data is nothing but the mul-
tiplication and addition of image data and filter data. So, a module which
can multiple and accumulate the multiplied data is required. MAC is that
module which can do multiplication and accumulation of result. MAC is a
module which multiples the series of data and also accumulates the multipli-
cation data with stored result. Operation of mac is as shown in fig 3.2.
Structure of neuron shown in the figure 2.2 can be achieved by MAC. And
also From above section of CNN layers we can see that all layers (except
pooling layer and ReLU layer) can also be designed by MAC. Thus MAC is
essential in DNN architecture. To increase the processing speed of MAC, we
are using pipeline systems. Here are the details about normal and pipelined
system.
12
increases for pipeline stages and thus increases the performance of the whole
system.
4 Proposed Architecture
For validating the functionality of the 2D convolution PU, we designed a full
2D convolution system with 2D convolution PU (as its heart). This system
need following input and output pins which are represented in the block
diagram as shown in figure 3.1.
This system mainly consists of Datapath and the controller components.
Datapath deals with the storing and processing of data and contains memory
unit and processing unit. Whereas controller operates datapath by sending
control signals after receiving status signals from datapath as feedback.
4.1 Datapath
As previously stated data path unit is used to generate the output by taking
the input data and processing those inputs. This data path unit generates
status signals to the controller as to inform the its state and controlled by the
control signals from the control unit. This data path contains a few numbers
13
of sub-units that are used in store and processing the data, which are as
follows
• Memory unit
• Address generator
• Status generator
• process unit
From these sub-units, memory unit is used to store the data, process unit
is used to process the data, address generators choose the data locations
in order to store or process the data and status generator sends the status
signals to the controller. More about these sub-units are as follows
14
makes the address for output and filter memories respectively. The address
generators generates address based on zero padding and stride values too.
• MAC topologies
As shown in Figure 4.3, these are the different topologies that are used
to compare the speed and performance of the MAC For the multipli-
cation, a signed Dadda multiplier is used and for the addition, CSlA
(carry select adder) is used.
15
Figure 5: Topologies of MAC used in Process Unit(PU).
16
Figure 6: Signed Dadda multiplier and Adder topologies used in the proposed
methodology.
4.2 Controller
A controller is the brain of the total system which generates control signals
using status signals. A controller is an FSM (finite state machine) that
changes the state of the entire system based on the status signals coming
from the Datapath. And each state can be performed in one cycle.
• ASM
This controller is designed using the following ASM algorithm as shown be-
low.
3. read both data from pic and filter memories and doing MAC operation
6. shift row of pic address and reset the column of pic address
• FSM
17
Figure 7: ASM of the Controller.
18
And the fsm as follows
19
• Increment row address and clearing column address
Here, A∗A and f ∗f are cycles required to load the picture and filter the data
into memory. O ∗O cycles are used to send the result from output memory to
the output pins. The remaining clock pulses are used by the processing unit
to load the result into output memory. Then, synthesis of this architecture
is done in the Artix-7 AC701 Evaluation Platform (xc7a200tfbg676-2) and
the ZYNQ-7 ZC702 Evaluation Board (xc7z020clg484-1) and the results are
as follows.
20
Artix 7 (xc7a200tfbg676-2) Zynq 7 (xc7z020clg484-1)
MAC Component Name
Delay (in ns) Slices (LUT) Delay (in ns) Slices (LUT)
Adder 6.6 41 7.9 41
Dadda Multiplier 8.8 102 10.3 102
5-Stage Dadda Multiplier 1.4 100 1.8 100
21
Figure 9: ASM of the Controller.
22
Figure 10: Latency and Throughput of the process unit
23
Artix 7 (xc7a200tfbg676-2) Zynq 7(xc7z020clg484-1)
Design type
Slice LUTs Slice registers Slice LUTs Slice registers
24
PU using non-pipelined MAC 531 779 531 779
PU using Pipelined MAC 515 796 515 796
PU using Proposed pipelined MAC 517 962 517 962
5.3 Throughput analysis
Throughput is the measure of no. of operations or bits generated per sec.
Here, we are using mbps (million bits per sec) as a unit for throughput
measurement. So, throughput is no. of bits generated by time required(in
sec) as follows Thus, here we can say that
Figure 11:
2) From 2,3,4, clock pulse required for one post convolution operation
depends on filter size. Thus, throughput only depends on the filter size
and that to inversely proportional for large filter sizes.
Here are the throughput estimation in both boards. and the average through-
put graph is shown in figure 4.1.
25
Sl. Pic Filter Output Latency for one convolution Throughput
No. Memory Size size PU using PU using PU using PU using
PU using PU using
size (fxf) (OxO) non- Proposed non- Proposed
pipelined pipelined
(AxA) pipelined pipelined pipelined pipelined
MAC MAC
MAC MAC MAC MAC
1 32 x 32 3x3 30 x 30 86400 81180 49950 208.33 221.73 360.36
26
2 32 x 32 5x5 28 x 28 195686.4 173577.6 89924.8 80.13 90.33 174.37
3 32 x 32 7x7 26 x 26 324480 282703.2 137566 41.67 47.82 98.28
4 32 x 32 9x9 24 x 24 453427.2 392025.6 185414.4 25.41 29.39 62.13
5 32 x 32 11 x 11 22 x 22 566860.8 488162.4 227431.6 17.08 19.83 42.56
6 32 x 32 13 x 13 20 x 22 652800 560880 259000 12.25 14.26 30.89
Sl. Pic Filter Output Latency for one convolution Throughput
No. Memory Size size PU using PU using PU using PU using
PU using PU using
size (fxf) (OxO) non- Proposed non- Proposed
pipelined pipelined
(AxA) pipelined pipelined pipelined pipelined
MAC MAC
MAC MAC MAC MAC
1 32 x 32 3x3 30 x 30 108900 102960 64800 165.29 174.83 277.78
27
2 32 x 32 5x5 28 x 28 246646.4 220147.2 116659.2 63.57 71.23 134.41
3 32 x 32 7x7 26 x 26 408980 358550.4 178464 33.06 37.71 75.76
4 32 x 32 9x9 24 x 24 571507.2 497203.2 240537.6 20.16 23.17 47.89
5 32 x 32 11 x 11 22 x 22 714480.8 619132.8 295046.4 13.55 15.63 32.81
6 32 x 32 13 x 13 20 x 22 822800 711360 336000 9.72 11.25 23.81
6 Conclusion
This paper presents a real-time implementation of a high throughput PU
which performs 2D convolution, on Artix-7 and Zynq-7 FPGA boards. Mostly
Convolution PU is constructed by MACs or multipliers and adders. In con-
trast to those techniques where more delay happens usually, this adopted
technique decreases the total delay that the system will works on high-
throughput. The proposed PU uses five stage dadda multiplier and CSlA
adder components. Proposed PU empowers the throughput by decreasing
the delay of the multiplier. And this proposed PU is compared with the non
pipeline MAC and the two stage pipeline MAC. And the result are promis-
ing too for the proposed PU but the latency bit higher than the two stage
pipelined MAC. Yet this proposed PU can be improved by adding more pro-
posed MACs working parallely to increases the throughput linearly. while
coming to performance of proposed PU, the Throughput is approximately
inversely proportional to the filter size. And the comparison states that the
proposed PU is faster and has high throughput than the other PUs but the
latency of the proposed PU is higher than the rest makes the proposed PU
a disadvantage. [26] [4] [30] [10] [25] [27] [5] [24] [13] [17] [14] [2] [9] [16]
[19] [23] [28] [8] [3] [12] [6] [18] [20] [22] [29] [15] The authors are thankful
to the editor and the anonymous reviewers for their helpful suggestions and
valuable comments throughout the review process, which have considerably
helped to improve the content of the paper.
References
[1] Lin Bai, Yecheng Lyu, and Xinming Huang. A unified hardware ar-
chitecture for convolutions and deconvolutions in cnn. 2020 IEEE In-
ternational Symposium on Circuits and Systems (ISCAS), pages 1–5,
2020.
[3] Arijit Bhadra and Suman Samui. Design and analysis of high-
throughput two-cycle multiply-accumulate (mac) architectures for fixed-
point arithmetic. In 2022 IEEE Calcutta Conference (CALCON), pages
267–272. IEEE, 2022.
28
[4] Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat
Chakradhar, and Hans Peter Graf. A programmable parallel accelerator
for learning and classification. In 2010 19th International Conference on
Parallel Architectures and Compilation Techniques (PACT), pages 273–
283, 2010.
[9] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. Eyeriss v2: A flexible
and high-performance accelerator for emerging deep neural networks.
CoRR, abs/1807.07928, 2018.
[10] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. Ey-
eriss: An energy-efficient reconfigurable accelerator for deep convolu-
tional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–
138, 2017.
[11] J. Cloutier, E. Cosatto, S. Pigeon, F.R. Boyer, and P.Y. Simard. Vip:
an fpga-based processor for image processing and neural networks. In
Proceedings of Fifth International Conference on Microelectronics for
Neural Networks, pages 330–336, 1996.
29
[14] Pudi Dhilleswararao, Srinivas Boppu, M. Sabarimalai Manikandan, and
Linga Reddy Cenkeramaddi. Efficient hardware architectures for accel-
erating deep neural networks: Survey. IEEE Access, 10:131788–131828,
2022.
[15] Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eu-
genio Culurciello, and Yann Lecun. Neuflow: A runtime-reconfigurable
dataflow processor for vision. 06 2011.
[18] Hasan Irmak, Federico Corradi, Paul Detterer, Nikolaos Alachiotis, and
Daniel Ziener. A dynamic reconfigurable architecture for hybrid spiking
and convolutional fpga-based neural network designs. Journal of Low
Power Electronics and Applications, 11(3), 2021.
[19] Zheming Jin and Hal Finkel. Exploration of opencl 2d convolution ker-
nels on intel fpga, cpu, and gpu platforms. pages 4460–4465, Los Ange-
les, CA, USA, 2019. IEEE.
[20] Jump and Ahuja. Effective pipelining of digital systems. IEEE Trans-
actions on Computers, 100(9):855–865, 1978.
[21] Yun Liang, Liqiang Lu, Qingcheng Xiao, and Shengen Yan. Evaluating
fast algorithms for convolutional neural networks on fpgas. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems,
PP:1–1, 02 2019.
[23] Gangzhao Lu, Weizhe Zhang, and Zheng Wang. Optimizing gpu memory
transactions for convolution operations. pages 399–403, Kobe, Japan,
2020. IEEE.
30
[24] Mihir Mody, Manu Mathew, Shyam Jagannathan, Arthur Redfern, Ja-
son Jones, and Thorsten Lorenzen. Cnn inference: Vlsi architecture for
convolution layer for 1.2 tops. In 2017 30th IEEE International System-
on-Chip Conference (SOCC), pages 158–162, 2017.
[25] Maurice Peemen, Arnaud A. A. Setio, Bart Mesman, and Henk Cor-
poraal. Memory-centric accelerator design for convolutional neural net-
works. In 2013 IEEE 31st International Conference on Computer Design
(ICCD), pages 13–19, 2013.
[30] Jichen Wang, Jun Lin, and Zhongfeng Wang. Efficient convolution ar-
chitectures for convolutional neural network. In 2016 8th International
Conference on Wireless Communications & Signal Processing (WCSP),
pages 1–5, 2016.
31