0% found this document useful (0 votes)
51 views31 pages

FPGA CNN Project Paper

Uploaded by

dumbabubu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views31 pages

FPGA CNN Project Paper

Uploaded by

dumbabubu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

FPGA-based High Throughput

Processing Unit for Efficiently


Executing 2D Convolution

December 10, 2024

1 Abstract
This paper explores the vital role of convolution as a foundational element in
Deep Neural Networks (DNNs) for extracting meaningful features from input
data. The focus is on designing a real-time architecture for a high-throughput
processing unit (PU) dedicated to 2D convolution. Three distinct PU topolo-
gies are examined, progressing from a non-pipelined Multiply-Accumulate
(MAC) to a 2-staged pipelined MAC and culminating in a pipelined MAC
with multiple stages, systematically reducing convolution process delays. Be-
yond theoretical considerations, this paper evaluates resource utilization and
latency across different PU topologies on Artix-7 and Zynq-7 FPGA chips,
providing comparative insights using Xilinx Vivado software. Additionally,
the study extends to estimating and comparing throughput across various
kernel and picture sizes, offering real-world performance insights into differ-
ent MAC topologies. The thorough investigation successfully enhanced the
convolution processes (within the dynamic framework of DNNs, resulting) in
a substantial 96% increase in throughput.

2 Introduction
Day by day the usage of deep learning and deep neural networks (DNNs)
are sky rocketing in many fields like object detection, object recognition,
semantic segmentation, medical imaging, etc. this is because of their accu-
racy within lesser time exceeds the human knowledge. Deep neural networks

1
(DNNs), a fundamental component of contemporary artificial intelligence, of-
fer a host of advantages across various applications, highlighting their adapt-
ability and efficiency in managing intricate tasks.
In the realm of image and object recognition [2], DNNs play a pivotal role,
propelling advancements in facial recognition, autonomous vehicles, and med-
ical image analysis through their capacity to learn intricate patterns from
data. Natural Language Processing (NLP) has experienced a transformative
leap with DNNs, particularly through transformer models like BERT and
GPT, empowering applications such as sentiment analysis, language transla-
tion, and chatbots with unprecedented accuracy. Speech recognition systems
heavily rely on DNNs, facilitating the development of virtual assistants, tran-
scription services, and voice-controlled devices, offering robust performance
in complex audio environments. DNNs contribute significantly to recommen-
dation systems, enhancing user experience by tailoring content suggestions
in areas like streaming services and e-commerce. In healthcare, these net-
works bring about breakthroughs by analyzing medical images for tasks like
tumor detection and disease classification, thereby improving diagnostic ac-
curacy. Autonomous vehicles leverage the strengths of DNNs for perception
tasks, ensuring robust object detection, precise lane-keeping, and effective
obstacle avoidance. Moreover, DNNs contribute to financial fraud detection,
excel in generative tasks like image synthesis, and prove valuable in anomaly
detection across diverse industries. Their applications extend to robotics,
drug discovery, and climate modeling. The dynamic adaptability of DNNs,
coupled with their ability to learn from vast and complex datasets, ensures
continuous exploration and adaptation across emerging domains, solidifying
their significance in shaping technological landscapes and ushering in un-
precedented advancements.
Due to these extensive applications in DNN, there is a keen interest among
people to utilize it across a wide range of applications. Despite encountering
various challenges, Deep Neural Networks (DNNs) continue to be a sub-
ject of interest and exploration. Overfitting remains a prevalent concern
as DNNs can become overly attuned to training data, compromising their
ability to generalize to new datasets. Data quality and quantity are cru-
cial, demanding extensive, well-labeled datasets, and the presence of biases
can skew model outcomes. The computational demands of training DNNs are
substantial, requiring powerful hardware and substantial time. Interpretabil-
ity is a persistent challenge due to the complex, non-linear nature of deep
neural networks, making it difficult to comprehend their decision-making
processes. Adversarial attacks pose security risks, exploiting vulnerabilities
in DNNs and causing misclassifications. Ethical considerations arise from
biases in training data, potentially leading to unfair outcomes, raising ques-

2
tions about responsible AI deployment. Transfer learning faces challenges in
adapting pre-trained models to new tasks or domains. Training instability
is a concern, as small changes in conditions can yield significantly different
model performances. A limited theoretical understanding of why DNNs work
in certain situations hinders efforts to design more robust models. Lastly, de-
ploying DNNs in resource-constrained environments remains challenging due
to their high computational requirements. Addressing these multifaceted
challenges is crucial for unlocking the full potential of DNNs across various
applications.
Among these Challenges, these papper mainly focuses on the compu-
tational demands of Deep Neural Networks (DNNs) have necessitated the
development of specialized hardware implementations to ensure efficient and
timely model training and inference. Traditional Central Processing Units
(CPUs) provide versatility but may lack the parallel processing power re-
quired for the massive matrix operations intrinsic to DNN computations.
Graphics Processing Units (GPUs) have gained prominence due to their
ability to parallelize tasks effectively, making them well-suited for train-
ing large models and processing extensive datasets simultaneously. Field-
Programmable Gate Arrays (FPGAs) offer flexibility, enabling the customiza-
tion of circuits to specific DNN architectures, which is advantageous for
evolving or optimizing neural network structures. Application-Specific Inte-
grated Circuits (ASICs) represent a pinnacle in specialized hardware, custom-
designed for particular algorithms, delivering exceptional speed and power
efficiency, albeit with higher development costs. The ongoing exploration
of hardware accelerators, such as Google’s Tensor Processing Units (TPUs),
showcases efforts to optimize DNN performance, emphasizing the need for
hardware architectures that balance computational power, energy efficiency,
and flexibility. As the field advances, these hardware implementations play
a pivotal role in harnessing the full potential of DNNs across a spectrum of
artificial intelligence applications.
Among them Field-Programmable Gate Arrays (FPGAs) and Application-
Specific Integrated Circuits (ASICs) are favored choices for Deep Neural
Network (DNN) accelerators due to their distinctive features aligning with
the computational demands of neural networks. FPGAs offer a balance be-
tween flexibility and parallelism, making them adaptable to varying DNN
architectures. Their programmable nature allows for rapid prototyping and
reconfiguration, reducing development time and cost. Additionally, FPGAs
strike a compromise between energy efficiency and performance, making them
suitable for scenarios where the hardware needs to support different neural
network models efficiently.
On the other hand, ASICs are custom-designed for specific tasks, provid-

3
ing unparalleled parallelism and energy efficiency. While the development
of ASICs is more time-consuming and expensive, they excel in performance,
making them ideal for large-scale deployments where dedicated hardware
optimized for a particular set of computations is paramount. The choice
between FPGAs and ASICs depends on factors such as the need for adapt-
ability, development time, energy efficiency, and the scale of deployment.
FPGAs are often preferred during the prototyping phase and in applications
where flexibility is crucial, while ASICs shine in situations demanding max-
imum performance and energy efficiency at scale.
In the realm of Deep Neural Networks (DNNs), 2D convolution stands
out as a paramount process, particularly within Convolutional Neural Net-
works (CNNs)(which is a part of DNN). This operation is integral for feature
extraction from spatial data, notably in image-related tasks. It entails slid-
ing a small filter over the input, performing element-wise multiplications and
aggregations to capture spatial hierarchies, patterns, and features. The sig-
nificance of 2D convolution lies in its inherent ability to autonomously learn
and extract meaningful representations from the input, enabling the discern-
ment of complex patterns and structures. This process is indispensable for
various applications in computer vision, including image recognition and ob-
ject detection. The hierarchical and localized nature of 2D convolution serves
as a cornerstone in the success of many deep learning applications, providing
a framework for CNNs to comprehend and interpret intricate features present
in the input data.
Convolutional Neural Networks (CNNs) offer several advantages over tra-
ditional Fully Connected Neural Networks (DNNs), especially in tasks involv-
ing grid-like data. CNNs capture spatial hierarchies effectively, recognizing
patterns at different levels of abstraction. They achieve this through param-
eter sharing, reducing the number of parameters and enhancing efficiency.
The concept of translation invariance, enabled by shared weights and pool-
ing layers, makes CNNs robust to variations in position, orientation, or scale
of objects in images. Additionally, local connectivity in CNNs allows them
to focus on capturing local patterns, contributing to their effectiveness in
computer vision applications. While CNNs excel in spatial tasks, the choice
between CNNs and DNNs depends on the nature of the data and the specific
requirements of the task at hand.
In the current work, the implementation of a high throughput process
unit for 2D convolution in FPGA has been presented. we have studied how a
pipeline technique can be used to provide high throughput and maximize per-
formance in the process unit of convolution by partitioning the critical path.
Increasing the throughput by means of pipeling in convolution process unit
makes this study differ from other former studies. The main contributions

4
of this work are as follows:

i) This paper explored various convolution techniques that are used for
CNN accelerators. Based on our analysis, MAC have been used for the
design of convolution architecture. Thus, this paper proposes a multi-
staged MAC which can be used to construct a faster processing unit
for 2D convolution.

ii) To built a multi-staged MAC, we propose a multi-staged signed mul-


tiplier in which the each stage delay is pretty much lesser than the
adder.

iii) Finally, the proposed MAC with other MACs are implemented in FPGA
boards. And those MACs are compared in terms of latency and through-
put parameters to distinguish which MAC is the best suited MAC for
the convolution layer.

This paper is organized with Background and related work, proposed


Architecture, Experimental Results and Estimations and finally conclusion.
In background and related works, the convolution process, it’s uses, it’s
hardware implementation, related works of this paper and finally about the
method pipeline which is used in these paper. The proposed architecture
chapter clearly describes about each and every module in the datapath and
controller with description of each state in FSM. In Experimental results and
estimations chapter, the simulation and synthesis results are clearly displayed
along with the estimations for different cases in terms of throughput, time
consumption. And finally, the Conclusion concludes this paper.

3 Background and related work


3.1 Convolution Process

Figure 1: example for convolution process.

5
Convolution is pivotal process in the field deep neural networks because
it enables these networks to efficiently process and extract features from
complex data like images, audio, and even text. It also allows deep neural
networks to automatically learn and recognize intricate hierarchical patterns,
such as edges, textures, and object parts in images, making them highly effec-
tive for tasks like image recognition, object detection, and image generation.
Convolution process uses two matrix data; input and filter/kernel to produce
a output matrix data. Here, input matrix/map may be a picture data, au-
dio data, text data or processed picture/audio/text data, filter/kernel is the
pattern

Figure 2: Convolution Procedure.

In convolution procedure, the filter/kernel is multiplied with a section


image data and adding the results to form a section of output image data.
To get more detailed in convolution process, some key points which are men-
tioned below, are to be known.
Stride is the number of rows/columns needed to be moved to get next neuron
data.
Zeropadding tells how many number of outer layer of zeroes needed to be
added to the image data before convolution. This Zero padding is a counter
measure for reduction of size of the output image.

here is how a 5X5 image is convoluted with 3X3 filter with stride as 1.
First the 3X3 filter is scalar multiplied with the first 3X3 section in image
and the 3X3 multiplied result is added to get the first output value. then
the filter moves with to the second row from first row since the stride is 1.
then the multiplication and the addition happens to get the second value and
continues to get last column value of first row of output since the filter can’t
be moved to the next row. so the filter is moved to the first row and second
column since the stride is 1 and gets the first output value in second row.

6
Then continues to get upto the last column value of second row of output.
This process continues upto the last column value of last row of output i.e.
moving the filter to the third row which makes the filter placing at the corner
of the input image. The whole convolution procedure discussed above is as
shown in the figure 2.2.
By taking the consideration of stride and zero padding, the output size can
be determined using below formula where as A is the input image size, ZP
is the Zero padding, f is the filter/kernel size and str is stride.
 
A + 2 ∗ ZP − f â
Outputsize = +1 (1)
str â

3.2 Usage of 2D convolution in DNN


As mentioned introduction, Convolution is the basic process in DNN as it
is used in many different layers in DNN. Below mentioned are the layers
in CNN/DNN in which convolution procedure is used. In below mentioned
layers, some layers uses convolution with small changes in the procedures,
we can implement these layers also using the proposed architecture which is
going to be discussed in the next section.

Figure 3: time taken by different layer in DNN per layer[13].

3.2.1 Convolution layer


Convolution layer is the pivotal layer in the DNN/CNN. It is used for many
applications like reducing the image size, edge detection of the objects in
image, sharpening an image, and blurring the image, etc. Convolution layer
consists of neurons which contains the convoluted result of kernel weights
and image or main data. For figure 2.3 we can say that the maximum time
taken by the convolution layer in DNN system.

7
3.2.2 Dilated convolution layer
Dilated Convolution[1] allows for an increased receptive field without increas-
ing the number of parameters or the amount of computation required. This
is achieved by introducing gaps or ”dilations” between the filter values of the
convolutional kernel as shown in Figure 2.8. The dilation factor controls the
size of these gaps, which can be thought of as holes in the filter. By increasing
the dilation factor, the receptive field of the layer increases, allowing for the
layer to capture features across larger spatial regions. Dilated convolutional
layers are mostly used to detect certain patterns in the CNN layer which are
very much distance apart.

3.2.3 Deconvolution layer


Deconvolutional layers, also known as transpose convolutional layers or up-
sampling layers, are a type of layer commonly used in Convolutional Neural
Networks (CNNs) for tasks such as image segmentation, image generation,
and image-to-image translation. The deconvolutional layer essentially per-
forms an inverse convolution operation, where the input feature map is trans-
formed into a larger output feature map with the same spatial dimensions as
the original input. The deconvolutional layer increases the resolution of the
feature map, making it possible to recover fine details that may have been
lost in previous pooling layers

3.2.4 fully connected layer


A fully Connected layer is a miniature DNN as shown in Figure 2.10. Since
the input notes are very much fewer, it is very easy to make hardware for
the fully connected layer. One of the way is by making convolution of image
with same sized different kernels.

3.3 Hardware implementation


There are lots of traditional software techniques which can perform these NN
operations using CPU, GPUs. These techniques takes lots of time because of
the increase in the network size of DNNs. In recent trends, These GPUs are
quiet slow compared to FPGA and ASIC hardware because they are specially
designed to perform these NN architectures. And the hardwares to perform
NN operations are called (hardware)accelerators are adding to the system to
boost the performance.

8
The architecture of an accelerator should be well-defined such that it should
perform multiple functions and operations quickly and precisely without any
structural hazards. And accelerator architecture for DNN should perform all
DNN functions and operations with higher performance than normal CPU.
The basic architectures of accelerators commonly used to execute the func-

Figure 4: Basic architectures of Accelerators[14]

tions and operations quicker and more precisely using a small multi-functional
building unit called a processing element (PE). This PE has multiple func-
tionalities and these functionalities are controlled by the controller placed
inside of PE or outside of PE based on the type of structure it was designed.
The Control unit, Register File, and computational unit combined in PE,
then that type of architecture is called spatial architecture and if the com-
putational unit only forms a PE with outside memory and was controlled by
one main controller, then that type of architecture is called Temporal archi-
tectures. Spatial architectures can be seen in CPUs and GPUs and Temporal
Architectures in FPGA-based and ASCI-based accelerators as shown in the
figure.

3.3.1 FPGA implementation


Among them, FPGA has the most advantage because of its changing func-
tionality based on the user’s will. So they can be easily used and are available
for users at very less cost due to their bulk production and lesser research
required than any other VLSI processing of chips. In FPGA, the execution
time is lesser than GPU and CPU refer [19].There also some Works under-
going on optimization of GPU to Convolution operation in less time [23]

9
Thus FPGA is preferred in the projects. As shown in the table 2.1, are the
architecture of some of the many accelerators for CNN and the description
of convolution operation in those architectures. And one more thing is that
the PE in these CNN accelerators is multiply-accumulate (MAC). MAC is
the basic and most used block for convolution architecture as a PE block.
In [10] and [9], a lot of MAC are used for building a Accelerator for DNN
applications.

10
SL. Frequency Performance Power
Name Year FPGA Platform
No. (MHz) (GOPS) (mW)
1. VIP[11] 1996 Altera EPF81500 16 N/A N/A
Parallel Coprocessor
2. 2009 Virtex5 LX330T 115 6.74 0.61
for CNN[27]
3. MAPLE[4] 2010 Virtex5 SX240T 125 7 N/A

11
4. DC-CNN[6] 2010 Virtex5 SX240T 120 16 1.14
5. NeuFlow[15] 2011 Virtex6 VLX240T 200 147 14.7
6. Memory-Centric Accelerator[25] 2013 Virtex6 VLX240T 150 17 N/A
Dynamic reconfigurable
7. 2022 Xilinx Zynq 7020 200 N/A N/A
Architecture[18]
Pin Name Input/Output Operation
Go Input To start the convolution operation
Reset Input To restart the operation
Clock Input To run the system
Pic data Input To send picture/image data for convolution
Filter data Input To send filter data for convolution
Done Output To indicate the operation completion
Conv data Output To receive final convoluted data

3.3.2 MAC
In the image 2.2, we can see that the output data is nothing but the mul-
tiplication and addition of image data and filter data. So, a module which
can multiple and accumulate the multiplied data is required. MAC is that
module which can do multiplication and accumulation of result. MAC is a
module which multiples the series of data and also accumulates the multipli-
cation data with stored result. Operation of mac is as shown in fig 3.2.
Structure of neuron shown in the figure 2.2 can be achieved by MAC. And
also From above section of CNN layers we can see that all layers (except
pooling layer and ReLU layer) can also be designed by MAC. Thus MAC is
essential in DNN architecture. To increase the processing speed of MAC, we
are using pipeline systems. Here are the details about normal and pipelined
system.

3.4 Pipeline systems


The main concept in circuit pipelining is to split a bigger and harder task
of the operation process into smaller stages which will help to enhance the
performance by reducing the combinatorial critical path. The non-pipeline
circuit is made up of combinational logic, an input, and an output. This
combinational logic has been partitioned into smaller portions of combina-
tional logic and connected registers in between the logic to form a pipeline
circuit[26]. Essentially, pipelining allows the design to run at a higher oper-
ating frequency at a lesser cost in latency caused by initializing the pipeline
stages. So, basically pipeline system is to break an existing slower task into
multiple faster tasks which are dependent on the previous task and makes
them work parallel to decrease the clock period. From Figure 2.11, the min-
imum clock that should be provided for the system is Tclkmin = Tcomb for
the non-pipeline system and Tclkmin = Tnew + Treg for the pipeline system
by taking one clock cycle and two clock cycles respectively. Thus frequency

12
increases for pipeline stages and thus increases the performance of the whole
system.

3.5 Related/Similar projects or other techniques//other


methods
These is a trending research is going on different algorithms for convolu-
tion layer in FPGA accelerators [12,33]. For more elaborate discussion and
insight, readers are encouraged to go through. [21] uses Winograd Algo-
rithm to calculate (a bunch)an array of output Matrix at once instead of
calculating one by one [5] describes about 3 fast memory shifting and con-
volution architectures to decrease the Usage of memory for low cost FPGA
implementation [24] synthesized a C.L with a array of MAC to boost up
the convolution Process to 1.2 TOPS. [30] created a Efficient Hardware ar-
chitecture using parallel Fast finite impulse respone FIR algorithm FFA for
CNN implementation [17] proposed a P.U. with less number of MAC blocks
and also consumes less clock cycles to do a convolution Process [8] proposed
a Architecture containing Multipiers and Adders to Decrease the resources
upto 37.8 which are used in the convolution process. [28] proposed efficient
hardware using few multiplexers, ALU blocks and control blocks to increase
the frequency of operation [28] system was further improved by 43.11% power
and area by [16]

4 Proposed Architecture
For validating the functionality of the 2D convolution PU, we designed a full
2D convolution system with 2D convolution PU (as its heart). This system
need following input and output pins which are represented in the block
diagram as shown in figure 3.1.
This system mainly consists of Datapath and the controller components.
Datapath deals with the storing and processing of data and contains memory
unit and processing unit. Whereas controller operates datapath by sending
control signals after receiving status signals from datapath as feedback.

4.1 Datapath
As previously stated data path unit is used to generate the output by taking
the input data and processing those inputs. This data path unit generates
status signals to the controller as to inform the its state and controlled by the
control signals from the control unit. This data path contains a few numbers

13
of sub-units that are used in store and processing the data, which are as
follows

• Memory unit

• Address generator

• Status generator

• process unit

From these sub-units, memory unit is used to store the data, process unit
is used to process the data, address generators choose the data locations
in order to store or process the data and status generator sends the status
signals to the controller. More about these sub-units are as follows

4.1.1 Memory Unit


In Memory units, there are three Memories - Pic memory, Filter memory and
output Memory. Pic memory is used to store the image data, filter memory
is used to store the weights for the convolution process and output memory is
used to store the final results after the convolution process. These memories
are N XN vector/2D memory where N should be max size of memory and
should be selected with zero padding extension in mind. but for output
memory, the size is calculated as shown in formula 2.1. This memory unit
uses a reset signal to empty the stored data in memories and uses load signal
for loading input or processed data. These loading takes the rows and column
values from address generator. These row and column values are used to
access particular cells in a memory.

4.1.2 Address Generator


The Address generator is used to generate the row and column values to
access a particular cell data from memories. These address generators are
basically counters with the load/count signal for increment and reset signal to
reset row/column value. Initially, address generator is use to generate address
for loading the data from the input. But for 2D convolution processing, filter
shifting makes difficult to generate with the normal row and column address.
Thus, we uses extra counters to generate the shift row and shift column
values. These shift values are used to shift to the filter on the image to do
the next convolution (MAC operation). These shift values added with normal
row/column values makes the address for the pic memory to access the data
(only in processing). And also these shift values and normal address values

14
makes the address for output and filter memories respectively. The address
generators generates address based on zero padding and stride values too.

4.1.3 Status Generators


Status generators are used to generate the status signal which defines the
status of the datapath unit to the controller unit. The status generator is
basically a comparator which compares the generator generated value with
a certain maximum value and generates the status signal. i.e the generated
values like row, column, shift row, and/or shift column values are compared
the certain maximum values like maximum row, column, shift row and/or
column values for pic, filter or output memories. Actually, the maximum
shift values can be determined using maximum row, column, zero padding
and stride.

4.1.4 Process Unit(PU)


Finally, PU is the heart of the Datapath unit which processes the input data
and produces the output data/result. These PU uses MAC or MAC plate
to process the data as mentioned in the previous section. MAC is known
as multiply-accumulator and it contains two parts; one is multiplication and
another is accumulator or adder. MAC gets the data from the pic memory
and filter memory based on the address generator. Then MAC multiplies
them and accumulates with the previously stored value which is present in
the result register. The process of accumulation continues until a reset from
user or a clear signal from controller. Thus, MAC uses reset or clear signal
to reset the result register, so that the accumulator can start over for the
next convolution process. But before resetting the result register, the result
value is stored in the output memory cell with shift values as address.

• MAC topologies
As shown in Figure 4.3, these are the different topologies that are used
to compare the speed and performance of the MAC For the multipli-
cation, a signed Dadda multiplier is used and for the addition, CSlA
(carry select adder) is used.

– Signed Dadda Multiplier [9]


This proposed methodology uses Dadda multiplier [7] because of
its fewer number of Adders used for multiplication operations and
it can be pipelined into a very less number of stages easily and also
each state contains independent addition of bits. The Dadda mul-
tiplier is a strategic way of Multiplication of two number binary

15
Figure 5: Topologies of MAC used in Process Unit(PU).

numbers by using half adders and full adders in a less number of


stages as shown in figure 4.2. But this convolution requires the
multiplication of a positive number with a negative/positive num-
ber. Thus, the signed Dadda multiplier is used for this kind of
operation.
– Carry Select Adder [10]
The proposed model uses the adder topologies as shown in Fig-
ure 4.2. This carry select adder is used two times, one is for the
Dadda multiplier and another is for the accumulator. Carry se-
lect adder is used because it can add two 16-bit numbers faster
than any other Adders [29]. The carry select adder selects 4-bit
addition with the carry and without the carry by using that carry
generated from the previous edition as shown in Figure 4.2 and for
the four-bit addition carry-lookahead adder is used because it is
the fastest 4-bit adder. The addition with carry one is generated
from the BEC (Binary Excess Code) which is used to increment
the addition as shown in Figure.

16
Figure 6: Signed Dadda multiplier and Adder topologies used in the proposed
methodology.

4.2 Controller
A controller is the brain of the total system which generates control signals
using status signals. A controller is an FSM (finite state machine) that
changes the state of the entire system based on the status signals coming
from the Datapath. And each state can be performed in one cycle.

• ASM

This controller is designed using the following ASM algorithm as shown be-
low.

1. load pic data into pic memory

2. load filter data into filter memory

3. read both data from pic and filter memories and doing MAC operation

4. load result to output memory and shift column of pic address

5. go to 3 if the column is not filled else to 6

6. shift row of pic address and reset the column of pic address

7. go to 3 if the output rows are not filled else to 8

8. send output to output pins

• FSM

17
Figure 7: ASM of the Controller.

Figure 8: Flow chart of the Controller FSM.

18
And the fsm as follows

• Idle state: unselecting all memories and no increments in address


• If go=1 then go to 2 else go to 1
• Selecting pic memory to write and increment the column address
• If pic-row and pic-col completed go to 4 else if pic-col =1 go to 3 else
go to 2
• Increment row address and clearing column address
• Clear row and column address and unselect pic memory
• Select filter memory to write and increment the column address
• If filter-row and filter-col completed go to 4 else if filter-col =1 go to 3
else go to 2
• Increment row address and clearing column address
• Clear row and column address and unselect filter memory
• Select result register to write and increment the column address
• If pic-row and pic-col completed go to 4 else if pic-col =1 go to 3 else
go to 2
• Increment row address and clearing column address
• Clear row and column address and unselect result register
• If sft-row and sft-col completed go to 4 else if sft-col =1 go to 3 else go
to 2
• Select output memory to write and increment the sft-column address
• Select output memory to write, increment sft-row address and clearing
sft-column address
• Unselect output memory and clear sft-row and sft-column address
• Increment column address
• If output-row and output-col completed go to 4 else if output-col =1
go to 3 else go to 2

19
• Increment row address and clearing column address

• Clear row and column address

• Resetting all memories and registers to their initial state.

Here, MAC operation is done differently in different topologies of MAC.


So, additional stall cycles were needed when pipelining was implemented ac-
cording to [26]. Here all the complete signals are generated by the status
generator, and increment signals are generated to control the address gener-
ator and memory select signals to select memories. â

5 Experimental Results and estimations


The architecture discussed above is coded in verilog with the input image size
and kernel size as 6X6 and 3X3. So, from the formula, the output size will be
4X4. And this code is simulated and executed in the Xilinxâs Vivado Design
Suite. By observing the simulation result, we observe that the system is
consuming the following mentioned clock cycles to complete one convolution
process.

• for non-pipeline system

1(reset)+1(go)+A∗A+f ∗f +[f ∗f +(clearstall)1]∗O ∗O +O ∗O (2)

• for pipeline system

1(reset)+1(go)+A∗A+f ∗f +[f ∗f +(pstall)1+1(clearstall)]∗O∗O+O∗O


(3)

• for proposed pipeline system

1(reset)+1(go)+A∗A+f ∗f +[f ∗f +(pstall)5+1(clearstall)]∗O∗O+O∗O


(4)

Here, A∗A and f ∗f are cycles required to load the picture and filter the data
into memory. O ∗O cycles are used to send the result from output memory to
the output pins. The remaining clock pulses are used by the processing unit
to load the result into output memory. Then, synthesis of this architecture
is done in the Artix-7 AC701 Evaluation Platform (xc7a200tfbg676-2) and
the ZYNQ-7 ZC702 Evaluation Board (xc7z020clg484-1) and the results are
as follows.

20
Artix 7 (xc7a200tfbg676-2) Zynq 7 (xc7z020clg484-1)
MAC Component Name
Delay (in ns) Slices (LUT) Delay (in ns) Slices (LUT)
Adder 6.6 41 7.9 41
Dadda Multiplier 8.8 102 10.3 102
5-Stage Dadda Multiplier 1.4 100 1.8 100

Table 1: Performance comparison of MAC components on different FPGA


platforms.

5.1 MAC synthesis results


The latency in the mac topologies used in the PU are show in the figure
4.1(a). The utilization and delay results from synthesis of MAC components
is as shown below table 4.1
From table 4.1, we can clearly state that 5 stage dadda multiplier has
lesser delay than the normal dadda multiplier. thus, will increase the speed
of the MAC operation very much faster.

5.2 Convolution layer results


When comes to the timing result of convolution layer architecture, the system
took the min clock periods of 9.6 nsec ,8.2 nsec and 3.7 ns for non-pipeline,
pipeline and proposed pipeline topologies respectively in Artix-7. Where as
in Zynq-7, the min clock periods for non-pipeline, pipeline and proposed
pipeline topologies are 12.1 nsec, 10.4 nsec and 4.8 nsec respectively. And
the utilization results are as follows in table 4.2.

21
Figure 9: ASM of the Controller.

22
Figure 10: Latency and Throughput of the process unit

23
Artix 7 (xc7a200tfbg676-2) Zynq 7(xc7z020clg484-1)
Design type
Slice LUTs Slice registers Slice LUTs Slice registers

(Out of 134600) (Out of 269200) (Out of 53200) (Out of 106400)

24
PU using non-pipelined MAC 531 779 531 779
PU using Pipelined MAC 515 796 515 796
PU using Proposed pipelined MAC 517 962 517 962
5.3 Throughput analysis
Throughput is the measure of no. of operations or bits generated per sec.
Here, we are using mbps (million bits per sec) as a unit for throughput
measurement. So, throughput is no. of bits generated by time required(in
sec) as follows Thus, here we can say that

Figure 11:

1) Throughput is independent on the output array size i.e. it does not


depends upon the input image size.

2) From 2,3,4, clock pulse required for one post convolution operation
depends on filter size. Thus, throughput only depends on the filter size
and that to inversely proportional for large filter sizes.

Here are the throughput estimation in both boards. and the average through-
put graph is shown in figure 4.1.

25
Sl. Pic Filter Output Latency for one convolution Throughput
No. Memory Size size PU using PU using PU using PU using
PU using PU using
size (fxf) (OxO) non- Proposed non- Proposed
pipelined pipelined
(AxA) pipelined pipelined pipelined pipelined
MAC MAC
MAC MAC MAC MAC
1 32 x 32 3x3 30 x 30 86400 81180 49950 208.33 221.73 360.36

26
2 32 x 32 5x5 28 x 28 195686.4 173577.6 89924.8 80.13 90.33 174.37
3 32 x 32 7x7 26 x 26 324480 282703.2 137566 41.67 47.82 98.28
4 32 x 32 9x9 24 x 24 453427.2 392025.6 185414.4 25.41 29.39 62.13
5 32 x 32 11 x 11 22 x 22 566860.8 488162.4 227431.6 17.08 19.83 42.56
6 32 x 32 13 x 13 20 x 22 652800 560880 259000 12.25 14.26 30.89
Sl. Pic Filter Output Latency for one convolution Throughput
No. Memory Size size PU using PU using PU using PU using
PU using PU using
size (fxf) (OxO) non- Proposed non- Proposed
pipelined pipelined
(AxA) pipelined pipelined pipelined pipelined
MAC MAC
MAC MAC MAC MAC
1 32 x 32 3x3 30 x 30 108900 102960 64800 165.29 174.83 277.78

27
2 32 x 32 5x5 28 x 28 246646.4 220147.2 116659.2 63.57 71.23 134.41
3 32 x 32 7x7 26 x 26 408980 358550.4 178464 33.06 37.71 75.76
4 32 x 32 9x9 24 x 24 571507.2 497203.2 240537.6 20.16 23.17 47.89
5 32 x 32 11 x 11 22 x 22 714480.8 619132.8 295046.4 13.55 15.63 32.81
6 32 x 32 13 x 13 20 x 22 822800 711360 336000 9.72 11.25 23.81
6 Conclusion
This paper presents a real-time implementation of a high throughput PU
which performs 2D convolution, on Artix-7 and Zynq-7 FPGA boards. Mostly
Convolution PU is constructed by MACs or multipliers and adders. In con-
trast to those techniques where more delay happens usually, this adopted
technique decreases the total delay that the system will works on high-
throughput. The proposed PU uses five stage dadda multiplier and CSlA
adder components. Proposed PU empowers the throughput by decreasing
the delay of the multiplier. And this proposed PU is compared with the non
pipeline MAC and the two stage pipeline MAC. And the result are promis-
ing too for the proposed PU but the latency bit higher than the two stage
pipelined MAC. Yet this proposed PU can be improved by adding more pro-
posed MACs working parallely to increases the throughput linearly. while
coming to performance of proposed PU, the Throughput is approximately
inversely proportional to the filter size. And the comparison states that the
proposed PU is faster and has high throughput than the other PUs but the
latency of the proposed PU is higher than the rest makes the proposed PU
a disadvantage. [26] [4] [30] [10] [25] [27] [5] [24] [13] [17] [14] [2] [9] [16]
[19] [23] [28] [8] [3] [12] [6] [18] [20] [22] [29] [15] The authors are thankful
to the editor and the anonymous reviewers for their helpful suggestions and
valuable comments throughout the review process, which have considerably
helped to improve the content of the paper.

References
[1] Lin Bai, Yecheng Lyu, and Xinming Huang. A unified hardware ar-
chitecture for convolutions and deconvolutions in cnn. 2020 IEEE In-
ternational Symposium on Circuits and Systems (ISCAS), pages 1–5,
2020.

[2] K. Benkrid and S. Belkacemi. Design and implementation of a 2d con-


volution core for video applications on fpgas. In Third International
Workshop on Digital and Computational Video, 2002. DCV 2002. Pro-
ceedings., pages 85–92, Clearwater Beach, FL, USA, 2002. IEEE.

[3] Arijit Bhadra and Suman Samui. Design and analysis of high-
throughput two-cycle multiply-accumulate (mac) architectures for fixed-
point arithmetic. In 2022 IEEE Calcutta Conference (CALCON), pages
267–272. IEEE, 2022.

28
[4] Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat
Chakradhar, and Hans Peter Graf. A programmable parallel accelerator
for learning and classification. In 2010 19th International Conference on
Parallel Architectures and Compilation Techniques (PACT), pages 273–
283, 2010.

[5] F. Cardells-Tormo, P.-L. Molinet, J. Sempere-Agullo, L. Baldez, and


M. Bautista-Palacios. Area-efficient 2d shift-variant convolvers for fpga-
based digital image processing. In International Conference on Field
Programmable Logic and Applications, 2005., pages 578–581, 2005.

[6] Srimat Chakradhar, Murugan Sankaradass, Venkata Jakkula, and Hari


Cadambi. A dynamically configurable coprocessor for convolutional neu-
ral networks. pages 247–257, 06 2010.

[7] Manash Chanda, Sankalp Jain, and Anup Dandapat. Implementation of


modified low-power 8??signed dadda multiplier. International Journal
on Electronic & Electrical Engineering, 11:8–14, 12 2010.

[8] Jing Chang and Sha Jin. An efficient implementation of 2d convolution


in cnn. IEICE Electronics Express, 14:20161134–20161134, 01 2017.

[9] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. Eyeriss v2: A flexible
and high-performance accelerator for emerging deep neural networks.
CoRR, abs/1807.07928, 2018.

[10] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. Ey-
eriss: An energy-efficient reconfigurable accelerator for deep convolu-
tional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–
138, 2017.

[11] J. Cloutier, E. Cosatto, S. Pigeon, F.R. Boyer, and P.Y. Simard. Vip:
an fpga-based processor for image processing and neural networks. In
Proceedings of Fifth International Conference on Microelectronics for
Neural Networks, pages 330–336, 1996.

[12] Ben Cope et al. Implementation of 2d convolution on fpga, gpu and


cpu. Imperial College Report, pages 2–5, 2006.

[13] Dimitrios Danopoulos, Christoforos Kachris, and Dimitrios Soudris. Ac-


celeration of image classification with caffe framework using fpga. In
2018 7th International Conference on Modern Circuits and Systems
Technologies (MOCAST), pages 1–4, 2018.

29
[14] Pudi Dhilleswararao, Srinivas Boppu, M. Sabarimalai Manikandan, and
Linga Reddy Cenkeramaddi. Efficient hardware architectures for accel-
erating deep neural networks: Survey. IEEE Access, 10:131788–131828,
2022.

[15] Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eu-
genio Culurciello, and Yann Lecun. Neuflow: A runtime-reconfigurable
dataflow processor for vision. 06 2011.

[16] K. Taraka Ganesh, B. Venkata Sujith Kumar, B. Sai Mihiraamsh,


G. Akhil, V. Ravitej, and Senthil Murugan. Low power and single mul-
tiplier design for 2d convolutions. In 2021 Second International Confer-
ence on Electronics and Sustainable Communication Systems (ICESC),
pages 1957–1964, Coimbatore, India, 2021. IEEE.

[17] Anakhi Hazarika, Soumyajit Poddar, and Hafizur Rahaman. Hardware


efficient convolution processing unit for deep neural networks. In 2019
2nd International Symposium on Devices, Circuits and Systems (IS-
DCS), pages 1–4, 2019.

[18] Hasan Irmak, Federico Corradi, Paul Detterer, Nikolaos Alachiotis, and
Daniel Ziener. A dynamic reconfigurable architecture for hybrid spiking
and convolutional fpga-based neural network designs. Journal of Low
Power Electronics and Applications, 11(3), 2021.

[19] Zheming Jin and Hal Finkel. Exploration of opencl 2d convolution ker-
nels on intel fpga, cpu, and gpu platforms. pages 4460–4465, Los Ange-
les, CA, USA, 2019. IEEE.

[20] Jump and Ahuja. Effective pipelining of digital systems. IEEE Trans-
actions on Computers, 100(9):855–865, 1978.

[21] Yun Liang, Liqiang Lu, Qingcheng Xiao, and Shengen Yan. Evaluating
fast algorithms for convolutional neural networks on fpgas. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems,
PP:1–1, 02 2019.

[22] Yihua Liao. Neural networks in hardware: A survey. Department of


Computer Science, University of California, 2001.

[23] Gangzhao Lu, Weizhe Zhang, and Zheng Wang. Optimizing gpu memory
transactions for convolution operations. pages 399–403, Kobe, Japan,
2020. IEEE.

30
[24] Mihir Mody, Manu Mathew, Shyam Jagannathan, Arthur Redfern, Ja-
son Jones, and Thorsten Lorenzen. Cnn inference: Vlsi architecture for
convolution layer for 1.2 tops. In 2017 30th IEEE International System-
on-Chip Conference (SOCC), pages 158–162, 2017.

[25] Maurice Peemen, Arnaud A. A. Setio, Bart Mesman, and Henk Cor-
poraal. Memory-centric accelerator design for convolutional neural net-
works. In 2013 IEEE 31st International Conference on Computer Design
(ICCD), pages 13–19, 2013.

[26] C. V. Ramamoorthy and H. F. Li. Pipeline architecture. ACM Comput.


Surv., 9(1):61??2, mar 1977.

[27] Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, Srimat


Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. A
massively parallel coprocessor for convolutional neural networks. In 2009
20th IEEE International Conference on Application-specific Systems,
Architectures and Processors, pages 53–60, 2009.

[28] Manupoti Sreenivasulu and T. Meenpal. Efficient hardware implemen-


tation of 2d convolution on fpga for image processing application. pages
1–5, Coimbatore, India, 2019. IEEE.

[29] Ramadass Uma, Vidya Vijayan, M Mohanapriya, and Sharon Paul.


Area, delay and power comparison of adder topologies. International
Journal of VLSI Design & Communication Systems, 3(1):153, 2012.

[30] Jichen Wang, Jun Lin, and Zhongfeng Wang. Efficient convolution ar-
chitectures for convolutional neural network. In 2016 8th International
Conference on Wireless Communications & Signal Processing (WCSP),
pages 1–5, 2016.

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy