New Dlau
New Dlau
CHAPTER 1
INTRODUCTION
1.Introduction:
As transistor density continues to grow exponentially, the limited power budget
allows only a small fraction of active transistors, which is referred to as dark silicon. Dark
silicon forces us to trade silicon area for energy. Specialized hardware acceleration has
emerged as an effective technique to mitigate the dark silicon, as it delivers up to several
orders of magnitude better energy efficiency than general-purpose processors. Heading
towards the big data era, a key challenge in the design of hardware accelerators is how to
efficiently transfer data between the memory hierarchy and the accelerators, mainly when
targeting emerging data-intensive applications (e.g., key value store, graph database, etc.).
However, with the increasing accuracy requirements and complexity for the practical
applications, the size of the neural networks becomes explosively large scale, such as the
Baidu Brain with 100 Billion neuronal connections, and the Google cat-recognizing system
with 1 Billion neuronal connections. The explosive volume of data makes the data centers
quite power consuming. In particular, the electricity consumption of data centers in U.S. are
projected to increase to roughly 140 C. Wang, Q. Yu, L.Gong, X.Li and X.Zhou are with
University of Science and Technology of China, Hefei, 230027, Anhui, China. E-mail:
{cswang,llxx,xhzhou}@ustc.edu.cn, yuiq@mail.ustc.edu.cn). Y. Xie is with University of
California at Santa Barbara, 93106, United States, E-mail:yuanxie@ece.ucsb.edu. Manuscript
received January 10, 2016. billion kilowatt-hours annually by 2020.
Therefore, it poses significant challenges to implement high performance deep
learning networks with low power cost, especially for large scale deep learning neural
network models. So far, the state of-the-art means for accelerating deep learning algorithms
are Field-Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC),
and Graphic Processing Unit (GPU). Compared with GPU acceleration, hardware
accelerators like FPGA and ASIC can achieve at least moderate performance with lower
power consumption. However, both FPGA and ASIC have relatively limited computing
resources, memory, and I/O bandwidths, therefore it is challenging to develop complex and
massive deep neural networks using hardware accelerators. For ASIC, it has a longer
development cycle and the flexibility is not satisfying. Chen et al presents a ubiquitous
M.Tech Page 1
Design and Implementation of DLAU using FPGA
machine-learning hardware accelerator called Dian Nao , which opens a new paradigm to
machine learning hardware accelerators focusing on neural networks. But Dian Nao is not
implemented using reconfigurable hardware like FPGA, therefore it cannot adapt to different
application demands. Currently around FPGA acceleration researches, Ly and Chow
designed FPGA based solutions to accelerate the Restricted Boltzmann Machine (RBM).
They created dedicated hardware processing cores which are optimized for the RBM
algorithm. Similarly Kim et al also developed a FPGA based accelerator for the restricted
Boltzmann machine.
They use multiple RBM processing modules in parallel, with each module
responsible for a relatively small number of nodes. Other similar works also present FPGA
based neural network accelerators. Qi et al. present a FPGA based accelerator , but it cannot
accommodate changing network size and network topologies. To sum up, these studies focus
on implementing a particular deep learning algorithm efficiently, but how to increase the size
of the neural networks with scalable and flexible hardware architecture has not been properly
solved.
M.Tech Page 2
Design and Implementation of DLAU using FPGA
1.2 Methodology:
In this the HDL designer tool is used to implement the DLAU circuits. The
implementation of the code is done by verilog language. In implementation image input
module designed in first stage, the image module is designed with counter after that an FSM
control module designed in this the states are customize by key values. For every key value
we have different states are assigned according to the key values the states are changed. For
different states, different wait cycles are assigned according to key in values. Modelsim
software is utilized to get simulation results.
M.Tech Page 3
Design and Implementation of DLAU using FPGA
CHAPTER 2:
LITERATURE SURVEY:
Chao Wang ; Lei Gong ; Qi Yu ; Xi Li ; Yuan Xie ; Xuehai Zhou has presented
emerging field of machine learning, deep learning shows excellent ability in solving complex
learning problems. However, the size of the networks becomes increasingly large scale due to
the demands of the practical applications, which poses significant challenge to construct a
high performance implementations of deep learning neural networks. In order to improve the
performance as well as to maintain the low power cost, in this paper we design deep learning
accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep
learning networks using field-programmable gate array (FPGA) as the hardware prototype.
The DLAU accelerator employs three pipelined processing units to improve the
throughput and utilizes tile techniques to explore locality for deep learning applications.
M.Tech Page 4
Design and Implementation of DLAU using FPGA
Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU
accelerator is able to achieve up to 36.1× speedup comparing to the Intel Core2 processors,
with the power consumption at 234 mW.
Sadiq M. Sait has depicted the recent advances in digital technologies, and availability
of credible data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems not possible
before. In particular, convolution neural networks (CNNs) have demonstrated their
effectiveness in image detection and recognition applications. However, they require
intensive CPU operations and memory bandwidth that make general CPUs fail to achieve
desired performance levels. Consequently, hardware accelerators that use application specific
integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing
units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize parallelism as well
as due to their energy efficiency. In this paper, we review recent existing techniques for
accelerating deep learning networks on FPGAs. We highlight the key features employed by
the various techniques for improving the acceleration performance. In addition, we provide
recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The
techniques investigated in this paper represent the recent trends in FPGA-based accelerators
of deep learning networks. Thus, this review is expected to direct the future advances on
efficient hardware accelerators and to be useful for deep learning researchers.
Neena Aloysius ; M. Geetha, The success of traditional methods for solving computer
vision problems heavily depends on the feature extraction process. But Convolutional Neural
Networks (CNN) have provided an alternative for automatically learning the domain specific
features. Now every problem in the broader domain of computer vision is re-examined from
the perspective of this new methodology. Therefore it is essential to figure-out the type of
network specific to a problem. In this work, we have done a thorough literature survey of
Convolutional Neural Networks which is the widely used framework of deep learning. With
Alex Net as the base CNN model, we have reviewed all the variations emerged over time to
suit various applications and a small discussion on the available frameworks for the
implementation of the same. We hope this piece of article will really serve as a guide for any
neophyte in the area.
M.Tech Page 5
Design and Implementation of DLAU using FPGA
Trupti R. Chavan ; Abhijeet V. Nandedkar The use of deep neural networks for
artificial intelligence tasks is increasing day by day. However, incremental learning in such
networks is a challenging task. This paper deals with learning new classes by using pre-
trained model without scratch training. The famous VGGNET architecture is used for
classification and can be viewed as cascaded structure of convolutional layers and a classifier.
A hybrid VGGNET model containing offline and online trained network is introduced for
incremental leaning. The offline trained network which plays an important role in feature
extraction, is fixed with pre-trained conventional network. While the online trained network
is adaptable and tuned to learn new classes. The key benefit of such learning is that without
scratch training, a huge reduction in learning time and computations is achieved. The
experimental results obtained on Caltech 101 dataset show that the performance of this
hybrid model is comparable to end to end training.
M.Tech Page 6
Design and Implementation of DLAU using FPGA
CHAPTER 3:
Restricted Boltzmann Machines (RBMs) have been widely used to efficiently train
each layer of a deep network. Normally a deep neural network is composed of one input
layer, several hidden layers and one classifier layer. The units in adjacent layers are all-to-all
weighted connected. The prediction process contains feed forward computation from given
input neurons to the output neurons with the current network configurations. Training process
includes pre-training which locally tune the connection weights between the units in adjacent
layers, and global training which globally tune the connection weights with Back Propagation
process. The large-scale deep neural networks include iterative computations which have few
conditional branch operation, therefore they are suitable for parallel optimization in
hardware. In this paper we first explore the hot spot using the profiler. Results in Fig. I
illustrates the percentage of running time including Matrix Multiplication (MM), Activation,
and Vector operations. For the representative three key operations: feed forward, Restricted
Boltzmann Machine (RBM), and back propagation (BP), matrix multiplication play a
significant role of the overall execution. In particular, it takes 98.6%, 98.2%, and 99.1% of
the feed forward, RBM, and BP operations. In comparison, the activation function only takes
1.40%, 1.48%, and 0.42% of the three operations. Experimental results on profiling
demonstrate that the design and implementation of MM accelerators is able to improve the
overall speedup of the system significantly. However, considerable memory bandwidth and
computing resources are needed to support the parallel processing, consequently it poses a
significant challenge to FPGA implementations compared with GPU and CPU optimization
measures. In order to tackle the problem, in this paper we employ tile techniques to partition
the massive input data set into tiled subsets. Each designed hardware accelerator is able to
buffer the tiled subset of data for processing. In order to support the large-scale neural
networks, the accelerator architecture are reused. Moreover, the data access for each tiled
subset can run in parallel to the computation of the hardware accelerators. Algorithm 1
Pseudo Code of the Tiled Inputs Require: Ni: the number of the input neurons No: the
M.Tech Page 7
Design and Implementation of DLAU using FPGA
number of the output neurons Tile Size: the tile size of the input data batchsize: the batch size
of the input data for n = 0; n < batch size; n + + do for k = 0; k < N i; k+ = T ile Size do for j
= 0; j < No; j + + do y[n][j] = 0; for i = k;i < k + T ile Size&&i < N i;i + + do y[n][j]+ = w[i]
[j] ∗ x[n][i] if i == N i − 1 then y[n][j] = f(y[n][j]); end if end for end for end for end for In
particular, for each iteration, output neurons are reused as the input neurons in next iteration.
To generate the output neurons for each iteration, we need to multiply the input neurons by
each column in weights matrix. As illustrated in Algorithm 1, the input data are partitioned
into tiles and then multiplied by the corresponding weights. Thereafter the calculated part
sum are accumulated to get the result. Besides the input/output neurons, we also divided the
weight matrix into tiles corresponding to the tile size. As a consequence, the hardware cost of
the accelerator only depends on the tile size, which saves significant number of hardware
resources. The tiled technique is able to solve the problem by implementing large networks
with limited hardware. Moreover, the pipelined hardware implementation is another
advantage of FPGA technology compared to GPU architecture, which uses massive parallel
SIMD architectures to improve the overall performance and throughput. According to the
profiling results depicted in Table I, during the prediction process and the training process in
deep learning algorithms, the common but important computational parts are matrix
multiplication and activation functions, consequently in this paper we implement the
specialized accelerator to speed up the matrix multiplication and activation functions.
M.Tech Page 8
Design and Implementation of DLAU using FPGA
Each circle in the graph above represents a neuron-like unit called a node, and nodes
are simply where calculations take place. The nodes are connected to each other across
layers, but no two nodes of the same layer are linked.
That is, there is no intra-layer communication – this is the restriction in a restricted
Boltzmann machine. Each node is a locus of computation that processes input, and begins by
making stochastic decisions about whether to transmit that input or not. (Stochastic means
“randomly determined”, and in this case, the coefficients that modify inputs are randomly
initialized.)
Each visible node takes a low-level feature from an item in the dataset to be learned.
For example, from a dataset of grayscale images, each visible node would receive one pixel-
value for each pixel in one image. (MNIST images have 784 pixels, so neural nets processing
them must have 784 input nodes on the visible layer.)
Now let’s follow that single pixel value, x, through the two-layer net. At node 1 of the
hidden layer, x is multiplied by a weight and added to a so-called bias. The result of those
two operations is fed into an activation function, which produces the node’s output, or the
strength of the signal passing through it, given input x.
M.Tech Page 9
Design and Implementation of DLAU using FPGA
Next, let’s look at how several inputs would combine at one hidden node. Each x is
multiplied by a separate weight, the products are summed, added to a bias, and again the
result is passed through an activation function to produce the node’s output.
M.Tech Page 10
Design and Implementation of DLAU using FPGA
Because inputs from all visible nodes are being passed to all hidden nodes, an
RBM can be defined as a symmetrical bipartite graph.
3.4 SYMMETRICAL
Symmetrical means that each visible node is connected with each hidden node
(see below). Bipartite means it has two parts, or layers, and the graph is a
mathematical term for a web of nodes.
At each hidden node, each input x is multiplied by its respective weight w.
That is, a single input x would have three weights here, making 12 weights altogether
(4 input nodes x 3 hidden nodes). The weights between two layers will always form a
matrix where the rows are equal to the input nodes, and the columns are equal to the
output nodes.
Each hidden node receives the four inputs multiplied by their respective
weights. The sum of those products is again added to a bias (which forces at least
some activations to happen), and the result is passed through the activation algorithm
producing one output for each hidden node.
If these two layers were part of a deeper neural network, the outputs of hidden
layer no. 1 would be passed as inputs to hidden layer no. 2, and from there through as
many hidden layers as you like until they reach a final classifying layer. (For simple
M.Tech Page 11
Design and Implementation of DLAU using FPGA
3.5 Reconstructions
M.Tech Page 12
Design and Implementation of DLAU using FPGA
Because the weights of the RBM are randomly initialized, the difference
between the reconstructions and the original input is often large. You can think of
reconstruction error as the difference between the values of r and the input values, and
that error is then back propagated against the RBM’s weights, again and again, in an
iterative learning process until an error minimum is reached. A more thorough
explanation of back propagation is here.
As you can see, on its forward pass, an RBM uses inputs to make predictions
about node activations, or the probability of output given a weighted x: p(a|x; w).
But on its backward pass, when activations are fed in and reconstructions, or
guesses about the original data, are spit out, an RBM is attempting to estimate the
probability of inputs x given activations a, which are weighted with the same
coefficients as those used on the forward pass. This second phase can be expressed
as p ( x|a ; w).
Together, those two estimates will lead you to the joint probability distribution
of inputs x and activations a, or p(x, a).Reconstruction does something different from
regression, which estimates a continous value based on many inputs, and different
from classification, which makes guesses about which discrete label to apply to a
given input example.
M.Tech Page 13
Design and Implementation of DLAU using FPGA
M.Tech Page 14
Design and Implementation of DLAU using FPGA
Let’s talk about probability distributions for a moment. If you’re rolling two
dice, the probability distribution for all outcomes looks like this:
That is, 7s are the most likely because there are more ways to get to 7 (3+4,
1+6, 2+5) than there are ways to arrive at any other sum between 2 and 12. Any
formula attempting to predict the outcome of dice rolls needs to take seven’s greater
frequency into account.
Or take another example: Languages are specific in the probability distribution
of their letters, because each language uses certain letters more than others. In
English, the letters e, t and a are the most common, while in Icelandic, the most
common letters are a, r and n. Attempting to reconstruct Icelandic with a weight set
based on English would lead to a large divergence.
In the same way, image datasets have unique probability distributions for their
pixel values, depending on the kind of images in the set. Pixels values are distributed
differently depending on whether the dataset includes MNIST’s handwritten
numerals:
M.Tech Page 15
Design and Implementation of DLAU using FPGA
Imagine for a second an RBM that was only fed images of elephants and dogs,
and which had only two output nodes, one for each animal. The question the RBM is
asking itself on the forward pass is: Given these pixels, should my weights send a
stronger signal to the elephant node or the dog node? And the question the RBM asks
on the backward pass is: Given an elephant, which distribution of pixels should I
expect?
That’s joint probability: the simultaneous probability of x given a and
of a given x, expressed as the shared weights between the two layers of the RBM.
The process of learning reconstructions is, in a sense, learning which groups of pixels
tend to co-occur for a given set of images. The activations produced by nodes of
hidden layers deep in the network represent significant co-occurrences; e.g.
“nonlinear gray tube + big, floppy ears + wrinkles” might be one.
In the two images above, you see reconstructions learned by Deeplearning4j’s
implemention of an RBM. These reconstructions represent what the RBM’s
activations “think” the original data looks like. Geoff Hinton refers to this as a sort of
machine “dreaming”. When rendered during neural net training, such visualizations
are extremely useful heuristics to reassure oneself that the RBM is actually learning.
If it is not, then its hyper parameters, discussed below, should be adjusted.
M.Tech Page 16
Design and Implementation of DLAU using FPGA
One last point: You’ll notice that RBMs have two biases. This is one aspect
that distinguishes them from other auto encoders. The hidden bias helps the RBM
produce the activations on the forward pass (since biases impose a floor so that at
least some nodes fire no matter how sparse the data), while the visible layer’s biases
help the RBM learn the reconstructions on the backward pass.
Once this RBM learns the structure of the input data as it relates to the
activations of the first hidden layer, then the data is passed one layer down the net.
Your first hidden layer takes on the role of visible layer. The activations now
effectively become your input, and they are multiplied by weights at the nodes of the
second hidden layer, to produce another set of activations.
This process of creating sequential sets of activations by grouping features and
then grouping groups of features is the basis of a feature hierarchy, by which neural
networks learn more complex and abstract representations of data.
With each new hidden layer, the weights are adjusted until that layer is able to
approximate the input from the previous layer. This is greedy, layerwise and
unsupervised pre-training. It requires no labels to improve the weights of the network,
which means you can train on unlabeled data, untouched by human hands, which is
the vast majority of data in the world. As a rule, algorithms exposed to more data
produce more accurate results, and this is one of the reasons why deep-learning
algorithms are kicking butt.
Because those weights already approximate the features of the data, they are
well positioned to learn better when, in a second step, you try to classify images with
the deep-belief network in a subsequent supervised learning stage.
While RBMs have many uses, proper initialization of weights to facilitate later
learning and classification is one of their chief advantages. In a sense, they
accomplish something similar to back propagation: they push weights to model data
well. You could say that pre-training and back prop are substitutable means to the
same end.
To synthesize restricted Boltzmann machines in one diagram, here is a
symmetrical bipartite and bidirectional graph:
M.Tech Page 17
Design and Implementation of DLAU using FPGA
For those interested in studying the structure of RBMs in greater depth, they
are one type of un directional graphical model, also called mark random field.
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-
examples/src/main/java/org/deeplearning4j/examples/unsupervised/deepbelief/DeepA
utoEncoderExample.java
Parameters & k
The variable k is the number of times you run contrastive divergence.
Contrastive divergence is the method used to calculate the gradient (the slope
representing the relationship between a network’s weights and its error), without
which no learning can occur.
Each time contrastive divergence is run, it’s a sample of the Markov Chain
composing the restricted Boltzmann machine. A typical value is 1. In the above
example, you can see how RBMs can be created as layers with a more general Multi
Layer Configuration. After each dot you’ll find an additional parameter that affects
the structure and performance of a deep neural net. Most of those parameters are
defined on this site.
M.Tech Page 18
Design and Implementation of DLAU using FPGA
M.Tech Page 20
Design and Implementation of DLAU using FPGA
CHAPTER 4:
INTRODUCTION TO DLAU:
4.1.INTRODUCTION
In the past few years, machine learning has become pervasive in various research
fields and commercial applications, and achieved satisfactory products. The emergence of
deep learning speeded up the development of machine learning and artificial intelligence.
Consequently, deep learning has become a research hot spot in research organizations .
In general, deep learning uses a multi-layer neural network model to extract high-
level features which are a combination of low level abstractions to find the distributed data
features, in order to solve complex problems in machine learning.
Currently the most widely used neural models of deep learning are Deep Neural
Networks (DNNs) and Convolution Neural Networks (CNNs) , which have been proved to
have excellent capability in solving picture recognition, voice recognition and other complex
machine learning tasks.
However, with the increasing accuracy requirements and complexity for the practical
applications, the size of the neural networks becomes explosively large scale, such as the
Baidu Brain with 100 Billion neuronal connections, and the Google cat-recognizing system
with 1 Billion neuronal connections.
The explosive volume of data makes the data centers quite power consuming. In
particular, the electricity consumption of data centers in U.S. are projected to increase to
roughly 140 C.
Therefore, it poses significant challenges to implement high performance deep
learning networks with low power cost, especially for large scale deep learning neural
network models.
So far, the state of-the-art means for accelerating deep learning algorithms are Field-
Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and
Graphic Processing Unit (GPU). Compared with GPU acceleration, hardware accelerators
like FPGA and ASIC can achieve at least moderate performance with lower power
consumption.
M.Tech Page 21
Design and Implementation of DLAU using FPGA
However, both FPGA and ASIC have relatively limited computing resources,
memory, and I/O bandwidths, therefore it is challenging to develop complex and massive
deep neural networks using hardware accelerators. For ASIC, it has a longer development
cycle and the flexibility is not satisfying.
Chen et al presents a ubiquitous machine-learning hardware accelerator called Dian
Nao , which opens a new paradigm to machine learning hardware accelerators focusing on
neural networks. But Dian Nao is not implemented using reconfigurable hardware like
FPGA, therefore it cannot adapt to different application demands.
Currently around FPGA acceleration researches, Ly and Chow designed FPGA based
solutions to accelerate the Restricted Boltzmann Machine (RBM). They created dedicated
hardware processing cores which are optimized for the RBM algorithm.
Similarly Kim also developed a FPGA based accelerator for the restricted Boltzmann
machine. They use multiple RBM processing modules in parallel, with each module
responsible for a relatively small number of nodes. Other similar works also present FPGA
based neural network accelerators. Qi et al. present a FPGA based accelerator, but it cannot
accommodate changing network size and network topologies.
To sum up, these studies focus on implementing a particular deep learning algorithm
efficiently, but how to increase the size of the neural networks with scalable and flexible
hardware architecture has not been properly solved. To tackle these problems, we present a
scalable deep learning accelerator unit named DLAU to speed up the kernel computational
parts of deep learning algorithms.
In particular, we utilize the tile techniques, FIFO buffers, and pipelines to minimize
memory transfer operations, and reuse the computing units to implement the large-size neural
networks. This approach distinguishes itself from previous literatures with following
contributions:
1. In order to explore the locality of the deep learning application, we employ tile
techniques to partition the large scale input data. The DLAU architecture can be
configured to operate different sizes of tile data to leverage the trade-offs between
speedup and h
2. Hardware costs. Consequently the FPGA based accelerator is more scalable to
accommodate different machine learning applications.
3. The DLAU accelerator is composed of three fully pipelined processing units,
including TMMU, PSAU, and AFAU. Different network topologies such as CNN,
DNN, or even emerging neural networks can be composed from these basic modules.
M.Tech Page 22
Design and Implementation of DLAU using FPGA
Consequently the scalability of FPGA based accelerator is higher than ASIC based
accelerator.
Restricted Boltzmann Machines (RBMs) have been widely used to efficiently train
each layer of a deep network. Normally a deep neural network is composed of one input
layer, several hidden layers and one classifier layer. The units in adjacent layers are all-to-all
weighted connected.
The prediction process contains feed forward computation from given input neurons
to the output neurons with the current network configurations. Training process includes pre-
training which locally tune the connection weights between the units in adjacent layers, and
global training which globally tune the connection weights with Back Propagation process.
The large-scale deep neural networks include iterative computations which have few
conditional branch operations , therefore they are suitable for parallel optimization in
hardware. In this paper we first explore the hot spot using the profiler. Results in Fig. I
illustrates the percentage of running time including Matrix Multiplication (MM), Activation,
and Vector operations.
For the representative three key operations: feed forward, Restricted Boltzmann
Machine (RBM), and back propagation (BP), matrix multiplication play a significant role of
the overall execution. In particular, it takes 98.6%, 98.2%, and 99.1% of the feed forward,
RBM, and BP operations. In comparison, the activation function only takes 1.40%, 1.48%,
and 0.42% of the three operations.
Experimental results on profiling demonstrate that the design and implementation of
MM accelerators is able to improve the overall speedup of the system significantly. However,
considerable memory bandwidth and computing resources are needed to support the parallel
processing, consequently it poses a significant challenge to FPGA implementations compared
with GPU and CPU optimization measures.
In order to tackle the problem, in this paper we employ tile techniques to partition the
massive input data set into tiled subsets. Each designed hardware accelerator is able to buffer
the tiled subset of data for processing. In order to support the large-scale neural networks, the
accelerator architecture are reused.
M.Tech Page 23
Design and Implementation of DLAU using FPGA
Moreover, the data access for each tiled subset can run in parallel to the computation
of the hardware accelerators. Algorithm 1 Pseudo Code of the Tiled Inputs Require: Ni: the
number of the input neurons No: the number of the output neurons Tile Size: the tile size of
the input data batch size: the batch size of the input data for n = 0; n < batch size; n + + do for
k = 0; k < N i; k+ = T ile Size do for j = 0; j < No; j + + do y[n][j] = 0; for i = k;i < k + T ile
Size&&i < N i;i + + do y[n][j]+ = w[i][j] ∗ x[n][i] if i == N i − 1 then y[n][j] = f(y[n][j]); end
if end for end for end for end for In particular, for each iteration, output neurons are reused as
the input neurons in next iteration.
To generate the output neurons for each iteration, we need to multiply the input
neurons by each column in weights matrix. As illustrated in Algorithm 1, the input data are
partitioned into tiles and then multiplied by the corresponding weights. Thereafter the
calculated part sum are accumulated to get the result.
Besides the input/output neurons, we also divided the weight matrix into tiles
corresponding to the tile size. As a consequence, the hardware cost of the accelerator only
depends on the tile size, which saves significant number of hardware resources.
The tiled technique is able to solve the problem by implementing large networks with
limited hardware. Moreover, the pipelined hardware implementation is another advantage of
FPGA technology compared to GPU architecture, which uses massive parallel SIMD
architectures to improve the overall performance and throughput.
According to the profiling results depicted in Table I, during the prediction process
and the training process in deep learning algorithms, the common but important
computational parts are matrix multiplication and activation functions, consequently in this
paper we implement the specialized accelerator to speed up the matrix multiplication and
activation functions.
M.Tech Page 24
Design and Implementation of DLAU using FPGA
M.Tech Page 25
Design and Implementation of DLAU using FPGA
TMMU employs an input FIFO buffer which receives the data transfe Fig. 2
illustrates the TMMU schematic diagram, in which we set tile size=32 as an example.
TMMU firstly reads the weight matrix data from input buffer into different BRAMs in 32 by
the row number of the weight matrix (n=i%32where n refers to the number of BRAM, and i
is the row number of weight matrix).
Then, TMMU begins to buffer the tiled node data. In the first time, TMMU reads the
tiled 32 values to registers Reg a and starts execution.
In parallel to the computation at every cycle, TMMU reads the next node from input
buffer and saves to the registers Reg b. Consequently the registers Reg a and Reg b can be
used alternately.
For the calculation, we use pipelined binary adder tree structure to optimize the
performance. As depicted in Fig. 2, the weight data and the node data are saved in BRAMs
and registers.
M.Tech Page 26
Design and Implementation of DLAU using FPGA
Part Sum Accumulation Unit (PSAU) is responsible for the accumulation operation.
Fig. Presents the PSAU architecture, which accumulates the part sum produced by TMMU.
If the Part Sum is the final result, PSAU will write the value to output buffer and send
results to AFAU in a pipeline manner.
PSAU can accumulate one Part Sum every clock cycle, therefore the throughput of
PSAU accumulation matches the generation of the Part Sum in TMMU.
M.Tech Page 27
Design and Implementation of DLAU using FPGA
Similar to PSAU, AFAU also has both input buffer and output buffer to maintain the
throughput with other processing units. In particular, we use two separate BRAMs to store
the values of a and b.
The computation of AFAU is pipelined to operate sigmoid function every clock cycle.
As a consequence, all the three processing units are fully pipelined to ensure the peak
throughput of the DLAU accelerator architecture.
M.Tech Page 28
Design and Implementation of DLAU using FPGA
CHAPTER 5
M.Tech Page 29
Design and Implementation of DLAU using FPGA
As form the above concept we have seen the existing design, is modelled only on
DLAU architectures for different no of bits for different application, hence we are unable to
provide an accurate solution for it.
To provide such complex features in one application we need to know how each
module is operated based on the application chosen.
Our concept have provided modules namely: TMMU, PSAU, AFAU for which need
to estimate and analyze its behavior, characteristics and reliability for which the application
chosen will be liable for the design considerations.
Our DLAU architecture is applied on the CNN layers to control its data
operations and output generation at each phase of the design considered.
M.Tech Page 30
Design and Implementation of DLAU using FPGA
Since the CNN structure itself has multiple layers of filters and boosting
techniques we propose one such filter and a boosting technique into our design which reduces
the real-time latency of the CNN layers considered.
Now, the figure considered is tend to provide the analysis of the
5.3 TMMU&PSAU:
From the design point of view we have considered the PSAU module as Data
accelerated Controller.
FLOW DIAGRAM PSAU:
M.Tech Page 31
Design and Implementation of DLAU using FPGA
5.4 AFAU:
Here the concept for AFAU represented with Flow diagram which depicts the
modelling of this circuit as FSM based digital display, where it would check the comparative
analysis of outputs to inputs created and generated intermediate and original sections.
M.Tech Page 32
Design and Implementation of DLAU using FPGA
convolution, thusly, frequently includes back propagation so as to all the more precisely
weight the end product.
Despite the fact that the layers are casually alluded to as convolutions, this is just by
show. Numerically, it is in fact a sliding dab item or cross-connection. This has
noteworthiness for the files in the lattice, in that it influences how weight is resolved at a
particular file point
Pooling
M.Tech Page 33
Design and Implementation of DLAU using FPGA
Average pooling utilizes the normal incentive from every one of a bunch of neurons at
the earlier layer
MAX-POOLING :
M.Tech Page 34
Design and Implementation of DLAU using FPGA
CHAPTER 6
In this section we are here to explain about the proposed design results, how each
analysis for the design modeling is achieved by considering below :
Area analysis
Power analysis
Time/Delay analysis(Initialization analysis)
Speed analysis.
By utilization of VHLD/Verilog language we design the required system which would be
processed and subjugated to above mentioned analysis below:
Synthesis,
Place and Route,
Simulation.
Synthesis
In this process we provide initially designed Verilog program or VHDL program code
which converts into the net list format. We now analyze the complete circuit with logic
elements and its RTL implementation.
In this project we need to control and model each design phase for TX and RX so that
the transmission is as fast as possible. This process generates net list for each design element.
Simulate:
Consideration of simulation process mainly requiring input and output that mean’s
output can be observed with respect to given input with clock pulses(cycles). In this process
we consider specific inputs, outputs which are in the form of clock pulses to provide the
simulated model of the designed circutary. Whenever observed output depends on the duty
cycle.
Duty cycle = Ton/Toff
M.Tech Page 35
Design and Implementation of DLAU using FPGA
For example the user can be assumed total clock pulses in that Ton always should be
greater than the Toff (Ton>Toff) then only get the improvised much more stability. In this
consideration we required always more duty cycle (D.C). Suppose Ton is less than the Toff
that condition in duty cycle (Ton<Toff) is less.
So improvised stability is also less so that assumptions. Actually these criteria are
affected by Finite state machines, LUT’s and holding times are there .So those conditions are
changed.
Simulation Results:
From the modelled graphs from the Model simulated output we have seen the
initialization each data with respect to inputs where each module is operated with different
timing circuitry. Now we synchronize the circuit using the design clock generated by the user
and estimated.
M.Tech Page 36
Design and Implementation of DLAU using FPGA
For the above and below figure we have observed the values of Xmit_d and Data1_tx
as 10010101. The values is input which we have assigned to DLAU, here the DLAU act as
the controller and comparator where each data from the CNN layers utilized from the design
criteria are being controller and checked accordingly.
M.Tech Page 37
Design and Implementation of DLAU using FPGA
Finally, after few iteration of the clk cycles we are able to revive the same data and
comparison of the original and received is verified based on the AFAU operation.
CONCLUSION:
M.Tech Page 38
Design and Implementation of DLAU using FPGA
As per the proposed design we have estimated and calculated the proposed design
consideration with an application point of view where the DLAU is being utilized and
verified with DL-MODULE based modelling.
We compare the results accordingly based and tabulate it. So as per design
consideration we have verified our design model based on the existing design which is
DLAU and with an application point of view where DL-MODULE with DLAU is being
considered.
Now according to the results and its implementation cycle we have shown the
comparisons for the existing and proposed design model.
Hence from the result and implementation point of view we have proven our proposed
method is more reliable and more effective with different kind’s of application where power
and area are critical.
AREA UTILIZATION:
As per the design we have estimated the Area about 40 % for Input/output
configurations where each device characteristics would change depending upon the design
criteria.
As per design model we have utilized about less that 300 circuit elements such as Flip
flops and about 710 for Look up table which would estimate about 0.6 percentage of total
logical elements for the area representation
POWER UTILIZATION:
M.Tech Page 39
Design and Implementation of DLAU using FPGA
The modelled power analysis haven’t been utilized based on power reduction
schemes. By modelling of signal characteristics we could verify the simulation output results
specifically the output data consideration which would estimate the correct fan-in and fan-out
for the design. The correct power utilization for the design under test is shown below:
M.Tech Page 40
Design and Implementation of DLAU using FPGA
REFERENCES:
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp.
436–444, 2015. [2] J. Hauswald et al., “DjiNN and Tonic: DNN as a service and its
implications for future warehouse scale computers,” in Proc. ISCA, Portland, OR, USA,
2015, pp. 27–40.
[2] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep convolutional
neural networks,” in Proc. FPGA, Monterey, CA, USA, 2015, pp. 161–170.
[3] P. Thibodeau. Data Centers are the New Polluters. Accessed on Apr. 4, 2016. [Online].
Available: http://www.computerworld.com/ article/2598562/data-center/data-centers-are-the-
new-polluters.html
[7] Q. Yu, C. Wang, X. Ma, X. Li, and X. Zhou, “A deep learning prediction process
accelerator based FPGA,” in Proc. CCGRID, Shenzhen, China, 2015, pp. 1159–1162.
[8] J. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neural
network,” in Proc. FPGA, Monterey, CA, USA, 2016, pp. 26–35.
M.Tech Page 41
Design and Implementation of DLAU using FPGA
Appendix
`resetall
`timescale 1ns/10ps
module main_controler_cnn(
input wire clk_main,
input wire rst,
output wire add_stg2,
output wire add_stg3,
output wire add_stg4,
output wire conv_stage1,
output wire conv_stg2,
output wire [15:0] max_pool_out
);
wire clk;
wire conv2done;
wire [15:0] dbus0;
wire [15:0] dbus1;
wire [15:0] dbus2;
wire [15:0] dbus3;
wire [15:0] dbus4;
wire done;
wire done1;
wire done2;
wire done3;
wire enable;
wire [15:0] main_in;
wire maxPoolingDone;
wire [4:0] output1;
wire [15:0] output10;
wire [15:0] output11;
wire [15:0] output12;
wire [15:0] output13;
wire [15:0] output14;
wire [15:0] output15;
wire [4:0] output2;
wire [5:0] output3;
wire [4:0] output4;
wire [4:0] output5;
wire [5:0] output6;
wire [6:0] output7;
wire [7:0] output8;
M.Tech Page 42
Design and Implementation of DLAU using FPGA
M.Tech Page 43
Design and Implementation of DLAU using FPGA
.clk (clk),
.enable (enable),
.done (done1) );
adderStage3 U_3(
.input1 (output4),
.input2 (output5),
.output1 (output6),
.clk (clk),
.enable (enable),
.done (done2)
);
adderStage4 U_5(
.input1 (output7),
.input2 (output6),
.output1 (output8),
.clk (clk),
.enable (enable),
.done (done3)
);
main_control_gen U_1(
.clk_main (clk_main),
.conv2done (conv2done),
.done (done),
.done1 (done1),
.done2 (done2),
.done3 (done3),
.maxPoolingDone (maxPoolingDone),
.output15 (output15),
.rst (rst),
.add_stg2 (add_stg2),
.add_stg3 (add_stg3),
.add_stg4 (add_stg4),
.clk (clk),
.conv_stage1 (conv_stage1),
.conv_stg2 (conv_stg2),
.dbus0 (dbus0),
.dbus1 (dbus1),
.dbus2 (dbus2),
.dbus3 (dbus3),
.dbus4 (dbus4),
.enable (enable),
.max_pool_out (max_pool_out)
);
maxPooling U_6(
M.Tech Page 44
Design and Implementation of DLAU using FPGA
.clk (clk),
.input1 (output9),
.input2 (output10),
.input3 (output11),
.input4 (main_in),
.enable (enable),
.output1 (output15),
.maxPoolingDone (maxPoolingDone)
);
assign main_in = output4 ^ output5 ^ output6;
endmodule
…………………………………………………………………………………………..
module file_gen(
output wire clk,
input wire clk_main,
output wire enable,
output wire [15:0] file_out,
input wire rst
);
reg [15:0] ram [15:0];
reg [15:0] out1;
reg en;
integer i,j;
always@(posedge(clk_main))
begin
if(rst)
begin
for (i =0;i<=255;i=i+1)
ram[i] =0;
out1=0;
end
else
begin
$readmemh("C:\\Users\\Rahul\\Desktop\\dlau\\image1.dat",ram);
for (i=0;i<=15;i=i+1)
for (j=0;j<=15;j=j+1)
begin
out1= ram[0];
#10 out1= ram[1];
#10 out1 = ram[2];
#15 out1 = ram[3];
#20 out1 = ram[4];
M.Tech Page 45
Design and Implementation of DLAU using FPGA
……………………………………………………………………………………………
`resetall
`timescale 1ns/10ps
module TMMU_PSAU(
input wire clk_main,
output wire [15:0] dbus0,
output wire [15:0] dbus1,
output wire [15:0] dbus2,
output wire [15:0] dbus3,
output wire [15:0] dbus4,
input wire [15:0] file_out,
input wire rst
);
reg [3:0] r1,r2,r3,r4,r5;
always@(posedge(clk_main))
begin
if (rst)
begin
r1=0;
r2=0;
r3=0;
r4=0;
r5=0;
end
M.Tech Page 46
Design and Implementation of DLAU using FPGA
else
begin
r1 = file_out[3:0];
r2 = file_out[5:2];
r3 = file_out[6:3];
r4 = file_out[4:1];
r5= file_out[7:4];
end
end
assign dbus0=r1;
assign dbus1=r2;
assign dbus2=r3;
assign dbus3=r4;
assign dbus4=r5;
endmodule
………………………………………………………………………………………….
module ConvolutionStage1(
input clk,
input [3:0] input2,
input [3:0] input4,
input [3:0] input5,
input [3:0] input6,
input [3:0] input8,
output reg signed [4:0] output1,
output reg signed [4:0] output2,
output reg signed [5:0] output3,
output reg signed [4:0] output4,
output reg signed [4:0] output5,
input enable,
output reg done
);
always @ (posedge clk) begin
if(enable) begian
output1 <= 0;
output2 <= 0;
output3 <= 0;
output4 <= 0;
output5 <= 0;
done <= 1'b0;
M.Tech Page 47
Design and Implementation of DLAU using FPGA
end
else begian
output1 <= {1'b1, ~(input2)} + 5'b00001;
output2 <= {1'b1, ~(input4)} + 5'b00001;
output3 <= {2'b00, input5} << 2;
output4 <= {1'b1, ~(input6)} + 5'b00001;
output5 <= {1'b1, ~(input8)} + 5'b00001;
done <= 1'b1;
end
end
endmodule
……………………………………………………………………………………………
module adderStage3(
input [4:0] input1,
input [4:0] input2,
output reg [5:0] output1,
input clk,
input enable,
output reg done
);
always @ (posedge clk) begin
if(enable) begin
output1 <= 0;
done <= 1'b0;
end
else begin
output1 <= {input1[4], input1} + {input2[4], input2};
done <= 1'b1;
end
end
endmodule
…………………………………………………………………………………………..
M.Tech Page 48
Design and Implementation of DLAU using FPGA
…………………………………………………………………………………………..
module ConvolutionStage2(
input clk,
input enable,
input [7:0] input1,
input [7:0] input2,
input [7:0] input3,
input [7:0] input4,
input [7:0] input5,
input [7:0] input6,
input [7:0] input7,
input [7:0] input8,
input [7:0] input9,
input [7:0] input10,
input [7:0] input11,
input [7:0] input12,
output reg signed [15:0] output1,
output reg signed [15:0] output2,
output reg signed [15:0] output3,
output reg signed [15:0] output4,
output reg signed [15:0] output5,
output reg signed [15:0] output6,
output reg done
);
M.Tech Page 49
Design and Implementation of DLAU using FPGA
……………………………………………………………………………………………..
M.Tech Page 50
Design and Implementation of DLAU using FPGA
maxPoolingDone <= 0;
end
else begin
if($signed(initialMax) < $signed(input1)) begin
if($signed(input2) < $signed(input1)) begin
if($signed(input3) < $signed(input1)) begin
if($signed(input4) < $signed(input1)) begin
output1 <= input1;
maxPoolingDone <= 1;
end
else begin
output1 <= input4;
maxPoolingDone <= 1;
end
end
else begin
if($signed(input3) < $signed(input4)) begin
output1 <= input4;
maxPoolingDone <= 1;
end
else begin
output1 <= input3;
maxPoolingDone <= 1;
end
end
end
else begin
if($signed(input3) < $signed(input2)) begin
if($signed(input4) < $signed(input2)) begin
output1 <= input2;
maxPoolingDone <= 1;
end
else begin
output1 <= input4;
maxPoolingDone <= 1;
end
end
else begin
if($signed(input3) < $signed(input4)) begin
output1 <= input4;
maxPoolingDone <= 1;
end
else begin
output1 <= input3;
M.Tech Page 51
Design and Implementation of DLAU using FPGA
maxPoolingDone <= 1;
end
end
end
end
else begin
output1 <= initialMax;
maxPoolingDone <= 1;
end
end
end
endmodule
…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………….end code
M.Tech Page 52