0% found this document useful (0 votes)
102 views52 pages

New Dlau

This document discusses the design and implementation of a Deep Learning Accelerator Unit (DLAU) using an FPGA. The DLAU aims to speed up computationally intensive parts of deep learning algorithms through techniques like tiling to minimize data transfers, and pipelining and reusing computing units to implement large neural networks. The DLAU architecture can be configured to operate on different tile sizes to balance speed and hardware costs, making it more scalable. The accelerator uses three pipelined processing units - TMMU, PSAU, and AFAU - that can implement different network topologies like CNNs, DNNs, and emerging neural networks. The implementation is done using Verilog on FPGA boards

Uploaded by

deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views52 pages

New Dlau

This document discusses the design and implementation of a Deep Learning Accelerator Unit (DLAU) using an FPGA. The DLAU aims to speed up computationally intensive parts of deep learning algorithms through techniques like tiling to minimize data transfers, and pipelining and reusing computing units to implement large neural networks. The DLAU architecture can be configured to operate on different tile sizes to balance speed and hardware costs, making it more scalable. The accelerator uses three pipelined processing units - TMMU, PSAU, and AFAU - that can implement different network topologies like CNNs, DNNs, and emerging neural networks. The implementation is done using Verilog on FPGA boards

Uploaded by

deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Design and Implementation of DLAU using FPGA

CHAPTER 1

INTRODUCTION

1.Introduction:
As transistor density continues to grow exponentially, the limited power budget
allows only a small fraction of active transistors, which is referred to as dark silicon. Dark
silicon forces us to trade silicon area for energy. Specialized hardware acceleration has
emerged as an effective technique to mitigate the dark silicon, as it delivers up to several
orders of magnitude better energy efficiency than general-purpose processors. Heading
towards the big data era, a key challenge in the design of hardware accelerators is how to
efficiently transfer data between the memory hierarchy and the accelerators, mainly when
targeting emerging data-intensive applications (e.g., key value store, graph database, etc.).
However, with the increasing accuracy requirements and complexity for the practical
applications, the size of the neural networks becomes explosively large scale, such as the
Baidu Brain with 100 Billion neuronal connections, and the Google cat-recognizing system
with 1 Billion neuronal connections. The explosive volume of data makes the data centers
quite power consuming. In particular, the electricity consumption of data centers in U.S. are
projected to increase to roughly 140 C. Wang, Q. Yu, L.Gong, X.Li and X.Zhou are with
University of Science and Technology of China, Hefei, 230027, Anhui, China. E-mail:
{cswang,llxx,xhzhou}@ustc.edu.cn, yuiq@mail.ustc.edu.cn). Y. Xie is with University of
California at Santa Barbara, 93106, United States, E-mail:yuanxie@ece.ucsb.edu. Manuscript
received January 10, 2016. billion kilowatt-hours annually by 2020.
Therefore, it poses significant challenges to implement high performance deep
learning networks with low power cost, especially for large scale deep learning neural
network models. So far, the state of-the-art means for accelerating deep learning algorithms
are Field-Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC),
and Graphic Processing Unit (GPU). Compared with GPU acceleration, hardware
accelerators like FPGA and ASIC can achieve at least moderate performance with lower
power consumption. However, both FPGA and ASIC have relatively limited computing
resources, memory, and I/O bandwidths, therefore it is challenging to develop complex and
massive deep neural networks using hardware accelerators. For ASIC, it has a longer
development cycle and the flexibility is not satisfying. Chen et al presents a ubiquitous

M.Tech Page 1
Design and Implementation of DLAU using FPGA

machine-learning hardware accelerator called Dian Nao , which opens a new paradigm to
machine learning hardware accelerators focusing on neural networks. But Dian Nao is not
implemented using reconfigurable hardware like FPGA, therefore it cannot adapt to different
application demands. Currently around FPGA acceleration researches, Ly and Chow
designed FPGA based solutions to accelerate the Restricted Boltzmann Machine (RBM).
They created dedicated hardware processing cores which are optimized for the RBM
algorithm. Similarly Kim et al also developed a FPGA based accelerator for the restricted
Boltzmann machine.
They use multiple RBM processing modules in parallel, with each module
responsible for a relatively small number of nodes. Other similar works also present FPGA
based neural network accelerators. Qi et al. present a FPGA based accelerator , but it cannot
accommodate changing network size and network topologies. To sum up, these studies focus
on implementing a particular deep learning algorithm efficiently, but how to increase the size
of the neural networks with scalable and flexible hardware architecture has not been properly
solved.

1.1 Aim of the Thesis:


To tackle these problems, we present a scalable deep learning accelerator unit named
DLAU to speed up the kernel computational parts of deep learning algorithms. In particular,
we utilize the tile techniques, FIFO buffers, and pipelines to minimize memory transfer
operations, and reuse the computing units to implement the large-size neural networks
1. In order to explore the locality of the deep learning application, we employ tile
techniques to partition the large scale input data. The DLAU architecture can be
configured to operate different sizes of tile data to leverage the trade-offs between
speedup and hardware costs. Consequently the FPGA based accelerator is more
scalable to accommodate different machine learning applications.
2. The DLAU accelerator is composed of three fully pipelined processing units,
including TMMU, PSAU, and AFAU. Different network topologies such as CNN,
DNN, or even emerging neural networks can be composed from these basic modules.
Consequently the scalability of FPGA based accelerator is higher than ASIC based
accelerator.

M.Tech Page 2
Design and Implementation of DLAU using FPGA

1.2 Methodology:
In this the HDL designer tool is used to implement the DLAU circuits. The
implementation of the code is done by verilog language. In implementation image input
module designed in first stage, the image module is designed with counter after that an FSM
control module designed in this the states are customize by key values. For every key value
we have different states are assigned according to the key values the states are changed. For
different states, different wait cycles are assigned according to key in values. Modelsim
software is utilized to get simulation results.

1.3 Significance of the Work:


By using some transformations security is provided for DSP circuits. The obfuscation
is completed in FPGA technology so the area of chip and power dissipation is reduced. The
confounded DSP circuits are utilized in transmission, video compression, wired and wireless
communication, bio medical signal process and speech process.

1.4 Organisation of the Thesis:

M.Tech Page 3
Design and Implementation of DLAU using FPGA

CHAPTER 2:
LITERATURE SURVEY:

Deploying deep neural networks on mobile devices is a challenging task. Current


model compression methods such as matrix decomposition effectively reduce the deployed
model size, but still cannot satisfy real-time processing requirement. This paper first
discovers that the major obstacle is the excessive execution time of non-tensor layers such as
pooling and normalization without tensor-like trainable parameters. This motivates us to
design a novel acceleration frame work: Deep-Rebirth through “slimming” existing
consecutive and parallel non-tensor and tensor layers.

The layer slimming is executed at different substructures: (a) streamline slimming by


merging the consecutive non-tensor and tensor layer vertically; (b) branch slimming by
merging non-tensor and tensor branches horizontally. The proposed optimization operations
significantly accelerate the model execution and also greatly reduce the run-time memory
cost since the slimmed model architecture contains less hidden layers. To maximally avoid
accuracy loss, the parameters in new generated layers are learned with layer-wise fine-tuning
based on both theoretical analysis and empirical verification. As observed in the experiment,
Deep Rebirth achieves more than 3x speed-up and 2.5xrun-time memory saving on Google
Net with only 0.4% drop on top-5 accuracy in Image Net. Furthermore, by combining with
other model compression techniques, Deep Rebirth offers an average of 106.3ms inference
time on the CPU of Samsung Galaxy S5 with 86.5% top-5 accuracy, 14% faster than Squeeze
Net which only has a top-5 accuracy of 80.5%.

Chao Wang ; Lei Gong ; Qi Yu ; Xi Li ; Yuan Xie ; Xuehai Zhou has presented
emerging field of machine learning, deep learning shows excellent ability in solving complex
learning problems. However, the size of the networks becomes increasingly large scale due to
the demands of the practical applications, which poses significant challenge to construct a
high performance implementations of deep learning neural networks. In order to improve the
performance as well as to maintain the low power cost, in this paper we design deep learning
accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep
learning networks using field-programmable gate array (FPGA) as the hardware prototype.

The DLAU accelerator employs three pipelined processing units to improve the
throughput and utilizes tile techniques to explore locality for deep learning applications.

M.Tech Page 4
Design and Implementation of DLAU using FPGA

Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU
accelerator is able to achieve up to 36.1× speedup comparing to the Intel Core2 processors,
with the power consumption at 234 mW.

Sadiq M. Sait has depicted the recent advances in digital technologies, and availability
of credible data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems not possible
before. In particular, convolution neural networks (CNNs) have demonstrated their
effectiveness in image detection and recognition applications. However, they require
intensive CPU operations and memory bandwidth that make general CPUs fail to achieve
desired performance levels. Consequently, hardware accelerators that use application specific
integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing
units (GPUs) have been employed to improve the throughput of CNNs.

More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize parallelism as well
as due to their energy efficiency. In this paper, we review recent existing techniques for
accelerating deep learning networks on FPGAs. We highlight the key features employed by
the various techniques for improving the acceleration performance. In addition, we provide
recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The
techniques investigated in this paper represent the recent trends in FPGA-based accelerators
of deep learning networks. Thus, this review is expected to direct the future advances on
efficient hardware accelerators and to be useful for deep learning researchers.

Neena Aloysius ; M. Geetha, The success of traditional methods for solving computer
vision problems heavily depends on the feature extraction process. But Convolutional Neural
Networks (CNN) have provided an alternative for automatically learning the domain specific
features. Now every problem in the broader domain of computer vision is re-examined from
the perspective of this new methodology. Therefore it is essential to figure-out the type of
network specific to a problem. In this work, we have done a thorough literature survey of
Convolutional Neural Networks which is the widely used framework of deep learning. With
Alex Net as the base CNN model, we have reviewed all the variations emerged over time to
suit various applications and a small discussion on the available frameworks for the
implementation of the same. We hope this piece of article will really serve as a guide for any
neophyte in the area.

M.Tech Page 5
Design and Implementation of DLAU using FPGA

Trupti R. Chavan ; Abhijeet V. Nandedkar The use of deep neural networks for
artificial intelligence tasks is increasing day by day. However, incremental learning in such
networks is a challenging task. This paper deals with learning new classes by using pre-
trained model without scratch training. The famous VGGNET architecture is used for
classification and can be viewed as cascaded structure of convolutional layers and a classifier.
A hybrid VGGNET model containing offline and online trained network is introduced for
incremental leaning. The offline trained network which plays an important role in feature
extraction, is fixed with pre-trained conventional network. While the online trained network
is adaptable and tuned to learn new classes. The key benefit of such learning is that without
scratch training, a huge reduction in learning time and computations is achieved. The
experimental results obtained on Caltech 101 dataset show that the performance of this
hybrid model is comparable to end to end training.

M.Tech Page 6
Design and Implementation of DLAU using FPGA

CHAPTER 3:

ALGORITHMS AND EXITING TECHNIQUES FOR


ACCELLERATION UNIT:
3.1 TECHINIQUES UTILIZED FOR DESINING SCALLED
ACCELLARATION UNIT:

Tile Techniques and Hot Spot Profiling

Restricted Boltzmann Machines (RBMs) have been widely used to efficiently train
each layer of a deep network. Normally a deep neural network is composed of one input
layer, several hidden layers and one classifier layer. The units in adjacent layers are all-to-all
weighted connected. The prediction process contains feed forward computation from given
input neurons to the output neurons with the current network configurations. Training process
includes pre-training which locally tune the connection weights between the units in adjacent
layers, and global training which globally tune the connection weights with Back Propagation
process. The large-scale deep neural networks include iterative computations which have few
conditional branch operation, therefore they are suitable for parallel optimization in
hardware. In this paper we first explore the hot spot using the profiler. Results in Fig. I
illustrates the percentage of running time including Matrix Multiplication (MM), Activation,
and Vector operations. For the representative three key operations: feed forward, Restricted
Boltzmann Machine (RBM), and back propagation (BP), matrix multiplication play a
significant role of the overall execution. In particular, it takes 98.6%, 98.2%, and 99.1% of
the feed forward, RBM, and BP operations. In comparison, the activation function only takes
1.40%, 1.48%, and 0.42% of the three operations. Experimental results on profiling
demonstrate that the design and implementation of MM accelerators is able to improve the
overall speedup of the system significantly. However, considerable memory bandwidth and
computing resources are needed to support the parallel processing, consequently it poses a
significant challenge to FPGA implementations compared with GPU and CPU optimization
measures. In order to tackle the problem, in this paper we employ tile techniques to partition
the massive input data set into tiled subsets. Each designed hardware accelerator is able to
buffer the tiled subset of data for processing. In order to support the large-scale neural
networks, the accelerator architecture are reused. Moreover, the data access for each tiled
subset can run in parallel to the computation of the hardware accelerators. Algorithm 1
Pseudo Code of the Tiled Inputs Require: Ni: the number of the input neurons No: the

M.Tech Page 7
Design and Implementation of DLAU using FPGA

number of the output neurons Tile Size: the tile size of the input data batchsize: the batch size
of the input data for n = 0; n < batch size; n + + do for k = 0; k < N i; k+ = T ile Size do for j
= 0; j < No; j + + do y[n][j] = 0; for i = k;i < k + T ile Size&&i < N i;i + + do y[n][j]+ = w[i]
[j] ∗ x[n][i] if i == N i − 1 then y[n][j] = f(y[n][j]); end if end for end for end for end for In
particular, for each iteration, output neurons are reused as the input neurons in next iteration.
To generate the output neurons for each iteration, we need to multiply the input neurons by
each column in weights matrix. As illustrated in Algorithm 1, the input data are partitioned
into tiles and then multiplied by the corresponding weights. Thereafter the calculated part
sum are accumulated to get the result. Besides the input/output neurons, we also divided the
weight matrix into tiles corresponding to the tile size. As a consequence, the hardware cost of
the accelerator only depends on the tile size, which saves significant number of hardware
resources. The tiled technique is able to solve the problem by implementing large networks
with limited hardware. Moreover, the pipelined hardware implementation is another
advantage of FPGA technology compared to GPU architecture, which uses massive parallel
SIMD architectures to improve the overall performance and throughput. According to the
profiling results depicted in Table I, during the prediction process and the training process in
deep learning algorithms, the common but important computational parts are matrix
multiplication and activation functions, consequently in this paper we implement the
specialized accelerator to speed up the matrix multiplication and activation functions.

3.2 Definition & Structure

Invented by Geoff Hinton, a Restricted Boltzmann machine is an algorithm useful for


dimensionality reduction, classification, regression, collaborative filtering, feature learning
and topic modeling. (For more concrete examples of how neural networks like RBMs can be
employed, please see our page on use cases).
Given their relative simplicity and historical importance, restricted Boltzmann
machines are the first neural network we’ll tackle. In the paragraphs below, we describe in
diagrams and plain language how they work.
RBMs are shallow, two-layer neural nets that constitute the building blocks of deep-belief
networks. The first layer of the RBM is called the visible, or input, layer, and the second is
the hidden layer.

M.Tech Page 8
Design and Implementation of DLAU using FPGA

Each circle in the graph above represents a neuron-like unit called a node, and nodes
are simply where calculations take place. The nodes are connected to each other across
layers, but no two nodes of the same layer are linked.
That is, there is no intra-layer communication – this is the restriction in a restricted
Boltzmann machine. Each node is a locus of computation that processes input, and begins by
making stochastic decisions about whether to transmit that input or not. (Stochastic means
“randomly determined”, and in this case, the coefficients that modify inputs are randomly
initialized.)

3.3 GET STARTED WITH DEEP LEARNING

Each visible node takes a low-level feature from an item in the dataset to be learned.
For example, from a dataset of grayscale images, each visible node would receive one pixel-
value for each pixel in one image. (MNIST images have 784 pixels, so neural nets processing
them must have 784 input nodes on the visible layer.)
Now let’s follow that single pixel value, x, through the two-layer net. At node 1 of the
hidden layer, x is multiplied by a weight and added to a so-called bias. The result of those
two operations is fed into an activation function, which produces the node’s output, or the
strength of the signal passing through it, given input x.

M.Tech Page 9
Design and Implementation of DLAU using FPGA

activation f((weight w * input x) + bias b ) = output a

Next, let’s look at how several inputs would combine at one hidden node. Each x is
multiplied by a separate weight, the products are summed, added to a bias, and again the
result is passed through an activation function to produce the node’s output.

M.Tech Page 10
Design and Implementation of DLAU using FPGA

Because inputs from all visible nodes are being passed to all hidden nodes, an
RBM can be defined as a symmetrical bipartite graph.
3.4 SYMMETRICAL
Symmetrical means that each visible node is connected with each hidden node
(see below). Bipartite means it has two parts, or layers, and the graph is a
mathematical term for a web of nodes.
At each hidden node, each input x is multiplied by its respective weight w.
That is, a single input x would have three weights here, making 12 weights altogether
(4 input nodes x 3 hidden nodes). The weights between two layers will always form a
matrix where the rows are equal to the input nodes, and the columns are equal to the
output nodes.
Each hidden node receives the four inputs multiplied by their respective
weights. The sum of those products is again added to a bias (which forces at least
some activations to happen), and the result is passed through the activation algorithm
producing one output for each hidden node.

If these two layers were part of a deeper neural network, the outputs of hidden
layer no. 1 would be passed as inputs to hidden layer no. 2, and from there through as
many hidden layers as you like until they reach a final classifying layer. (For simple

M.Tech Page 11
Design and Implementation of DLAU using FPGA

feed-forward movements, the RBM nodes function as an auto encoder and nothing


more.)

3.5 Reconstructions

But in this introduction to restricted Boltzmann machines, we’ll focus on how


they learn to reconstruct data by themselves in an unsupervised fashion (unsupervised
means without ground-truth labels in a test set), making several forward and
backward passes between the visible layer and hidden layer no. 1 without involving a
deeper network.
In the reconstruction phase, the activations of hidden layer no. 1 become the
input in a backward pass. They are multiplied by the same weights, one per internode
edge, just as x was weight-adjusted on the forward pass.
The sum of those products is added to a visible-layer bias at each visible node,
and the output of those operations is a reconstruction; i.e. an approximation of the
original input. This can be represented by the following diagram:

M.Tech Page 12
Design and Implementation of DLAU using FPGA

Because the weights of the RBM are randomly initialized, the difference
between the reconstructions and the original input is often large. You can think of
reconstruction error as the difference between the values of r and the input values, and
that error is then back propagated against the RBM’s weights, again and again, in an
iterative learning process until an error minimum is reached. A more thorough
explanation of back propagation is here.
As you can see, on its forward pass, an RBM uses inputs to make predictions
about node activations, or the probability of output given a weighted x: p(a|x; w).
But on its backward pass, when activations are fed in and reconstructions, or
guesses about the original data, are spit out, an RBM is attempting to estimate the
probability of inputs x given activations a, which are weighted with the same
coefficients as those used on the forward pass. This second phase can be expressed
as p ( x|a ; w).
Together, those two estimates will lead you to the joint probability distribution
of inputs x and activations a, or p(x, a).Reconstruction does something different from
regression, which estimates a continous value based on many inputs, and different
from classification, which makes guesses about which discrete label to apply to a
given input example.

M.Tech Page 13
Design and Implementation of DLAU using FPGA

Reconstruction is making guesses about the probability distribution of the


original input; i.e. the values of many varied points at once. This is known
as generative learning, which must be distinguished from the so-called discriminative
learning performed by classification, which maps inputs to labels, effectively drawing
lines between groups of data points.
Let’s imagine that both the input data and the reconstructions are normal
curves of different shapes, which only partially overlap. To measure the distance
between its estimated probability distribution and the ground-truth distribution of the
input, RBMs use Kullback Leibler Divergence. A thorough explanation of the math
can be found on Wikipedia.
KL-Divergence measures the non-overlapping, or diverging, areas under the
two curves, and an RBM’s optimization algorithm attempts to minimize those areas
so that the shared weights, when multiplied by activations of hidden layer one,
produce a close approximation of the original input. On the left is the probability
distibution of a set of original input, p, juxtaposed with the reconstructed
distribution q; on the right, the integration of their differences.

By iteratively adjusting the weights according to the error they produce, an


RBM learns to approximate the original data. You could say that the weights slowly
come to reflect the structure of the input, which is encoded in the activations of the
first hidden layer. The learning process looks like two probability distributions
converging, step by step.

M.Tech Page 14
Design and Implementation of DLAU using FPGA

3.6 Probability Distributions

Let’s talk about probability distributions for a moment. If you’re rolling two
dice, the probability distribution for all outcomes looks like this:
That is, 7s are the most likely because there are more ways to get to 7 (3+4,
1+6, 2+5) than there are ways to arrive at any other sum between 2 and 12. Any
formula attempting to predict the outcome of dice rolls needs to take seven’s greater
frequency into account.
Or take another example: Languages are specific in the probability distribution
of their letters, because each language uses certain letters more than others. In
English, the letters e, t and a are the most common, while in Icelandic, the most
common letters are a, r and n. Attempting to reconstruct Icelandic with a weight set
based on English would lead to a large divergence.
In the same way, image datasets have unique probability distributions for their
pixel values, depending on the kind of images in the set. Pixels values are distributed
differently depending on whether the dataset includes MNIST’s handwritten
numerals:

M.Tech Page 15
Design and Implementation of DLAU using FPGA

or the headshots found in Labeled Faces in the Wild:

Imagine for a second an RBM that was only fed images of elephants and dogs,
and which had only two output nodes, one for each animal. The question the RBM is
asking itself on the forward pass is: Given these pixels, should my weights send a
stronger signal to the elephant node or the dog node? And the question the RBM asks
on the backward pass is: Given an elephant, which distribution of pixels should I
expect?
That’s joint probability: the simultaneous probability of x given a and
of a given x, expressed as the shared weights between the two layers of the RBM.
The process of learning reconstructions is, in a sense, learning which groups of pixels
tend to co-occur for a given set of images. The activations produced by nodes of
hidden layers deep in the network represent significant co-occurrences; e.g.
“nonlinear gray tube + big, floppy ears + wrinkles” might be one.
In the two images above, you see reconstructions learned by Deeplearning4j’s
implemention of an RBM. These reconstructions represent what the RBM’s
activations “think” the original data looks like. Geoff Hinton refers to this as a sort of
machine “dreaming”. When rendered during neural net training, such visualizations
are extremely useful heuristics to reassure oneself that the RBM is actually learning.
If it is not, then its hyper parameters, discussed below, should be adjusted.

M.Tech Page 16
Design and Implementation of DLAU using FPGA

One last point: You’ll notice that RBMs have two biases. This is one aspect
that distinguishes them from other auto encoders. The hidden bias helps the RBM
produce the activations on the forward pass (since biases impose a floor so that at
least some nodes fire no matter how sparse the data), while the visible layer’s biases
help the RBM learn the reconstructions on the backward pass.

3.7 Multiple Layers

Once this RBM learns the structure of the input data as it relates to the
activations of the first hidden layer, then the data is passed one layer down the net.
Your first hidden layer takes on the role of visible layer. The activations now
effectively become your input, and they are multiplied by weights at the nodes of the
second hidden layer, to produce another set of activations.
This process of creating sequential sets of activations by grouping features and
then grouping groups of features is the basis of a feature hierarchy, by which neural
networks learn more complex and abstract representations of data.
With each new hidden layer, the weights are adjusted until that layer is able to
approximate the input from the previous layer. This is greedy, layerwise and
unsupervised pre-training. It requires no labels to improve the weights of the network,
which means you can train on unlabeled data, untouched by human hands, which is
the vast majority of data in the world. As a rule, algorithms exposed to more data
produce more accurate results, and this is one of the reasons why deep-learning
algorithms are kicking butt.
Because those weights already approximate the features of the data, they are
well positioned to learn better when, in a second step, you try to classify images with
the deep-belief network in a subsequent supervised learning stage.
While RBMs have many uses, proper initialization of weights to facilitate later
learning and classification is one of their chief advantages. In a sense, they
accomplish something similar to back propagation: they push weights to model data
well. You could say that pre-training and back prop are substitutable means to the
same end.
To synthesize restricted Boltzmann machines in one diagram, here is a
symmetrical bipartite and bidirectional graph:

M.Tech Page 17
Design and Implementation of DLAU using FPGA

For those interested in studying the structure of RBMs in greater depth, they
are one type of un directional graphical model, also called mark random field.

3.8 Code Sample: Stacked RBMS

https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-
examples/src/main/java/org/deeplearning4j/examples/unsupervised/deepbelief/DeepA
utoEncoderExample.java
Parameters & k
The variable k is the number of times you run contrastive divergence.
Contrastive divergence is the method used to calculate the gradient (the slope
representing the relationship between a network’s weights and its error), without
which no learning can occur.
Each time contrastive divergence is run, it’s a sample of the Markov Chain
composing the restricted Boltzmann machine. A typical value is 1. In the above
example, you can see how RBMs can be created as layers with a more general Multi
Layer Configuration. After each dot you’ll find an additional parameter that affects
the structure and performance of a deep neural net. Most of those parameters are
defined on this site.

M.Tech Page 18
Design and Implementation of DLAU using FPGA

Weight Init, or  weight Initialization represents the starting value of the


coefficients that amplify or mute the input signal coming into each node. Proper
weight initialization can save you a lot of training time, because training a net is
nothing more than adjusting the coefficients to transmit the best signals, which allow
the net to classify accurately.
Activation Function refers to one of a set of functions that determine the
threshold(s) at each node above which a signal is passed through the node, and below
which it is blocked. If a node passes the signal through, it is “activated.”
Optimization Algo refers to the manner by which a neural net minimizes
error, or finds a locus of least error, as it adjusts its coefficients step by step. LBFGS,
an acronym whose letters each refer to the last names of its multiple inventors, is an
optimization algorithm that makes use of second-order derivatives to calculate the
slope of gradient along which coefficients are adjusted.
Regularization methods such as l2 help fight over fitting in neural nets.
Regularization essentially punishes large coefficients, since large coefficients by
definition mean the net has learned to pin its results to a few heavily weighted inputs.
Overly strong weights can make it difficult to generalize a net’s model when exposed
to new data.
Visible Unit/Hidden Unit refers to the layers of a neural net. The Visible
Unit, or layer, is the layer of nodes where input goes in, and the Hidden Unit is the
layer where those inputs are recombined in more complex features. Both units have
their own so-called transforms, in this case Gaussian for the visible and Rectified
Linear for the hidden, which map the signal coming out of their respective layers onto
a new space.
Loss Function is the way you measure error, or the difference between your
net’s guesses and the correct labels contained in the test set. Here we
use SQUARED_ERROR, which makes all errors positive so they can be summed and
back propagated.
Learning Rate, like momentum, affects how much the neural net adjusts the
coefficients on each iteration as it corrects for error. These two parameters help
determine the size of the steps the net takes down the gradient towards a local
optimum. A large learning rate will make the net learn fast, and maybe overshoot the
optimum. A small learning rate will slow down the learning, which can be inefficient.
Continuous RBMs
M.Tech Page 19
Design and Implementation of DLAU using FPGA

A continuous restricted Boltzmann machine is a form of RBM that accepts


continuous input (i.e. numbers cut finer than integers) via a different type of
contrastive divergence sampling. This allows the CRBM to handle things like image
pixels or word-count vectors that are normalized to decimals between zero and one.
It should be noted that every layer of a deep-learning net requires four
elements: the input, the coefficients, a bias and the transform (activation algorithm).
The input is the numeric data, a vector, fed to it from the previous layer (or as the
original data). The coefficients are the weights given to various features that pass
through each node layer.
The bias ensures that some nodes in a layer will be activated no matter what.
The transformation is an additional algorithm that squashes the data after it passes
through each layer in a way that makes gradients easier to compute (and gradients are
necessary for a net to learn).
Those additional algorithms and their combinations can vary layer by layer.An
effective continuous restricted Boltzmann machine employs a Gaussian
transformation on the visible (or input) layer and a rectified-linear-unit transformation
on the hidden layer. That’s particularly useful in facial reconstruction. For RBMs
handling binary data, simply make both transformations binary ones.
Gaussian transformations do not work well on RBMs’ hidden layers. The
rectified-linear-unit transformations used instead are capable of representing more
features than binary transformations, which we employ on deep-belief nets.

M.Tech Page 20
Design and Implementation of DLAU using FPGA

CHAPTER 4:
INTRODUCTION TO DLAU:

4.1.INTRODUCTION

In the past few years, machine learning has become pervasive in various research
fields and commercial applications, and achieved satisfactory products. The emergence of
deep learning speeded up the development of machine learning and artificial intelligence.
Consequently, deep learning has become a research hot spot in research organizations .
In general, deep learning uses a multi-layer neural network model to extract high-
level features which are a combination of low level abstractions to find the distributed data
features, in order to solve complex problems in machine learning.
Currently the most widely used neural models of deep learning are Deep Neural
Networks (DNNs) and Convolution Neural Networks (CNNs) , which have been proved to
have excellent capability in solving picture recognition, voice recognition and other complex
machine learning tasks.
However, with the increasing accuracy requirements and complexity for the practical
applications, the size of the neural networks becomes explosively large scale, such as the
Baidu Brain with 100 Billion neuronal connections, and the Google cat-recognizing system
with 1 Billion neuronal connections.
The explosive volume of data makes the data centers quite power consuming. In
particular, the electricity consumption of data centers in U.S. are projected to increase to
roughly 140 C.
Therefore, it poses significant challenges to implement high performance deep
learning networks with low power cost, especially for large scale deep learning neural
network models.
So far, the state of-the-art means for accelerating deep learning algorithms are Field-
Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and
Graphic Processing Unit (GPU). Compared with GPU acceleration, hardware accelerators
like FPGA and ASIC can achieve at least moderate performance with lower power
consumption.

M.Tech Page 21
Design and Implementation of DLAU using FPGA

However, both FPGA and ASIC have relatively limited computing resources,
memory, and I/O bandwidths, therefore it is challenging to develop complex and massive
deep neural networks using hardware accelerators. For ASIC, it has a longer development
cycle and the flexibility is not satisfying.
Chen et al presents a ubiquitous machine-learning hardware accelerator called Dian
Nao , which opens a new paradigm to machine learning hardware accelerators focusing on
neural networks. But Dian Nao is not implemented using reconfigurable hardware like
FPGA, therefore it cannot adapt to different application demands.
Currently around FPGA acceleration researches, Ly and Chow designed FPGA based
solutions to accelerate the Restricted Boltzmann Machine (RBM). They created dedicated
hardware processing cores which are optimized for the RBM algorithm.
Similarly Kim also developed a FPGA based accelerator for the restricted Boltzmann
machine. They use multiple RBM processing modules in parallel, with each module
responsible for a relatively small number of nodes. Other similar works also present FPGA
based neural network accelerators. Qi et al. present a FPGA based accelerator, but it cannot
accommodate changing network size and network topologies.
To sum up, these studies focus on implementing a particular deep learning algorithm
efficiently, but how to increase the size of the neural networks with scalable and flexible
hardware architecture has not been properly solved. To tackle these problems, we present a
scalable deep learning accelerator unit named DLAU to speed up the kernel computational
parts of deep learning algorithms.
In particular, we utilize the tile techniques, FIFO buffers, and pipelines to minimize
memory transfer operations, and reuse the computing units to implement the large-size neural
networks. This approach distinguishes itself from previous literatures with following
contributions:
1. In order to explore the locality of the deep learning application, we employ tile
techniques to partition the large scale input data. The DLAU architecture can be
configured to operate different sizes of tile data to leverage the trade-offs between
speedup and h
2. Hardware costs. Consequently the FPGA based accelerator is more scalable to
accommodate different machine learning applications.
3. The DLAU accelerator is composed of three fully pipelined processing units,
including TMMU, PSAU, and AFAU. Different network topologies such as CNN,
DNN, or even emerging neural networks can be composed from these basic modules.
M.Tech Page 22
Design and Implementation of DLAU using FPGA

Consequently the scalability of FPGA based accelerator is higher than ASIC based
accelerator.

4.1 .1 Tile Techniques and Hot Spot Profiling

Restricted Boltzmann Machines (RBMs) have been widely used to efficiently train
each layer of a deep network. Normally a deep neural network is composed of one input
layer, several hidden layers and one classifier layer. The units in adjacent layers are all-to-all
weighted connected.
The prediction process contains feed forward computation from given input neurons
to the output neurons with the current network configurations. Training process includes pre-
training which locally tune the connection weights between the units in adjacent layers, and
global training which globally tune the connection weights with Back Propagation process.
The large-scale deep neural networks include iterative computations which have few
conditional branch operations , therefore they are suitable for parallel optimization in
hardware. In this paper we first explore the hot spot using the profiler. Results in Fig. I
illustrates the percentage of running time including Matrix Multiplication (MM), Activation,
and Vector operations.
For the representative three key operations: feed forward, Restricted Boltzmann
Machine (RBM), and back propagation (BP), matrix multiplication play a significant role of
the overall execution. In particular, it takes 98.6%, 98.2%, and 99.1% of the feed forward,
RBM, and BP operations. In comparison, the activation function only takes 1.40%, 1.48%,
and 0.42% of the three operations.
Experimental results on profiling demonstrate that the design and implementation of
MM accelerators is able to improve the overall speedup of the system significantly. However,
considerable memory bandwidth and computing resources are needed to support the parallel
processing, consequently it poses a significant challenge to FPGA implementations compared
with GPU and CPU optimization measures.
In order to tackle the problem, in this paper we employ tile techniques to partition the
massive input data set into tiled subsets. Each designed hardware accelerator is able to buffer
the tiled subset of data for processing. In order to support the large-scale neural networks, the
accelerator architecture are reused.

M.Tech Page 23
Design and Implementation of DLAU using FPGA

Moreover, the data access for each tiled subset can run in parallel to the computation
of the hardware accelerators. Algorithm 1 Pseudo Code of the Tiled Inputs Require: Ni: the
number of the input neurons No: the number of the output neurons Tile Size: the tile size of
the input data batch size: the batch size of the input data for n = 0; n < batch size; n + + do for
k = 0; k < N i; k+ = T ile Size do for j = 0; j < No; j + + do y[n][j] = 0; for i = k;i < k + T ile
Size&&i < N i;i + + do y[n][j]+ = w[i][j] ∗ x[n][i] if i == N i − 1 then y[n][j] = f(y[n][j]); end
if end for end for end for end for In particular, for each iteration, output neurons are reused as
the input neurons in next iteration.
To generate the output neurons for each iteration, we need to multiply the input
neurons by each column in weights matrix. As illustrated in Algorithm 1, the input data are
partitioned into tiles and then multiplied by the corresponding weights. Thereafter the
calculated part sum are accumulated to get the result.
Besides the input/output neurons, we also divided the weight matrix into tiles
corresponding to the tile size. As a consequence, the hardware cost of the accelerator only
depends on the tile size, which saves significant number of hardware resources.
The tiled technique is able to solve the problem by implementing large networks with
limited hardware. Moreover, the pipelined hardware implementation is another advantage of
FPGA technology compared to GPU architecture, which uses massive parallel SIMD
architectures to improve the overall performance and throughput.
According to the profiling results depicted in Table I, during the prediction process
and the training process in deep learning algorithms, the common but important
computational parts are matrix multiplication and activation functions, consequently in this
paper we implement the specialized accelerator to speed up the matrix multiplication and
activation functions.

4.2 DLAU Architecture and Execution Model

Fig. 1 describes the DLAU system architecture which contains an embedded


processor, a DDR3 memory controller, a DMA module, and the DLAU accelerator. The
embedded processor is responsible for providing programming interface to the users and
communicating with DLAU via JTAG-UART.
In particular it transfers the input data and the weight matrix to internal BRAM
blocks, activates the DLAU accelerator, and returns the results to the user after execution.
The DLAU is integrated as a standalone unit which is flexible and adaptive to accommodate
different applications with configurations.

M.Tech Page 24
Design and Implementation of DLAU using FPGA

The DLAU consists of 3 processing units organized in a pipeline manner: Tiled


Matrix Multiplication Unit (TMMU), Part Sum Accumulation Unit (PSAU), and Activation
Function Acceleration Unit (AFAU). For execution, DLAU reads the tiled data from the
memory by DMA, computes with all the three processing units in turn, and then writes the
results back to the memory.
In particular, the DLAU accelerator architecture has following key features: FIFO
Buffer: Each processing unit in DLAU has an input buffer and an output buffer to receive or
send the data in FIFO.
These buffers are employed to prevent the data loss caused by the inconsistent
throughput between each processing unit. Tiled Techniques: Different machine learning
applications may require specific neural net-work sizes.
The tile technique is employed to divide the large volume of data into small tiles that
can be cached on chip, therefore the accelerator can be adopted to different neural network
size.
Consequently the FPGA based accelerator is more scalable to accommodate different
machine learning applications. Pipeline Accelerator: We use stream-like data passing
mechanism (e.g. AXI-Stream for demonstration) to transfer data between the adjacent
processing units, therefore TMMU, PSAU, and AFAU can compute in streaming-like
manner.
Of these three computational modules, TMMU is the primary computational unit,
which reads the total weights and tiled nodes data through DMA, performs the calculations,
and then transfers the intermediate Part Sum results to PSAU.
PSAU collects Part Sums and performs accumulation. When the accumulation is
completed, results will be passed to AFAU.
AFAU performs the activation function using piecewise linear interpolation methods.
In the rest of this section, we will detail the implementation of these three processing units
respectively

4.2.1 TMMU architecture

Tiled Matrix Multiplication Unit (TMMU) is in charge of multiplication and


accumulation operations. TMMU is specially designed to exploit the data locality of the
weights and is responsible for calculating the Part Sums.

M.Tech Page 25
Design and Implementation of DLAU using FPGA

TMMU employs an input FIFO buffer which receives the data transfe Fig. 2
illustrates the TMMU schematic diagram, in which we set tile size=32 as an example.
TMMU firstly reads the weight matrix data from input buffer into different BRAMs in 32 by
the row number of the weight matrix (n=i%32where n refers to the number of BRAM, and i
is the row number of weight matrix).

Then, TMMU begins to buffer the tiled node data. In the first time, TMMU reads the
tiled 32 values to registers Reg a and starts execution.
In parallel to the computation at every cycle, TMMU reads the next node from input
buffer and saves to the registers Reg b. Consequently the registers Reg a and Reg b can be
used alternately.
For the calculation, we use pipelined binary adder tree structure to optimize the
performance. As depicted in Fig. 2, the weight data and the node data are saved in BRAMs
and registers.

M.Tech Page 26
Design and Implementation of DLAU using FPGA

The pipeline takes advantage of time-sharing the coarse-grained accelerators. As a


consequence, this implementation enables the TMMU unit to produce a Part Sum result every
clock cycle.rred from DMA and an output FIFO buffer to send Part Sums to PSAU.

4.2.2 PSAU architecture

Part Sum Accumulation Unit (PSAU) is responsible for the accumulation operation.

Fig. Presents the PSAU architecture, which accumulates the part sum produced by TMMU.
If the Part Sum is the final result, PSAU will write the value to output buffer and send
results to AFAU in a pipeline manner.
PSAU can accumulate one Part Sum every clock cycle, therefore the throughput of
PSAU accumulation matches the generation of the Part Sum in TMMU.

4.2.3 AFAU architecture

Finally, Activation Function Acceleration Unit (AFAU) implements the activation


function using piecewise linear interpolation (y=ai*x+bi, x∈[x1,xi+1)).
This method has been widely applied to implement activation functions with
negligible accuracy loss when the interval between xi and xi+1 is insignificant. Eq. (1) shows
the implementation of sigmoid function.
For x>8 and x≤-8, the results are sufficiently close to the bounds of 1 and 0,
respectively. For the cases in -8

M.Tech Page 27
Design and Implementation of DLAU using FPGA

Similar to PSAU, AFAU also has both input buffer and output buffer to maintain the
throughput with other processing units. In particular, we use two separate BRAMs to store
the values of a and b.
The computation of AFAU is pipelined to operate sigmoid function every clock cycle.
As a consequence, all the three processing units are fully pipelined to ensure the peak
throughput of the DLAU accelerator architecture.

M.Tech Page 28
Design and Implementation of DLAU using FPGA

CHAPTER 5

PROPOSED APPLICATION FOR DLAU MODELLING


5.1 Deep Learning Module based DLAU:
Our design specification models the concept of DL-MODULE which would provide
an estimation of how the DLAU operates based on DL-MODULE (deep learning) consisting
of only data segments or image segments.
The design of the DLM based on the DLAU architecture would emphasize on how
each such design modules are being implemented and applied on particular applications. We
propose a controller architecture for the current CNN model for the DLM based DLAU
architecture which would emphasize on the how each layers have been utilized and controlled
according to the timing consideration as per the design choices.
The controller operation is utilized on the DLAU architecture as TMMU, PSAU and
AFAU. Here each module operation characteristics is based on the specific criteria of the
design choice chosen for each layer consideration on the CNN structure. Our design
emphasizes on the Laplacian filter and its convolution modifications on the CNN structure.

M.Tech Page 29
Design and Implementation of DLAU using FPGA

As form the above concept we have seen the existing design, is modelled only on
DLAU architectures for different no of bits for different application, hence we are unable to
provide an accurate solution for it.
To provide such complex features in one application we need to know how each
module is operated based on the application chosen.
Our concept have provided modules namely: TMMU, PSAU, AFAU for which need
to estimate and analyze its behavior, characteristics and reliability for which the application
chosen will be liable for the design considerations.

5. 2 DLAU _ (TMMU PSAU& AFAU) Structure design:

Our DLAU architecture is applied on the CNN layers to control its data
operations and output generation at each phase of the design considered.

M.Tech Page 30
Design and Implementation of DLAU using FPGA

Since the CNN structure itself has multiple layers of filters and boosting
techniques we propose one such filter and a boosting technique into our design which reduces
the real-time latency of the CNN layers considered.
Now, the figure considered is tend to provide the analysis of the

FLOWCHART FOR PSAU AND TMMU:

5.3 TMMU&PSAU:
From the design point of view we have considered the PSAU module as Data
accelerated Controller.
FLOW DIAGRAM PSAU:

M.Tech Page 31
Design and Implementation of DLAU using FPGA

5.4 AFAU:

Here the concept for AFAU represented with Flow diagram which depicts the
modelling of this circuit as FSM based digital display, where it would check the comparative
analysis of outputs to inputs created and generated intermediate and original sections.

CNN LAYERS FLOW DIAGRAM:

CONVOLUTION WITH LAPLACIAN FILTER STAGE 1:

A convolutional neural system comprises of an information and a yield layer, just as


different concealed layers. The shrouded layers of a CNN regularly comprise of a progression
of convolutional layers that convolve with an increase or other dab item.
The actuation capacity is ordinarily a RELU layer, and is in this manner pursued by
extra convolutions, for example, pooling layers, completely associated layers and
standardization layers, alluded to as concealed layers in light of the fact that their sources of
info and yields are veiled by the enactment capacity and last convolution. The last

M.Tech Page 32
Design and Implementation of DLAU using FPGA

convolution, thusly, frequently includes back propagation so as to all the more precisely
weight the end product.
Despite the fact that the layers are casually alluded to as convolutions, this is just by
show. Numerically, it is in fact a sliding dab item or cross-connection. This has
noteworthiness for the files in the lattice, in that it influences how weight is resolved at a
particular file point

CONVOLUTIONAL WITH LAPLACIAN FILTER STAGE 2:

Pooling

Convolutional systems may incorporate nearby or worldwide pooling layers to


streamline the basic calculation. Pooling layers decrease the components of the information
by joining the yields of neuron groups at one layer into a solitary neuron in the following
layer.
Nearby pooling consolidates little bunches, normally 2 x 2. Worldwide pooling
follows up on every one of the neurons of the convolutional layer.
Also, pooling may register a maximum or a normal. Max pooling utilizes the most
extreme incentive from every one of a group of neurons at the earlier layer.

M.Tech Page 33
Design and Implementation of DLAU using FPGA

Average pooling utilizes the normal incentive from every one of a bunch of neurons at
the earlier layer

MAX-POOLING :

M.Tech Page 34
Design and Implementation of DLAU using FPGA

CHAPTER 6

RESULTS AND DISCUSSION:


MATHEMATICAL AND ANALYTICAL CALCULATION FOR
PROPOSED RESULTS:

In this section we are here to explain about the proposed design results, how each
analysis for the design modeling is achieved by considering below :
 Area analysis
 Power analysis
 Time/Delay analysis(Initialization analysis)
 Speed analysis.
By utilization of VHLD/Verilog language we design the required system which would be
processed and subjugated to above mentioned analysis below:
 Synthesis,
 Place and Route,
 Simulation.

Synthesis

In this process we provide initially designed Verilog program or VHDL program code
which converts into the net list format. We now analyze the complete circuit with logic
elements and its RTL implementation.
In this project we need to control and model each design phase for TX and RX so that
the transmission is as fast as possible. This process generates net list for each design element.

Simulate:

Consideration of simulation process mainly requiring input and output that mean’s
output can be observed with respect to given input with clock pulses(cycles). In this process
we consider specific inputs, outputs which are in the form of clock pulses to provide the
simulated model of the designed circutary. Whenever observed output depends on the duty
cycle.
Duty cycle = Ton/Toff

M.Tech Page 35
Design and Implementation of DLAU using FPGA

For example the user can be assumed total clock pulses in that Ton always should be
greater than the Toff (Ton>Toff) then only get the improvised much more stability. In this
consideration we required always more duty cycle (D.C). Suppose Ton is less than the Toff
that condition in duty cycle (Ton<Toff) is less.
So improvised stability is also less so that assumptions. Actually these criteria are
affected by Finite state machines, LUT’s and holding times are there .So those conditions are
changed.

Simulation Results:

From the modelled graphs from the Model simulated output we have seen the
initialization each data with respect to inputs where each module is operated with different
timing circuitry. Now we synchronize the circuit using the design clock generated by the user
and estimated.

M.Tech Page 36
Design and Implementation of DLAU using FPGA

For the above and below figure we have observed the values of Xmit_d and Data1_tx
as 10010101. The values is input which we have assigned to DLAU, here the DLAU act as
the controller and comparator where each data from the CNN layers utilized from the design
criteria are being controller and checked accordingly.

M.Tech Page 37
Design and Implementation of DLAU using FPGA

Finally, after few iteration of the clk cycles we are able to revive the same data and
comparison of the original and received is verified based on the AFAU operation.

SNO PARAMETERS EXISTING DESIGN PROPOSED


DESIGN
1. AREA 48% 23%
2. POWER 1.12W .185W
3. LATENCY 58 22
4. ROUTE DELAY 6.87ns 2.28ns
5. TOTAL DELAY 4.78 ns 2.85 ns

CONCLUSION:

M.Tech Page 38
Design and Implementation of DLAU using FPGA

As per the proposed design we have estimated and calculated the proposed design
consideration with an application point of view where the DLAU is being utilized and
verified with DL-MODULE based modelling.
We compare the results accordingly based and tabulate it. So as per design
consideration we have verified our design model based on the existing design which is
DLAU and with an application point of view where DL-MODULE with DLAU is being
considered.
Now according to the results and its implementation cycle we have shown the
comparisons for the existing and proposed design model.
Hence from the result and implementation point of view we have proven our proposed
method is more reliable and more effective with different kind’s of application where power
and area are critical.

AREA UTILIZATION:

As per the design we have estimated the Area about 40 % for Input/output
configurations where each device characteristics would change depending upon the design
criteria.
As per design model we have utilized about less that 300 circuit elements such as Flip
flops and about 710 for Look up table which would estimate about 0.6 percentage of total
logical elements for the area representation

POWER UTILIZATION:

M.Tech Page 39
Design and Implementation of DLAU using FPGA

The modelled power analysis haven’t been utilized based on power reduction
schemes. By modelling of signal characteristics we could verify the simulation output results
specifically the output data consideration which would estimate the correct fan-in and fan-out
for the design. The correct power utilization for the design under test is shown below:

M.Tech Page 40
Design and Implementation of DLAU using FPGA

REFERENCES:

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp.
436–444, 2015. [2] J. Hauswald et al., “DjiNN and Tonic: DNN as a service and its
implications for future warehouse scale computers,” in Proc. ISCA, Portland, OR, USA,
2015, pp. 27–40.

[2] C. Zhang et al., “Optimizing FPGA-based accelerator design for deep convolutional
neural networks,” in Proc. FPGA, Monterey, CA, USA, 2015, pp. 161–170.

[3] P. Thibodeau. Data Centers are the New Polluters. Accessed on Apr. 4, 2016. [Online].
Available: http://www.computerworld.com/ article/2598562/data-center/data-centers-are-the-
new-polluters.html

[4] D. L. Ly and P. Chow, “A high-performance FPGA architecture for restricted Boltzmann


machines,” in Proc. FPGA, Monterey, CA, USA, 2009, pp. 73–82.

[5] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous


machine-learning,” in Proc. ASPLOS, Salt Lake City, UT, USA, 2014, pp. 269–284.

[6] S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, “A highly scalable restricted


Boltzmann machine FPGA implementation,” in Proc. FPL, Prague, Czech Republic, 2009,
pp. 367–372.

[7] Q. Yu, C. Wang, X. Ma, X. Li, and X. Zhou, “A deep learning prediction process
accelerator based FPGA,” in Proc. CCGRID, Shenzhen, China, 2015, pp. 1159–1162.

[8] J. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neural
network,” in Proc. FPGA, Monterey, CA, USA, 2016, pp. 26–35.

M.Tech Page 41
Design and Implementation of DLAU using FPGA

Appendix

`resetall
`timescale 1ns/10ps
module main_controler_cnn(
input wire clk_main,
input wire rst,
output wire add_stg2,
output wire add_stg3,
output wire add_stg4,
output wire conv_stage1,
output wire conv_stg2,
output wire [15:0] max_pool_out
);
wire clk;
wire conv2done;
wire [15:0] dbus0;
wire [15:0] dbus1;
wire [15:0] dbus2;
wire [15:0] dbus3;
wire [15:0] dbus4;
wire done;
wire done1;
wire done2;
wire done3;
wire enable;
wire [15:0] main_in;
wire maxPoolingDone;
wire [4:0] output1;
wire [15:0] output10;
wire [15:0] output11;
wire [15:0] output12;
wire [15:0] output13;
wire [15:0] output14;
wire [15:0] output15;
wire [4:0] output2;
wire [5:0] output3;
wire [4:0] output4;
wire [4:0] output5;
wire [5:0] output6;
wire [6:0] output7;
wire [7:0] output8;

M.Tech Page 42
Design and Implementation of DLAU using FPGA

wire [15:0] output9;


ConvolutionStage1 U_2(
.clk (clk),
.input2 (dbus0),
.input4 (dbus1),
.input5 (dbus2),
.input6 (dbus3),
.input8 (dbus4),
.output1 (output1),
.output2 (output2),
.output3 (output3),
.output4 (output4),
.output5 (output5),
.enable (enable),
.done (done)
);
ConvolutionStage2 U_4(
.clk (clk),
.enable (enable),
.input1 (output8),
.input2 (output8),
.input3 (output8),
.input4 (output8),
.input5 (output8),
.input6 (output8),
.input7 (output8),
.input8 (output8),
.input9 (output8),
.input10 (output8),
.input11 (output8),
.input12 (output8),
.output1 (output9),
.output2 (output10),
.output3 (output11),
.output4 (output12),
.output5 (output13),
.output6 (output14),
.done (conv2done
);
adderStage2 U_0(
.input1 (output1),
.input2 (output2),
.input3 (output3),
.output1 (output7),

M.Tech Page 43
Design and Implementation of DLAU using FPGA

.clk (clk),
.enable (enable),
.done (done1) );
adderStage3 U_3(
.input1 (output4),
.input2 (output5),
.output1 (output6),
.clk (clk),
.enable (enable),
.done (done2)
);
adderStage4 U_5(
.input1 (output7),
.input2 (output6),
.output1 (output8),
.clk (clk),
.enable (enable),
.done (done3)
);
main_control_gen U_1(
.clk_main (clk_main),
.conv2done (conv2done),
.done (done),
.done1 (done1),
.done2 (done2),
.done3 (done3),
.maxPoolingDone (maxPoolingDone),
.output15 (output15),
.rst (rst),
.add_stg2 (add_stg2),
.add_stg3 (add_stg3),
.add_stg4 (add_stg4),
.clk (clk),
.conv_stage1 (conv_stage1),
.conv_stg2 (conv_stg2),
.dbus0 (dbus0),
.dbus1 (dbus1),
.dbus2 (dbus2),
.dbus3 (dbus3),
.dbus4 (dbus4),
.enable (enable),
.max_pool_out (max_pool_out)
);
maxPooling U_6(

M.Tech Page 44
Design and Implementation of DLAU using FPGA

.clk (clk),
.input1 (output9),
.input2 (output10),
.input3 (output11),
.input4 (main_in),
.enable (enable),
.output1 (output15),
.maxPoolingDone (maxPoolingDone)
);
assign main_in = output4 ^ output5 ^ output6;
endmodule

…………………………………………………………………………………………..

module file_gen(
output wire clk,
input wire clk_main,
output wire enable,
output wire [15:0] file_out,
input wire rst
);
reg [15:0] ram [15:0];
reg [15:0] out1;
reg en;
integer i,j;
always@(posedge(clk_main))
begin
if(rst)
begin
for (i =0;i<=255;i=i+1)
ram[i] =0;
out1=0;
end
else
begin
$readmemh("C:\\Users\\Rahul\\Desktop\\dlau\\image1.dat",ram);
for (i=0;i<=15;i=i+1)
for (j=0;j<=15;j=j+1)
begin
out1= ram[0];
#10 out1= ram[1];
#10 out1 = ram[2];
#15 out1 = ram[3];
#20 out1 = ram[4];

M.Tech Page 45
Design and Implementation of DLAU using FPGA

#25 out1 = ram[5];


#10 out1 = ram[6];
#35 out1 = ram[7];
#50 out1 = ram[8];
#60 out1 = ram[9];
#70 out1 = ram[10];
#80 out1 = ram[11];
#90 out1 = ram[12];
#100 out1 = ram[13];
#110 out1 = ram[14];
#120 out1 = ram[15];
end
end
end
assign file_out=out1;
assign enable = rst;
assign clk= clk_main;
endmodule

……………………………………………………………………………………………

`resetall
`timescale 1ns/10ps
module TMMU_PSAU(
input wire clk_main,
output wire [15:0] dbus0,
output wire [15:0] dbus1,
output wire [15:0] dbus2,
output wire [15:0] dbus3,
output wire [15:0] dbus4,
input wire [15:0] file_out,
input wire rst
);
reg [3:0] r1,r2,r3,r4,r5;
always@(posedge(clk_main))
begin
if (rst)
begin
r1=0;
r2=0;
r3=0;
r4=0;
r5=0;
end

M.Tech Page 46
Design and Implementation of DLAU using FPGA

else
begin
r1 = file_out[3:0];
r2 = file_out[5:2];
r3 = file_out[6:3];
r4 = file_out[4:1];
r5= file_out[7:4];
end
end
assign dbus0=r1;
assign dbus1=r2;
assign dbus2=r3;
assign dbus3=r4;
assign dbus4=r5;
endmodule

………………………………………………………………………………………….

`timescale 1ns / 1ps

module ConvolutionStage1(
input clk,
input [3:0] input2,
input [3:0] input4,
input [3:0] input5,
input [3:0] input6,
input [3:0] input8,
output reg signed [4:0] output1,
output reg signed [4:0] output2,
output reg signed [5:0] output3,
output reg signed [4:0] output4,
output reg signed [4:0] output5,
input enable,
output reg done
);
always @ (posedge clk) begin
if(enable) begian
output1 <= 0;
output2 <= 0;
output3 <= 0;
output4 <= 0;
output5 <= 0;
done <= 1'b0;

M.Tech Page 47
Design and Implementation of DLAU using FPGA

end
else begian
output1 <= {1'b1, ~(input2)} + 5'b00001;
output2 <= {1'b1, ~(input4)} + 5'b00001;
output3 <= {2'b00, input5} << 2;
output4 <= {1'b1, ~(input6)} + 5'b00001;
output5 <= {1'b1, ~(input8)} + 5'b00001;
done <= 1'b1;
end
end
endmodule

……………………………………………………………………………………………

`timescale 1ns / 1ps

module adderStage3(
input [4:0] input1,
input [4:0] input2,
output reg [5:0] output1,
input clk,
input enable,
output reg done
);
always @ (posedge clk) begin
if(enable) begin
output1 <= 0;
done <= 1'b0;
end
else begin
output1 <= {input1[4], input1} + {input2[4], input2};
done <= 1'b1;
end
end
endmodule

…………………………………………………………………………………………..

`timescale 1ns / 1ps


module adderStage4(
input [6:0] input1,
input [5:0] input2,

M.Tech Page 48
Design and Implementation of DLAU using FPGA

output reg [7:0] output1,


input clk,
input enable,
output reg done
);
always @ (posedge clk) begin
if(enable) begin
output1 <= 0;
done <= 1'b0;
end
else begin
output1 <= {input1[6], input1} + {{2{input2[5]}}, input2};
done <= 1'b1;
end
end
endmodule

…………………………………………………………………………………………..

`timescale 1ns / 1ps

module ConvolutionStage2(
input clk,
input enable,
input [7:0] input1,
input [7:0] input2,
input [7:0] input3,
input [7:0] input4,
input [7:0] input5,
input [7:0] input6,
input [7:0] input7,
input [7:0] input8,
input [7:0] input9,
input [7:0] input10,
input [7:0] input11,
input [7:0] input12,
output reg signed [15:0] output1,
output reg signed [15:0] output2,
output reg signed [15:0] output3,
output reg signed [15:0] output4,
output reg signed [15:0] output5,
output reg signed [15:0] output6,
output reg done
);

M.Tech Page 49
Design and Implementation of DLAU using FPGA

always @ (posedge clk) begin


if(enable) begin
output1 <= 0;
output2 <= 0;
output3 <= 0;
output4 <= 0;
output5 <= 0;
output6 <= 0;
done <= 1'b0;
end
else begin

output1 <= {{8{input1[7]}}, input1} * {{8{input7[7]}}, input7};


output2 <= {{8{input2[7]}}, input2} * {{8{input8[7]}}, input8};
output3 <= {{8{input3[7]}}, input3} * {{8{input9[7]}}, input9};
output4 <= {{8{input4[7]}}, input4} * {{8{input10[7]}}, input10};
output5 <= {{8{input5[7]}}, input5} * {{8{input11[7]}}, input11};
output6 <= {{8{input6[7]}}, input6} * {{8{input12[7]}}, input12};
done <= 1'b1;
end
end
endmodule

……………………………………………………………………………………………..

`timescale 1ns / 1ps


module maxPooling(
input clk,
input [15:0] input1,
input [15:0] input2,
input [15:0] input3,
input [15:0] input4,
input enable,
output reg signed [15:0] output1,
output reg maxPoolingDone
);
reg [7:0] inputArray [0:7];
reg [15:0] initialMax = 0;
reg [15:0] tempOutput;
always @ (posedge clk) begin
if(enable) begin
output1 <= 0;

M.Tech Page 50
Design and Implementation of DLAU using FPGA

maxPoolingDone <= 0;
end
else begin
if($signed(initialMax) < $signed(input1)) begin
if($signed(input2) < $signed(input1)) begin
if($signed(input3) < $signed(input1)) begin
if($signed(input4) < $signed(input1)) begin
output1 <= input1;
maxPoolingDone <= 1;
end
else begin
output1 <= input4;
maxPoolingDone <= 1;
end
end
else begin
if($signed(input3) < $signed(input4)) begin
output1 <= input4;
maxPoolingDone <= 1;
end
else begin
output1 <= input3;
maxPoolingDone <= 1;
end
end
end
else begin
if($signed(input3) < $signed(input2)) begin
if($signed(input4) < $signed(input2)) begin
output1 <= input2;
maxPoolingDone <= 1;
end
else begin
output1 <= input4;
maxPoolingDone <= 1;
end
end
else begin
if($signed(input3) < $signed(input4)) begin
output1 <= input4;
maxPoolingDone <= 1;
end
else begin
output1 <= input3;

M.Tech Page 51
Design and Implementation of DLAU using FPGA

maxPoolingDone <= 1;
end
end
end
end
else begin
output1 <= initialMax;
maxPoolingDone <= 1;
end
end
end
endmodule

…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………………………………………………………………………
…………………………………….end code

M.Tech Page 52

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy