Combined Paper
Combined Paper
1.INTRO:
Among several Devanagari scripts available in India, Marathi compositions
are also rich source of the information. Comparatively large work has been
done on OCR like English, Arabic, Sanskrit, and Kannada and so on
however; very little work was accounted on recognition of Marathi contents.
Over the recent years, the handwritten Devanagari text recognition has attracted
numerous researchers to do the investigation. Different procedures were proposed
by the various researchers to perform the recognition. Deep neural systems are
picking up the popularity in the field of Computer Vision (CV) and Machine
Learning (ML). Although, Recognition of handwritten Devanagari characters
is a difficult errand, however, Deep learning can be adequately utilized as an
answer for such issues.
In this paper, they proposed a CNN based OCR framework which precisely
perceived manually written Marathi words and delivered great quality
printed Marathi text. Because of the limited availability of the Marathi
training dataset, they arranged their own training dataset. The dataset was
prepared with assistance of individuals belonging to age group of 8 to 45 years.
The dataset had 9,360 words. The training accuracy of CNN model is 94.76 %.
(104 words with 90 images each).
2.CHALLENGES
1.Letters that appear to be similar.
Recognizing individual Marathi characters is an even troublesome task in light of
the fact that some characters in Devanagari lipi are fundamentally the same as one
another like "ma" and "bha", or "va" and "ba" or "sa" and "ra" etc.
In case of handwritten Marathi text document the writing styles may vary person
to person. The variation
of individual’s writing style is the key challenge in developing handwritten
OCR system.
Deep learning is a part of the wide field of Machine Learning which is based on
artificial neural networks to perform tasks from their experience over a period of
time. This technique is mostly useful for recognizing the images and shapes. A
Convolutional Neural Network (ConvNet or CNN) is a popular algorithm of
deep learning that is especially used in image classification. They learn from
pictures(dataset) for classification purposes, this in future can even eradicate
the need of manual classification. Like other neural systems, a CNN is made out
of an input layer, convolutional layers ,pooling layer, softmax layer and fully
connected layers. CNNs have been successful in image classification and have led
to identification of faces, objects and traffic signs and have literally given vision
to robots and are also used in self driving cars.
Use of CNN[1]
1.Convolution
The process convolution is derived from the “Convolution Operator”.
Convolution is used to extract features from the image. Convolution is the
main feature that makes CNN different from ANN , because it maintains the
spatial relation between its pixels , whereas in ANN the image is converted into a
1D matrix and then classified. So CNN is more successful in the field of image
classification than ANN.
Working Of Convolution
Convolution is done by sliding the filter or kernel over the image matrix as
shown in the above figure , multiplying the pixels of image with the
corresponding pixel on the filter and adding them all to the feature map or
convolved image. Before training a CNN we need to provide number of
filters , size of filters , number of layers and types of layers. At each layer
each image is convolved using a number of filters, producing feature
maps.The more the number of filters more will be the feature maps and we’ll
get more features at each step, but that doesn’t means we always have to put
more features to improve classification accuracy, in fact this may lead to
overfitting and slowing the training process . Through the process of back
propagation, these filters or trainable parameters here, learns to extract
appropriate features which are needed for classification. For a multichannel
image, the number of channels in input image must be the same with the number
of channels in filter, but here we are having grayscale images (where each pixel is
from 0 to 255 where 0 is black and 255 is complete white), and these images are
having only a single channel .There are many aspects involved in convolution
such as :-
1)Depth
The number of filters define the depth at a convolutional layer. A single image
convolved with multiple filters produce multiple feature maps(volume).
2)Stride
The length or unit through which a filter moves rightward or downward is called
the stride. This determines the size of resultant feature maps.
3)Padding
Padding is a very important step during convolution, we need to pad the images
so as to control the size of resultant feature map, If we use the “padding=same”
feature in our convolutional layer so as to avoid the size reduction of our
convolved image , usually 0 is added as the padded value. If the CNN developer
is not careful with padding, the size of the feature map will reduce drastically
leading to loss of input information for the further layers of CNN (layers at right).
2.ReLU(Rectified Linear Unit)
ReLU is the activation function after the convolution step and is done to
introduce non linearity in the rectified image.
Graph of ReLU[1]
Since the process of convolution is a linear element wise matrix multiplication
and addition and since most of the real world data is non linear, therefore we use
ReLU activation function.
Visualization Of ReLU
3. Pooling
We slide our 2 x 2 window by 2 cells (also called ‘stride’) and take the maximum
value in each region. this reduces the dimensionality of our feature map.
When pooling operation is applied it does not reduce the volume for example if
there are 4 feature maps after convolution + ReLU operations then we will have 4
feature maps(pooled) after pooling also.
Pooling reduces the dimensionality of images and saves us with the computation,
thus controlling overfitting.
Pooling makes the CNN unresponsive to small changes, distortions in the input
image (a small distortion cannot change the max value or average value to a large
extent, thus easing the process of classification.)
The output which we have got after convolution and pooling operations represents
the features of the input image. The Fully Connected Layer(Output Layer)
classifies this image on the basis of these features or through the combination of
these features after learning from the training dataset.
Demonstration of a fully connected layer
If CNN is presented in relation with ANN , the only difference is CNN has
convolution and pooling for the purpose of feature extraction and these
features act as an input for the fully connected layers which classifies the
image in its proper category.
Firstly, the trainable parameters such as filters and other weights are
initialized randomly. The Neural Network then inputs an image for training,
and then the network goes through the forward propagation. That is
Convolution and Pooling with forward propagation in the Fully Connected
layers and produces a probability distribution at the output.
Since the weights and other trainable parameters are randomly assigned (the
network is untrained) so the probabilities are random. The total error is
calculated at the output layer.
Now we use Backpropagation and then calculate the gradients of the error
with respect to all weights in the network and use gradient descent to update
all filter values / weights and parameter values to minimize the output error.
The weights are updated according to their share in the error. Now if this
image is again inputted the results will be closer to the expected results, this
shows that our neural network is learning as the error has been reduced. Now, all
images are inputted one by one and finally the neural network gets trained
(this means that weights and other trainable parameters have been updated to
classify a new image correctly).
If the training set is large enough, the network will (hopefully) generalize
well to new images and classify them into correct categories.
4.METHODOLOGY
A CNN comprises numerous convolution layers followed by fully connected
layers that interface each neuron in one layer with every neuron in another layer.
There are various choices that a CNN developer needs to choose before
convolutional learning. The choices are: Number of the convolution layers,
Number of filters and size of filters, Number of pooling layers with step size,
Number of masked (hidden) neurons in dense layers, and enhancement
algorithm to be utilized, etc.
A. Data Acquisition
The mobile phone document scanner was utilized to scan the handwritten
Marathi text documents. To obtain the picture of sensible quality , high
resolution cameras were used. They used 16 MP mobile Camera for scanning
the document images. The acquired pictures were put away in one folder for
performing pre-processing operations.
B. Preprocessing
Generally document image pre-handling steps are used for improving the
appearance of secured images for extracting highlighting features. Mobile
camera or Optical scanner may incorporate clatters and clamours while
capturing the document images, like unwanted shadows, additional dark
spots, dispersed lines, variations at the edges etc. Hence, before starting the
actual word recognition process, the immaculate image must be obtained from the
scanned image. Generally, image binarization is used to diminish a maximum
pixel data in grayscale image to a little amount. In image binarization stage,
each pixel in an image is supplanted with a dark pixel if the intensity of
image pixel I(i,j) is less than certain value T (pixel = 0; if I(i,j)< T) or a white
pixel if the image pixel intensity is more T (pixel = 1; if I(i,j) > T).
C. Skew Correction
D. Feature Extraction
C1->BN->RE->P1->C2->P2->4C->DROP->FC(104)
Ambadas Shinde and Yogesh Dandawate made a 20 layer CNN model for this
purpose. Their first layer which is input layer takes in 36X96 grayscale images.
The first convolution layer has 96 filters with size 3 x 3. After, the convolution
layer they performed batch normalization and ReLU layer activation. Then they
have used a pooling layer for lessening the convolved image components. The
next layer which follows is the convolution layer having 128 filters of size
3x3.They did not fully connect the previous layer(P1) with this layer,so as to trim
the trainable parameters. Then they have used 4 convolution layers and finally
the dropout layer with the probability of 0.5 before the fully connected layer.
Finally the output layer they have is having 104 output classes.
I. INTRODUCTION
Character recognition is one of the emerging fields within the computer vision. Hand
transcriptions can be easily identified by humans. Different languages have different
patterns to spot. Humans can identify the text accurately. Hand transcriptions cannot be
identified by machines. It's difficult to spot the text by the system. Handwriting
recognition is the ability of a machine to receive and interpret the handwritten input from
an external source like image and convert it into digital text.
Character recognition involves several steps like acquisition, feature extraction,
classification, and recognition. During this approach, the system is trained to seek out the
similarities, and also the differences among various handwritten samples.
B. Model Overview:
We use a NN for our task. It consists of a convolutional neural network (CNN) layers,
recurrent neural network (RNN) layers, and a final Connectionist Temporal Classification
(CTC) layer.
In this project, we've taken 5 CNN (feature extraction) and a pair of RNN layers and a CTC
layer (calculate the loss).
C. Operations:
CNN: The input image is given to the CNN layers. These layers are trained to take out
relevant features from the image. Each layer consists of three operations. First, the
convolution operation, 5×5 filter is used in the first two layers and 3×3 filter used in the
last three layers to the input. Then, the non-linear RELU function is applied. At last, a
pooling layer summarizes image regions and outputs a downsized(smaller) version of the
input.
RNN: The feature sequence consists of 256 features per time-step, the RNN propagates
relevant information through this sequence. The favored Long Short-Term Memory
(LSTM) implementation of RNNs is employed because it is in a position to propagate
information through longer distances and provides more robust training-characteristics
than vanilla RNN. The RNN output sequence is mapped to a matrix of 32×80.
CTC: while training the NN, the CTC is given the RNN output matrix and also the ground
truth text and it computes the loss value. While inferring, the CTC is just given the matrix
and it decodes it into the ultimate text.
Data:
Input: It is a gray-value image of size 128×32. Usually, the pictures from the dataset don't
have exactly this size, therefore we resize it (without distortion) until it either contains a
width of 128 or a height of 32. Then, we place the image in a (white) target image of size
128×32.
Finally, we will normalize the gray-values of the image so that it could simplify the task for
the NN.
Fig. 4: Left: a picture from the dataset with an arbitrary size. it's scaled to suit the target image of size
128×32.|
The empty a part of the target image is crammed with white color.
CNN output: Figure. 5 displays the output of the CNN layers which may be a sequence of
length 32. Each layer entry contains 256 features. All these features are further process
has been done by the RNN layers
RNN output: The given Figure shows a visualization of the RNN output matrix for a
picture containing the text “little”. The matrix shown within the top-most graph consists
of the scores for the characters included in the Connectionist Temporal Classification
blank label as its last entry.
Only the last character “e” isn't aligned. But this can be OK because the CTC operation is
segmentation-free and doesn't care about absolute positions. From the bottom-most
graph showing the scores for the characters “l”, “i”, “t”, “e” and also the CTC blank label,
the text can easily be decoded: we just take the foremost probable character from each
time-step, this forms the so-called best path, then we throw away repeated characters and
at last all blanks: “l---ii--t-t--l-...-e” → “l---i--t-t--l-...-e” → “little”
Fig. 6: Top: output matrix of the RNN layers. Middle: input image. Bottom:
Probabilities for the characters “l”, “i”, “t”, “e” and therefore the CTC blank label.
CNN: For each CNN layer, create a kernel of size k×k to be utilized in the convolution
operation. Then, RELU operation again to the pooling layer with results of the
convolution.
RNN: Create and stack the two RNN layers Then, create a bidirectional RNN from it, such
the input sequence is traversed from front to back and therefore the other way round. As
a result, we get two output sequences forward and backward
Finally, it's mapped to the output sequence (or matrix) which is fed into the CTC layer.
CTC: For loss calculation, we feed both the bottom truth text and therefore the matrix to
the operation.
The length of the input sequences must be given. We now have all the input files to make
the loss operation and therefore the decoding operation.
Training:
The mean of the loss values of the batch elements is employed to coach the NN
If you want to improve the recognition accuracy, you can follow one of these hints:
Remove cursive writing style in the input images
Increase input size (if an input of NN is large, complete text-lines can be used)
F. Spell checker
The python package pyspellchecker provides us this feature to search out the words that will are
misspelled and also suggest the possible corrections.
The proposed network has different layers which are as shown in the following figure:
A. CNN layers
Most significantly, CNN is trainable and can produce highly optimized weights and good
generalization performance.
B. RNN layer
RNN have a “memory” which remembers all information about what has been calculated. It uses
the same parameters for each input as it performs the same task on all the inputs or hidden layers
to produce the output. This reduces the complexity of parameters, unlike other neural networks.
RNN provides the same weights and biases to all the layers, thus reducing the
complexity and memorizes each previous output by giving each output as input to
the next hidden layer.
Hence these three layers can be joined together such that the weights and bias of
all the hidden layers are the same, into a single recurrent layer.
The formula for calculating current state:
C. CTC layer
CTC is a loss function that is employed to coach neural networks. There's no must align
data because we assign a probability for any label. CTC is alignment free. It works on the
summing over probability of all possible alignments between the input and also the label.
Blank token:
CTC removes any repeating characters that it finds. However some words have letters that
are repeated more than once like the letter l in hello and they may inevitably end up
getting removed. There is a way around that called the Blank token. It doesn’t mean
anything and simply gets removed before the final word output is produced.
1. CTC network assigns the character with the simplest probability of every input
sequence.
2. Repeats not having a blank token in between get merged.
3. Lastly, the blank token gets removed.
The CTC network can then give the probability label to the input. By summing all the
probabilities of the characters of each time step. The CTC algorithm is alignment-free — it
doesn’t require alignment between the input and thus the output. However, to induce the
probability of output given an input, CTC works by summing over the probability of all
possible alignments between the two.
D. Summary of Dataset
The IAM Handwriting Database contains forms of handwritten English text which can be used to
train and test handwritten text recognizers and to perform writer identification and verification
experiments.
IV. RESULTS
In this project we have given image as an input then it predicts the output by loading the model
which is already previously created and saved.
The above image is the input given to the neural network to predict the solution.
This is image which shows the output to the above input image.
Future Work:
In future we are planning to extend this study to a large variety of datasets. The future is
completely based on technology. No one will use paper and pen for writing. In that
scenario they will write on touch pads so the inbuilt software can automatically detect
text which they write and convert into digital text so that searching and understanding
becomes simplified.
Deep, Big, Simple Neural Nets for Handwritten Digit Recognition
Architecture: We train five MLPs with two to nine hidden layers and varying numbers of
hidden units. ]. Each neuron’s activation function is a scaled hyperbolic tangent: y(a) =
Atanh Ba, where A= 1.7159 and B = 0.6666.
GPU Implementations: -
1. Deformations: Using GPU instead of CPU, generating the elastic displacement field
takes only 3 seconds. Deforming the whole training set is more than 10 times
faster, taking 9 instead of the original 93 seconds.
2. Training Algorithm: We closely follow the standard BP algorithm except that BP of
deltas and weight updates are disentangled and performed sequentially. The
training algorithm involves 2 algorithms: -
a. Forward Propagation
b. Backward Propagation
c. Weight Updating: The associated algorithm starts by reading the
appropriate delta and precomputes all repetitive expressions.
Results: -
1. The GPU accelerates the deformation routine by a factor of 10 (only elastic
deformations are GPU optimized); the forward propagation (FP) and BP routines
are speed up by a factor of 40.
2. Most remarkable, the best network has an error rate of only 0.35% (35 out of
10,000 digits). This is significantly better than the best previously published
results—0.39% by Ranzato et al. (2006) and 0.40% by Simard et al. (2003), both
obtained by more complex methods.
3. The best test error of this MLP is even lower (0.32%) and may be viewed as the
maximum capacity of the network.
Conclusions: -
1. This ongoing hardware progress may be more important than advances in
algorithms and software (although the future will belong to methods combining
the best of both worlds).
2. Current graphics cards (GPUs) are already more than 40 times faster than standard
microprocessors when it comes to training big and deep neural networks by the
ancient algorithm, online backpropagation (weight update rate up to 5 × 109/s and
more than 1015 per trained network).
Recognition of Handwritten Characters using Deep Convolutional
Neural Network
I. INTRODUCTION
The main difficult part of Indian handwritten recognition is overlapping between
characters. These overlapping shaped characters are difficult to recognize and may lead to
low recognition rate. These factors also increase the complexity of handwritten character
recognition. This paper proposes a new approach to identify handwritten characters for
Telugu language using Deep Learning (DL). The proposed work can be enhance the
recognition rate of individual characters.
The objective of this work is to effectively extract the topo-logical features for
handwritten numerals recognition using convolutional Neural Network (CNN) with deep
network. The rest of the paper is organized as follows. Section II describes the present
work and its methodology. The experimental setup is discussed in III with discussions. In
Section IV concludes with remarks of proposed work in the end.
The dataset collection process carefully designed and main-tained diversity of across the
gender. It has collected man-ually from 280 individuals with various handwritten styles.
The collected samples are digitized and segmented to create numeral database. While
segmentation the image are scaled to 32x32 with out any loss of information. With the
diversity of writing style, handwritten recognition allows us to aimed to process and
extract different features that can distinguish between numerals.
The overall accuracy is captured of different epochs is shown in Fig. 4. At the 50 th epoch,
we measured heights accuracy of 94% is captured. In our experiments, we also captured
the each digit recognition rate that gives the insight of how each digits can be accurately
identified. Most of the digits are recognized more the 95% except digit 4, digit 7 and digit
9 are very low in accuracy due to ambiguity in patterns. These digits are closely resemble
each other that can successfully recognize individual patterns. The highest recognition
rate is captured digit is ―3‖ with 98%. The detailed accuracy of each digit is shown in
Fig. 5.
IV. CONCLUSION
An efficient handwritten character recognition approach is proposed in this work. The
proposed work employed a deep network model to recognize the handwritten numerals.
The experimental results shown that convolutional neural networks performance better
in recognition for Telugu language. We have captured the low recognition rate of three
digits viz. 3,7 and 9 can be improved by introducing the new kernel methods. The possible
reason may be a overlapping patterns among them.
***********