0% found this document useful (0 votes)
27 views6 pages

Ijcrt 196552

Uploaded by

Lâm Thế Tài
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Ijcrt 196552

Uploaded by

Lâm Thế Tài
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Image Caption Generator Using CNN and LSTM

Swarnim Tripathi Ravi Sharma


swarnim0711@gmail.com ravi.sharma@galgotiasuniversity.edu.in
Galgotias University,India Galgotias University,India

Abstract​ - F
​ or this paper, we use CNN and LSTM to and many others. Social media like Instagram ,
become aware of the caption of the image. Image Facebook etc can generate captions routinely from
caption generation is a system that comprehends images
natural language processing & computer vision The principal goal of this research paper is to get a
standards to recognize the connection of the image little bit of expertise in deep learning strategies. We
in English. In this research paper, we cautiously use two strategies specially CNN and LSTM for
pursue a number of important concepts of image classification.
photograph captioning and its familiar processes.
We talk about Keras library, numpy and jupyter I​MAGE​ C​APTIONING​ T​ECHNIQUES
notebooks for the making of this paper. We also talk CNN​ - ​Convolutional Neural systems are specific
about flickr_dataset and CNN used for photo important neural systems that can produce
classification. information that has an information shape,for
example, a 2D lattice and CNN is valuable for
Keywords​- CNN,LSTM,image captioning, deep working with pictures. It examines pictures from
learning. left corner to the right corner and through to
extricate significant highlights from the picture,
and consolidates the element to characterize
I​NTRODUCTION pictures. It can deal with interpreted, pivoted,
Every day we see a lot of photographs in the scaled, and modified pictures. The Convolutional
surroundings , on social media and in the neural system is a profound learning calculation
newspapers. Humans are able to recognize that takes in the info picture, allocates significance
photographs themselves only. We humans can pick to various components/protests in the picture, and
out the photographs without their designated recognizes it from each other.
captions but on the other hand machines need
images to get trained first then it'd generate the
photograph caption automatically.
Image captioning may benefit for loads of
purposes, for example supporting the visionless
person using text-to-speech through real time
feedback about encompassing the situation over a
camera feed, improving social medical leisure with Fig.1 CNN Architecture
the aid of reorganizing the captions for photographs The required pre-handling in ConvNet negligible
in social feed alongwith messages to speech. when compared with other order calculations. In
Facilitating kids in recognizing substances further spite of the fact that channels are hand-designed in
to gaining knowledge of the language. Captions crude strategies, with sufficient preparation,
for every photograph on the world wide web can ConvNets is fit for learning these
produce quicker & detailed authentic photographs channels/highlights. The structure of the curved
exploring and indexing. Image captioning has system is like the neuronal network design inside
diverse packages in numerous fields inclusive of human mind & is inspired by the way of the
biomedicine, commerce, internet looking and navy organization of the visual cortex. Singular neurons
react to upgrades simply in a limited district of the compare two images we check the pixel values of
visible field known as open field. The assortment each pixel. This technique only helps us to compare
of such fields covers the summation of visual two identical images only but when we keep
regions. different images to compare the comparison fails.
In CNN image comparison takes place piece by
CNN : Architecture​ ​-​ ​ ​A pure rustic neural network, in piece.
whatever location all neurons in a single layer merge
with all of the neurons in the subsequent layer is
inefficient in regards to analyzing large pictures and
video. For a normal size picture with many picture
elements called pixels & 3-tone colors (RGB i.e., red
color,green color,blue color), the range of restriction
utilizing an accepted neural system will be in the tons,
& that can prompt overfitting.
To constrain effective quantities of restrictions &
recognition of the neural system on significant pieces
of picture, CNN utilises a 3D arrangement in which
each adjustment of neurons breaks down a little area
or “highlight” of picture. Rather than all neurons to
skip their selections to the next neural layer, each
gathering of the neurons spends significant time in
Fig.3 Feature map of CNN picture
distinguishing one piece of picture, such as a nose, a
left ear, mouth or a leg. The last yield is a point of
scope, illustrating how reasonable every one of The main reason behind using CNN algorithm is
abilities is elected as part of the class. that, this is the only algorithm which takes pictures
as an input and on the basis of input pictures
drawing the feature map, ie.classifying each pixel
on the basis on similarity and differences.The CNN
classifies the pixels and a matrix is created, which
is known as feature map. Feature map is a
collection of similar pixels placed in a separate
category. These matrices play an important role in
Fig.2 Working of CNN finding the essence of the thing in the input picture.
More about CNN​ -
How does CNN work ?
There are total 3 types of layers in CNN model-
As we have discussed previously, a fully connected
neural network where the input in the preceding 1. Convolutional
layers is connected to every input in the following 2. Pooling
layers is convenient for the task at hand, along such 3. Fully connected
lines, according to CNN, the neurons in a cell may In the first layer, the input image is read through the
be connected with a specific cell area before it, CNN, and on that foundation a feature map is made.
rather than all the neurons in a totally similar way. From that feature map , it serves as an input to the
following layers, i.e for the Pooling layer. In the
pooling layer, the feature map is broken down into
extra simpler parts to carefully examine the context of
the picture. This layer makes the feature map more
dense so as to discover the most critical information
about the picture.
The 1st and 2nd layers i.e Convolutional and Pooling
they’re practised so many times, depending on the
This helps in reducing the complexity of the neural picture as to get the densed information about the
network and acquiring less computing power. As picture. The extra dense feature map is created
per new computer under standard image with the because of these two layers. And this densed feature
use of numbers at each pixel. When we generally map is utilised by the last layer i.e Fully Connected.
This layer performs classification. It sorts the pixels
with respect to similarity and differences.
Classification is done upto exceptional limit so as to
The Problem with RNNs(Recurrent Neural
get the essence of the picture, help in identifying the Networks)​:-
objects , persons, things,etc.
RNNs are a part of a deep learning set of rules which
are performed to deal with a number of complicated or
complex computer tasks like item classification &
speech recognition. RNNs are performed to address an
array of activities that arise in series, with the
information of every situation based completely on
Fig.4 Layers of scanned picture statistics from preceding situations.
These layers help CNN to clearly locate and find Exquisitely, we intend to favour RNNs which are
features of the picture. Extraction of vital features having extended collections of data & higher
present in the picture of fixed length inputs is capabilities. This RNN can be used to carry out plenty
transformed into fixed size outputs. of real life problems like inventory forecasting &
reinforce speech recognition. Yet, RNNs are not used
CNN techniques are very much in usage viz,
to solve real life problems & that is because of the
· Computer vision​— in the area of medical Vanishing Gradient problem.
sciences image analysis is done through CNNs
only. Inner structure of the body is effortlessly
examined with the help of this. Vanishing Gradient Problem​ -

This vanishing gradient problem is the main cause


In mobile phones, it's been used for so many things, which makes the working of RNNs challenging. In
for instance, to find the age of the person, to unlock general, the engineering of RNNs is made such that
the phone by examining the picture from the it stores the data for some short period of time and
camera. stores some array of data. It's not possible for
RNNs to remember all the data values and a long
In industries its far used for making patents or period of time. RNNs can only store some of the
copyright of specific clicked pictures. data for a small period of time. Thereupon, the
reminiscence of RNNs is only favourable for
shorter arrays of data and for short-time periods.
· Pharmaceuticals discovery​— its been broadly
This vanishing gradient problem becomes very
used for discovering the drugs/pharmaceuticals, by
analysing the chemical features and finding the best prominent as compared to traditional RNNs- to
drug to cure a particular problem. solve a particular problem it adds so many time
steps, which results in losing the data when we use
backpropagation. With so many time steps, RNNs
Origin of LSTM:- have to store data values of each time step, which
results in storing more & more data values and that
LSTM was first searched by two German one is not feasible in the case of RNNs. And by this
researchers - Sepp Hochreiter and Jurgen vanishing gradient problem is formed.
Schmidhuber, in 1997. LSTM stands for long
short-term memory. In the Deep Learning What can be done so as to solve this Vanishing
discipline of recurrent neural networks, LSTM Gradient problem with RNNs -
holds a crucial place. The special element about
LSTM is that it not only stores the input data, but To solve this problem, we will be using Long
can also supply predictions about the subsequent
short-term memory (LSTM) , which is a subset of
datasets through its own. This LSTM network
RNNs. LSTM are basically constructed to
retains the stored data for a particular time period
overcome the problem of Vanishing Gradients. The
and on that basis predicts or gives the future values
exceptional thing about LSTM is that it can
to the data. This is the main purpose why LSTM is
used here more than that of traditional RNN. preserve the data values for lengthy interval
of time and hence can solve the vanishing gradient not needed in the future to solve a particular task.
problem. This gate is responsible for the overall performance
of the LSTM, it optimizes the data.
LSTMs are constructed in such a manner that they · Input gate ​— ​the starting of LSTM starts from
always contain errors. And due to these errors this gate, i.e. input gate. This gate takes input from
LSTM keeps studying the data values over several the user and supplies the input data to other gates.
time steps. Because of studying data values again · Output gate ​— ​This gate is responsible for
& again, it makes studying backpropagation easy showcasing the desired result in a proper manner.
over time & layers.

Uses of Long Short-Term Memory Networks :-

LSTMs are profoundly and mostly used for variety


deep learning duties that largely encompasses
forecasting of the data depending upon the
preceding data. The 2 remarkable illustrations
cover text prediction and stock market prediction.

Text Prediction - ​The LSTM is very used in


predicting the texts. ​The long term memory,
understanding of LSTM makes it capable enough
to predict the next words in the sentences. This is
the result of the LSTM network in predicting the
next words by its own. The LSTM first stores the
data, the feel of the words, the styling of the words,
the use of the words in a particular situation,etc and
Fig.5 Gates in LSTM
on that basis predicts the next words. The stored
data, i.e. input data is further used for future use.
Ideally, as per the diagram above, LSTM uses
The best illustration can be given of text prediction
several gates to store the data and after that it is a Chatbot, that is widely utilized by the
processes the data and sends the result to the final eCommerce websites and mobile applications.
gate. When we talk about RNNs, they used to pass
Stock market Prediction - ​In the stock market
the data to the final gate without any processing. also, LSTM stores the data or the trends in which
From these gates in LSTM, the whole network can the market behaves at a particular time, at a
shape the data in many forms, including storing the particular instant and on that account predicts the
next variations and trends of the market. ​It's a
data and reviewing the data from the gates. The
problematic task to predict the variation in the
gates in the LSTM are independently skillful to stock market because market variations are very
make judgement concerning the facts & the data. challenging to predict and forecast. The LSTM
Moreover these gates are able to make judgements model has to be trained in such a manner that it
gives the correct values to the users. For that, a lot
on their own by opening or closing the gates. of data has to be stored for a lot of time, it can take
days also.
The understanding of LSTM gates to hold the data
More about LSTM​ -
for a period of time offers benefit to the LSTM
over the RNNs. LSTMs are basically a part of RNNs, which are
having capacity to hold more data values as
Architecture Of LSTM​ :- compared to RNNs. LSTMs are widely in use
today in every field. The simplest diagram of
The architecture of LSTM is very simple, it
LSTM is shown below. It consists of 3 major gates
consists of 3 major gates, which store the data for a
viz, Forget gate, input gate, output gate. These
longer period of time and help in solving the
gates are having capacity to store the data and give
difficulties which RNNs couldn't solve.
out the desired output. Whenever talked about
The 3 major gates of the LSTM covers are : LSTM network the three gates always comes up.
· Forget gate ​— the main work of the forget gate is The below diagram shows the simplest architecture
to filter the data, i.e. to delete all that data which is of LSTM :
The following files are set up for making this
system to run by us to check the working of the
CNN-LSTM model..

· Model –​ This folder will contain all the trained


models which are at first trained. ​This would be
one time process to train the model.

Fig.6 Working of LSTM · Description.txt –​ This is the file which will


contain the picture names & their related captions
Image Caption Generation Model:- later preprocessing.

In order to prepare an image caption generation · Feature.p – ​This file binds the picture and their
model, we will be summing up the two different related captions that are extracted from the
Xception, which is a pre-trained CNN model.
architectures. It is further called as CNN-LSTM
model. So, in this we will be using these two · Tokenizers.p –​ This file contains an expression
architectures to get the caption for the input which we call tokens , and these tokens are
pictures. generalised with the index value.

· CNN - it's been used to extract the important · Models.png – ​Diagrammatic representation of
features from the input picture. To do this, we have extension of the CNN-LSTM model.
taken a pre-trained model for our consideration
named Xception. · Testing_captions_generator.py –​ This is the
Python file which is used in generating the captions
· LSTM - its been used to store the data or the of the pictures.
features from the CNN model and further process it
· Training_captions_generator.ipynb –​ This is
and to support in the generation of a good caption
basically a Jupyter notebook, which is in short a
for the picture.
web based application. We use this to train our
model & on that basis achieving captions to our
input pictures.

Fig.7 CNN-LSTM model

Project File Architecture ​:- Fig.8 Project file structure

For our research purpose, we have downloaded the


data set which consists of following files :

· Flickr8k_Datasets – ​This file contains all the


pictures for which we have to first train our model.
It contains 8091 images.

· Flickr8k_texts –​ This folder contains text files &


pre-formed captions for the pictures.
In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition.
5561–5570.
[6] Lisa Anne Hendricks, Subhashini Venugopalan,
Marcus Rohrbach, Raymond Mooney, Kate
Saenko, Trevor Darrell, Junhua Mao, Jonathan
Huang, Alexander Toshev, Oana Camburu, et al.
2016. Deep compositional captioning: Describing
novel object categories without paired training
data. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition.
Fig.9 Flickr_Dataset
[7] Dzmitry Bahdanau, Kyunghyun Cho, and
Yoshua Bengio. 2015. Neural machine translation
by jointly learning to align and translate. In
C​ONCLUSION International Conference on Learning
Representations (ICLR).
The CNN-LSTM model was built on the idea of
[8] Shuang Bai and Shan An. 2018. A Survey on
generating the captions for the input pictures. This
Automatic Image Caption Generation.
model can be used for a variety of applications. In
Neurocomputing. ACM Computing Surveys, Vol.
this, we studied about the CNN model, RNN
models, LSTM models, and in the end we validated 0, No. 0, Article 0. Acceptance Date: October
that the model is generating captions for the input 2018. 0:30 Hossain et al.
pictures. [9] Satanjeev Banerjee and Alon Lavie. 2005.
METEOR: An automatic metric for MT evaluation
with improved correlation with human judgments.
In Proceedings of the acl workshop on intrinsic and
R​EFERENCES
extrinsic evaluation measures for machine
[1] Abhaya Agarwal and Alon Lavie. 2008. translation and/or summarization, Vol. 29. 65–72.
Meteor, m-bleu and m-ter: Evaluation metrics for
high-correlation with human rankings of machine
translation output. In Proceedings of the
ThirdWorkshop on Statistical Machine Translation.
Association for Computational Linguistics,
115–118.
[2] Ahmet Aker and Robert Gaizauskas. 2010.
Generating image descriptions using dependency
relational patterns. In Proceedings of the 48th
annual meeting of the association for computational
linguistics. Association for Computational
Linguistics, 1250–1258.
[3] Peter Anderson, Basura Fernando, Mark
Johnson, and Stephen Gould. 2016. Spice:
Semantic propositional image caption evaluation.
In European Conference on Computer Vision.
Springer, 382–398.
[4] Peter Anderson, Xiaodong He, Chris Buehler,
Damien Teney, Mark Johnson, Stephen Gould, and
Lei Zhang. 2017. Bottom-up and top-down
attention for image captioning and vqa. arXiv
preprint arXiv:1707.07998 (2017).
[5] Jyoti Aneja, Aditya Deshpande, and Alexander
G Schwing. 2018. Convolutional image captioning.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy