Project Synopsis Imagecaptioning
Project Synopsis Imagecaptioning
Engineering
Synopsis
of
Image Captioning
using Deep Learning
Introduction:
The people communicate through language, whether written or spoken. They often use
this language to describe the visual world around them. Images, signs are another way of
communication and understanding for the physically challenged people. The generation of
description from the image automatically in proper sentences is a very difficult and
challenging task, but it can help and have a great impact on visually impaired people for better
understanding of the description of the images of the web.
In order to make this happen, we will combine both image and text processing to build
a useful Deep Learning application, aka Image Captioning. Image Captioning refers to the
process of generating textual description from an image – based on the objects and actions in
the image.
The objective of this project is to create a system that detects what is happening in an
image without actually telling the system what is happening. This can be applied in social
media systems where machines will automatically detect what the user is going to write based
on image or it can be used to help explain blind people, what is image all about. This project
will be combined with a flask-based web application.
Feasibility Study: -
Image captioning is a popular research area of Artificial Intelligence (AI) that deals
with image understanding and a language description for that image. Image understanding
needs to detect and recognize objects. It also needs to understand scene type or location,
object properties and their interactions. Generating well-formed sentences requires both
syntactic and semantic understanding of the language.
In traditional machine learning, hand crafted features such as Local Binary Patterns
(LBP), Scale-Invariant Feature Transform (SIFT), the Histogram of Oriented Gradients
(HOG), and a combination of such features are widely used. In these techniques, features are
extracted from input data. They are then passed to a classifier such as Support Vector
Machines (SVM) in order to classify an object. Since hand crafted features are task specific,
extracting features from a large and diverse set of data is not feasible. Moreover, real world
data such as images and video are complex and have different semantic interpretations.
On the other hand, in deep machine learning based techniques, features are learned
automatically from training data and they can handle a large and diverse set of images and
videos. For example, Convolutional Neural Networks (CNN) are widely used for feature
learning, and a classifier such as Softmax is used for classification. CNN is generally
followed by Recurrent Neural Networks (RNN) in order to generate captions
The software we are using to implement our drowsiness detection system is Spyder,
which is simple, fun, and productive.
The approach we will be using for this deep learning project is as follows:
We are starting with the requirement gathering followed by the feasibility study. Then
the coding will start which resumes for 3 weeks.
1st Member: -
(Raunak Jalan): Designing the Module, Coding and Testing, API design
2nd Member: -
(Bhuvaneshwar Choudhary): Requirement Gathering & Analysis, Coding/Testing
Innovation in Project:
The first challenge stems from the compositional nature of natural language and visual
scenes. While the training dataset contains co-occurrences of some objects in their context, a
captioning system should be able to generalize by composing objects in other contexts.
Traditional captioning systems suffer from lack of compositionality and naturalness as they
often generate captions in a sequential manner, i.e., next generated word depends on both the
previous word and the image feature. This can frequently lead to syntactically correct, but
semantically irrelevant language structures, as well as to a lack of diversity in the generated
captions. We propose to address the compositionality issue with a context-aware Attention
captioning model, which allows the captioner to compose sentences based on fragments of
the observed visual scenes. Specifically, we used a recurrent language model with a gated
recurrent visual attention that gives the choice at every generating step of attending to either
visual or textual cues from the last generation step.
System Requirements:
Python 3.7.2
Software Requirements:
Spyder IDE
Python
Hardware Requirements:
CPU: Intel Pentium 4, 2.53 GHz or equivalent
OS: Microsoft Windows 7, 8.1, 10 / MacOS Mojave (version 10.14)
RAM: 2 GB
Storage: 1.4 GB of free disk space
Bibliography: -
https://www.analyticsvidhya.com/blog/2018/04/solving-an-image-captioning-task-using-
deep-learning/
https://www.researchgate.net/publication/329037107_Image_Captioning_Based_on_Deep
Neural_Networks
https://medium.com/swlh/image-captioning-in-python-with-keras-870f976e0f18