0% found this document useful (0 votes)
142 views5 pages

Project Synopsis Imagecaptioning

The document summarizes a student project on image captioning using deep learning. The project aims to automatically generate captions in sentences to describe images for visually impaired people. It will use convolutional neural networks to extract image features and recurrent neural networks to generate captions. The work is divided between two students, with one focusing on design, coding and testing and the other on requirements gathering and analysis. The project will be implemented over 6 weeks and integrated into a web application using Flask. It aims to address challenges in compositionality and generating diverse yet relevant captions.

Uploaded by

Raunak Jalan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views5 pages

Project Synopsis Imagecaptioning

The document summarizes a student project on image captioning using deep learning. The project aims to automatically generate captions in sentences to describe images for visually impaired people. It will use convolutional neural networks to extract image features and recurrent neural networks to generate captions. The work is divided between two students, with one focusing on design, coding and testing and the other on requirements gathering and analysis. The project will be implemented over 6 weeks and integrated into a web application using Flask. It aims to address challenges in compositionality and generating diverse yet relevant captions.

Uploaded by

Raunak Jalan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Department of Computer Science &

Engineering

Synopsis
of
Image Captioning
using Deep Learning

B.E. IV Year – 7th Semester


(Branch: CSE)

SUBMITTED BY: - SUBMITTED TO: -


Raunak Jalan (17BCS2596) IS-1(B) Er. Ankita Sharma
Bhuvaneshwar Choudhary (17BCS1762) IS-1(B) (Assistant Professor)

Introduction:
The people communicate through language, whether written or spoken. They often use
this language to describe the visual world around them. Images, signs are another way of
communication and understanding for the physically challenged people. The generation of
description from the image automatically in proper sentences is a very difficult and
challenging task, but it can help and have a great impact on visually impaired people for better
understanding of the description of the images of the web.

In order to make this happen, we will combine both image and text processing to build
a useful Deep Learning application, aka Image Captioning. Image Captioning refers to the
process of generating textual description from an image – based on the objects and actions in
the image.

Objective of the project: -

The objective of this project is to create a system that detects what is happening in an
image without actually telling the system what is happening. This can be applied in social
media systems where machines will automatically detect what the user is going to write based
on image or it can be used to help explain blind people, what is image all about. This project
will be combined with a flask-based web application.

Feasibility Study: -

Image captioning is a popular research area of Artificial Intelligence (AI) that deals
with image understanding and a language description for that image. Image understanding
needs to detect and recognize objects. It also needs to understand scene type or location,
object properties and their interactions. Generating well-formed sentences requires both
syntactic and semantic understanding of the language.

Understanding an image largely depends on obtaining image features. The techniques


used for this purpose can be broadly divided into two categories: (1) Traditional machine
learning based techniques and (2) Deep machine learning based techniques.

In traditional machine learning, hand crafted features such as Local Binary Patterns
(LBP), Scale-Invariant Feature Transform (SIFT), the Histogram of Oriented Gradients
(HOG), and a combination of such features are widely used. In these techniques, features are
extracted from input data. They are then passed to a classifier such as Support Vector
Machines (SVM) in order to classify an object. Since hand crafted features are task specific,
extracting features from a large and diverse set of data is not feasible. Moreover, real world
data such as images and video are complex and have different semantic interpretations.

On the other hand, in deep machine learning based techniques, features are learned
automatically from training data and they can handle a large and diverse set of images and
videos. For example, Convolutional Neural Networks (CNN) are widely used for feature
learning, and a classifier such as Softmax is used for classification. CNN is generally
followed by Recurrent Neural Networks (RNN) in order to generate captions

Methodology/ Planning of work: -

The software we are using to implement our drowsiness detection system is Spyder,
which is simple, fun, and productive.

The approach we will be using for this deep learning project is as follows:

Step 1 – Take image as input from the system or camera.


Step 2 – Using CNN to extracts the features from our input image. This is our image
understanding part.
Step 3 – The feature vector is linearly transformed to have the same dimension as the input
dimension of the RNN/LSTM network. This is our text generation part.
Step 4 – Sending the generated caption as a response to the front end which is made using
Flask.

We are starting with the requirement gathering followed by the feasibility study. Then
the coding will start which resumes for 3 weeks.

Stages of work Timeline


Requirement Gathering and feasibility study 2 weeks
Requirement Analysis and Design 2 weeks
Coding 2 weeks
Testing 3 weeks

At the end of the project, following use cases will be covered.


1. On providing any image to the application as an input, a relevant and creative caption
in the form of descriptive sentence is generated.
2. The generated output will describe in a single sentence what is shown in the image –
the objects present, their properties, the actions being performed and the interaction
between the objects, etc.

Module & Team Member wise Distribution of work: -

1st Member: -
(Raunak Jalan): Designing the Module, Coding and Testing, API design
2nd Member: -
(Bhuvaneshwar Choudhary): Requirement Gathering & Analysis, Coding/Testing

Innovation in Project:
The first challenge stems from the compositional nature of natural language and visual
scenes. While the training dataset contains co-occurrences of some objects in their context, a
captioning system should be able to generalize by composing objects in other contexts.
Traditional captioning systems suffer from lack of compositionality and naturalness as they
often generate captions in a sequential manner, i.e., next generated word depends on both the
previous word and the image feature. This can frequently lead to syntactically correct, but
semantically irrelevant language structures, as well as to a lack of diversity in the generated
captions. We propose to address the compositionality issue with a context-aware Attention
captioning model, which allows the captioner to compose sentences based on fragments of
the observed visual scenes. Specifically, we used a recurrent language model with a gated
recurrent visual attention that gives the choice at every generating step of attending to either
visual or textual cues from the last generation step.

Dependencies and Requirements: -

System Requirements:
 Python 3.7.2

Software Requirements:
 Spyder IDE
 Python

Hardware Requirements:
 CPU: Intel Pentium 4, 2.53 GHz or equivalent
 OS: Microsoft Windows 7, 8.1, 10 / MacOS Mojave (version 10.14)
 RAM: 2 GB
 Storage: 1.4 GB of free disk space
Bibliography: -
 https://www.analyticsvidhya.com/blog/2018/04/solving-an-image-captioning-task-using-
deep-learning/
 https://www.researchgate.net/publication/329037107_Image_Captioning_Based_on_Deep
Neural_Networks
 https://medium.com/swlh/image-captioning-in-python-with-keras-870f976e0f18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy