0% found this document useful (0 votes)
35 views66 pages

Sample project doc-REC

The document is a project report on 'Image Caption Generation Using Deep Learning' submitted by students of Raghu Engineering College for their B.Tech degree. It details the use of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) for generating image captions, highlighting the project's objectives, methodologies, and acknowledgments. The report includes various sections such as introduction, system analysis, implementation, and references, outlining the project's scope and significance in the field of artificial intelligence.

Uploaded by

21981a0570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views66 pages

Sample project doc-REC

The document is a project report on 'Image Caption Generation Using Deep Learning' submitted by students of Raghu Engineering College for their B.Tech degree. It details the use of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) for generating image captions, highlighting the project's objectives, methodologies, and acknowledgments. The report includes various sections such as introduction, system analysis, implementation, and references, outlining the project's scope and significance in the field of artificial intelligence.

Uploaded by

21981a0570
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 66

IMAGE CAPTION GENERATION USING DEEP LEARNING

A Project Report Submitted in the partial fulfillment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING

By

K.Bhavani Prasanna K.Gayatri


17981A0577 17981A0595

K.Dinesh Sai Chandu Shetty M.Jaya Prakash


17981A9582 17981A05A2

Under the esteemed guidance of

P.Appala Naidu

Associate Professor

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

RAGHU ENGINEERING COLLEGE


(Autonomous)
Affiliated to JNTU-GV-Vizianagaram
Approved by AICTE, Accredited by NBA, Accredited by NAAC with A+ grade
www.raghuenggcollege.com
2024-2025
RAGHU ENGINEERING COLLEGE
(Autonomous)
Affiliated to JNTU Kakinada
Approved by AICTE, Accredited by NBA, Accredited by NAAC with A+ grade
www.raghuenggcollege.com

DEPARTMENT OF COMPUTER SCIENCE &

ENGINEERING CERTIFICATE

This is to certify that this project entitled “Image Caption Generation Using
Deep Learning” done by K.Bhavani Prasanna, K.Gayatri, K.Dinesh Sai Chandu Shetty,
M.Jaya Prakash bearing Regd.No: 17981A0577, 17981A0595, 17981A0582, 17981A05A2, is
a student of B.Tech in the Department of Computer Science and Engineering, Raghu
Engineering College, during the period 2020-2021 ,in partial fulfillment for the award of the
Degree of Bachelor of Technology in Computer Science & Engineering to the Jawaharlal Nehru
Technological University, Kakinada is a record of bonafide work carried out under my guidance
and supervision.
The results embodied in this project report have not been submitted to any other University
or Institute for the award of any Degree.

Internal Guide Head of the Department


Mr. P.Appala Naidu, Dr P Appala Naidu,
Dept of CSE, Dept of CSE,
Raghu Engineering college, Raghu Engineering College,
Dakamarri (V), Dakamarri (V),
Visakhapatnam. Visakhapatnam.

i
External Examiner

ii
DISSERTATION APPROVAL SHEET
This is to certify that the dissertation titled

Image Caption Generation Using Deep Learning

By

K.Bhavani Prasanna K.Gayatri

K.Dinesh Sai Chandu Shetty M.Jaya Prakash


Is approved for the degree of Bachelor of Technology

Mr. P.Appala Naidu (Guide)

Internal Examiner

External Examiner

HOD

Date

iii
DECLARATION
This is to certify that this project titled “Image Caption Generation Using Deep Learning ” is
bonafide work done by my team, in partial fulfillment of the requirements for the award of the
degree B.Tech. and submitted to the Department of Computer Science & Engineering, Raghu
Engineering College, Dakamarri.

I also declare that this project is a result of my own effort and that has not been copied from
anyone and I have taken only citations from the sources which are mentioned in the
references.

This work was not submitted earlier at any other University or Institute for the award of any
degree.

Place: NAME :
K. BHAVANI PRASANNA
Date: K. GAYATRI
K. DINESH SAI CHANDU SHETTY
M. JAYA PRAKASH

iv
ACKNOWLEDGEMENT
We express our sincere gratitude to my esteemed Institute “Raghu Institute of
Technology”, which has provided us with an opportunity to fulfil the most cherished desire to
reach my goal.

We take this opportunity with great pleasure to put on record our ineffable personal
indebtedness to Sri Raghu Kalidindi, Chairman of Raghu Institute of Technology for
providing necessary departmental facilities.

We would like to thank the Principal Dr. Ch. Srinivasu, Dr A. Vijay Kumar- Dean
planning & Development, Dr E.V.V.Ramanamurthy-Controller of Examinations, and the
Management of “Raghu Engineering College”, for providing the requisite facilities to carry
them out the project in the campus.

Our sincere thanks to Dr. P. Appala Naidu, HoD, Department of Computer Science and
Engineering, Raghu Engineering College, for this kind support in the successful completion of
this work.

We sincerely express our deep sense of gratitude to Dr. P. Appala Naidu, Professor
Department of Computer Science and Engineering, Raghu Engineering College, for his
perspicacity, wisdom, and sagacity coupled with compassion and patience. It is our great
pleasure to submit this work under his wing.

We extend deep-hearted thanks to all faculty members of the Computer Science


department for the value-based imparting of theory and practical subjects, which were used in
the project.

We are thankful to the non-teaching staff of the Department of Computer Science and
Engineering, Raghu Engineering College, for their inexpressible support.

Regards
K.Bhavani Prasanna(17981A0577)

K.Gayatri(17981A0595)

M.Jaya Prakash(17981A05A2)

K.Dinesh Sai Chandu Shetty(17981A0582)

v
vi
ABSTRACT

Image caption generation involves recognizing the context of the image and describe them in
natural language like English. This project involves image detection, image recognition and
NLP for generating captions using deep learning.
In this Python project, we will be implementing this project using CNN(Convolutional
Neural Network) and LSTM (Long short term memory). The CNN works as encoder to
extract features from images and LSTM which is a form of RNN works as decoder to
generate words.The image features will be extracted from ResNet which is a CNN model
trained on the imagenet dataset and then we feed the features into the LSTM model which
will be responsible for generating the image captions.
DATA SET : Flickr_8K

v
TABLE OF CONTENTS
CONTENT PAGE NUMBER
Certificate i
Dissertation Approval Sheet ii
Declaration iii
Acknowledgement iv
Abstract v
Contents vi
List of figures viii
CHAPTER 1: INTRODUCTION 1
1.1 Purpose 2
1.2 Scope 2
1.3 Motivation 2
1.4 Fundamental concepts 2
1.4.1 AI 2
1.4.2 Machine Learning 4
1.4.3 Deep Learning 5
1.5 Proposed System Fundamentals 6
1.6 Proposed Algorithms 6
1.6.1 RmsProp Optimizer Algorithm 6

CHAPTER 2: SYSTEM ANALYSIS 8


2.1 Existing System 9
2.2 Proposed System 10
CHAPTER 3: SYSTEM REQUIREMENTS SPECIFICATION 12
3.1 Hardware Requirements 13
3.2 Software Requirements 13
3.3 Pre requisites 13
CHAPTER 4 : SYSTEM DESIGN 14
4.1 System Model 15
4.2 System Architecture 15
4.2.1 CNN 15

vi
4.2.2 LSTM 18
4.2.3 Dataset 21
4.2.4 Transfer Model 21
4.3 UML Diagrams 23
CHAPTER 5 : IMPLEMENTATION 27
5.1 Technology Description 28
5.2 Sample code 29
CHAPTER 6 : SCREEN SHOTS 34
CHAPTER 7 : SYSTEM TESTING 41
7.1 Introduction to Testing 42
7.2 Types of Testing

CHAPTER 8: CONCLUSION AND FUTURE ENHANCEMENTS 45


CHAPTER 9: LITERATURE SURVEY 47
CHAPTER 10: REFERENCES 49

vii
LIST OF FIGURES

PAGE
NO
Fig 1.1 Introduction
1
Fig 1.3 Motivation 2

Fig 4.1 System Architecture 9

Fig 4.2.1 CNN Architecture


19
Fig 4.2.1.1 Convolution
15

Fig 4.2.1.1b ReLu Layer


16

Fig 4.2.1.2 Pooling


17

Fig 4.2.1.3 Flattening


17

Fig 4.2.1.4 Full Connection


18

Fig 4.2.2 LSTM


19

Fig 4.2.2.1 LSTM Architecture


21

Fig 4.2.3 Data Set


22

Fig 4.2.4 Transfer Learning


22

Fig 4.3.1 Use Case Diagram 23

Fig 4.3.2 Activity Diagram 24

viii
Fig 4.3.3 State Chart Diagram 25

Fig 4.3.4 Sequence Diagram 26

ix
CHAPTER-1
INTRODUCTION

1
1. Introduction

Image captioning is one of the challenges in AI. Here there are two tasks one is image
Generation and other is caption for that image. Just like Humans able to detect, identify and
describe the images the computer can also be able to do these tasks.
Eg:

1.1 Purpose
The main challenge is to create a model so that computer can give captions for image just
like we humans can be able to describe the image once we see it, know about it.
1.2 Scope

To develop a model and training it so that it can be able to predict captions for the given
image using deep learning approaches.
1.3 Motivation
Look around and you will see a lot of day to day objects. You recognize them almost
instantaneously and involuntarily. You don’t have to wait for a few minutes after looking at a
table to understand that it is in fact a table.

2
1.3.1 How Humans Do It?
The processing of visual stream happens in ventral visual stream. It is a hierarchy
of areas in brain that helps in object recognition. Whenever we look at any object, our
brain extracts the features and in such a way that the size, orientation, illumination,
perspective etc don’t matter. You remember an object by its shape and inherent features.
It doesn’t matter how the object is placed, how big or small it is or what side is visible to
you. Information processing here involves taking data from visual stream along with
neurons and helps us detecting it.
1.3.2 How Machines Can Do It?
It is not an easier task when it comes to machines because they should
try to extract the features whether it is of any size and can be able to detect it, identify it.
But because of advancement of technology in field of Artificial Intelligence, it is possible.
With using different technologies like Deep Learning, Neural Networks and NLP uses
algorithms for Image Detection and Recognition and Language Processing. Our model we
designed use CNN for image extraction and LSTM for language extraction for given
image.

1.4 Fundamental Concepts


1.4.1 Artificial Intelligence:
It is the simulation of natural intelligence in machines that are programmed to
learn and mimic the actions of humans. These machines are able to learn with experience

3
and perform human-like tasks.

1.4.2 Machine Learning:


ML teaches a machine how to make inferences and decisions based on past
experience. It identifies patterns, analyses past data to infer the meaning of these data
points to reach a possible conclusion without having to involve human experience. This
automation to reach conclusions by evaluating data, saves a human time for businesses
and helps them make a better decision. It can be supervised or unsupervised data.

1.4.3 Deep Learning:


Deep learning is a type of machine learning and artificial intelligence (AI) that
imitates the way humans gain certain types of knowledge. Deep learning is an important
element of data science, which includes statistics and predictive modeling. It is extremely
beneficial to data scientists who are tasked with collecting, analyzing and interpreting large
amounts of data; deep learning makes this process faster and easier.

At its simplest, deep learning can be thought of as a way to automate predictive


analytics. While traditional machine learning algorithms are linear, deep learning algorithms are
stacked in a hierarchy of increasing complexity and abstraction.

To understand deep learning, imagine a toddler whose first word is dog. The toddler learns
what a dog is -- and is not -- by pointing to objects and saying the word dog. The parent says,
"Yes, that is a dog," or, "No, that is not a dog." As the toddler continues to point to objects, he
becomes more aware of the features that all dogs possess. What the toddler does, without
knowing it, is clarify a complex abstraction -- the concept of dog -- by building a hierarchy in
which each level of abstraction is created with knowledge that was gained from the preceding
layer of the hierarchy.

4
How deep learning works
Computer programs that use deep learning go through much the same process as the toddler
learning to identify the dog. Each algorithm in the hierarchy applies a nonlinear transformation
to its input and uses what it learns to create a statistical model as output. Iterations continue
until the output has reached an acceptable level of accuracy. The number of processing layers
through which data must pass is what inspired the label deep. Object recognition, speech
recognition, and language translation are some of the tasks performed through deep learning.

Deep Neural Networks:

A type of advanced machine learning algorithm, known as artificial neural network,


underpins most deep learning models. As a result, deep learning may sometimes be referred to as
deep neural learning or deep neural networking.

Neural networks come in several different forms, including Recurrent Neural Networks,
Convolutional Neural Networks, Artificial Neural Networks and Feed Forward Neural
Networks, and each has benefits for specific use cases. However, they all function in somewhat
similar ways -- by feeding data in and letting the model figure out for itself whether it has made
the right interpretation or decision about a given data element.

NLP

NLP is a science of reading, understanding, interpreting a language by a machine. Once


a machine understands what the user intends to communicate, it responds accordingly.

1.4 Proposed System Fundamentals


Image generation based and language generation based model where Image Extraction
based mainly deals with CNN and in language generation gives you the information about
the image .

5
1.5 Proposed Algorithms

RMSprop is a gradient-based optimization technique used in training neural networks. It


was proposed by the father of back-propagation, Geoffrey Hinton. Gradients of very complex
functions like neural networks have a tendency to either vanish or explode as the data propagates
through the function (refer to vanishing gradients problem). Rmsprop was developed as a stochastic
technique for mini-batch learning.

2 RMSprop deals with the above issue by using a moving average of squared gradients to
normalize the gradient. This normalization balances the step size (momentum), decreasing
the step for large gradients to avoid exploding and increasing the step for small gradients to
avoid vanishing.

3 Simply put, RMSprop uses an adaptive learning rate instead of treating the learning rate as
a hyperparameter. This means that the learning rate changes over time.

4 RMSprop’s update rule:

5 RMSprop is a gradient-based optimization technique used in training neural networks. It was


proposed by the father of back-propagation, Geoffrey Hinton. Gradients of very complex
functions like neural networks have a tendency to either vanish or explode as the data
propagates through the function (refer to vanishing gradients problem). Rmsprop was
developed as a stochastic technique for mini-batch learning.

6 RMSprop deals with the above issue by using a moving average of squared gradients to
normalize the gradient. This normalization balances the step size (momentum), decreasing

6
the

7
step for large gradients to avoid exploding and increasing the step for small gradients to avoid vanishing.

7 Simply put, RMSprop uses an adaptive learning rate instead of treating the learning rate as
a hyperparameter. This means that the learning rate changes over time.

8 RMSprop’s update rule

𝜗𝑑𝜔 = 𝛽𝜗𝑑𝜔 + (1 − 𝛽)

𝜗𝑑𝑏 = 𝛽𝜗𝑑𝜔 + (1 − 𝛽)

W=W-𝛼𝜗𝑑w

8
CHAPTER-2
SYSTEM ANALYSIS

9
2.1 Existing System

Fig 2 . 1 architecture of existing system

 Here for a given image as input. We apply CNN - RNN model


where CNN for feature extraction of image. Here we considered
t rained data set called Exception.

 The steps for generating captions with object detection and feature extraction using
neural networks are as follows.

 Step 1: Object detection In this step, the objects in the input image are detected
using R-CNN region proposal approach.

10
 Step 2: Feature Extraction In this step, the features in the image are extracted using
principal component analysis using NumPy. CNN is used for scene classification
and RNN is used for detecting objects and human attributes.

 Step 3: Creating attributes In this step, the features extracted by the neural
networks were used to define the attributes with its label strings.

 Step 4: Encoder and Decoder In this step, the label strings were subjected to an
encoder RNN for encoding the label strings to a proper format, and the resultant
variable length string is subjected to a fixed length decoder for converting to a fixed
length descriptive sentence.

2.2 Proposed System:

 Analysis of the Existing system, helps in designing problem statement of Proposed


system.

11
It Considers an image as an input and image extraction done using CNN and sent to
the language model thereby LSTM will be performed which gives the accurate result
using beam search.

12
CHAPTER-3
SYSTEM
REQUIREMENTS
SPECIFICATIONS

13
3.1 Software Requirements
 Operating System : Windows
 Programming Languages : Python3.6 or above
 Kaggle Notebook or Google Colab

3.2 Hardware Requirements


 System : intel i3 and above
 RAM : 4GB(Minimum), 8GB(Recommended)
 Usage Of GPU : Recommended
3.3 Pre requisites
The modules we should import are:
OpenCv
OpenCV is the huge open-source library for the computer vision, machine learning, and
image processing and now it plays a major role in real-time operation which is very
important in today's systems. By using it, one can process images and videos to identify
objects, faces, or even handwriting of a human.
Keras
Keras is an API designed for human beings, not machines. Keras follows best practices for
reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of
user actions required for common use cases, and it provides clear and actionable feedback
upon user error.
Pandas
Pandas is mainly used for data analysis. Pandas allows importing data from various file
formats such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows
various data manipulation operations such as merging, reshaping, selecting, as well as data
cleaning, and data wrangling features.

14
CHAPTER-4
SYSTEM DESIGN

15
4.1 System Model
 Method : Deep Model
 Type : CNN + LSTM
 Dataset Used: Flickr8K Dataset
 Pre-Trained Model Used: RestNet50

Fig 4.1
4.2 SYSTEM ARCHITECTURE

4.2.1 Convolutional Neural Network(CNN): CNN is a Deep Learning


algorithm which takes in an input image and assigns importance (learnable
weights and biases) to various aspects/objects in the image, which helps it
differentiate one image from the other.
The following are steps involved:
Step 1: Convolution
It consists of input image and feature detector of 3*3, 7*7, 9*9 but considers only
3*3.Count the cells from top left 3*3 till right and top to bottom that are matched
and added to first row and first col of feature map and process is repeated till all

16
rows are filled.

Fig 4.2.1.1
Step 1b : ReLu Layer
The Rectified Linear Unit Or ReLu is not a separate component of
Convolutional Networks process but most commonly used activation function.
The main purpose of applying the rectifier function is to increase non linearity of
images. The function returns 0 if it is negative else positive returns the value back.
It is written as,
f(x) = max(0,x)

Fig 4.2.1b

17
Step 2: Pooling

Fig 4.2.1.2
The process of filling in a pooled feature map differs from the one we
used to come up with the regular feature map. This time you'll place a 2x2 box at
the top-left corner, and move along the row.

For every 4 cells your box stands on, you'll find the maximum numerical
value and insert it into the pooled feature map. For above Eg, the box currently
contains a group of cells where the maximum value is 4

Step 3: Flattening
Before inserting into artificial neural network it must be flattened.

Fig 4.2.1.3

18
Step 4: Full Connection

Fig 4.2.1.4
This full connection works as follows:

 The neuron in the fully-connected layer detects a certain feature; say, a nose.
 It preserves its value.
 It communicates this value to both the "dog" and the "cat" classes.
 Both classes check out the feature and decide whether it's relevant to them..

19
Fig4.2.1 : Showing CNN architecture
4.2.2 Long Short Term Memory(LSTM)
Long Short Term Memory Network is an advanced RNN, a sequential
network, that allows information to persist. It is capable of handling the vanishing
gradient problem faced by RNN.
Architecture:
It has 3 parts
1. Forget Gate
2. Input Gate
3. Output Gate as shown below

i) Forget Gate :
Decide whether we should keep the information from previous timestamp or
20
forget it.

 Xt: input to the current timestamp.


 Uf: weight associated with the input
 Ht-1: The hidden state of the previous timestamp
 Wf: It is the weight matrix associated with hidden state

ii) Input Gate :


Input gate is used to quantify the importance of the new information carried by the
input. Here is the equation of the input gate

Output Gate:

Based on the current expectation we have to give a relevant word to fill in the
blank. That word is our output and this is the function of our Output gate.

21
Now Calculate Hidden state

It turns out that the hidden state is a function of Long term memory (Ct) and the
current output. If you need to take the output of the current timestamp just apply the
SoftMax activation on hidden state Ht.

Softmax Activation Layer:

o Softmax converts a vector of values to a probability distribution.


o The elements of the output vector are in range (0, 1) and sum to 1.
o Each vector is handled independently. The axis argument sets which axis of the
input the function is applied along.
o Softmax is often used as the activation for the last layer of a classification
network because the result could be interpreted as a probability distribution.
o The softmax of each vector x is computed as

exp(x) / tf.reduce_sum(exp(x)).

22
Fig : LSTM Architecture

4.2.3 Flickr8k Data set:


It consists of 8091 unique images and each image will be mapped to five different
sentences which will describe the image.

4.2.4 Transfer Learning


o Transfer learning is about leveraging feature representations from a pre-trained model, so
you don’t have to train a new model from scratch.
o The pre-trained models are usually trained on massive datasets that are a
standard benchmark in the computer vision frontier.
23
o The weights obtained from the models can be reused in other computer vision tasks.
o These models can be used directly in making predictions on new tasks or integrated
into the process of training a new model. Including the pre-trained models in a new
model leads to lower training time and lower generalization error.
o Transfer learning is particularly very useful when you have a small training dataset.
Popular models are RestNet50, VGG16, Inception, Xception etc. which are trained on
ImageNet as a source.

24
4.3 UML Diagrams
UML, which stands for Unified Modeling Language, is a way to visually represent the
architecture, design, and implementation of complex software systems. When you’re
writing code, there are thousands of lines in an application, and it’s difficult to keep track
of the relationships and hierarchies within a software system. UML diagrams divide that
software system into components and subcomponents.
Use Case Diagram
A use case diagram is a visual representation of a set of use cases and actors that may be
used to define the functionality and behaviour of a whole application system. The needs of
a system, including internal and external factors, are gathered via use case diagrams. The
majority of these criteria are design-related. As a result, when a system's functionalities are
analysed. Use cases are prepared and actors are identified.

25
Activity Diagram

Activity diagrams visualize the steps performed in a use case—the activities can be
sequential, branched, or concurrent. This type of UML diagram is used to show the
dynamic behavior of a system, but it can also be useful in business process modeling.

26
State Chart Diagram
State diagrams, simply put, depict states and transitions. A state refers to the different
combinations of information that an object can hold, and this UML diagram can visualize
all possible states and the way the object transitions from one state to the next.

27
Flow Chart Diagram

A flowchart is a diagram that depicts the steps in a process. It began as a tool for
expressing algorithms and programming logic in computer science, but it has now
expanded to include all sorts of processes. Flowcharts are now widely used for presenting
data and helping thinking.

28
CHAPTER-5
IMPLEMENTATION

29
5.1 Technology Description
The different phases involved in image caption generation are as follows:
Step 01 : Preprocessing Of Data
It involves image and text preprocessing for Flickr8K dataset.
Step 02: Image Detection
It can be done by extracting features from a pretrained model called RestNet50 thereby
using CNN image classification is done. Where Image with most probabilistic value can
be be considered by RPM Optimizer.
Step 03: Sentence Generation
It can be done by LSTM where it accepts the image and gives caption for it using softmax
activation layer,
Step 04: Evaluation
We use accuracy as a metric where it can be applied to Beam search when model is
predicted using ArgMax. It gives most the caption which has high accuracy.

30
5.2 Sample Code

import numpy as np
import pandas as pd
import cv2
import os
from glob import glob
from keras.applications import ResNet50
from keras.models import Model

#image Preprocess
images_path = '../input/flickr8k-sau/Flickr_Data/Images/'
images = glob(images_path+'*.jpg')
len(images)
incept_model = ResNet50(include_top=True)
last = incept_model.layers[-2].output
modele = Model(inputs = incept_model.input,outputs = last)
modele.summary()

#text preprocess
caption_path = '../input/flickr8k-sau/Flickr_Data/Flickr_TextData/Flickr8k.token.txt'
captions = open(caption_path, 'rb').read().decode('utf-8').split('\n')
len(captions)

captions_dict = {}
for i in captions:
try:
img_name = i.split('\t')[0][:-2]
caption = i.split('\t')[1]
if img_name in images_features:
if img_name not in captions_dict:
captions_dict[img_name] = [caption]

else:
captions_dict[img_name].append(caption)

except:
pass
#Visualize
import matplotlib.pyplot as plt

for i in range(5):
plt.figure()
31
img_name = images[i]

img = cv2.imread(img_name)

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) import matplotlib.pyplot as plt

for k in images_features.keys():
plt.figure()

img_name = '../input/flickr8k-sau/Flickr_Data/Images/' + k

img = cv2.imread(img_name)

img = cv2.cvtColor(img,
cv2.COLOR_BGR2RGB)
plt.xlabel(captions_dict[img_name.split('/')[-1]])
plt.imshow(img)

plt.xlabel(captions_dict[img_name.split('/')[-1]])
plt.imshow(img) import matplotlib.pyplot as plt

for k in images_features.keys():
plt.figure()

img_name = '../input/flickr8k-sau/Flickr_Data/Images/' + k

img = cv2.imread(img_name)

img = cv2.cvtColor(img,
cv2.COLOR_BGR2RGB)
plt.xlabel(captions_dict[img_name.split('/')[-1]])
plt.imshow(img)

break
def preprocessed(txt):
modified = txt.lower()
modified = 'startofseq ' + modified + ' endofseq'
return modified
for k,v in captions_dict.items():
for vv in v:
captions_dict[k][v.index(vv)] = preprocessed(vv)
count_words = {}
32
for k,vv in captions_dict.items():

33
for v in vv:
for word in v.split():
if word not in count_words:

count_words[word] = 0

else:
count_words[word] += 1
len(count_words)

THRESH = -1
count = 1
new_dict = {}
for k,v in count_words.items():
if count_words[k] > THRESH:
new_dict[k] = count
count += 1

new_dict['<OUT>'] = len(new_dict)
captions_backup = captions_dict.copy()
captions_dict = captions_backup.copy()

for k, vv in captions_dict.items():
for v in vv:
encoded = []
for word in v.split():
if word not in new_dict:
encoded.append(new_dict['<OUT>'])
else:
encoded.append(new_dict[word])

captions_dict[k][vv.index(v)] = encoded
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

MAX_LEN = 0
for k, vv in captions_dict.items():
for v in vv:
if len(v) > MAX_LEN:
MAX_LEN = len(v)
print(v)

Batch_size = 5000
VOCAB_SIZE = len(new_dict)

34
def generator(photo, caption):
n_samples = 0

X = []
y_in = []
y_out = []

for k, vv in caption.items():
for v in vv:
for i in range(1, len(v)):
X.append(photo[k])

in_seq= [v[:i]]
out_seq = v[i]

in_seq = pad_sequences(in_seq, maxlen=MAX_LEN, padding='post', truncating='post')[0]


out_seq = to_categorical([out_seq], num_classes=VOCAB_SIZE)[0]

y_in.append(in_seq)
y_out.append(out_seq)

return X, y_in, y_out

X = np.array(X)
y_in = np.array(y_in, dtype='float64')
y_out = np.array(y_out, dtype='float64')
X.shape, y_in.shape, y_out.shape

MODEL

from keras.preprocessing.sequence import pad_sequences


from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model, Sequential
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Flatten,Input, Convolution2D, Dropout, LSTM, TimeDistributed,
Embedding, Bidirectional, Activation, RepeatVector,Concatenate
from keras.models import Sequential, Model

inv_dict = {v:k for k, v in new_dict.items()}


model.save('model.h5')
model.save_weights('mine_model_weights.h5')
np.save('vocab.npy', new_dict)
def getImage(x):

35
test_img_path = images[x]

test_img = cv2.imread(test_img_path)
test_img = cv2.cvtColor(test_img, cv2.COLOR_BGR2RGB)

test_img = cv2.resize(test_img, (299,299))

test_img = np.reshape(test_img, (1,299,299,3))

return test_img
for i in range(5):

no = np.random.randint(1500,7000,(1,1))[0,0]
test_feature = modele.predict(getImage(no)).reshape(1,2048)

test_img_path = images[no]
test_img = cv2.imread(test_img_path)
test_img = cv2.cvtColor(test_img, cv2.COLOR_BGR2RGB)

text_inp = ['startofseq']

count = 0
caption = ''
while count < 25:
count += 1

encoded = []
for i in text_inp:
encoded.append(new_dict[i])

encoded = [encoded]

encoded = pad_sequences(encoded, padding='post', truncating='post', maxlen=MAX_LEN)

prediction = np.argmax(model.predict([test_feature, encoded]))

sampled_word = inv_dict[prediction]

caption = caption + ' ' + sampled_word

if sampled_word == 'endofseq':
break

text_inp.append(sampled_word)

plt.figure()
plt.imshow(test_img)
plt.xlabel(caption)

36
CHAPTER-6
SCREEN SHOTS

37
IMAGE

Visualize With Captions

MODEL

38
Check if model is fitted or not. It can be verified by generating number of epochs.

Epochs :
An epoch means training the neural network with all the training data for one cycle. In
an epoch, we use all of the data exactly once. A forward pass and a backward pass together
are counted as one pass: An epoch is made up of one or more batches, where we use a part
of the dataset to train the neural network.
The below screenshot shows sample for epochs.We have generated for 50 epochs for batch
size=512 with categorical loss and accuracy for each epoch.

39
PREDICTION

40
41
Evaluation :
Normally by selecting a particular approach for LSTM(beam or greedy) you can get
output as above using n.argmax which gives the one with highest accuracy. But if you
wanted to check you can prefer metrics like BLEU Score – Bilingual Evaluation Understudy
Score like BLEU – 1,2,3,4; Meteor , Rouge.

42
43
CHAPTER-7
TESTING

44
7.1 Introduction to Testing
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, subassemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets
its requirements and user expectations and does not fail in an unacceptable manner. There
are various types of test. Each test type addresses a specific testing requirement.
7.2 Types of Testing
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches
and internal code flow should be validated. It is the testing of individual software units of
the application .it is done after the completion of an individual unit before integration. This
is a structural testing, that relies on knowledge of its construction and is invasive. Unit
tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and contains clearly
defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components
were individually satisfaction, as shown by successfully unit testing, the combination of
components is correct and consistent. Integration testing is specifically aimed at exposing
the problems that arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions = tested are available as
specified by the business and technical requirements, system documentation, and user

45
manuals.
Functional testing is centered on the following items:
• Valid Input : identified classes of valid input must be accepted.
• Invalid Input : identified classes of invalid input must be rejected.
• Functions : identified functions must be exercised.
• Output : identified classes of application outputs must be exercised.
• Systems/Procedures: interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key functions,
or special test cases. In addition, systematic coverage pertaining to identify Business
process flows; data fields, predefined processes, and successive processes must be
considered for testing. Before functional testing is complete, additional tests are identified
and the effective value of current tests is determined.
System Testing
System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing
is the configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.
White Box Testing
White Box Testing is a testing in which in which the software tester has knowledge of the
inner workings, structure and language of the software, or at least its purpose. It is purpose.
It is used to test areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested .Black box tests, as most other kinds of
tests, must be written from a definitive source document, such as specification or
requirements document, such as specification or requirements document. It is a testing in
which the software under test is treated, as a black box .you cannot “see” into it. The test
provides inputs and responds to outputs without considering how the software works.

46
Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted
as two distinct phases
Test strategy and approach Field testing will be performed manually and functional tests
will be written in detail.
Test objectives:
• All field entries must work properly.
• Pages must be activated from the identified link.
• The entry screen, messages and responses must not be delayed.
Features to be tested:
• Verify that the entries are of the correct format.
• No duplicate entries should be allowed.
• All links should take the user to the correct page. Integration Testing Software integration
testing is the incremental integration testing of two or more integrated software
components on a single platform to produce failures caused by interface defects. The task
of the integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at
the company level – interact without error.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing :
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered

47
CHAPTER-8
CONCLUSION &
FUTURE
ENHANCEMENTS

48
CONCLUSION AND FUTURE ENHANCEMENTS:

 We have implemented a CNN-LSTM model for building an Image Caption


Generator. A CNN-LSTM architecture has wide-ranging applications which
include use cases in Computer Vision and Natural Language Processing domains.
 This model also overcomes the vanishing gradient by using RestNet50 For CNN
and LSTM instead of RNN.
 We can further implement our model for Multilanguage extraction instead of
natural language.
 It can be further extended to voice recognition where given image can
automatically caption our voice. It can be helpful for blind people so that they can
know everything just by the scan of our image.
 We can also train the model by trying for different input data sets like CoCOMo
and Flickr16K, with more epochs but we should make sure of GPU’s usage.
 Try extracting the features from other pre trained models like VGG 16 which may
even take some months but give more accurate results.
 Different metrics can be conducted to ensure quality of the output.

49
CHAPTER-9
LITERATURE SURVEY

50
1) Detection and Recognition of Objects in Image Caption Generator System: A Deep Learning
Approach N. Komal Kumar , D. Vigneswari , A. Mohan , K. Laxman , J. Yuvaraj describes about
existing model using CNN-RNN approach and how it can be used.
2) Feature Extraction using Convolution Neural Networks (CNN) and Deep Learning
Manjunath Jogin, Divya, Mohana , Meghana, Madhulika, Apoorva – Discusses in detail
importance of CNN, architecture of CNN and types of CNN and its methodology.
3) Show and Tell: A Neural Image Caption Generator :
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan Tells you about
implementation of LSTM.
4) Image Captioning Based on Deep Neural Networks Shuang Liu , Liang Bai,a, Yanli
Hu and Haoran Wang states in brief about importance of CNN using with LSTM
and how it can be evaluated further.
5) Image Caption Generator using Big Data and Machine Learning Dr. Vinayak D. Shinde,
Mahiman P. Dave, Anuj M. Singh, Amit C. Dubey
Describe about various datasets like COCOMO, Flick8K, Flickr16K, Vgg16 and also LSTM
background mechanism

51
CHAPTER-10
REFERENCES

52
REFERENCES:

[1] L. Fei-Fei , A. Iyer , C. Koch , P. Perona. ,What do we perceive in a glance of


a real-world scene? J. Vis. 7 (1) (2007) 1–29

[2] Neural Image Caption Generation with Weighted Training and Reference

a. Guiguang Ding, Minghai Chen, Sicheng Zhao, Hui Chen, Jungong


Han & Qiang Liu
[3] Image Caption Generator using Big Data and Machine Learning Dr. Vinayak D. Shinde,
Mahiman P. Dave, Anuj M. Singh, Amit C. Dubey
[4] Feature Extraction using Convolution Neural Networks (CNN) and Deep Learning
Manjunath Jogin, Divya, Mohana , Meghana, Madhulika, Apoorva
[5] O. Vinyals , A. Toshev , S. Bengio , D. Erhan , S3how and tell: a neural image cap- tion
generator, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 3156–3164 .
[6] M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking
Task: Data, Models and Evaluation Metrics", Journal of Artificial Intelligence Research,
Volume 47, pages 853-899
[7] R. Staniute and D. Sesok, "A Systematic Literature Review on Image Captioning,"
Applied Sciences, vol. 9, no. 10, 16 March 2019
a. Farhadi, . M. Hejrati, . M. A. Sadeghi and . P. Young, "Every Picture Tells a
Story: Generating Sentences from Images," in ACM Digital Library, 2010.
[8] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Ex plain images with multimodal recurrent
neural networks. In arXiv:1410.1090, 2014.
[9] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk:
Understanding and generating simple image descriptions. In CVPR, 2011
[10] Y Lecun, L Bottou, Y Bengio et al.,
"Gradient-based learning applied to document recognition[J]", Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278-2324, 1998.

[11] A Krizhevsky, I Sutskever, G E. Hinton, "ImageNet Classification with Deep


53
Convolutional Neural Networks", Advances in Neural Information Processing Systems 2012,
vol. 25, no. 2, 2012

[12] Hochreiter, Sepp, and J. Schmidhuber. "Long Short Term Memory."Neural


Computation 9.8: 1735-1780. (1997)

[13] Ranzato, Marc'Aurelio, et al. "Sequence Level Training with Recurrent


Neural Networks." Computer Science (2015)

[14] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." IEEE
Conference on Computer Vision and Pattern Recognition IEEE Computer Society, 770-778.
(2016)

[15] Automatic Description Generation from Images: A Survey of Models,


Datasets, and Evaluation Measures

54

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy