0% found this document useful (0 votes)
63 views14 pages

Team 21 Omkar Reddy Gojala Mrinalini Injeti Ramakanth

The document describes a project to generate descriptive captions for images using neural networks. The team used the Flickr8K dataset to train an encoder-decoder model with an InceptionV3 CNN encoder and LSTM decoder. The model was evaluated using BLEU scores, with the LSTM achieving a higher score than RNN. Examples of accurate, funny, and incorrect predictions are provided. Future work ideas involve using more data, visual attention techniques, and an app to aid the visually impaired.

Uploaded by

Haythem Nedri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views14 pages

Team 21 Omkar Reddy Gojala Mrinalini Injeti Ramakanth

The document describes a project to generate descriptive captions for images using neural networks. The team used the Flickr8K dataset to train an encoder-decoder model with an InceptionV3 CNN encoder and LSTM decoder. The model was evaluated using BLEU scores, with the LSTM achieving a higher score than RNN. Examples of accurate, funny, and incorrect predictions are provided. Future work ideas involve using more data, visual attention techniques, and an app to aid the visually impaired.

Uploaded by

Haythem Nedri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Team 21

Omkar Reddy Gojala


Mrinalini Injeti Ramakanth
 Goal is to generate a descriptive sentence of an image
 Project was inspired by the works of Andrej Karpathy and Marc Tanti et al.(2017)

Neural Two dogs are wrestling in


Network the grass

 Potential Applications:
 Aiding visually impaired
 Generating video summary using individual frames
 We used Flickr8K dataset for this
project

 Flickr8K dataset contains a variety of


images depicting scenes and situations

 The dataset consists of 8000 images


and each image has 5 corresponding
descriptions

 We split the data into 6000, 1000, &


• A man riding his bike on a hill 1000 images as training, validation
• A man with helmet and backpack standing on dirt and testing sets respectively
bike in a hilly grassy area
• A person rides a motorbike through a grassy field  The images are of different dimensions
• Man on motorcycle riding in dry field wearing a
helmet and backpack
• The biker is riding through a grassy plain .
 Each description is tokenized and converted to lowercase
 Removed alpha-numeric characters and punctuation marks
 We use startseq and endseq as prefix and postfix for each caption respectively
 Filtered out unique words from the corpus and represented each word by an integer
 To generate a fixed length word vector we calculated the maximum length caption
 Resized all images to a fixed size of 299x299x3 using OpenCV
 Employed transfer learning using pre-trained InceptionV3 CNN model to encode images
 We removed the last softmax layer from the InceptionV3 network to extract 2048 image
vector
 For each image we will train the model by temporally injecting incremental
sequences of the description
 In this phase, we essentially create labels in our training data

Image Partial Caption Target Word


Image startseq a
Image startseq a young
Image startseq a young boy
…… …… ……
Image startseq a young boy endseq
wearing a helmet and
riding a bike in a park
 We used an encoder-decoder architecture

 2048 image vector is fed to a Dense layer to


generate 256 length image vector

 34 length word vector is fed to LSTM/RNN to


output 256 length word vector

 Decoder model adds both the encoder outputs


and is fed to Dense 256 layer

 The last Dense layer will have as many nodes as


the vocabulary size

 The last softmax layer predicts the next word


present in the output vocabulary
 Caption is predicted word by word
 Image is fed along with the first word(startseq) to the RNN to predict the second
word
 Again the same image along with first word + second word is fed to the RNN to
predict the third word and so on until the last word(endseq) is encountered

Target Word
Neural Network
(i=0) little
model
(i=1) boy
(i=2) ….
(i=0) startseq
(i=1) startseq little
(i=2) …..
Bilingual Evaluation Understudy Score (BLEU)
 BLEU is a metric for evaluating a generated sentence to a reference sentence
 BLEU score lies between 0 and 1

LSTM (Long Short Term Memory) Simple RNN (Recurrent Neural Network)

BLEU N-GRAM SCORE BLEU N-GRAM SCORE


BLEU-1 0.572214 BLEU-1 0.364472
BLEU-2 0.339204 BLEU-2 0.181942
BLEU-3 0.237129 BLEU-3 0.103185
BLEU-4 0.116733 BLEU-4 0.085675
Correct Predictions

Actual Caption: Actual Caption: Actual Caption:


a boy with a blue helmet is white fluffy dog running in a boy dribbles a basketball
riding a bike the dirt in the gymnasium

Predicted Caption: Predicted Caption: Predicted Caption:


little boy rides bike with white dog runs across the boy in white shirt is playing
helmet sand basketball
Funny Predictions ??

Actual Caption: Actual Caption: Actual Caption:


man fly fishing in a small a woman wearing a black and a group of different people
river with steam in the white outfit while holding her are walking in all different
background sunglasses directions in a city

Predicted Caption: Predicted Caption:


Predicted Caption: man in pink dress is holding group of people walking
Man is swinging on a swing her head ocean
Predictions that went really wrong!

Actual Caption: Actual Caption: Actual Caption:


A man wearing a red life A dog is chewing on a metal a young hockey player
jacket is holding a purple pole playing in the ice rink
rope while waterskiing

Predicted Caption: Predicted Caption: Predicted Caption:


man in white and white and dog is standing in its mouth chasing player in motorcycle
white shorts leash on swing is playing chasing
 We can enhance the predictions by using
more training examples. For example using
Flickr32k dataset which has 32000 images

FUTURE  Implement visual attention techniques,


which focuses on interesting parts of the
WORK image

 Creating an application for visually


impaired to convert the generated caption
into voice output

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy