Sample project doc-REC
Sample project doc-REC
A Project Report Submitted in the partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
P.Appala Naidu
Associate Professor
ENGINEERING CERTIFICATE
This is to certify that this project entitled “Image Caption Generation Using
Deep Learning” done by K.Bhavani Prasanna, K.Gayatri, K.Dinesh Sai Chandu Shetty,
M.Jaya Prakash bearing Regd.No: 17981A0577, 17981A0595, 17981A0582, 17981A05A2, is
a student of B.Tech in the Department of Computer Science and Engineering, Raghu
Engineering College, during the period 2020-2021 ,in partial fulfillment for the award of the
Degree of Bachelor of Technology in Computer Science & Engineering to the Jawaharlal Nehru
Technological University, Kakinada is a record of bonafide work carried out under my guidance
and supervision.
The results embodied in this project report have not been submitted to any other University
or Institute for the award of any Degree.
i
External Examiner
ii
DISSERTATION APPROVAL SHEET
This is to certify that the dissertation titled
By
Internal Examiner
External Examiner
HOD
Date
iii
DECLARATION
This is to certify that this project titled “Image Caption Generation Using Deep Learning ” is
bonafide work done by my team, in partial fulfillment of the requirements for the award of the
degree B.Tech. and submitted to the Department of Computer Science & Engineering, Raghu
Engineering College, Dakamarri.
I also declare that this project is a result of my own effort and that has not been copied from
anyone and I have taken only citations from the sources which are mentioned in the
references.
This work was not submitted earlier at any other University or Institute for the award of any
degree.
Place: NAME :
K. BHAVANI PRASANNA
Date: K. GAYATRI
K. DINESH SAI CHANDU SHETTY
M. JAYA PRAKASH
iv
ACKNOWLEDGEMENT
We express our sincere gratitude to my esteemed Institute “Raghu Institute of
Technology”, which has provided us with an opportunity to fulfil the most cherished desire to
reach my goal.
We take this opportunity with great pleasure to put on record our ineffable personal
indebtedness to Sri Raghu Kalidindi, Chairman of Raghu Institute of Technology for
providing necessary departmental facilities.
We would like to thank the Principal Dr. Ch. Srinivasu, Dr A. Vijay Kumar- Dean
planning & Development, Dr E.V.V.Ramanamurthy-Controller of Examinations, and the
Management of “Raghu Engineering College”, for providing the requisite facilities to carry
them out the project in the campus.
Our sincere thanks to Dr. P. Appala Naidu, HoD, Department of Computer Science and
Engineering, Raghu Engineering College, for this kind support in the successful completion of
this work.
We sincerely express our deep sense of gratitude to Dr. P. Appala Naidu, Professor
Department of Computer Science and Engineering, Raghu Engineering College, for his
perspicacity, wisdom, and sagacity coupled with compassion and patience. It is our great
pleasure to submit this work under his wing.
We are thankful to the non-teaching staff of the Department of Computer Science and
Engineering, Raghu Engineering College, for their inexpressible support.
Regards
K.Bhavani Prasanna(17981A0577)
K.Gayatri(17981A0595)
M.Jaya Prakash(17981A05A2)
v
vi
ABSTRACT
Image caption generation involves recognizing the context of the image and describe them in
natural language like English. This project involves image detection, image recognition and
NLP for generating captions using deep learning.
In this Python project, we will be implementing this project using CNN(Convolutional
Neural Network) and LSTM (Long short term memory). The CNN works as encoder to
extract features from images and LSTM which is a form of RNN works as decoder to
generate words.The image features will be extracted from ResNet which is a CNN model
trained on the imagenet dataset and then we feed the features into the LSTM model which
will be responsible for generating the image captions.
DATA SET : Flickr_8K
v
TABLE OF CONTENTS
CONTENT PAGE NUMBER
Certificate i
Dissertation Approval Sheet ii
Declaration iii
Acknowledgement iv
Abstract v
Contents vi
List of figures viii
CHAPTER 1: INTRODUCTION 1
1.1 Purpose 2
1.2 Scope 2
1.3 Motivation 2
1.4 Fundamental concepts 2
1.4.1 AI 2
1.4.2 Machine Learning 4
1.4.3 Deep Learning 5
1.5 Proposed System Fundamentals 6
1.6 Proposed Algorithms 6
1.6.1 RmsProp Optimizer Algorithm 6
vi
4.2.2 LSTM 18
4.2.3 Dataset 21
4.2.4 Transfer Model 21
4.3 UML Diagrams 23
CHAPTER 5 : IMPLEMENTATION 27
5.1 Technology Description 28
5.2 Sample code 29
CHAPTER 6 : SCREEN SHOTS 34
CHAPTER 7 : SYSTEM TESTING 41
7.1 Introduction to Testing 42
7.2 Types of Testing
vii
LIST OF FIGURES
PAGE
NO
Fig 1.1 Introduction
1
Fig 1.3 Motivation 2
viii
Fig 4.3.3 State Chart Diagram 25
ix
CHAPTER-1
INTRODUCTION
1
1. Introduction
Image captioning is one of the challenges in AI. Here there are two tasks one is image
Generation and other is caption for that image. Just like Humans able to detect, identify and
describe the images the computer can also be able to do these tasks.
Eg:
1.1 Purpose
The main challenge is to create a model so that computer can give captions for image just
like we humans can be able to describe the image once we see it, know about it.
1.2 Scope
To develop a model and training it so that it can be able to predict captions for the given
image using deep learning approaches.
1.3 Motivation
Look around and you will see a lot of day to day objects. You recognize them almost
instantaneously and involuntarily. You don’t have to wait for a few minutes after looking at a
table to understand that it is in fact a table.
2
1.3.1 How Humans Do It?
The processing of visual stream happens in ventral visual stream. It is a hierarchy
of areas in brain that helps in object recognition. Whenever we look at any object, our
brain extracts the features and in such a way that the size, orientation, illumination,
perspective etc don’t matter. You remember an object by its shape and inherent features.
It doesn’t matter how the object is placed, how big or small it is or what side is visible to
you. Information processing here involves taking data from visual stream along with
neurons and helps us detecting it.
1.3.2 How Machines Can Do It?
It is not an easier task when it comes to machines because they should
try to extract the features whether it is of any size and can be able to detect it, identify it.
But because of advancement of technology in field of Artificial Intelligence, it is possible.
With using different technologies like Deep Learning, Neural Networks and NLP uses
algorithms for Image Detection and Recognition and Language Processing. Our model we
designed use CNN for image extraction and LSTM for language extraction for given
image.
3
and perform human-like tasks.
To understand deep learning, imagine a toddler whose first word is dog. The toddler learns
what a dog is -- and is not -- by pointing to objects and saying the word dog. The parent says,
"Yes, that is a dog," or, "No, that is not a dog." As the toddler continues to point to objects, he
becomes more aware of the features that all dogs possess. What the toddler does, without
knowing it, is clarify a complex abstraction -- the concept of dog -- by building a hierarchy in
which each level of abstraction is created with knowledge that was gained from the preceding
layer of the hierarchy.
4
How deep learning works
Computer programs that use deep learning go through much the same process as the toddler
learning to identify the dog. Each algorithm in the hierarchy applies a nonlinear transformation
to its input and uses what it learns to create a statistical model as output. Iterations continue
until the output has reached an acceptable level of accuracy. The number of processing layers
through which data must pass is what inspired the label deep. Object recognition, speech
recognition, and language translation are some of the tasks performed through deep learning.
Neural networks come in several different forms, including Recurrent Neural Networks,
Convolutional Neural Networks, Artificial Neural Networks and Feed Forward Neural
Networks, and each has benefits for specific use cases. However, they all function in somewhat
similar ways -- by feeding data in and letting the model figure out for itself whether it has made
the right interpretation or decision about a given data element.
NLP
5
1.5 Proposed Algorithms
2 RMSprop deals with the above issue by using a moving average of squared gradients to
normalize the gradient. This normalization balances the step size (momentum), decreasing
the step for large gradients to avoid exploding and increasing the step for small gradients to
avoid vanishing.
3 Simply put, RMSprop uses an adaptive learning rate instead of treating the learning rate as
a hyperparameter. This means that the learning rate changes over time.
6 RMSprop deals with the above issue by using a moving average of squared gradients to
normalize the gradient. This normalization balances the step size (momentum), decreasing
6
the
7
step for large gradients to avoid exploding and increasing the step for small gradients to avoid vanishing.
7 Simply put, RMSprop uses an adaptive learning rate instead of treating the learning rate as
a hyperparameter. This means that the learning rate changes over time.
𝜗𝑑𝜔 = 𝛽𝜗𝑑𝜔 + (1 − 𝛽)
𝜗𝑑𝑏 = 𝛽𝜗𝑑𝜔 + (1 − 𝛽)
W=W-𝛼𝜗𝑑w
8
CHAPTER-2
SYSTEM ANALYSIS
9
2.1 Existing System
The steps for generating captions with object detection and feature extraction using
neural networks are as follows.
Step 1: Object detection In this step, the objects in the input image are detected
using R-CNN region proposal approach.
10
Step 2: Feature Extraction In this step, the features in the image are extracted using
principal component analysis using NumPy. CNN is used for scene classification
and RNN is used for detecting objects and human attributes.
Step 3: Creating attributes In this step, the features extracted by the neural
networks were used to define the attributes with its label strings.
Step 4: Encoder and Decoder In this step, the label strings were subjected to an
encoder RNN for encoding the label strings to a proper format, and the resultant
variable length string is subjected to a fixed length decoder for converting to a fixed
length descriptive sentence.
11
It Considers an image as an input and image extraction done using CNN and sent to
the language model thereby LSTM will be performed which gives the accurate result
using beam search.
12
CHAPTER-3
SYSTEM
REQUIREMENTS
SPECIFICATIONS
13
3.1 Software Requirements
Operating System : Windows
Programming Languages : Python3.6 or above
Kaggle Notebook or Google Colab
14
CHAPTER-4
SYSTEM DESIGN
15
4.1 System Model
Method : Deep Model
Type : CNN + LSTM
Dataset Used: Flickr8K Dataset
Pre-Trained Model Used: RestNet50
Fig 4.1
4.2 SYSTEM ARCHITECTURE
16
rows are filled.
Fig 4.2.1.1
Step 1b : ReLu Layer
The Rectified Linear Unit Or ReLu is not a separate component of
Convolutional Networks process but most commonly used activation function.
The main purpose of applying the rectifier function is to increase non linearity of
images. The function returns 0 if it is negative else positive returns the value back.
It is written as,
f(x) = max(0,x)
Fig 4.2.1b
17
Step 2: Pooling
Fig 4.2.1.2
The process of filling in a pooled feature map differs from the one we
used to come up with the regular feature map. This time you'll place a 2x2 box at
the top-left corner, and move along the row.
For every 4 cells your box stands on, you'll find the maximum numerical
value and insert it into the pooled feature map. For above Eg, the box currently
contains a group of cells where the maximum value is 4
Step 3: Flattening
Before inserting into artificial neural network it must be flattened.
Fig 4.2.1.3
18
Step 4: Full Connection
Fig 4.2.1.4
This full connection works as follows:
The neuron in the fully-connected layer detects a certain feature; say, a nose.
It preserves its value.
It communicates this value to both the "dog" and the "cat" classes.
Both classes check out the feature and decide whether it's relevant to them..
19
Fig4.2.1 : Showing CNN architecture
4.2.2 Long Short Term Memory(LSTM)
Long Short Term Memory Network is an advanced RNN, a sequential
network, that allows information to persist. It is capable of handling the vanishing
gradient problem faced by RNN.
Architecture:
It has 3 parts
1. Forget Gate
2. Input Gate
3. Output Gate as shown below
i) Forget Gate :
Decide whether we should keep the information from previous timestamp or
20
forget it.
Output Gate:
Based on the current expectation we have to give a relevant word to fill in the
blank. That word is our output and this is the function of our Output gate.
21
Now Calculate Hidden state
It turns out that the hidden state is a function of Long term memory (Ct) and the
current output. If you need to take the output of the current timestamp just apply the
SoftMax activation on hidden state Ht.
exp(x) / tf.reduce_sum(exp(x)).
22
Fig : LSTM Architecture
24
4.3 UML Diagrams
UML, which stands for Unified Modeling Language, is a way to visually represent the
architecture, design, and implementation of complex software systems. When you’re
writing code, there are thousands of lines in an application, and it’s difficult to keep track
of the relationships and hierarchies within a software system. UML diagrams divide that
software system into components and subcomponents.
Use Case Diagram
A use case diagram is a visual representation of a set of use cases and actors that may be
used to define the functionality and behaviour of a whole application system. The needs of
a system, including internal and external factors, are gathered via use case diagrams. The
majority of these criteria are design-related. As a result, when a system's functionalities are
analysed. Use cases are prepared and actors are identified.
25
Activity Diagram
Activity diagrams visualize the steps performed in a use case—the activities can be
sequential, branched, or concurrent. This type of UML diagram is used to show the
dynamic behavior of a system, but it can also be useful in business process modeling.
26
State Chart Diagram
State diagrams, simply put, depict states and transitions. A state refers to the different
combinations of information that an object can hold, and this UML diagram can visualize
all possible states and the way the object transitions from one state to the next.
27
Flow Chart Diagram
A flowchart is a diagram that depicts the steps in a process. It began as a tool for
expressing algorithms and programming logic in computer science, but it has now
expanded to include all sorts of processes. Flowcharts are now widely used for presenting
data and helping thinking.
28
CHAPTER-5
IMPLEMENTATION
29
5.1 Technology Description
The different phases involved in image caption generation are as follows:
Step 01 : Preprocessing Of Data
It involves image and text preprocessing for Flickr8K dataset.
Step 02: Image Detection
It can be done by extracting features from a pretrained model called RestNet50 thereby
using CNN image classification is done. Where Image with most probabilistic value can
be be considered by RPM Optimizer.
Step 03: Sentence Generation
It can be done by LSTM where it accepts the image and gives caption for it using softmax
activation layer,
Step 04: Evaluation
We use accuracy as a metric where it can be applied to Beam search when model is
predicted using ArgMax. It gives most the caption which has high accuracy.
30
5.2 Sample Code
import numpy as np
import pandas as pd
import cv2
import os
from glob import glob
from keras.applications import ResNet50
from keras.models import Model
#image Preprocess
images_path = '../input/flickr8k-sau/Flickr_Data/Images/'
images = glob(images_path+'*.jpg')
len(images)
incept_model = ResNet50(include_top=True)
last = incept_model.layers[-2].output
modele = Model(inputs = incept_model.input,outputs = last)
modele.summary()
#text preprocess
caption_path = '../input/flickr8k-sau/Flickr_Data/Flickr_TextData/Flickr8k.token.txt'
captions = open(caption_path, 'rb').read().decode('utf-8').split('\n')
len(captions)
captions_dict = {}
for i in captions:
try:
img_name = i.split('\t')[0][:-2]
caption = i.split('\t')[1]
if img_name in images_features:
if img_name not in captions_dict:
captions_dict[img_name] = [caption]
else:
captions_dict[img_name].append(caption)
except:
pass
#Visualize
import matplotlib.pyplot as plt
for i in range(5):
plt.figure()
31
img_name = images[i]
img = cv2.imread(img_name)
for k in images_features.keys():
plt.figure()
img_name = '../input/flickr8k-sau/Flickr_Data/Images/' + k
img = cv2.imread(img_name)
img = cv2.cvtColor(img,
cv2.COLOR_BGR2RGB)
plt.xlabel(captions_dict[img_name.split('/')[-1]])
plt.imshow(img)
plt.xlabel(captions_dict[img_name.split('/')[-1]])
plt.imshow(img) import matplotlib.pyplot as plt
for k in images_features.keys():
plt.figure()
img_name = '../input/flickr8k-sau/Flickr_Data/Images/' + k
img = cv2.imread(img_name)
img = cv2.cvtColor(img,
cv2.COLOR_BGR2RGB)
plt.xlabel(captions_dict[img_name.split('/')[-1]])
plt.imshow(img)
break
def preprocessed(txt):
modified = txt.lower()
modified = 'startofseq ' + modified + ' endofseq'
return modified
for k,v in captions_dict.items():
for vv in v:
captions_dict[k][v.index(vv)] = preprocessed(vv)
count_words = {}
32
for k,vv in captions_dict.items():
33
for v in vv:
for word in v.split():
if word not in count_words:
count_words[word] = 0
else:
count_words[word] += 1
len(count_words)
THRESH = -1
count = 1
new_dict = {}
for k,v in count_words.items():
if count_words[k] > THRESH:
new_dict[k] = count
count += 1
new_dict['<OUT>'] = len(new_dict)
captions_backup = captions_dict.copy()
captions_dict = captions_backup.copy()
for k, vv in captions_dict.items():
for v in vv:
encoded = []
for word in v.split():
if word not in new_dict:
encoded.append(new_dict['<OUT>'])
else:
encoded.append(new_dict[word])
captions_dict[k][vv.index(v)] = encoded
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
MAX_LEN = 0
for k, vv in captions_dict.items():
for v in vv:
if len(v) > MAX_LEN:
MAX_LEN = len(v)
print(v)
Batch_size = 5000
VOCAB_SIZE = len(new_dict)
34
def generator(photo, caption):
n_samples = 0
X = []
y_in = []
y_out = []
for k, vv in caption.items():
for v in vv:
for i in range(1, len(v)):
X.append(photo[k])
in_seq= [v[:i]]
out_seq = v[i]
y_in.append(in_seq)
y_out.append(out_seq)
X = np.array(X)
y_in = np.array(y_in, dtype='float64')
y_out = np.array(y_out, dtype='float64')
X.shape, y_in.shape, y_out.shape
MODEL
35
test_img_path = images[x]
test_img = cv2.imread(test_img_path)
test_img = cv2.cvtColor(test_img, cv2.COLOR_BGR2RGB)
return test_img
for i in range(5):
no = np.random.randint(1500,7000,(1,1))[0,0]
test_feature = modele.predict(getImage(no)).reshape(1,2048)
test_img_path = images[no]
test_img = cv2.imread(test_img_path)
test_img = cv2.cvtColor(test_img, cv2.COLOR_BGR2RGB)
text_inp = ['startofseq']
count = 0
caption = ''
while count < 25:
count += 1
encoded = []
for i in text_inp:
encoded.append(new_dict[i])
encoded = [encoded]
sampled_word = inv_dict[prediction]
if sampled_word == 'endofseq':
break
text_inp.append(sampled_word)
plt.figure()
plt.imshow(test_img)
plt.xlabel(caption)
36
CHAPTER-6
SCREEN SHOTS
37
IMAGE
MODEL
38
Check if model is fitted or not. It can be verified by generating number of epochs.
Epochs :
An epoch means training the neural network with all the training data for one cycle. In
an epoch, we use all of the data exactly once. A forward pass and a backward pass together
are counted as one pass: An epoch is made up of one or more batches, where we use a part
of the dataset to train the neural network.
The below screenshot shows sample for epochs.We have generated for 50 epochs for batch
size=512 with categorical loss and accuracy for each epoch.
39
PREDICTION
40
41
Evaluation :
Normally by selecting a particular approach for LSTM(beam or greedy) you can get
output as above using n.argmax which gives the one with highest accuracy. But if you
wanted to check you can prefer metrics like BLEU Score – Bilingual Evaluation Understudy
Score like BLEU – 1,2,3,4; Meteor , Rouge.
42
43
CHAPTER-7
TESTING
44
7.1 Introduction to Testing
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, subassemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets
its requirements and user expectations and does not fail in an unacceptable manner. There
are various types of test. Each test type addresses a specific testing requirement.
7.2 Types of Testing
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches
and internal code flow should be validated. It is the testing of individual software units of
the application .it is done after the completion of an individual unit before integration. This
is a structural testing, that relies on knowledge of its construction and is invasive. Unit
tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a
business process performs accurately to the documented specifications and contains clearly
defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components
were individually satisfaction, as shown by successfully unit testing, the combination of
components is correct and consistent. Integration testing is specifically aimed at exposing
the problems that arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions = tested are available as
specified by the business and technical requirements, system documentation, and user
45
manuals.
Functional testing is centered on the following items:
• Valid Input : identified classes of valid input must be accepted.
• Invalid Input : identified classes of invalid input must be rejected.
• Functions : identified functions must be exercised.
• Output : identified classes of application outputs must be exercised.
• Systems/Procedures: interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements, key functions,
or special test cases. In addition, systematic coverage pertaining to identify Business
process flows; data fields, predefined processes, and successive processes must be
considered for testing. Before functional testing is complete, additional tests are identified
and the effective value of current tests is determined.
System Testing
System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing
is the configuration oriented system integration test. System testing is based on process
descriptions and flows, emphasizing pre-driven process links and integration points.
White Box Testing
White Box Testing is a testing in which in which the software tester has knowledge of the
inner workings, structure and language of the software, or at least its purpose. It is purpose.
It is used to test areas that cannot be reached from a black box level.
Black Box Testing
Black Box Testing is testing the software without any knowledge of the inner workings,
structure or language of the module being tested .Black box tests, as most other kinds of
tests, must be written from a definitive source document, such as specification or
requirements document, such as specification or requirements document. It is a testing in
which the software under test is treated, as a black box .you cannot “see” into it. The test
provides inputs and responds to outputs without considering how the software works.
46
Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted
as two distinct phases
Test strategy and approach Field testing will be performed manually and functional tests
will be written in detail.
Test objectives:
• All field entries must work properly.
• Pages must be activated from the identified link.
• The entry screen, messages and responses must not be delayed.
Features to be tested:
• Verify that the entries are of the correct format.
• No duplicate entries should be allowed.
• All links should take the user to the correct page. Integration Testing Software integration
testing is the incremental integration testing of two or more integrated software
components on a single platform to produce failures caused by interface defects. The task
of the integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at
the company level – interact without error.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing :
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered
47
CHAPTER-8
CONCLUSION &
FUTURE
ENHANCEMENTS
48
CONCLUSION AND FUTURE ENHANCEMENTS:
49
CHAPTER-9
LITERATURE SURVEY
50
1) Detection and Recognition of Objects in Image Caption Generator System: A Deep Learning
Approach N. Komal Kumar , D. Vigneswari , A. Mohan , K. Laxman , J. Yuvaraj describes about
existing model using CNN-RNN approach and how it can be used.
2) Feature Extraction using Convolution Neural Networks (CNN) and Deep Learning
Manjunath Jogin, Divya, Mohana , Meghana, Madhulika, Apoorva – Discusses in detail
importance of CNN, architecture of CNN and types of CNN and its methodology.
3) Show and Tell: A Neural Image Caption Generator :
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan Tells you about
implementation of LSTM.
4) Image Captioning Based on Deep Neural Networks Shuang Liu , Liang Bai,a, Yanli
Hu and Haoran Wang states in brief about importance of CNN using with LSTM
and how it can be evaluated further.
5) Image Caption Generator using Big Data and Machine Learning Dr. Vinayak D. Shinde,
Mahiman P. Dave, Anuj M. Singh, Amit C. Dubey
Describe about various datasets like COCOMO, Flick8K, Flickr16K, Vgg16 and also LSTM
background mechanism
51
CHAPTER-10
REFERENCES
52
REFERENCES:
[2] Neural Image Caption Generation with Weighted Training and Reference
[14] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." IEEE
Conference on Computer Vision and Pattern Recognition IEEE Computer Society, 770-778.
(2016)
54