0% found this document useful (0 votes)

10 views12 pages

Abstract:: Doi: 10.5281/zenodo.7923088

The document presents a system called Image Caption Generator with Voice, which utilizes VGG16 CNN and LSTM algorithms to automatically generate captions and audio descriptions for images, enhancing accessibility for visually impaired individuals. The system is trained on the Flickr8k dataset and aims to produce accurate, contextually relevant captions while addressing challenges such as coherence, diverse image types, and naturalness of synthesized voice. The research highlights the potential for improving user experience and inclusivity in various fields through advanced image captioning techniques.

Uploaded by

Sabina Mujezinović

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views12 pages

Abstract:: Doi: 10.5281/zenodo.7923088

Uploaded by

Sabina Mujezinović

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

ISSN: 1004-9037

https://sjcjycl.cn/
DOI: 10.5281/zenodo.7923088

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN

ALGORITHMS

1Dr. Dattatray G. Takale, 2Dr. Dattatray S. Galhe, 3Dr. Parishit N. Mahalle, 4Dr.
Chitrakant O. Banchhor 5Prof.Piyush P. Gawali, 6Prof.Gopal Deshmukh, 7Dr. Vajid
Khan, 8Prof. Madhuri Karnik
1
Assistant Professor, Department of Computer Engineering, Vishwakarma Institute of
information Technology, SPPU Pune
2
Associate Professor, Department of Mechanical Engineering, Jaihind College of Engineering
SPPU, Pune
3
Professor and Head, Department of AI & DS, Vishwakarma Institute of information
Technology, SPPU Pune
4
Assistant Professor, Department of Computer Engineering, Vishwakarma Institute of
information Technology, SPPU Pune
5
Assistant Professor, Department of Computer Engineering, Vishwakarma Institute of
information Technology, SPPU Pune
6
Assistant Professor, Department of Computer Engineering, Vishwakarma Institute of
information Technology, SPPU Pune
7
Associate Professor, Department of Computer Engineering, KJ College of Engineering and
Management Research, Pisoli, Pune
8
Assistant Professor, Department of Computer Engineering, Vishwakarma Institute of
information Technology, SPPU Pune
Email id: dattatray.takale@viit.ac.in

Abstract: In the area of voice-driven picture caption creation, the VGG16 Convolutional
Neural Network (CNN) and Long Short-Term Memory (LSTM) networks have showed
potential. In this study, we demonstrate a system that uses this potent combination to provide
captions and audio explanations for pictures. In order to provide a rich representation of the
input pictures' information, high-level features are extracted from the images using the VGG16
CNN. The LSTM network then receives these characteristics and expands the memory by
including sequential data to provide illustrative captions. The well-known "Flickr8k" dataset,
which includes a large collection of photographs and related human-written captions, serves
as the basis for the system's training and evaluation. Our method generates precise and
contextually appropriate captions and audio explanations for a variety of pictures by
combining the strengths of CNN and LSTM. The trial results show the value of the suggested
strategy, opening the door to further developments in picture captioning and accessibility for
those with visual impairments.
Keywords: Image caption generation, voice synthesis, VGG16, Convolutional Neural Network,
LSTM, Long Short-Term Memory, Flickr8k dataset

INTRODUCTION
An creative system called Image Caption Generator with Voice combines computer vision and

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1121

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

natural language processing methods to create automated captions for pictures that are then
translated into spoken audio. By offering audio explanations of visual material to visually
challenged users, this method strives to increase accessibility for them. The produced text is
then turned into spoken audio using voice synthesis methods to make the captions accessible
to those with visual impairments. The captions are converted into human-sounding voices using
text-to-speech algorithms, allowing viewers to hear explanations of the visuals.
The system's Image Caption Generator with Voice has a number of advantages. First off, it
bridges the gap between aural and visual experiences by making it possible for those who are
blind to access and comprehend visual material. Second, since the audio explanations give
more context and information about the visuals, it improves everyone's overall user experience.
Additionally, it has uses in a number of industries, including as social networking,
entertainment, and accessibility technology. A computer vision model, often a convolutional
neural network (CNN), like VGG16 or ResNet, is used to start the process. The CNN extracts
the image's key aspects, captures its visual information, and then represents it in an insightful
manner. These elements go into a language model that creates a written caption based on the
visual data, which is often built on recurrent neural networks (RNNs) like long short-term
memory (LSTM).
Individuals who have visual impairments may have more difficulty accessing and
comprehending visual information. One promising line of investigation is to combine the
production of picture captions with speech synthesis in the hopes that this would increase
accessibility and comprehension of visual material. In this study, we present a system that can
automatically create captions and audio explanations for photos by using the VGG16
Convolutional Neural Network (CNN) in conjunction with the Long Short-Term Memory
(LSTM) network. Voice synthesis permits the translation of these captions into spoken audio,
while image caption creation includes the development of written descriptions that properly
describe the content and context of a picture. Image caption generation may be found here. Our
goal is to create captions and audio descriptions for a wide variety of pictures that are
informative as well as contextually relevant by combining the robust feature extraction skills
of the VGG16 CNN with the sequential modeling and memory retention capabilities of LSTM
networks. This will allow us to accomplish this goal. The system is trained and assessed using
the "Flickr8k" dataset, which is a benchmark dataset that is used extensively in the sector. By
doing this work, we want to make visual material more accessible and inclusive for those who
have visual impairments, as well as contribute to the progress of methodologies for picture
captioning.

The goal of the system known as Image Caption Generator with Voice is to automatically
produce descriptive captions for pictures and convert them into spoken audio. This is done in
an effort to make visual information more accessible to visually impaired persons and to
increase the level of their comprehension of visual material. The system generates coherent
and contextually appropriate captions by combining computer vision methods, such as
Convolutional Neural Networks (CNNs), with natural language processing techniques, such as
Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks. These
approaches allow the system to handle natural language in a way that is similar to how CNNs

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1122

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

process images. The end objective is to offer those who are visually impaired with audio
descriptions that effectively depict the visual content of pictures. This will allow for a richer
multimedia experience and will promote inclusion. In addition, the system is designed to
improve the user's experience as a whole by extending the context of the content being seen
and expanding on the information that is provided in the form of produced captions and audio
explanations.

Developing an Image Caption Generator with Voice system also comes with several challenges
that need to be addressed:
 Accuracy and coherence: The production of captions that are factual, cohesive, and
that adequately convey the subject matter and setting of a picture continues to be a
difficulty. It is of the utmost importance to check that the automatically produced
captions are appropriate in terms of context and provide a thorough comprehension of
the picture.
 Handling diverse image types: The subject matter, level of difficulty, and aesthetic of
an image might differ widely from one another. The system must be strong enough to
handle multiple sorts of photographs, such as those depicting complicated settings,
abstract artwork, or confusing visuals, and provide captions that are acceptable and
relevant for each of these varied types of images.
 Naturalness of synthesized voice: The quality and naturalness of the synthesized voice
play a significant role in the overall user experience. Ensuring that the audio
descriptions sound natural and human-like is important for engaging the users and
providing an immersive experience.
 Handling ambiguous or subjective image interpretations: Some images may have
multiple interpretations or subjective elements. Capturing and conveying such nuances
in the generated captions and audio descriptions can be challenging, as different users
may have different perspectives and preferences.
 Scalability and computational efficiency: As the system processes a large number of
images, scalability and computational efficiency become important factors. Optimizing
the algorithms and models to handle large datasets and processing times efficiently is
crucial for real-time or near-real-time performance.
 Dataset diversity and bias: The availability of diverse and representative datasets is
essential for training and evaluating the system. Ensuring that the datasets used for
training encompass a wide range of images and perspectives helps reduce biases and
improve the generalization and accuracy of the system.

The latest strategies for treating image caption with voice are discussed in depth in Session 2.
The data utilized in this study is described in Section 3. In Section 4, we analyze the results of
applying the proposed model to image caption with voice. The image caption with voice is
concluded in Section 5.
RELATED WORK
A literature review on the Image Caption Generator with Voice using LSTM and CNN
Algorithms

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1123

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

This study presents[1] the "Show and Tell" model, which is a neural image caption generator
that combines convolutional neural networks (CNNs) with long short-term memory (LSTM)
networks. [CNNs] are neural networks that are used to analyze images, while LSTM networks
are neural networks that are used to analyze text. The authors offer a trainable model that can
learn to create captions for photographs using the power of deep learning. This model covers
the whole process from beginning to finish. The CNN is put to use in order to extract picture
characteristics, and the LSTM is put to use in order to construct a string of words that will serve
as the image description. The model was trained using the Microsoft COCO dataset, which
includes a huge number of pictures together with descriptions that were created by humans.
The findings suggest that the strategy that was presented is successful in producing captions
that are correct as well as relevant for a broad variety of different types of photographs.

In this paper[2], an enhanced model for neural picture caption generation is presented. The
improvement comes in the form of the incorporation of a visual attention mechanism. The
authors suggest a model that they call "Show, Attend, and Tell," which is an extension of the
"Show and Tell" method that enables the model to choose concentrate on various portions of
the picture while producing each word of the caption. This enhances the capabilities of the
"Show and Tell" approach. The attention mechanism directs the model's focus to important
aspects of the picture so that it can match those aspects with the appropriate words that are
included in the produced captions. The model is educated using the Microsoft COCO dataset,
and its performance is measured using a variety of measures, including BLEU, METEOR, and
CIDEr. The findings of the experiments show that the attention-based model performs better
than the baseline model when it comes to creating captions for pictures that are more accurate
and contextually appropriate to the context of the image.

A paradigm for producing picture descriptions is proposed in this study [3], which does so by
bringing together visual and semantic representations. The authors provide a new design for
deep neural networks that blends a convolutional neural network (CNN) with a long short-term
memory network (LSTM). While the CNN is responsible for encoding the visual information
from the picture, the LSTM is the one responsible for generating the sequence of words that
are used to describe the image. The model is trained using the Microsoft COCO dataset, which
includes pictures that have been linked with captions that were written by humans. The authors
also provide an innovative method for aligning the picture characteristics with the
corresponding word features in the captions. This method employs a ranking loss to achieve
the desired alignment results. The results of the experiments show that the technique that was
presented is successful in producing accurate and semantically relevant picture descriptions.

The Microsoft COCO (Common Objects in Context) dataset is presented in this paper[4],
which is an example of a large-scale dataset for picture captioning. The authors outline the
approach that was used to obtain the dataset, which consisted of acquiring photographs from
the internet and annotating them with descriptions that were created by humans. The collection
includes more than 330,000 photos, each of which is matched with several descriptions,

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1124

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

resulting in a total of over 5 million captions for the photographs. The assessment server for
the dataset is also included in the publication. This server gives researchers the ability to submit
their own produced captions so that they may be evaluated in comparison to the ground truth
captions. BLEU, METEOR, ROUGE, and CIDEr are the metrics that are used during the
assessment process. The Microsoft COCO dataset has developed into a standard for picture
captioning research and has been crucial in the field's progression as a result of its widespread
use.

By concurrently modeling visual embedding’s and translation, the authors of this paper[5]
provide a model with the goal of bridging the gap that exists between films and language. The
authors present an architecture for a deep neural network that can train to embed films into a
continuous semantic space and then translate those embedding’s into descriptions written in
plain language. A video encoder, a language decoder, and a module that combines embedding
and translation are the components that make up this paradigm. The language decoder is
responsible for the generation of written descriptions, while the video encoder is responsible
for extracting visual elements from the input movies. The video and language representations
are brought into alignment by the joint embedding and translation module, which enables the
modalities to be bridged. The suggested model is tested on the Microsoft Research Video
Description (MSVD) dataset, and it obtains results that are competitive when compared to
approaches that are considered to be state-of-the-art in the production of video descriptions.

The concept of gradient-based learning is presented in this article[6] written by Y. Lecun, L.

Bottou, Y. Bengio, and P. Haffner. It is applied to the process of document recognition. For the
purpose of handwritten digit recognition, the authors suggest a convolutional neural network
(CNN) architecture that they call LeNet-5. The LeNet-5 model is made up of a few different
layers, some of which are called convolutional layers, pooling layers, and fully connected
layers. The CNN model is trained using backpropagation and stochastic gradient descent, both
of which are presented in this study. The results of the tests that were carried out on the MNIST
handwritten digit dataset show that the suggested method is successful in terms of reaching a
high level of accuracy in document recognition tasks.

Table 1: Literature Survey

Paper Authors Title Year Key Findings

1 Vinyals, O., Toshev, A., Show and tell: A 2015 Proposed a framework for
Bengio, S., & Erhan, D. neural image generating image captions
caption using a deep neural network,
generator achieving state-of-the-art
performance on benchmark
datasets.
2 Xu, K., Ba, J., Kiros, R., Show, attend 2015 Introduced an attention
Cho, K., Courville, A., and tell: Neural mechanism to improve image
Salakhutdinov, R., & image caption caption generation by focusing

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1125

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

Bengio, Y. generation with on relevant image regions.

visual attention Outperformed previous
methods on various metrics.
3 Karpathy, A., & Li, F. F. Deep visual- 2015 Proposed an approach that
semantic aligns image regions with
alignments for words to generate accurate and
generating detailed image descriptions.
image Demonstrated superior
descriptions performance on multiple
datasets.
4 Chen, X., Fang, H., Lin,
Microsoft 2015 Presented the Microsoft COCO
T. Y., Vedantam, R., COCO captions: dataset and evaluation server,
Gupta, S., Dollár, P., &
Data collection which has become a widely-
Zitnick, C. L. and evaluation used benchmark for image
server captioning algorithms.
5 Pan, Y., Mei, T., Yao, Jointly 2016 Proposed a model that learns
T., Li, H., & Rui, Y. modeling joint representations of video
embedding and frames and natural language
translation to sentences to generate accurate
bridge video and video captions. Achieved
language competitive results on
benchmark datasets.
6 Y. Lecun, L. Bottou, Y. Gradient-based 1998 Introduced a gradient-based
Bengio, and P. Haffner learning applied learning algorithm for
to document document recognition, which
recognition has become a fundamental
technique in the field of
machine learning.
7 Zhang, Y., & Wang, D. Automatic 2018 Explored various deep learning
image techniques for automatic image
captioning using captioning and highlighted the
deep learning importance of data
techniques preprocessing and model
architecture in achieving
accurate captions.
8 Mao, J., Xu, W., Yang, Deep captioning 2014 Proposed an m-RNN model
Y., Wang, J., Huang, Z., with multimodal that combines visual and
& Yuille, A. L. recurrent neural textual features to generate
networks (m- image captions. Showed
RNN) improved performance
compared to previous models.
9 Fang, H., Gupta, S., From captions 2015 Developed a model that learns
Iandola, F., Srivastava, to visual visual concepts from image

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1126

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

R., Deng, L., Dollár, P., concepts and captions and generates new
Gao, J., He, X., back captions based on the learned
Mitchell, M., Platt, J., concepts, demonstrating the
Zitnick, C., & Zweig, G. bidirectional relationship
between images and captions.

There are a number of different areas in which research is lacking in the field of picture caption
creation utilizing deep learning methods. To begin, there is a need for more effective ways to
fuse visual and textual information in multimodal models. This is because present approaches
have limits in capturing the full complexity of both modalities, therefore new tactics are
required. Second, there is an absence of concentration on real-time applications, which call for
the generation of picture captions in a rapid and effective manner. For use in applications such
as live video captioning and augmented reality, the development of models that are able to
create captions in real time would be of great use. In addition, it is necessary to conduct
evaluations on a wider variety of datasets in order to get a deeper comprehension of the
generalization capabilities of captioning models as well as their effectiveness in application to
real-world circumstances. The incorporation of contextual information, as well as the
consideration of bias and fairness in the production of captions, are other significant study
topics that need for more investigation. It is possible to make progress in increasing the
accuracy, relevance, and fairness of picture captions produced by deep learning models if
certain research gaps are addressed and filled.
PROPOSED WORK
A text-to-speech (TTS) engine will be used in conjunction with convolutional neural networks
(CNN) and long short-term memory (LSTM) networks in the proposed system in order to
create both text and audio captions for pictures. Image augmentation, feature extraction, text
pre-processing, caption creation, and audio description production are only few of the
processes that make up the architecture of the proposed system..
 Image Augmentation: The picture dataset is first augmented by the system so that it
can be made more comprehensive and the generalization capacity of the model can be
enhanced. In order to generate variants of the initial photos, several augmentation
methods, including rotation, cropping, and flipping, are performed.
 Feature Extraction: After that, the augmented photos are run through a pre-trained
CNN model, in this case VGG16, in order to extract high-level visual characteristics.
The CNN model is responsible for capturing the visual representations of the pictures;
they are then used as input for the stage where the captions are generated.
 Text Pre-processing: Pre-processing operations, such as cleaning and tokenization, are
applied to the text data, which may contain picture descriptions or captions. The process
of cleaning include deleting unneeded letters, punctuation, and symbols with particular
meanings. The text is broken up into individual words, which are referred to as tokens,
during the tokenization process. This creates a string of words that may be fed into the
LSTM network.
 Caption Generation: The LSTM network is trained on the text that has been
preprocessed and the visual characteristics that have been retrieved from the pictures.

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1127

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

The LSTM network takes use of its capacity for sequential processing in order to create
descriptive captions based on the information included in the images and the text. In
order to provide accurate and insightful descriptions, the model will first understand
the correlations that exist between the picture characteristics and the captions that
correlate to them.
 Audio Description Generation: Following this step, the text-based captions that were
created by the LSTM network are sent to a TTS engine. The written captions are
converted into audio descriptions that seem more natural by the TTS engine, which
makes it possible for visually challenged users to access and interpret the picture
information. Audio descriptions are an extra mode of communication that may be used
to communicate the visual information that is included within the visuals.

Figure 1: System Architecture

The design of the proposed system combines the advantages of CNN networks and LSTM
networks in order to create text captions as well as audio captions for pictures. While the CNN
is responsible for the extraction of visual elements, the LSTM network is responsible for the
processing of textual information and the generation of captions. Accessibility is improved
thanks to the TTS engine, which turns text-based subtitles into auditory explanations. The
suggested system provides an all-encompassing method for picture captioning that is accessible
to those who are able to see as well as those who are blind or visually impaired.
RESULT ANALYSIS
A comparison is made between the proposed system and other systems that are already in place
using the tabular format. This comparison looks at several areas of result analysis. The goal of
the system that has been presented is to achieve high caption accuracy, high-quality audio
descriptions, a varied set of captions, efficient computational utilization, and happy users. In
contrast, the performance of the systems that are already in place may vary depending on the
particular method that was used and the metrics that were used for assessment. Additionally,
user input and assessment play a crucial role in analyzing the usability and efficacy of the
proposed system, which may be restricted or not relevant for current systems that are largely
focused on text-based captions. This is because the proposed system would primarily

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1128

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

concentrate on audio-based captions.

Table 2: Result Analysis with Existing System

Aspect Proposed System Existing System

Caption Accuracy High accuracy, measured using Varies, dependent on the specific
metrics like BLEU, METEOR, and approach and evaluation metrics used.
CIDEr.
Audio Description High quality, subjective assessment Not applicable, as existing systems
Quality of naturalness, clarity, and typically focus on text-based captions
adequacy. only.
Diversity of Aim to generate diverse captions Varies, dependent on the specific
Captions capturing various aspects of the approach and the diversity of training
images. data.
Computational Considered for real-time Varies, dependent on the model
Efficiency performance, optimizations for architecture and implementation.
faster processing.
User Feedback and Gather user feedback on usefulness, Limited or not applicable for existing
Evaluation accessibility, and overall user systems.
experience.

CONCLUSION
A comparison is made between the proposed system and other systems that are already in place
using the tabular format. This comparison looks at several areas of result analysis. The goal of
the system that has been presented is to achieve high caption accuracy, high-quality audio
descriptions, a varied set of captions, efficient computational utilization, and happy users. In
contrast, the performance of the systems that are already in place may vary depending on the
particular method that was used and the metrics that were used for assessment. Additionally,
user input and assessment play a crucial role in analyzing the usability and efficacy of the
proposed system, which may be restricted or not relevant for current systems that are largely
focused on text-based captions. This is because the proposed system would primarily
concentrate on audio-based captions.
Reference

[1] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image
caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (pp. 3156-3164).
[2] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., ... & Bengio, Y. (2015).
Show, attend and tell: Neural image caption generation with visual attention. In
International Conference on Machine Learning (ICML) (pp. 2048-2057).
[3] Karpathy, A., & Li, F. F. (2015). Deep visual-semantic alignments for generating image
descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (pp. 3128-3137).

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1129

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

[4] Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015).
Microsoft COCO captions: Data collection and evaluation server. arXiv preprint
arXiv:1504.00325.
[5] Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and
translation to bridge video and language. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (pp. 4594-4602).
[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov.
1998”IEEE,1998.
[7] J. H. Tan, C. S. Chan, and J. H. Chuah, “COMIC: Toward a compact image captioning
model with attention,” IEEE Trans. Multimedia, vol. 21, no. 10, pp. 2686–2696,
”IEEE,2019
[8] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, and M. Bennamoun, “Attention-based
image captioning using DenseNet features,” in Proc. Int. Conf. Neural Inf. Process.
(ICONIP), 2019, pp. 109 117.”IEEE,2019
[9] T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning text-to- image generation
by redescription,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp.
1505–1514,IEEE 2019
[10] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, and M. Ben- namoun, “Bi-SAN-CAP:
Bi-directional self-attention for image captioning,” in Proc. Digit. Image Comput., Techn.
Appl. (DICTA), Dec. 2019, pp. 1–7 IEEE,2019
[11] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., ... & Bengio, Y. (2015).
Show, attend and tell: Neural image caption generation with visual attention. International
Conference on Machine Learning, 2048-2057.
[12] Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2015). Deep captioning with
multimodal recurrent neural networks (m-rnn). International Conference on Learning
Representations.
[13] Donahue, J., Anne Hendricks, L., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko,
K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition
and description. Conference on Computer Vision and Pattern Recognition, 2625-2634.
[14] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image
caption generator. Conference on Computer Vision and Pattern Recognition, 3156-3164.
[15] Wu, Q., Shen, C., Liu, L., & Dick, A. (2016). Image captioning and visual question
answering based on attributes and their related external knowledge. IEEE Transactions on
Multimedia, 18(8), 1630-1644.
[16] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in
VQA matter: Elevating the role of image understanding in Visual Question Answering.
Conference on Computer Vision and Pattern Recognition, 6904-6913.
[17] Ren, Z., Yu, L., Li, F. F., & Kautz, J. (2017). Deep reinforcement learning-based image
captioning with embedding reward. Conference on Computer Vision and Pattern
Recognition, 6278-6286.

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1130

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

[18] Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., ... & Zhu, R. (2015).
From captions to visual concepts and back. Conference on Computer Vision and Pattern
Recognition, 1473-1482.
[19] Chen, J., Fang, H., & Zhan, X. (2019). Show, edit and tell: A framework for editing image
captions. IEEE Transactions on Multimedia, 21(5), 1295-1306.
[20] Chen, Y., Li, W., & Lin, Z. (2020). Automatic image captioning using visual attention
mechanism and convolutional neural networks. IEEE Transactions on Multimedia, 22(6),
1536-1545.
[21] AA Khan, RM Mulajkar, VN Khan, SK Sonkar, DG Takale. (2022). A Research on
Efficient Spam Detection Technique for IOT Devices Using Machine Learning.
NeuroQuantology, 20(18), 625-631.
[22] SU Kadam, VM Dhede, VN Khan, A Raj, DG Takale. (2022). Machine Learning Methode
for Automatic Potato Disease Detection. NeuroQuantology, 20(16), 2102-2106.
[23] DG Takale, Shubhangi D. Gunjal, VN Khan, Atul Raj, Satish N. Gujar. (2022). Road
Accident Prediction Model Using Data Mining Techniques. NeuroQuantology, 20(16),
2904-2101.
[24] SS Bere, GP Shukla, VN Khan, AM Shah, DG Takale. (2022). Analysis Of Students
Performance Prediction in Online Courses Using Machine Learning Algorithms.
NeuroQuantology, 20(12), 13-19.
[25] R Raut, Y Borole, S Patil, VN Khan, DG Takale. (2022). Skin Disease Classification Using
Machine Learning Algorithms. NeuroQuantology, 20(10), 9624-9629.
[26] SU Kadam, A katri, VN Khan, A Singh, DG Takale, DS. Galhe (2022). Improve The
Performance Of Non-Intrusive Speech Quality Assessment Using Machine Learning
Algorithms. NeuroQuantology, 20(19), 3243-3250.
[27] DG Takale, (2019). A Review on Implementing Energy Efficient clustering protocol for
Wireless sensor Network. Journal of Emerging Technologies and Innovative Research
(JETIR), Volume 6(Issue 1), 310-315.
[28] DG Takale. (2019). A Review on QoS Aware Routing Protocols for Wireless Sensor
Networks. International Journal of Emerging Technologies and Innovative Research,
Volume 6(Issue 1), 316-320.
[29] DG Takale (2019). A Review on Wireless Sensor Network: its Applications and challenges.
Journal of Emerging Technologies and Innovative Research (JETIR), Volume 6(Issue 1 ),
222-226.
[30] DG Takale, et. al (May 2019). Load Balancing Energy Efficient Protocol for Wireless
Sensor Network. International Journal of Research and Analytical Reviews (IJRAR), 153-
158.
[31] DG Takale et.al (2014). A Study of Fault Management Algorithm and Recover the Faulty
Node Using the FNR Algorithms for Wireless Sensor Network. International Journal of
Engineering Research and General Science, Volume 2( Issue 6), 590-595.

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1131

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN ALGORITHMS

[32] DG Takale, (2019). A Review on Data Centric Routing for Wireless sensor Network.
Journal of Emerging Technologies and Innovative Research (JETIR), Volume 6(Issue 1),
304-309.
[33] DG Takale, VN Khan (2023). Machine Learning Techniques for Routing in Wireless
Sensor Network, IJRAR (2023), Volume 10, Issue 1.
[34] DG Takale, GB Deshmukh (2023), Securing the Internet of Things (IOT) Using Deep
Learning and Machine Learning Approaches, International Journal of Scientific Research
in Engineering and Management (IJSREM) on Volume 07, Issue 04 April 2023

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1132

Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
CNN and RNN
No ratings yet
CNN and RNN
82 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
118 Presentation
No ratings yet
118 Presentation
26 pages
DL - Review of Research Papers - Image - Caption - Generation
No ratings yet
DL - Review of Research Papers - Image - Caption - Generation
34 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
Paper 91-Comparative Evaluation of CNN Architectures
No ratings yet
Paper 91-Comparative Evaluation of CNN Architectures
9 pages
ANew Image Captioning Approachfor Visually Impaired People
No ratings yet
ANew Image Captioning Approachfor Visually Impaired People
6 pages
Report Contents Image Caption Generation-1
No ratings yet
Report Contents Image Caption Generation-1
42 pages
Project Report Image Captioning Models Prakhar Dhyani
No ratings yet
Project Report Image Captioning Models Prakhar Dhyani
8 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
New PDF
No ratings yet
New PDF
48 pages
Synopsis May 2024 (Pradeep, Vikas) - 1
No ratings yet
Synopsis May 2024 (Pradeep, Vikas) - 1
14 pages
Papers
No ratings yet
Papers
9 pages
Image Captionbot For Assistive Technology
No ratings yet
Image Captionbot For Assistive Technology
3 pages
Review 3
No ratings yet
Review 3
18 pages
Image Captioning Based Website Forvisuall y Impaired
No ratings yet
Image Captioning Based Website Forvisuall y Impaired
5 pages
Two Tier LSTM Model
No ratings yet
Two Tier LSTM Model
13 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Report 1
No ratings yet
Report 1
34 pages
Base Paper
No ratings yet
Base Paper
6 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Apply Deep Learning-Based CNN and LSTM For Visual Image Caption Generator
No ratings yet
Apply Deep Learning-Based CNN and LSTM For Visual Image Caption Generator
6 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
Smartphone-Based Image Captioning For Visually and Hearing Impaired
No ratings yet
Smartphone-Based Image Captioning For Visually and Hearing Impaired
5 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
RP Springer
No ratings yet
RP Springer
10 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Project Review
No ratings yet
Project Review
12 pages
Ref 12
No ratings yet
Ref 12
7 pages
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
No ratings yet
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
6 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Image Caption Generator: Minor Project (BCA 5005)
No ratings yet
Image Caption Generator: Minor Project (BCA 5005)
15 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Project Report
No ratings yet
Project Report
35 pages
Image Caption Generation Using Deep Neural Networks
No ratings yet
Image Caption Generation Using Deep Neural Networks
3 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Document From Deependra Singh
No ratings yet
Document From Deependra Singh
10 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Visual Image Caption Generator 38
No ratings yet
Visual Image Caption Generator 38
6 pages
Abstract Final Major Project
No ratings yet
Abstract Final Major Project
1 page
Review 3
No ratings yet
Review 3
18 pages
2019 A - Knowledge-Based - Recommendation - System - That - Includes - Sentiment - Analysis - and - Deep - Learning
No ratings yet
2019 A - Knowledge-Based - Recommendation - System - That - Includes - Sentiment - Analysis - and - Deep - Learning
12 pages
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
No ratings yet
Image Captioning - A Deep Learning Approach Using CNN and LSTM Network
6 pages
Cse Btech Vi Sem Scheme Syllabus Jan 2022 1
No ratings yet
Cse Btech Vi Sem Scheme Syllabus Jan 2022 1
24 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
A Semi Supervised Deep Learning
No ratings yet
A Semi Supervised Deep Learning
12 pages
CV2018 Review
No ratings yet
CV2018 Review
2 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Parth and Danny Robotics AI
No ratings yet
Parth and Danny Robotics AI
6 pages
Image Caption Technical Report
No ratings yet
Image Caption Technical Report
31 pages
Deep Learning in Mental Health Outcome Research
No ratings yet
Deep Learning in Mental Health Outcome Research
26 pages
Hardik-Sankhla 21EJDDS137
No ratings yet
Hardik-Sankhla 21EJDDS137
1 page
Jurnal CNN Pneumonia
No ratings yet
Jurnal CNN Pneumonia
5 pages
Deep Learning I. Introduction (: 1. The History and The Development of Deep Learning
No ratings yet
Deep Learning I. Introduction (: 1. The History and The Development of Deep Learning
21 pages
Yu COCAS A Large-Scale Clothes Changing Person Dataset For Re-Identification CVPR 2020 Paper
No ratings yet
Yu COCAS A Large-Scale Clothes Changing Person Dataset For Re-Identification CVPR 2020 Paper
10 pages
Real Time Exam Monitoring Using Computer Vision and Machine Learning Techniques
No ratings yet
Real Time Exam Monitoring Using Computer Vision and Machine Learning Techniques
186 pages
Machine Vision System For Quality Inspection of Beans PDF
No ratings yet
Machine Vision System For Quality Inspection of Beans PDF
15 pages
Presentation On Project
No ratings yet
Presentation On Project
9 pages
Automatic Detection of Oil Palm Tree From UAV Images Based On The Deep Learning Method
No ratings yet
Automatic Detection of Oil Palm Tree From UAV Images Based On The Deep Learning Method
13 pages
Landslide Detection Using Deep Learning and Object-Based Image Analysis
No ratings yet
Landslide Detection Using Deep Learning and Object-Based Image Analysis
12 pages
Mask R-CNN
No ratings yet
Mask R-CNN
20 pages
Res2Net A New Multi-Scale Backbone Architecture
No ratings yet
Res2Net A New Multi-Scale Backbone Architecture
11 pages
Research Article: Malware Detection On Byte Streams of PDF Files Using Convolutional Neural Networks
No ratings yet
Research Article: Malware Detection On Byte Streams of PDF Files Using Convolutional Neural Networks
10 pages
Anupam
No ratings yet
Anupam
41 pages
Inception
No ratings yet
Inception
6 pages
PDF Merged
No ratings yet
PDF Merged
59 pages
Project List
No ratings yet
Project List
18 pages
Human Activity Recognition Using Machine Learning: Bachelor of Technology
No ratings yet
Human Activity Recognition Using Machine Learning: Bachelor of Technology
19 pages
Goel Humans in 4D Reconstructing and Tracking Humans With Transformers ICCV 2023 Paper
No ratings yet
Goel Humans in 4D Reconstructing and Tracking Humans With Transformers ICCV 2023 Paper
12 pages
The 18th IMTGT International Conference On Mathematics Statistics and Their Applications (2024)
No ratings yet
The 18th IMTGT International Conference On Mathematics Statistics and Their Applications (2024)
77 pages
Abhinav Resume2.0
No ratings yet
Abhinav Resume2.0
1 page
Cleantech Transforming Waste Management With Transfer Learning
No ratings yet
Cleantech Transforming Waste Management With Transfer Learning
45 pages
Enhancing Mineral Processing With Deep Learning Automated Quartz Identification Using Thin Section Images
No ratings yet
Enhancing Mineral Processing With Deep Learning Automated Quartz Identification Using Thin Section Images
17 pages
Project Report
No ratings yet
Project Report
67 pages
Movie Popularity and Target Audience Prediction Using The Content-Based Recommender System
No ratings yet
Movie Popularity and Target Audience Prediction Using The Content-Based Recommender System
17 pages
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Abstract:: Doi: 10.5281/zenodo.7923088

Uploaded by

Abstract:: Doi: 10.5281/zenodo.7923088

Uploaded by

ISSN: 1004-9037

IMAGE CAPTION GENERATOR WITH VOICE USING LSTM AND CNN

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1121

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1122

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1123

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1124

The concept of gradient-based learning is presented in this article[6] written by Y. Lecun, L.

Table 1: Literature Survey

Paper Authors Title Year Key Findings

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1125

Bengio, Y. generation with on relevant image regions.

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1126

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1127

Figure 1: System Architecture

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1128

concentrate on audio-based captions.

Table 2: Result Analysis with Existing System

Aspect Proposed System Existing System

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1129

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1130

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1131

Journal of Data Acquisition and Processing Vol. 38 (3) 2023 1132

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.