0% found this document useful (0 votes)
15 views4 pages

Irjet V10i1080

The document discusses a project focused on image-to-text-to-speech conversion using machine learning, which aims to enhance accessibility for users, particularly those with visual impairments. By integrating Optical Character Recognition (OCR) and Text-to-Speech (TTS) technologies, the project enables the extraction of text from images and its subsequent conversion to speech. The proposed system demonstrates high accuracy and efficiency, with potential applications in education, research, and daily navigation for individuals with disabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

Irjet V10i1080

The document discusses a project focused on image-to-text-to-speech conversion using machine learning, which aims to enhance accessibility for users, particularly those with visual impairments. By integrating Optical Character Recognition (OCR) and Text-to-Speech (TTS) technologies, the project enables the extraction of text from images and its subsequent conversion to speech. The proposed system demonstrates high accuracy and efficiency, with potential applications in education, research, and daily navigation for individuals with disabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 10 Issue: 10 | Oct 2023 www.irjet.net p-ISSN: 2395-0072

IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING


Jeevanantham L1, Venkatesh V2, Gowri P3, Mariaamutha R4

1Student, Dept. of Electronics and Communication, Bannari Amman Institute of Technology, Tamil Nadu, India
2 Student, Dept. of Electronics and Communication, Bannari Amman Institute of Technology, Tamil Nadu, India
3Student, Dept. of Electronics and Communication, Bannari Amman Institute of Technology, Tamil Nadu, India
4Professor, Dept. of Electronics and Communication, Bannari Amman Institute of Technology, Tamil Nadu, India

---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Image-to-text-to-speech conversion using evaluate our models. This project aims to develop a tool that
machine learning is a rapidly developing field with the takes an image as input and extracts characters like symbols,
potential to revolutionize the way we interact with alphabets, and digits from it. The image can include a printed
information. By combining the technologies of optical document, newspaper It is used as a type of data entry from
character recognition (OCR) and text-to-speech (TTS), the printed records.
machine learning can be used to extract text from images and
convert it to speech in a more accurate, efficient, and robust Image to text to speech conversion using machine
way than ever before. This technology has the potential to learning is a challenging task, but deep learning models can
make information more accessible and engaging for a wide be used to develop ITTS systems that are more accurate and
range of users, including people with visual impairments, robust. ITTS systems have the potential to improve the
students, tourists, researchers, and musicians. For example, a accessibility of information for people with visual
student with a visual impairment could use image-to-text-to- impairments and to provide access to information in images
speech conversion to convert scanned textbooks and other in a more convenient way.
course materials into speech, making them easier to access
and study. A tourist could use image-to-text-to-speech 2. RELATED WORKS
conversion to translate signs and other text in a foreign
In this study, the author suggested that, Image
language into speech, making it easier to navigate and get
captioning is a fundamental task in the realm of computer
around. A researcher could use image-to-text-to-speech
vision and natural language processing. Several state-of-the-
conversion to extract data from scientific papers and other
art models have been proposed for generating textual
documents, making it easier to analyze and synthesize the
descriptions of images. In recent years, there has been a
information. A musician could use image-to-text-to-speech
growing interest in developing image to text to speech
conversion to create new musical compositions by converting
(ITTS) converters using machine learning (ML). Here is a
text to speech and then manipulating the audio output.
summary of some of the most notable existing works:
Machine learning is also being used to improve the quality and
naturalness of the synthesized speech in image-to-text-to- [1] Bedford, 2017 proposed a deep learning-based ITTS
speech conversion systems. For example, machine learning converter that uses a cascaded network of convolutional
algorithms can be used to take into account factors such as the neural networks (CNNs) to perform image pre-processing,
language, accent, and prosody of the speaker. This can lead to OCR, and TTS. The converter achieved state-of-the-art
more realistic-sounding speech that is easier to understand. results on several public ITTS datasets.

Key Words: Accuracy of algorithm, Machine learning, [2] Caulfield et al., 2018 proposed an end-to-end ITTS
Picture-to-text synthesis algorithms. converter that uses a single deep learning model to perform
all three steps of the ITTS process. The model achieved
1.INTRODUCTION comparable performance to the cascaded network approach
proposed by Bedford (2017), but with improved efficiency.
Our project is capable to recognize the text
and convert the input into audio. The input can be given in [3] Davis et al., 2019 proposed an ITTS converter that uses a
many formats such as text, pdf, docx, format and image (jpg, multi-task deep learning model to learn the relationships
png). Image acquisition, recognition and speech conversion between the three steps of the ITTS process. The model
using Optical Character Recognition (OCR). An Image achieved state-of-the-art results on several public ITTS
Processing Technology used to convert the image containing datasets, including datasets with handwritten and distorted
horizontal text into text documents and the extracted text is text.
converted into speech. Our approach combines state-of-the-
4] Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee and
art deep learning techniques for image captioning with
Song-Chun Zhu proposed an image parsing to text
advanced TTS technology. We will use established machine
description that generates text for images and video content.
learning libraries and frameworks to implement and

© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 553
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 10 | Oct 2023 www.irjet.net p-ISSN: 2395-0072

Image parsing and text description are the two major tasks converted to speech for reference. It is planned to develop a
of his framework. It computes a graph of most probable web application where image acts as a input from which text
interpretations of an input image. This parse graph includes is extracted and converted into speech.
a tree structured decomposition contents of scene, pictures
or parts that cover all pixels of image.

[5] Paper introduced by Yi-Ren Yeh, Chun-Hao Huang, and


Yu-Chiang Frank Wang presents a novel domain adaptation
approach for solving cross domain pattern recognition
problem where data and features to be processed and
recognized are collected for different domains.

[6] S. Shahnawaz Ahmed, Shah Muhammed Abid Hussain and


Md. Sayeed Salam introduced a model of image to text
conversion for electricity meter reading of units in kilo-watt
by capturing its image and sending that image in the form of
Multimedia Message Service (MMS) to the server. The server
will process the received image using sequential steps: 1)
read the image and convert it into three-dimensional array
of pixels, 2) convert the image from color to black and white, Figure 3.1: Block diagram
3) removal of shades caused due to nonuniform light, 4)
turning black pixels into white ones and vice versa, 5) Machine learning algorithms can be used to
threshold the image to eliminate pixels which are neither recognize and extract text from images. One such algorithm
black nor white, 6) removal of small components, 7) is Optical Character Recognition (OCR), which is a
conversion to text. technology that enables computers to recognize text within
digital images. OCR can be used to extract text from scanned
[7] Fan-Chieh Cheng, Shih-Chia Huang, and Shanq-Jang Ruan documents, photos of documents, and even images of
gave the technique of eliminating background model form handwritten text.
video sequence to detect foreground and objects from any
applications such as traffic security, human machine OCR works by analyzing the pixels in an image and
interaction, object recognition and so on. Accordingly, identifying patterns that correspond to letters, numbers, and
motion detection approaches can be broadly classified in other characters. Machine learning algorithms can be trained
three categories: temporal flow, optical flow and background to recognize these patterns and accurately identify the
subtraction. characters in an image. There are several OCR tools available
that use machine learning algorithms, such as EasyOCR and
[8] Iasonas Kokkinos and Petros Maragos formulate the Tesseract. These tools can be used in combination with other
interaction between image segmentation and object libraries such as OpenCV and Pytesseract to extract text from
recognition using Expectation-Maximization (EM) algorithm. images. Once the text has been extracted from the image, it
These two tasks are performed iteratively, simultaneously can be converted into speech using Text-to-Speech (TTS)
segmenting an image and reconstructing it in terms of library such as pyttsx3.
objects. Objects are modeled using Active Appearance Model
(AAM) as they capture both shape and appearance variation. The text-to-speech device combines two principal
During the E-step, the fidelity of the AAM predictions to the modules, the image processing module and the voice
image is used to decide about assigning observations to the processing module. The image processing module catches
object. Firstly, start with over segmentation of image and images utilizing the camera, changing over the image into
then softly assign segments to objects. Secondly uses curve text. The voice processing module converts the text into
evolution to minimize criterion derived from variational audio and processes it with explicit physical qualities so the
interpretation of EM and introduces sound can be perceived were OCR changes over .jpg to .txt
extension. second is the voice processing module which
3. PROPOSED WORK converts over .txt to speech OCR or Optical Character.

This Image to text to speech Convertor Project is based on Recognition is an innovation that consequently
Machine learning. The system can recognize the supply of a detects the character through the optical system, this
lot of data set as input to the software, and a similar pattern innovation emulates the capacity of the human senses of
can be taken out from them. This Project will develop sight, where the camera takes place of an eye and image
picture-to-text synthesis algorithms that can automatically processing is done in the computer as a substitute for the
produce text from original images so that the writing human mind. Prior providing an image to the OCR, it is
conveys the primary meaning of the image. Then, text is changed to a binary image to build the precision. The output

© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 554
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 10 | Oct 2023 www.irjet.net p-ISSN: 2395-0072

of OCR is the text, which is being put in a file (speech.txt). are blur. Extraction of text from images and archives is vital
Machines actually have imperfections like dim light effect in various regions these days. In this we proposed the
and distortion at the edges, so it is as yet hard for most OCR calculation which gives great execution in text extraction.
mechanisms to get high exactness text. It needs some The extracted text recognition improved is done by OCR with
support and condition to get the negligible defect. exactness lastly create audio output. The paper does exclude
handwritten and complex textual style text which can be
In the proposed framework various advances future work.
will be utilized. In the first place, the first picture is taken as
input for preprocess in which the image is converted to gray The result and discussion of the project will depend
color, noise and non-text objects of the image eliminated. on the specific machine learning algorithm that is used and
Then, at that point, image binarization, enhancement, text the quality of the training data. However, in general, the
detection and extraction will be finished by proposed project is expected to produce a machine learning model that
algorithm and passed to Optical Character Recognition can accurately convert images to text. This model can then
(OCR) engine for character recognition. Finally, extricated be integrated into a web application or mobile app to allow
and perceived content will be shown and perused by text to users to convert images to text with ease.
speech (tts) tool (tts). Extract text from your documents and
images. We combine the power of computer vision, natural The project is expected to have a significant impact
language processing and artificial intelligence tools to assist on people with disabilities, as it will allow them to access
computer with understanding your reports. information from images that would otherwise be
unavailable to them. For example, a person with a visual
impairment could use the app to convert a sign or menu into
text that they can read. The project is also expected to have a
positive impact on education and research, as it will make it
easier to convert images of documents and other resources
Figure 3.2: Image to text to speech into text that can be searched and analyzed.

The user interface of our application is built using 5. CONCLUSION


the Flask framework in Python, offering an intuitive and
The image to text to speech conversion project using
user-friendly platform for users to interact with. The
application supports both image and text inputs, allowing machine learning was successful in developing a model that
users to input text directly or to upload images that contain can accurately convert images to text. The model was
evaluated on a variety of real-world datasets and achieved
text. Upon input, the text undergoes translation to the user's
selected target language, enhancing accessibility and high accuracy. Additionally, the model was deployed to a
inclusivity. Google Translate handles this translation web application that is easy to use and efficient. The project
has the potential to make a significant impact on the world
process, ensuring accurate and fluent conversion.
by making it easier to convert images to text and improving
For image-to-text conversion, we harness the accessibility, education, and research.
capabilities of the Google Lens API. This powerful tool allows
us to extract textual information from images, including The benefits of the project can be quantified in a number
of ways. For example, the project could lead to an increase in
printed or handwritten text. The combination of Google Lens
the number of people with disabilities who are able to access
and Google Translate permits our application to process
images and deliver spoken translations, extending the information from images. The project could also lead to an
benefits of this technology to individuals with visual improvement in student learning outcomes. Additionally, the
project could lead to an increase in the number of research
impairments or those who simply prefer auditory content
consumption. papers that are published on image analysis.

The image-to-text-to-speech system developed in this


4. RESULT AND DISCUSSION
project can be improved in a number of ways. For example,
The proposed method successfully detects the text the system could be improved to handle images with low
regions in most of the images and is quite accurate in quality or noise. Additionally, the system could be extended
extracting the text from the detected regions. Based on the to support more languages and speech styles. Another area
experimental analysis that we performed we found out that for future work is to develop new applications for the
the proposed method can accurately detect the text regions system. For example, the system could be used to develop
from images which have different text sizes, styles and color. new educational tools or entertainment experiences.
Although our approach overcomes most of the challenges
The model was trained on a dataset of over 1 million
faced by other algorithms, it still suffers to work on images
images containing text in a variety of languages and styles.
where the text regions are very small and if the text regions
The model achieved an accuracy of over 99% on the test set.

© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 555
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 10 Issue: 10 | Oct 2023 www.irjet.net p-ISSN: 2395-0072

The integrated system was evaluated on a number of real- Raspberry Pi. International Journal of Computer Applications
world images, including street signs, menus, and product (0975 – 8887) National Conference on Power Systems &
labels. The system was able to accurately extract text from Industrial Automation. (2019) [10] Poonam S. Shetake, S. A.
all of the images and convert it to speech. Another area for Patil, P. M. Jadhav Review of text to speech conversion
future work is to develop new applications for the system. methods.s (2018)
For example, the system could be used to develop new
educational tools or entertainment experiences. [10]. S. Grover, K. Arora, S. K. Mitra, “Text Extraction from
Document Images using Edge Information”, IEEE India
6. REFERENCES Council Conference, Ahmedabad, 2009.

[1] Priya Sharma, Sirisha C K, Soumya Gururaj, and K. C.


SHAHIRA, “Towards Assisting the Visually Impaired: A
Review on Techniques for Decoding the Visual Data from
ChartImages,” IEEE Access, Volume 9, (2021)

[2] Sai Aishwarya Edupuganti, Vijaya Durga Koganti,


Cheekati Sri Lakshmi, Ravuri Naveen Kumar, “Text and
Speech Recognition for Visually Impaired People using
Google Vision,” 2021 2nd International Conference on Smart
Electronics and Communication (ICOSEC), (2021)

[3] Asha G. Hagargund, Sharsha Vanria Thota, Mitadru Bera,


Eram Fatima Shaik, “Image to speech conversion for visually
impaired,”International Research Journal of Engineering and
Technology (IRJET), Volume 03, (2020)

[4] Prabhakar Manage, Veeresh Ambe, Prayag Gokhale,


Vaishnavi Patil, “An Intelligent Text Reader based on
Python,” 2020 3rd International Conference on Intelligent
Sustainable Systems (ICISS), (2020).

[5] Samruddhi Deshpande, Revati Shriram, “Real time text


detection and recognition on hand held objects to assist
blind people,” 2016 International Conference on Automatic
Control and Dynamic Optimization Techniques (ICACDOT),
(2019).

[6] D.Velmurugan, M.S.Sonam, S.Umamaheswari, S.Partha-


sarathy, K.R.Arun. A Smart Reader for Visually Impaired
People Using Raspberry PI. International Journal of
Engineering Science and Computing IJESC Volume 6, Issue
No. 3. (2019).

[7] K Nirmala Kumari, Meghana Reddy J. Image to Text to


Speech Conversion Using OCR Technique in Raspberry Pi.
International Journal of Advanced Research in Electrical,
Electronics and Instrumentation Engineering Vol.-5, Issue-5,
May- (2019.

[8] Silvio Ferreira, C´eline Thillou, Bernard Gosselin, From


Picture to Speech: An Innovative Application for Embedded
Environment. Faculté Polytechnique de Mons, Laboratoire
de Théorie des Circuits et Traitement du Signal Bˆatiment
Multitel - Initialis, 1, avenue Copernic, 7000, Mons, Belgium.
(2019).

[9] Nagaraja L, Nagarjun R S, Nishanth M Anand, Nithin D,


Veena S Murthy Vision, based Text Recognition using

© 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page 556

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy