Report
Report
Davangere 577007
A Project report on
“TEXT-TO-SPEECH CONVERTER”
Submitted in partial fulfillment of the requirement for the award of
the degree of
The successful completion of a project depends on the co-operation and help of many
people, along with those who directly execute its work.
I take this opportunity to acknowledge the help received for valuable assistance and co-
operation from many sources, right from the stage when the project was conceived to
the stage of its completion.
I am also deeply indebted to our beloved Principal, Prof. Nandan G P for providing the
necessary facilities to carry out this work.
I am extremely grateful to our HoD, Mr. Nandan G P for having accepted to patronize
me in the right direction with all his wisdom.
I would also express my heartfelt thanks to our Project Coordinator Prof. Srikanth T
N, Assistant Professor, Department of BCA, SRS FGC for his constant guidance and
devoted support.
I also express my heartfelt thanks to all the faculty members, technical staff, non-
teaching staff of SRS FGC, my family, friends, and well-wishers for their constant
encouragement.
STUDENT_NAME: Archana G (U13SL1S0041)
Anupama U(U13SL21S0005)
Impana N M(U13SL21S0016)
Sneha S A (U13SL21S0017)
Sowmya S (U13SL21S0052)
ABSTRACT
1 Schematic TTS
A simple but general diagram of a TTS system
2
Sr.No. CHAPTERS
1 INTRODUCTION
1.1 Introduction
1.2 Types of TTS Systems
2 LITERATURE REVIEW
2.1 Literature Review
2.2 Representation and analysis of speech signals
2.3 Problems of Existing System
2.4 Expectation of the new system
2.5 Identification of the Need
2.6 Objectives of Study
2.7 Significance of the Study
2.8 Limitation of Study
2.9 Definition of Terms
2.10 Different applications of TTS
3 METHODOLOGY
3.1 Methodology
3.2 Choice of methodology for the new System
3.3 Domain-specific Synthesis
3.4 Unit Selection Synthesis
3.5 Diphone Synthesis
4 IMPLEMENTATION
4.1 Implementation
4.2 Speech Application Programming Interface(SAPI)
4.3 Java Speech API(JSAPI)
4.4 Requirements
4.5 Language Used
5 RESULT
6 CODING
6.1 HTML
6.2 CSS
6.3 JavaScript
7 REFERENCES
CHAPTER – 1
INTRODUCTION
1.1 Introduction:
Recent progress in speech synthesis has produced synthesizers with very high
intelligibility but the sound quality and naturalness still remain a major problem.
However, the quality of present products has reached an adequate level for
several applications, such as multimedia and telecommunications.
At first sight, this task does not look hard to perform. After all we all have a
deep knowledge of reading rules of our mother tongue. They were transmitted
to us, in a simplified form, at primary school, and we improved them year after
year. But in the context of TTS synthesis, it is impossible to record and store all
the words of the language. Some other method has to be implemented for this
purpose. The quality of a speech synthesizer is judged by its similarity to the
human voice and b its ability to be understood. A text-to-speech synthesizer
allows people with visual impairments and reading disabilities to listen to
written words on a home computer.
Many computers operating systems have included speech synthesizers since the
early 1990s.
1.2 Types of TTS Systems:
Most Text to speech engines can be categorized by the method that they use to
translate phonemes into audible sound. Some TTS Systems are listed below :-
1. Pre-recorded:
2. Formant:
3. Concatenated:
LITERATURE REVIEW
Existing systems algorithm is shown below in Figure 4. It shows that the system
does not have an avenue to annotate text to the specification of the user rather it
speaks plaintext.
2. Text pre-processing: The system only produces the text that is fed
into it without any pre-processing operation occurring.
It is expected that the new system will reduce and improve on the problems
encountered in the old system. The system is expected to among other things do
the following :-
The application will build a platform to aid people with disabilities especially
on reading and also help get information easily without any stress.
The project could also help children earn to pronounce words and how to read.
The study will serve as a foundation and guide to other research students
interested in researching on Text-to-Speech systems.
Text to speech also has limitations. The most obvious drawback of text as a
knowledge building and communication tool is that it lacks the inherent
expressiveness of speech. When speech is transcribed into text, it loses many of
its unique qualities – tone, rhythm, pace and repetition that helps to reduce
memory demands and support comprehension. A transcript may accurately
record the spoken word, but the strategic and emotive qualities and impact of
speech are diminished on the page. Furthermore, the cognitive demands of
organizing ideas into acceptable syntax, conventions, and presentational form
can pose significant barriers to using text for expression among both novice and
expert writers alike.
We think in images or word fragments. Ideas float in and out of our heads —
and rarely in a linear or conventional way. Writing attempts to shape that free –
forming, dynamic process of thought into a single, sequential output of
sentences and paragraphs. Some individuals may have all these creative ideas in
their imaginative mind, but because their mind is so much quicker and richer
than the pen, when they put ink to parchment, the outcome is a blank sheet of
paper.
1. Telephony :
2. Automotive:
4. Medical:
5. Industrial:
METHODOLOGY
3.1 Methodology:
Two methodologies were chosen for the new system: The first methodology is
Object Oriented Analysis and Development Methodology (OOADM). OOADM
was selected because the system has to be represented to the user in a manner
that is user-friendly and understandable by the user.
Also, since the project is to emulate human behavior, Expert system had to be
used for mapping of Knowledge into a Knowledge base with a reasoning
procedure. Expert system was used in the internal operations of the program,
following the algorithm of Rule Based computation. The technique is derived
from general principles described by researchers in knowledge engineering
techniques (Murray et al., 1991: 1996).
The output from the best unit-selection systems is often indistinguishable from
real human voices, especially in contexts for which the TTS system has been
tuned. However, maximum naturalness typically requires unit selection speech
databases to be very large, in some systems ranging into the gigabytes of
recorded data, representing dozens of hours of speech. Also, unit selection
algorithms have been known to select segments from a place that results in less-
than-ideal synthesis (e.g. minor words become unclear) even when a better
choice exists in the database.
Text-to-speech synthesis takes place in several steps. The TTS systems get a
text as input, which it first must analyze and then transform into a phonetic
description. Then in a further step it gencrates the prosody. From the
information now available, it can produce a speech signal.
The structure of the text-to-speech synthesizer can be broken down into major
modules:
Prosody Generation: after the pronunciation has been determined, the prosody
is generated. The degree of naturalness of a TTS system is dependent on
prosodic factors like intonation modelling (phrasing and accentuation),
amplitude modelling and duration modelling (including the duration of sound
and the duration of pauses, which determines the length of the syllable and the
tempos of the speech).
Fig.5 Operations of the Natural Language Processing module of a TTS
Synthesizer
The output of the NLP module is passed to the DSP module. This is where the
actual synthesis of the speech signal happens. In concatenative synthesis the
selection and linking of speech segments take place. For individual sounds the
best option (where several appropriate options are available) are selected from a
database and concatenated.
Fig.6 The DSP component of a general concatenation-based synthesizer
CHAPTER – 4
IMPLEMENTATION
4.1 Implementation:
The TTS system converts an arbitrary ASCII text to speech. The first step
involves extracting the phonetic components of the message, and we obtain a
string of symbols representing sound-units (phonemes or allophones),
boundaries between words, phrases and sentences along with a set of prosody
markers (indicating the speed, the intonation etc.). The second step consists of
finding the match between the sequence of symbols and appropriate items
stored in the phonetic inventory and binding them together to form the acoustic
signal for the voice output device.
1. A database containing the parameter values for the sounds within the word,
Fig.8 Data flow diagram of the Speech Synthesis System Using Gane and
Sarson Symbol.
User Interface (Source): This can be Graphical User Interface (GUT), or the
Command Line Interface (CLI).
Some of the technologies involved in the design of this system includes the
following :-
4.2 Speech Application Programming Interface (SAPI):
4.4 Requirements:
Hard Disk 40 GB
Processor Dual Core
RAM 512 MB
4.5 Language Used:
While HTML and CSS are not really programming languages (well, some of
you may not agree but yes, they are not the programming language) but they are
the backbone for web development. HTML provides the structure while CSS
provides the style and helps them to look better and more visually appealing.
JavaScript:
There is no doubt that JavaScript is the King of web development and probably
the most popular language among web developers. It's also the only language
that allows you to create web applications, both frontend, and backend as well
as mobile applications. The strength of JavaScript is not just it can run on
browser and server using Nodejs but also the awesome frameworks and libraries
it has for web development and app development.
CHAPTER – 5
RESULT
COADING
6.1 HTML:
index.html:
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Text To Speech Converter</title>
<link rel="stylesheet" href="style.css">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<div class="wrapper">
<header>Text To Speech</header>
<form action="#">
<div class="row">
<label>Enter Text</label>
<textarea></textarea>
</div>
<div class="row">
<label>Select Voice</label>
<div class="outer">
<select></select>
</div>
</div>
<button>Convert To Speech</button>
</form>
</div>
<script src="script.js"></script>
</body>
</html>
6.1 CSS:
style.css:
*{
margin: 0;
padding: 0;
box-sizing: border-box;
font-family: 'Poppins',sans-serif;
}
body{
display: flex;
align-items: center;
justify-content: center;
min-height: 100vh;
background: #5256AD;
}
::selection
{
color: #fff;
background: #5256AD;
}
.wrapper
{
width: 370px;
padding: 25px 30px;
border-radius: 7px;
background: #fff;
box-shadow: 7px 7px 20px rgba(0,0,0,0.05);
}
.wrapper header
{
font-size: 28px;
font-weight: 500;
text-align: center;
}
.wrapper form
{
margin: 35px 0 20px;
}
form .row
{
display: flex;
margin-bottom: 20px;
flex-direction: column;
}
form :where(textarea,select,button)
{
outline: none;
width: 100%;
height: 100%;
border: none;
border-radius: 5px;
}
form button
{
height: 52px;
color: #fff;
font-size: 17px;
cursor: pointer;
margin-top: 10px;
background: #675AFE;
transition: 0.3s ease;
}
form button:hover
{
background: #4534fe;
}
@media(max-width:400px)
{
.wrapper
{
max-width: 345px;
width: 100%;
}
}
6.1 JavaScript:
script.js:
voices();
function voices()
{
for(let voice of synth.getVoices())
{
let selected = voice.name === "Google US English" ? "selected" : "";
let option = `<option value="${voice.name}"
${selected}>${voice.name} (${voice.lang})</option>`;
voiceList.insertAdjacentHTML("beforeend",option);
}
}
synth.addEventListener("voiceschanged", voices);
function textToSpeech(text)
{
let utterance = new SpeechSynthesisUtterance(text);
for(let voice of synth.getVoices())
{
if(voice.name === voiceList.value)
{
utterance.voice = voice;
}
}
synth.speak(utterance);
}
speechBtn.addEventListener("click", e =>
{
e.preventDefault();
if(textarea.value !== "")
{
if(!synth.speaking)
{
textToSpeech(textarea.value);
}
if(textarea.value.length>80)
{
if(isSpeaking)
{
synth.resume();
isSpeaking=false;
speechBtn.innerText="Pause Speech";
}
else
{
synth.pause();
isSpeaking=true;
speechBtn.innerText="Resume Speech";
}
setInterval(()=>{
if(!synth.speaking && !isSpeaking)
{
isSpeaking=true;
speechBtn.innerText="Convert To Speech";
}
});
}
else
{
speechBtn.innerText="Convert To Speech";
}
}
});
CHAPTER – 7
REFERENCES
1. https://www.scribd.com/document/572985199/incomplete-report
2. https://youtu.be/cvZrpquLQg?si=K7nMYZNhC4LK4_Rp
3. https://images.app.goo.gl/U14nCWRz734jhU4c7