0% found this document useful (0 votes)
24 views11 pages

Proposal PhamThaiNguyen 22560053

Uploaded by

22560023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

Proposal PhamThaiNguyen 22560053

Uploaded by

22560023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

CMP6200/DIG6200

INDIVIDUAL UNDERGRADUATE PROJECT


2024–2025

A1: Proposal

REAL-TIME MULTILINGUAL SPEECH


TRANSLATION SYSTEM WITH
PERSONAL VOICE SYNTHESIS

Course: Smart Library Base: Ebook System with Enhanced AI Summarization


Student Name: Le Thanh Trung
Student Number: 22560023
Supervisor Name:Ph.D Nguyen Thanh Binh
Contents
1 Introduction........................................................................................1
1.1 Background and Rationale...........................................................1
1.2 Key Themes/Topics.......................................................................1
2 Aim and Objectives............................................................................2
2.1 Project Aim...................................................................................2
2.2 Project Objectives........................................................................2
3 Project Planning.................................................................................3
3.1 Initial Project Plan........................................................................3
3.2 Resources......................................................................................4
3.3 Risk Assessments..........................................................................5
4 Project Review and Methodology.....................................................6
4.1 Critique of Past Similar Projects...................................................6
4.2 Literature Search Methodology...................................................7
4.3 Initial Literature Search Results...................................................7
5 Bibliography.......................................................................................8
1 Introduction

1.1 Background and Rationale


Nowadays, people’s conversations revolve around how to protect the environment. There are a
thousand ways to preserve it, depending on various factors, including human factors (personality), education,
raising awareness of environment protection and more. With the advancement of information technology, one
solution to not only help save the environment but also reduce production costs can lead to savings of millions
of dollars for a country or a specific industry (document, book, textbook,etc..) , and potentially up to billions of
dollars.
Many of them (who try to digitize books in industry) have undergone revolutions. However, my study
not only focuses on digitization of books but also on the UI/UX, the user experiences aiming to make users feel
like they are reading a physical book. The most significant difference from the rest is AI summarization, text-to-
speech technology and user-driven platform
But why is AI summarization a step forward compared to the rest? People tend to enjoy something that
is fast and convenient. One demonstration of this is “Film reviewing”. Instead of spending hours reading, using
or experiencing something, now people just need a quick summary to get the essential points, allowing them to
make informed decisions without commitment.
That is why “Smart Library Base: Ebook System with Enhanced AI Summarization” was created.

1.2 Key Themes/Topics


 AI Summarization: using the AI to summarize all the information in my case is the content
of books.

 Text-to-speech (audio): This technology converts written content into spoken words.

 User-driven platform: allow users to comment, upload, modify their own book.

 Enhanced Reading Experience: the most different from the others is making the reading
experience like a physical books is page-flipping effect


2 Aim and Objectives

2.1 Project Aim


The aim of this project is to create a website that allow users to publishing their books(in pdf
file or posting online). Users can read books, comment on them, and rate the books they have
just read

2.2 Project Objectives


Guidance: Objectives set out how you are going to achieve your aim. Objectives should be
Specific, Measurable, Achievable, Resourced, and Time-limited (SMART). Objectives should be
presented using bullet points and numbering to enhance readability. Information on SMART
objectives is available on Moodle.

Tips:

 Use action-oriented language for your objectives to make it clear what steps will be taken.
Make sure that each objective aligns directly with your overarching project aim.
 Try to limit yourself to a manageable number of objectives to keep your project focused.
 Use bullet points and numbering to list your objectives clearly. 3|Page
 Technology Research: Complete a review of existing speech recognition (ASR),
neural machine translation (NMT), and voice synthesis
 Create and Develop a Multilingual ASR System: Build a multilingual speech
recognition (ASR) system that recognizes and transcribes real-time speech in multiple
languages.
 Intergrate Real-Time Translation: incorporate a neural machine translation (NMT)
system that can convert spoken language into target language even adjusting to structural
differences between languages.

4|Page
 Create Personalized Voice Synthesis: develop the a text-to-speech (TTS) system that
can generate speech in translated language while keeping the speaker’s original voice
characteristics, such as accent, tone, and emotion.
 Optimize for Real-Time: The system works seamlessly in real-time, providing
translation and voice output, allowing smooth communication without stops or delays.
 Test and gather feedback: test the system in real-time situations to evaluate its
performance, gather user feedback, and improve to ensure it meets accuracy and speed
requirements.

3 Project Planning

3.1 Initial Project Plan


Guidance: Identify the tasks and subtasks that are necessary to meet your project objectives.
Provide a brief description for each, outlining its role in achieving the goals you've set. Also, be aware
of the time each task and subtask will require.

Tips:

 While a Gantt chart is a useful tool, it's not mandatory. You may use any table or format
that effectively captures your planning. The key is to clearly display the timeline, tasks, and
their interdependencies.
 Your descriptions should be concise but informative, helping the reader understand the role
and importance of each task and subtask in the context of your project plan.

Task Description Subtask Time


Estimate
1.Technology Research ASR, NMT, and -Identify ASR models (Whisper ASR or 4 weeks
Research TTS models DeepSpeech)
-Evaluate NMT models (OpenNMT or Fairseq)
-Explore voice synthesis models (Tacotron 2)
2 Develop ASR Build and train ASR to - Choose ASR models and train on datasets like 4 weeks
System recognize real time speech LibriSpeech, Common Voice.
- Fine-tune the ASR model for real-
time performance across multiple
languages.
3. Integrate Real- Incorporate neural machine - Train or fine-tune NMT models (like 4 weeks
Time NMT translation (NMT) for real- Fairseq) on parallel corpora (OpenSubtitles
time translation. datasets).
- Implement real-time processing pipelines.
4. Create Develop the TTS system to - Select TTS models (Tacotron 2). 4 weeks
Personalized Voice generate speech in the - Train on voice datasets (VCTK) for
Synthesis translated language, personalized voice output.
keeping the speaker's
characteristics.
5. Optimize for Ensure smooth real-time - Integrate ASR, NMT, and TTS into 4 weeks
Real-Time operation, balancing a seamless pipeline.
Functionality performance and - Conduct real-time performance tests and
quality. optimize latency.

5|Page
6. Testing and Test the system in 3 weeks
- Conduct user testing with bilingual speakers.
Feedback Collection various real-time
- Simulate high-demand scenarios (customer
scenarios and collect
service, healthcare) to evaluate system
feedback for refinement.
robustness.
7. Iterative Make improvements - Optimize translation quality for 3 weeks
Improvements and based on feedback, structural differences between languages.
Refinement ensuring translation - Fine-tune the voice synthesis to reflect
accuracy and speaker characteristics more precisely.
voice quality.
8. Final Testing and Perform final tests and 2 weeks
Documentation document the system’s
capabilities, preparing the
project for submission.

o Model: Tacotron 2, FastSpeech 2 for voice generation.

o Datasets: VCTK, LibriTTS (for keeping speaker’s voice characteristics).

3.2 Resources
Guidance: Specify the resources required for the successful execution of your project. Resources
can include lab equipment, IT hardware and software, as well as research materials like databases or
library resources. For example, you might need high-computing servers for machine learning projects,
specialised editing software for digital media work, or network simulation tools for networking tasks.

Tips:

 Bear in mind that the university does not provide additional funding for student projects, so you
should account for any costs that are not covered by existing resources. It's advisable to consult
your supervisor regarding these costs, as they might be able to provide or recommend
equipment or software.
 Align your resource requirements closely with your project aim and objectives to ensure that
available resources are sufficient for achieving your goals.

Resource Reason
Microphone and speakers Testing and Simulating
Hardware
speech input/output
Speech Recognition ( Google Building a multilingual speech
Speech-to-Text API or recognition (ASR) system
Python SpeechRecognition )
Software Neural machine translation Integrating multilingual
tools translation models
Text-to-speech synthesis Developing personalized voice
system synthesis.

Python (TensorFlow, PyTorch) Building and training models.


VCS ( git, github) Managing project code

6|Page
IEEE Xplore, Google Accessing to online academic
Scholar, Birmingham City database
University Library
Research Resources
Datasets LibriSpeech (for ASR), Common
Voice, and VCTK (for voice
synthesis)
Data Storage for Training Cloud Storage Services (Google Storing and accessing datasets
Datasets Drive, OneDrive) remotely.
Display System Web-Based Application Developing a web application
( Reactjs and FastAPI)

3.3 Risk Assessments


Guidance: Contemplate the potential risks that could derail your project timeline. This could
encompass a variety of factors, from unavailability of specific resources like software or equipment to
logistical constraints such as limited access to specialists or test subjects.

Tips:

 Prioritise the identification of risks that directly impact your project aim and objectives. For more
substantial risks, consider implementing a contingency plan or alternative approaches that
could mitigate these challenges.
 Consult your supervisor for expertise and advice on managing identified risks effectively.

Each project has its own set of challenges, and noticing these risks beforehand really helps in
making a successful plan.
One of the major difficulties that we are going to face is real-time performance. Especially
when we have a large amount of data, the translation and synthesis system may not keep
pace, which leads us into annoying delays or lag, diminishing our good of smooth
communication. In this situation, we should make more attempts to optimize our algorithms
and conduct early testing, first of all, as to whether our projects meet performance targets.
We must examine lighter models and tap into cloud-based computing when we need that
extra power.
Another limitation arises in the form of sentence structures among different languages. As an
example, translations between languages that are very different from one another could result
in delays or errors. In such cases, we can adopt an advanced neural machine translation
technique which will be able to adapt to these differences in real time. The flexibility here
would be of major importance in maintaining the reliability of the system.
Keeping in mind a few of these risks and accordingly taking some appropriate effort to
mitigate them, we will be able to work much more effectively toward developing a
successful real-time multilingual speech translation system with personalize voice synthesis.

7|Page
4 Project Review and Methodology

4.1 Critique of Past Similar Projects


Guidance: Examine past similar projects or final year projects to enhance your understanding of how
to approach your own project. Firstly, discuss the strengths and weaknesses of these past projects.
Secondly, identify what aspects, such as background, methodologies, techniques, or technologies,
are particularly useful for your own project. Finally, explain how you plan to apply or adapt these
useful aspects in your project. The aim is to understand best practices and potential pitfalls,
regardless of whether the projects are directly related to your own subject matter.

Tips:

 If you can't find projects that directly align with your focus, select the closest available options.
Concentrate on what you can learn from them in terms of project planning, methodologies, or
specific techniques, and how you can apply this knowledge to enhance your own project's
robustness.
 "Critique" in this context means a detailed analysis and assessment of something, in this case,
past projects. Look beyond just listing good and bad points; instead, discuss the reasoning
behind these points and their implications for your own work.

Project 1: “LibriS2S: A German-English Speech-to-Speech Translation Corpus by Pedro


Jeuris and Jan Niehues”
The "LibriS2S" project will close the gap in speech-to-speech translation by developing the
first publicly available German-English speech-to-speech corpus. Other than text-based
translation datasets, it leverages independently created audio for both languages, thus allowing
unbiased pronunciation. This corpus, inspired by FastSpeech 2, directly translates spoken
language into another without going through any intermediate text. The model embeds
important linguistic features, such as pitch, energy, and transcripts of the source language to
generate good quality translations. While that may work out pretty okay, it lacks real-time
processing ability and flexibility for other languages. In my project, I'll adapt these
techniques for multiple languages and focus on real-time translation with the goal of
enriching personal voice synthesis. I will work to ensure that the system maintains the accent,
tone, and emotion of a speaker into various other languages.
https://paperswithcode.com/paper/libris2s-a-german-english-speech-to- speech
Project 2: “Direct speech-to-speech translation with a sequence-to-sequence model by
Ye Jia* , Ron J. Weiss* , Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng
Chen, Yonghui Wu ”
This project present an attention-based sequence-to-sequence neural network which can
directly translate speech from one language into speech in another language, without relying
on an intermediate text representation. The network is trained end-to-end, learning to map
speech spectrograms into target spectrograms in another language, corresponding to the
translated content (in a different canonical voice). This further demonstrate the ability to
synthesize translated speech using the voice of the source speaker and conduct experiments
on two Spanish-to-English speech translation datasets, and find that the proposed model
slightly underperforms a baseline cascade of a direct speech-to-text translation model and a
text-to-
8|Page
speech synthesis model, demonstrating the feasibility of the approach on this very
challenging task. While the model slightly underperforms a traditional method using separate
speech-to- text and text-to-speech processes, it establishes a feasible framework for tackling
direct speech- to-speech translation. This approach can inform my project, particularly in
keeping speaker voice characteristics during translation and improving real-time
performance.
https://paperswithcode.com/paper/direct-speech-to-speech-translation-with-a

4.2 Literature Search Methodology


Guidance: In this section, articulate your approach for conducting a literature search relevant to your
project. Specify the search terms you intend to use and explain the rationale for choosing them.
Identify the databases you will utilise for your search, including IEEE Xplore, Elsevier, ACM Digital
Library, Web of Science, or Google Scholar. Discuss your strategy for assessing the relevance and
quality of the resources you find. Finally, describe your method for recording your findings, ensuring
they can be easily referenced later. It's advisable to consult with your supervisor at the outset to
ensure you're on the right track.

Tips:

 Be precise with your search terms, as this will significantly affect the quality of resources
you discover.
 Maintain consistency when grading the significance of each resource; you may consider
employing a scoring system or rubric.
 Keep a well-organised record of your findings using citation management software, such
as Mendeley, to facilitate the writing process.

Search Terms:
• Speech-to-Speech Translation
• Neural Machine Translation
• Text-to-Speech Synthesis
• Real-time Speech Recognition
• Voice Synthesis
Models Methods of
Research:
• IEEE Xplore
• Google Scholar
• BCU Online Library
• British Online Library
• Paper with Code

4.3 Initial Literature Search Results


Guidance: Present a few examples of key resources you have identified during your initial literature
search. This doesn't need to be exhaustive but should be sufficient to validate that your search
methodology is effective.

Tips:

 When reporting your initial findings, aim to be clear and concise.


9|Page
 Include the citation, a brief summary, and your own evaluation of each resource's relevance
or importance to your project.
 This serves as a preliminary validation of your literature search strategy and also prepares you
for a more in-depth literature review later in the project.

“Direct speech-to-speech translation with a sequence-to-sequence model by Ye Jia* , Ron


J. Weiss* , Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui
Wu ”
Summary: This project presents a neural network model that translates speech directly from
one language to another without using intermediate text. The model is trained end-to-end,
mapping speech spectrograms into target language spectrograms, maintaining the source
speaker's voice characteristics. The experiments indicate that while it slightly
underperforms compared to traditional cascade models, it lays the groundwork for future
developments in direct speech translation.
Resource relevance: This resource emphasizes direct speech translation and the keeping
of speaker voice, aligning with my goal of developing a multilingual speech translation
system.
“LibriS2S: A German-English Speech-to-Speech Translation Corpus by Pedro Jeuris and
Jan Niehues”
Summary: This project introduces the LibriS2S corpus, designed to facilitate research in
speech-to-speech translation between German and English. The corpus is unique in that it
uses independently recorded audio to ensure unbiased pronunciation, enabling the
development of models that can directly generate speech signals from source language
inputs.
Resource relevance: This resource will aid in understanding the significance of training
data in speech translation systems and how to curate high-quality datasets for improved
model performance.

5 Bibliography
Guidance: Compile a list of all the references you have used in your proposal, adhering to the
Harvard referencing style. Each citation should be complete, accurate, and in the prescribed format.

Tips:

 For more guidance, consult the BCU learning services.


 Using citation management software like Mendeley can also assist in organising and formatting
your references correctly.
Note: The Bibliography does not count towards the overall word count of the project proposal.

1. A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, T. Zeppenfeld, and P.


Zhan, “JANUS-III: Speech-to-speech translation in multiple languages,” in Proc.
ICASSP, 1997.

2. W. Wahlster, Verbmobil: Foundations of speech-to-speech translation. Springer, 2000.

10 | P a g
e
3. S. Nakamura, K. Markov, H. Nakaiwa, G.-i. Kikui, H. Kawai, T. Jitsuhiro, J.-S. Zhang, H.
Yamamoto, E. Sumita, and S. Yamamoto, “The ATR multilingual speech-to-speech
translation system,” IEEE Transactions on Audio, Speech, and Language Processing, 200

11 | P a g
e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy