0% found this document useful (0 votes)

35 views8 pages

Selection of Optimal Solution For Example and Model of Retrieval Based Voice Conversion

Uploaded by

luzpachecobl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views8 pages

Selection of Optimal Solution For Example and Model of Retrieval Based Voice Conversion

Uploaded by

luzpachecobl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Selection of Optimal Solution for Example and Model of

Retrieval Based Voice Conversion

Zhongxi Ren

Faculty of Data Science, City University of Macau, Macau, 999078, China

Abstract. Since 2010, the computer has been developing continuously in the
field of speech conversion, and now the speech-to-text technology has become
mature, but the development of timbre conversion and imitation is not perfect.
Recently a new tone imitation program has become a focus, but this program
model training options are still lacking. This paper hopes to train the model
through the in-depth practical operation of this program and the custom value in
the model training step of this program. Multiple training processes of Retrieval
Based Voice Conversion (RVC) model will be practiced, and the timbour pro-
duced by the model with different number of rounds will be compared with the
sound source. After the model training, two evaluation methods were used to
check the similarity of the evaluation model. One is the objective evaluation
method based on Mel cepstral distortion principle, which is realized by software.
The other is a subjective evaluation method based on the principle of directly
collecting human sensory data. The similarity statistics are obtained respectively,
the selection criteria of the general optimal solution model are obtained, and the
relative standard training reference values are provided for users.

Keywords: Timbre conversion, Mel cepstral distortion,Model training, Objec-

tive evaluation, Subjective evaluation.

1 Introduction

Retrieval based Voice Conversion (hereinafter referred to as RVC) is a new type of

voice conversion program developed in June 2023. Compared with traditional training
timbre imitation program, Retrieval based Voice Conversion eliminates timbre
leakage by replacing input source features with training set features using top1
retrieval.At the same time, RVC has the following characteristics: even on relatively
poor graphics cards can be fast training; Training with a small amount of data can also
get better results (it is recommended to collect at least 10 minutes of low-noise speech
data); The timbre can be changed by model fusion (with the help of CKpt-merge in
the ckpt processing TAB); Easy to use web interface; The Ultimate Vocal Remover 5
(UVR5) model can be invoked to quickly separate vocals from accompaniment [1].
However, RVC also cannot intelligently select the number of rounds with the highest
similarity to the training sound source in the process of model training, so it needs to

© The Author(s) 2024

B. H. Ahmad (ed.), Proceedings of the 2023 International Conference on Data Science, Advanced Algorithm and
Intelligent Computing (DAI 2023), Advances in Intelligent Systems Research 180,
https://doi.org/10.2991/978-94-6463-370-2_48
Selection of Optimal Solution for Example and Model of Retrieval 469

manually customize the number of training rounds and provide multiple model results
for manual screening. In this paper, multiple training processes of RVC model will be
practiced, and the timbour produced by the model with different number of rounds
will be compared with the sound source, so as to find out the most similar training
rounds between the model and the sound source.

2 Model training practice based on RVC

The next audio source to be trained is the voice from the cartoon character. The
reason for this choice is that the voice of the cartoon character has the characteristics
of stable voice line and clear audio source. Because the training steps of each model
are exactly the same, and the values set during training are exactly the same, so in this
part, we will take out the training process of one of the models, according to audio
acquisition, format conversion, set program parameters, limit the number of training
rounds, show the training process of the model, and show the process of output
training results in detail.

2.1 Audio collection

In terms of audio source material selection, RVC requires the file format of wav, and
the total audio source material duration is preferably about ten minutes. At the same
time, it is best to choose materials with no background music or small background
music. If background music is unavoidable, third-party software or RVC's own
Ultimate Vocal Remover 5 routine can be used to separate and batch process vocal
accompaniment [2]. It is worth mentioning here that Ultimate Vocal Remover 5 is a
deep neural network-based instrument separation software that trains the model to
accurately separate drums, bass, vocals and other vocal parts. Compared to RX10,
RipX, and SpectraLayers, UVR5 shows significant advantages in terms of model
generation quality and selectivity.
When the audio file is imported into the program (all audio formats can be
automatically recognized), the UVR5 model drives the program to generate two files:
the voice file and the background file (Fig.1).

Fig. 1. Voice separation program based on UVR5 (Original)

470 Z. Ren

2.2 Training parameter setting

At this stage, in order to save hard disk space, the storage frequency of this
experiment was set to once 50 rounds (Fig.2). At the same time, in order to obtain
relatively comprehensive data to detect the similarity of the trained model and expand
the result data set, the total number of training rounds was set to a larger 500.The
Graphics Processing Unit (GPU) is divided into 10G memory for the program, it is
worth mentioning that RVC currently only supports Nvidia series graphics cards.
Because RVC relies on NVIDIA's CUDA technology. CUDA (Compute Unified
Device Architecture) is a programming model developed by NVIDIA to take full
advantage of their graphics processing units (Gpus) for large-scale computing. Many
deep learning and machine learning libraries, such as TensorFlow and PyTorch, can
directly leverage CUDA for efficient numerical computation. AMD graphics cards do
not support CUDA because it is a proprietary NVIDIA technology. AMD has its own
similar technology called ROCm, but ROCm is not as widely supported and adopted
as CUDA.

Fig. 2. Parameter setting (Original)

2.3 Program initiation

 Audio automatic segmentation

At the beginning of the training session, RVC automatically divides a single 10-
minute audio file into one-sentence wav files. The entire audio segmentation process
can be completed in as little as a minute or two. This capability is made possible by
RVC's ability to leverage the Kaldi and Mozilla DeepSpeech programs, a powerful
open-source speech recognition toolkit that provides a range of tools and libraries for
speech recognition. It contains various components for training and testing speech
recognition models and supports a variety of speech recognition tasks, including
automatic segmentation of audio. DeepSpeech is Mozilla's open-source speech
Selection of Optimal Solution for Example and Model of Retrieval 471

recognition engine, based on deep learning technology. It provides tools for training
and using speech recognition models with high accuracy and performance. This
process allows for automatic segmentation of audio using DeepSpeech [3]. In
addition, RVC can also select the appropriate audio segmentation program based on
the input audio.

 Training
The RVC program starts training on a scale of 1 to 500, saves every 50 rounds as
previously set, and stores the results of each 50 rounds in the logs folder.Training
time will vary depending on the clarity and duration of the audio file. In addition, the
RVC does not strictly require 10 minutes for the audio source file, but too long the
audio source file may make the training time too long and lead to the final model
timbre distortion [4].
After the model training is complete, both the model file and the base model file
are output to the logs folder.

3 Result

There is no accepted standard measurement data set for speech conversion. The
Speech Conversion Challenge was held once each in 2016 and 2018, and the data sets
they used are expected to become the standard [5]. The evaluation methods of speech
conversion are divided into objective evaluation and subjective evaluation [6]. We
will use both methods for this experiment.

3.1 Objective evaluation

The objective evaluation standard of speech conversion is Mel cepstral distortion

(MCD), which represents the difference between the MFCC features of the converted
speech and the MFCC features of the standard output speech [7]. To measure MCD, it
is necessary to align the MFCC feature sequence of the converted speech with the
standard output speech. Suppose the standard output feature of a frame is y, and the
feature of the converted speech is ŷ, then the MCD of this frame is defined as:
10 2
��(�, ŷ) = �� 10
�−ŷ 2
(1)

The MCD can be averaged over all the test data, so that the comparison can produce
similarity [5]. The MCD unit is the decibels (dB) (1) in front of the coefficient, is to
convert the unit into decibels, which is divided by ln10 in order to convert the MFCC
itself is a natural value into a common logarithm, MCD can be averaged on all test
data. Next, we use Sound Similar Free based on MCD algorithm for similarity
comparison to find out the training rounds model that is most similar to the sound
source (Fig.3).
472 Z. Ren

Fig. 3. Single model similarity comparison results (Original)

Table 1. The timbre similarity

Rounds Similarity of Similarity of Similarity of Model 2

Model 1 Model 2
5000 9.1% 3.4% 6.4%
10000 33.3% 26.3% 29.3%
15000 69% 59% 60%
20000 86.9% 77.9% 88%
25000 81% 88% 87%
30000 76.7% 71.7% 61.7%
35000 68.6% 64.7% 64.8%
40000 58% 51% 50%
Through statistical data, it is found that the results obtained by objective evaluation
method show that all models with 20000 and 25000 training rounds are most similar
to the original sound source (Table 1). After the number of training rounds reaches
25,000, the audio model training will gradually become distorted. But here is only the
software test results, for the voice timbre, the subjective hearing of people is more
straightforward than the software test data, so next we will conduct subjective
assessment.

3.2 Subjective evaluation

There are two main criteria for subjective evaluation: the sound quality of the
converted speech and the similarity to the target speaker. When evaluating a single
system, the mean opinion score (MOS) is generally used [8]. For sound quality, a 5-
point system is generally used, with 1 being the worst and 5 being the best. For
similarity, subjects are often asked to listen to the source speaker's speech, the target
speaker's speech (in varying order), and the converted speech, and choose among the
following four levels.

 The converted voice is more like the owner of the sound source, and it is very
positive;
Selection of Optimal Solution for Example and Model of Retrieval 473

 The converted speech is more like the owner of the sound source, but it is
uncertain;Completely unsure of which speaker the converted speech is more like
(there may not be such a rating);
 The converted speech is more like the owner of the sound source, but it is
uncertain;
 The converted voice is more like the source owner, and very sure.

When comparing two systems, we can evaluate them separately and then compare
the scores; preference tests can also be performed as follows [9]. For sound quality,
the subjects are generally asked to listen to the output of two systems and choose
which one is better; For similarity, the subject is usually asked to listen to the output
of the two systems (in varying order) and the voice of the target speaker, and choose
which system's output is more like the target speaker. This latter test is often referred
to as ABX test or XAB test, where A and B refer to the output of the two systems, and
X refers to the speech of the target speaker [10].
Here, the experiment set up a post on the network, put the audio generated by the
three models and the audio of the original sound source in different posts, and
launched a vote, so that users can judge whether the two audio is the same person, a
total of 103 people participated. The statistical results are as follows:

Table 2. Model 1 Results of similarity network survey for each round

Rounds affirmative indeterminacy repudiate

5000 0 0 103
10000 0 8 95
15000 30 13 60
20000 77 20 6
25000 91 5 7
30000 40 60 3
35000 24 66 19
40000 3 10 90

Table 3. Model 2 Results of similarity network survey for each round

Rounds affirmative indeterminacy repudiate

5000 0 0 103
10000 1 7 95
15000 28 13 62
20000 79 15 9
25000 96 4 3
30000 37 60 6
35000 28 61 14
40000 1 12 90
474 Z. Ren

Table 4. Model 3 Results of similarity network survey for each round

Rounds affirmative indeterminacy repudiate

5000 0 0 103
10000 2 7 94
15000 29 14 60
20000 89 10 4
25000 96 4 3
30000 29 60 14
35000 24 67 18
40000 1 13 89
Judging from the evaluation results, no matter the subjective evaluation results or the
objective evaluation results (Table 2-4), the model of 20000 rounds and 25000 rounds
has the highest similarity with the original sound source. However, if only from the
subjective evaluation results, in addition to the 5,000-round model, all the other round
number models in the investigation process have listeners think that there is a similar
tone, but the 20,000 round and 25,000 round model are considered by most listeners
to be the most similar. In general, the timbre similarity of the model will gradually
increase from the beginning of training round 0 until 25000, and the similarity will
reach the highest when the model reaches 25000. However, after the model is trained
to 25000, due to the overtraining of the model, the timbre will gradually be distorted,
and the higher the degree of overtraining, the higher the degree of timbre distortion
will be. Therefore, the experiment judged that with the increase of training times, the
timbre similarity showed a trend of first increasing and then decreasing.

4 Conclusion

This paper makes a practical application of the RVC program and records the model
training process in detail, showing the best value of model training in this program.
By using the objective evaluation method and the subjective evaluation method, the
optimum number of model rounds is obtained, which provides a relatively objective
training reference value for the new program RVC. In addition to the detailed
recording of the model training process, this paper also introduces the MCD
measurement method in the acoustic feature conversion, and realizes the MCD
measurement through the program. This provides some help for other researchers to
further study the sound transformation. By measuring MCD, we can assess the gap
between the generated speech and the target speech, which can guide the
improvement and optimization of the model. With further development of the
technology, RVC or similar programs could make the training of AI models
unnecessary to manually screen models. This means models can be trained more
efficiently, saving time and resources. However, there is also a risk that such timbre
imitation procedures could be abused by criminals. To ensure the legal and ethical
use of technology, relevant authorities need to establish strict regulatory mechanisms
and laws and regulations to prevent potential abuses.
Selection of Optimal Solution for Example and Model of Retrieval 475

References
1. Qian K, Zhang Y, Chang S, Yang X, Hasegawa-Johnson M. AutoVC: Zero lens voice
style transmission, only autoencoder loss. Attended the 36th International Machine
Learning Conference. 2019, 10.48550/arXiv.1905.05879.
2. Toda T, Black AW, Tokuda K.Spectral transformation based on maximum likelihood
estimation considering the global variance of the transformation parameters. 2005.
3. Nakashika T, Takiguchi T, Minami Y. Speech conversion non-parallel training based on
adaptive constrained Boltzmann machine. IEEE/ACM Transactions on Audio, Speech, and
Language Processing. 2016, 24(11):1-1.
4. Hwang Hsin-Te, Tsao Yu, Wang Hsin-Min, Wang Yih-Ru, Chen Sin-Horng. Global
variance analysis of speech conversion training phase based on gmm.Signal and
Information Processing Association Annual Summit and Conference (APSIPA). 2013,
Asia-Pacific.
5. Toda T, Chen LH, Saito D, Villavicencio F, Yamagishi J. The Voice Conversion
Challenge 2016. Interspeech, 2016.
6. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A,
Bengio Y. Generative adversarial nets. NIPS, 2014.
7. Lorenzo-Trueba J, Yamagishi J, Toda T, Saito D, Villavicencio F, Kinnunen T, Ling Z.
Speech Conversion Challenge 2018: Facilitating the development of parallel and non-
parallel approaches, arxiv:1804.04262.
8. Hsu CC, Hwang HT, Wu YC, Tsao Y, Wang HM. Non-parallel corpus speech conversion
using variational autoencoders. Signal and Information Processing Association Annual
Meeting (APSIPA). 2016.
9. Wu Z, Virtanen T, Kinnunen T, Chng ES. Paradigm-based speech conversion uses non-
negative spectral deconvolution. Workshop on Speech Synthesis, 2013,5:201-206.
10. Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cyclic
consensus against networks. IEEE. 2017, arxiv:1703.10593.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-
NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/),
which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's
Creative Commons license, unless indicated otherwise in a credit line to the material. If material
is not included in the chapter's Creative Commons license and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder.

Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
No ratings yet
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
10 pages
Voice Transformation Algorithms With Real Time DSP Rapid Prototyping Tools
No ratings yet
Voice Transformation Algorithms With Real Time DSP Rapid Prototyping Tools
4 pages
BMC Plane Template
No ratings yet
BMC Plane Template
2 pages
Manual de Potogold
No ratings yet
Manual de Potogold
139 pages
HARDWARE CATEGORIES (Functional Parts)
No ratings yet
HARDWARE CATEGORIES (Functional Parts)
13 pages
Fundamentals of 6G Communications and Networking - ISAC Chapter
No ratings yet
Fundamentals of 6G Communications and Networking - ISAC Chapter
21 pages
Or Assign
No ratings yet
Or Assign
13 pages
Overview-Lex and Yacc Program
No ratings yet
Overview-Lex and Yacc Program
11 pages
TM351 - Spring 2025- TMA- Other Branches
No ratings yet
TM351 - Spring 2025- TMA- Other Branches
4 pages
Embedded CH 2 final
No ratings yet
Embedded CH 2 final
33 pages
Captionomaly_A_Deep_Learning_Toolbox_for_Anomaly_Captioning_in_Social_Surveillance_Systems
No ratings yet
Captionomaly_A_Deep_Learning_Toolbox_for_Anomaly_Captioning_in_Social_Surveillance_Systems
9 pages
UX Storytelling Guide
No ratings yet
UX Storytelling Guide
24 pages
Buka (INDO) Secrets of Divine Love by A. Helwa
No ratings yet
Buka (INDO) Secrets of Divine Love by A. Helwa
372 pages
Building an Xbox 360 Emulator
No ratings yet
Building an Xbox 360 Emulator
28 pages
Pirate Midi Bridge6 User Manual v2.0.0
No ratings yet
Pirate Midi Bridge6 User Manual v2.0.0
74 pages
exam2
No ratings yet
exam2
2 pages
5 Pointers
100% (1)
5 Pointers
18 pages
AI - 01 Practice Exercise 1 Answer
100% (1)
AI - 01 Practice Exercise 1 Answer
2 pages
Software
No ratings yet
Software
26 pages
1.5 - Starting and Stopping Methods-4
No ratings yet
1.5 - Starting and Stopping Methods-4
48 pages
Exams 2024 Python For Beginners
No ratings yet
Exams 2024 Python For Beginners
22 pages
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
No ratings yet
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
6 pages
1.2 - Simbology in Electric Drives-2
No ratings yet
1.2 - Simbology in Electric Drives-2
26 pages
EE325 Contemporary CommSys Tutorial2 Solutions 2020
100% (1)
EE325 Contemporary CommSys Tutorial2 Solutions 2020
10 pages
Week 5 Exercises Solutions
100% (1)
Week 5 Exercises Solutions
12 pages
sw_ins
No ratings yet
sw_ins
12 pages
Electronics 12 00143 v2
No ratings yet
Electronics 12 00143 v2
19 pages
Word Sense Disambiguation: by Under The Guidance of
No ratings yet
Word Sense Disambiguation: by Under The Guidance of
99 pages
Ddca
No ratings yet
Ddca
31 pages
Case Study On Linux Ms. Jaya Gangwani
No ratings yet
Case Study On Linux Ms. Jaya Gangwani
40 pages
SFT FPGA
No ratings yet
SFT FPGA
16 pages
C++ Project On Payroll Management System
No ratings yet
C++ Project On Payroll Management System
16 pages
Lab 3 Report - DLD Lab
No ratings yet
Lab 3 Report - DLD Lab
16 pages
Exercices Modelisation
No ratings yet
Exercices Modelisation
3 pages
Demo Course PPT - Python
No ratings yet
Demo Course PPT - Python
18 pages
Lesson 08 - Acquiring Measurements With HW
No ratings yet
Lesson 08 - Acquiring Measurements With HW
25 pages
Corrig+® Exam Java Contrl Esprit
No ratings yet
Corrig+® Exam Java Contrl Esprit
4 pages
Lesson 3. Manipulator Dynamics
No ratings yet
Lesson 3. Manipulator Dynamics
31 pages
MRCN-QUIZ (2)
No ratings yet
MRCN-QUIZ (2)
2 pages
SwapnilShinde - product manager position-HRMS Cv
No ratings yet
SwapnilShinde - product manager position-HRMS Cv
3 pages
Telemetry Troubleshoot Guide
No ratings yet
Telemetry Troubleshoot Guide
21 pages
Ev Charging Station Finder
No ratings yet
Ev Charging Station Finder
31 pages
Final Year Project
No ratings yet
Final Year Project
24 pages
Ibrahim Mat COS 16
No ratings yet
Ibrahim Mat COS 16
6 pages
Compal Confidential: UMA MB Schematic Document Eh7Lw/Eh5Lw/Fh5Tw/Eh7Lc/Eh5Lc
No ratings yet
Compal Confidential: UMA MB Schematic Document Eh7Lw/Eh5Lw/Fh5Tw/Eh7Lc/Eh5Lc
46 pages
Correction Exercice XSD
100% (1)
Correction Exercice XSD
9 pages
4A. Decision Trees
No ratings yet
4A. Decision Trees
35 pages
Review Pre Final Test Grade 9
No ratings yet
Review Pre Final Test Grade 9
52 pages
Tube-Based MPC
No ratings yet
Tube-Based MPC
9 pages
An Iot Noise Detector With Automatic Recording System
No ratings yet
An Iot Noise Detector With Automatic Recording System
14 pages
ENSA Agadir Hassane Bouzahir Chapter 1 - Graphs
No ratings yet
ENSA Agadir Hassane Bouzahir Chapter 1 - Graphs
17 pages
First Semester: Es 15 101 System Design Using Embedded Processors
No ratings yet
First Semester: Es 15 101 System Design Using Embedded Processors
2 pages
Hi3516Dv300 - Stpsom Datasheet: /product Display
No ratings yet
Hi3516Dv300 - Stpsom Datasheet: /product Display
8 pages
Econometrics Mock Exam - Solutions
No ratings yet
Econometrics Mock Exam - Solutions
3 pages
Data Mining
No ratings yet
Data Mining
6 pages
HFCL - MicroDuct - FOCs
No ratings yet
HFCL - MicroDuct - FOCs
4 pages
PRP
No ratings yet
PRP
60 pages
RDX Manager 0.1.0.40 Release Notes
No ratings yet
RDX Manager 0.1.0.40 Release Notes
3 pages
Ccca4 - Exam
No ratings yet
Ccca4 - Exam
4 pages
1.examen S.F.S.D - QCM - Recto-Verso
100% (1)
1.examen S.F.S.D - QCM - Recto-Verso
2 pages
Week-1 Assessment-1 Answers
No ratings yet
Week-1 Assessment-1 Answers
3 pages
Introduction To Computer
No ratings yet
Introduction To Computer
3 pages
Kimberly Angel Garciano BIT - ET 3A ELEXTECH 315 Activity 1
No ratings yet
Kimberly Angel Garciano BIT - ET 3A ELEXTECH 315 Activity 1
3 pages
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
No ratings yet
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
9 pages
Printf Is One of The Apis or Interfaces Exposed To User Space To
No ratings yet
Printf Is One of The Apis or Interfaces Exposed To User Space To
3 pages
Datamining - Quiz - QCM Datamining
No ratings yet
Datamining - Quiz - QCM Datamining
4 pages
Java Exercises Beginning
No ratings yet
Java Exercises Beginning
1 page
DC Lab Exp6 17l238 Rep
No ratings yet
DC Lab Exp6 17l238 Rep
12 pages
Iot and Machine Learning-A Technological Combination For Smart Application
No ratings yet
Iot and Machine Learning-A Technological Combination For Smart Application
4 pages
Lecture 3: Fields, Getters and Setters, Constructors, Testing
No ratings yet
Lecture 3: Fields, Getters and Setters, Constructors, Testing
21 pages
SPESIFIKASI TEKNIS DAN IDENTITAS Ok
No ratings yet
SPESIFIKASI TEKNIS DAN IDENTITAS Ok
1 page
AIESEC How To Make A CV
No ratings yet
AIESEC How To Make A CV
14 pages
A Paper Arabic Text To Arabic Sign Almasoud AlKhalifa
100% (1)
A Paper Arabic Text To Arabic Sign Almasoud AlKhalifa
6 pages
Lex-Yacc For Exam
100% (1)
Lex-Yacc For Exam
17 pages
Table 1-16 Alloy Steel Stud Bolts For ANSI 2500.1b Ring Type Joint Flanges, Each With Two Heavy Hex Nuts
No ratings yet
Table 1-16 Alloy Steel Stud Bolts For ANSI 2500.1b Ring Type Joint Flanges, Each With Two Heavy Hex Nuts
1 page
Homework 5
No ratings yet
Homework 5
5 pages
Exercise: Basic Elements of Computer System: Objective Type Questions
No ratings yet
Exercise: Basic Elements of Computer System: Objective Type Questions
2 pages
Chapter 17-CORBA Case Study
100% (2)
Chapter 17-CORBA Case Study
47 pages
Examen Textmining 20202021
No ratings yet
Examen Textmining 20202021
2 pages
Travaux Pratiques: Exercice 1 (Sockets UDP)
No ratings yet
Travaux Pratiques: Exercice 1 (Sockets UDP)
13 pages
What Is Codename One
No ratings yet
What Is Codename One
18 pages
An To An A That It It An: I. (L, The
No ratings yet
An To An A That It It An: I. (L, The
10 pages
AI Berkeley Solution PDF
No ratings yet
AI Berkeley Solution PDF
9 pages
Pycryptodome Master
100% (1)
Pycryptodome Master
82 pages
Problem Solving
No ratings yet
Problem Solving
20 pages
Liste Convoqués Master SESN 22-09-2023
No ratings yet
Liste Convoqués Master SESN 22-09-2023
9 pages
Recherche D Information
No ratings yet
Recherche D Information
34 pages
Test Blanc
No ratings yet
Test Blanc
23 pages
Machine Learning CA 2
No ratings yet
Machine Learning CA 2
19 pages
Square Topology For NoCs
No ratings yet
Square Topology For NoCs
4 pages
OA TD2 Correction 2019 2020 - Compressed
No ratings yet
OA TD2 Correction 2019 2020 - Compressed
3 pages
ATLAS Transformation Language: Rubby Casallas Grupo de Construcción de Software Uniandes
No ratings yet
ATLAS Transformation Language: Rubby Casallas Grupo de Construcción de Software Uniandes
18 pages
Distributed Databases: Solutions To Practice Exercises
No ratings yet
Distributed Databases: Solutions To Practice Exercises
4 pages
2027
No ratings yet
2027
6 pages
Unit Standards and Competencies Diagram
No ratings yet
Unit Standards and Competencies Diagram
8 pages
NT Denso Nec76f0040gd 275036-7341 Toyota 1033
No ratings yet
NT Denso Nec76f0040gd 275036-7341 Toyota 1033
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Selection of Optimal Solution For Example and Model of Retrieval Based Voice Conversion

Uploaded by

Selection of Optimal Solution For Example and Model of Retrieval Based Voice Conversion

Uploaded by

Selection of Optimal Solution for Example and Model of

Retrieval Based Voice Conversion

Faculty of Data Science, City University of Macau, Macau, 999078, China

Keywords: Timbre conversion, Mel cepstral distortion,Model training, Objec-

Retrieval based Voice Conversion (hereinafter referred to as RVC) is a new type of

© The Author(s) 2024

2 Model training practice based on RVC

2.1 Audio collection

Fig. 1. Voice separation program based on UVR5 (Original)

2.2 Training parameter setting

Fig. 2. Parameter setting (Original)

2.3 Program initiation

 Audio automatic segmentation

3.1 Objective evaluation

The objective evaluation standard of speech conversion is Mel cepstral distortion

Fig. 3. Single model similarity comparison results (Original)

Table 1. The timbre similarity

Rounds Similarity of Similarity of Similarity of Model 2

3.2 Subjective evaluation

Table 2. Model 1 Results of similarity network survey for each round

Rounds affirmative indeterminacy repudiate

Table 3. Model 2 Results of similarity network survey for each round

Rounds affirmative indeterminacy repudiate

Table 4. Model 3 Results of similarity network survey for each round

Rounds affirmative indeterminacy repudiate

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.