0% found this document useful (0 votes)
9 views12 pages

Ieee

IEEE research paper for voice recognition

Uploaded by

dabhadeparth69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Ieee

IEEE research paper for voice recognition

Uploaded by

dabhadeparth69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1650 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.

32, 2024

EfficientTTS 2: Variational End-to-End


Text-to-Speech Synthesis and Voice Conversion
Chenfeng Miao , Qingying Zhu , Minchuan Chen , Jun Ma , Shaojun Wang , and Jing Xiao , Fellow, IEEE

Abstract—Recently, the field of Text-to-Speech (TTS) has been without the need for mel-spectrograms [9], [10], [11]. Among all
dominated by one-stage text-to-waveform models which have sig- the open-sourced text-to-waveform models, VITS [11] achieves
nificantly improved speech quality compared to two-stage models. the best model performance and efficiency. However, it still has
In this work, we propose EfficientTTS 2 (EFTS2), a one-stage
high-quality end-to-end TTS framework that is fully differentiable some drawbacks. Firstly, the MAS method [12] used to learn
and highly efficient. Our method adopts an adversarial training sequence alignment in VITS is precluded in the standard back-
process, with a differentiable aligner and a hierarchical-VAE-based propagation process, thus affecting training efficiency. Secondly,
waveform generator. These design choices free the model from the in order to generate a time-aligned textual representation, VITS
use of external aligners, invertible structures, and complex training simply repeats each hidden text representation by its correspond-
procedures as most previous TTS works have. Moreover, we extend
EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, ing duration. This repetition operation is non-differentiable
an end-to-end VC model that allows high-quality speech-to-speech thus hurting the quality of generated speech. Thirdly, VITS
conversion. Experimental results suggest that the two proposed utilizes bijective transformations, specifically affine coupling
models achieve better or at least comparable speech quality com- layers [13], to compute latent representations. However, for
pared to baseline models, while also providing faster inference affine coupling layers, only half of the input data gets updated
speeds and smaller model sizes.
after each transformation. Therefore, one has to stack multiple
Index Terms—Text-to-speech, speech synthesis, voice convers- affine coupling layers to generate meaningful latent represen-
ion, differentiable aligner, VAE, hierarchical-VAE, end-to-end. tations, which increases the model size and further reduces the
model’s efficiency. A recent work NaturalSpeech [14] improves
I. INTRODUCTION upon VITS by leveraging a learnable differentiable aligner and
EXT-TO-SPEECH (TTS) task aims at producing human- a bidirectional prior/posterior module. However, the training of
T like synthetic speech signals from text inputs. In recent
years, sparked by the development of autoregressive (AR) mod-
the learnable differentiable aligner requires a warm-up stage,
which is a pretraining process with the help of external aligners.
els [1], [2], [3] and non-autoregressive (NAR) models [4], [5], Although the bidirectional prior/posterior module of Natural-
[6], neural network systems have dominated the TTS field. The Speech can reduce the training and inference mismatch caused
conventional neural TTS systems cascade two separate models: by the bijective flow module, it further increases the model’s
an acoustic model that transforms the input text sequences computational cost of training.
into acoustic features (e.g., mel-spectrogram) [1], [5], followed A recent work EfficientTTS (EFTS) [15] proposed a NAR
by a neural vocoder that transforms the acoustic features into architecture with differentiable alignment modeling that is op-
audio waveforms [7], [8]. Although two-stage TTS systems have timized jointly with the rest of the model. In EFTS, a family
demonstrated the capability of producing human-like speech, of text-to-mel-spectrograms models and a text-to-waveform
these systems come with several disadvantages. First of all, the model are developed. However, the performance of the text-
acoustic model and the neural vocoder cannot be optimized to-waveform model is close to but no better than two-stage
jointly, which often hurts the quality of the generated speech. models. Inspired by EFTS, we propose an end-to-end text-
Moreover, the separate training pipeline not only complicates to-waveform TTS system, the EfficientTTS 2 (EFTS2), that
the training and deployment but also makes it difficult to model overcomes the above issues of current one-stage models with
downstream tasks. competitive model performance and higher efficiency. The main
Recently, in the TTS field, there is a growing interest in de- contributions of this paper are as follows:
r We propose a differentiable aligner with a hybrid attention
veloping one-stage text-to-waveform models that can be trained
mechanism and a variational alignment predictor, which
Manuscript received 26 May 2023; revised 27 October 2023; accepted 12 empowers the model to learn expressive time-aligned latent
February 2024. Date of publication 23 February 2024; date of current version 1 representation and have controllable diversity in speech
March 2024. The associate editor coordinating the review of this manuscript and rhythms. (Section IV-A)
approving it for publication was Prof. Hema A Murthy. (Corresponding author: r We introduce a 2-layer hierarchical-VAE-based waveform
Chenfeng Miao.)
The authors are with Ping An Technology, Shanghai 200120, China (e-mail: generator that not only produces high-quality outputs but
miao_chenfeng@126.com; qingying.zhu@outlook.com; chenminchuan109@ also learns hierarchical and explainable latent variables
pingan.com.cn; majun@pingan.com.cn; swang.usa@gmail.com; xiaojing661@
pingan.com.cn). that control different aspects of the generated speech.
Digital Object Identifier 10.1109/TASLP.2024.3369528 (Section IV-B)

© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1651

r We develop an end-to-end adversarial TTS model, EFTS2,


that is fully differentiable and can be trained end-to-end. It
is better than or at least comparable to VITS in naturalness
and offers faster inference speed and a smaller model
footprint. (Section IV)
r We extend EFTS2 to the voice conversion (VC) task and
propose EFTS2-VC, an end-to-end VC model. The con-
version performance of EFTS2-VC is comparable to a
state-of-the-art model (YourTTS, [16]) while obtaining sig-
nificantly faster inference speed and much more expressive
speaker-independent latent representations. (Section V)
The rest of this paper is organized as follows: we begin
by introducing the motivation behind this work in Section II,
followed by a discussion of the background knowledge of EFTS
in Section III. In Section IV, we present the proposed TTS model,
EFTS2. We then extend our work to the voice conversion task
Fig. 1. Overall architecture of EFTS. The arrow with a broken line represents
and present EFTS2-VC, a voice conversion model, in Section V. the computation in the inference phase only.
In Section VI, we provide experimental results, and finally, we
conclude this paper in Section VIII.
to overcome this problem, including Auto-Regressive models
II. MOTIVATION (ARs, [20]), Normalizing Flows (NFs, [11], [21]), Denoising
Diffusion Probabilistic Models (DDPMs, [22]), Generative Ad-
Our goal is to build an ideal TTS model that enables end- versarial Networks (GANs, [23]) and Variational Auto-Encoders
to-end training and high-fidelity speech generation. To achieve (VAEs, [24]). However, the first three approaches have certain
this, we consider two major challenges in designing the model: limitations either in the training or inference process. The num-
(i) Differentiable aligner: The TTS datasets usually consist ber of generation steps needed for AR models increases linearly
of thousands of audio files with corresponding text scripts that with the length of the synthesized sentence, which strongly
are, however, not time-aligned with the audio. Many previous affects the inference speed for longer input sentences. NFs
TTS works either use external aligners [5], [6], [17] or non- employ bijective transformations that often result in large model
differentiable internal aligners [11], [12], [18] for alignment footprints. DDPMs necessitate numerous inference iterations to
modeling,1 which complicates the training procedure and re- generate high-quality samples, leading to an inefficient infer-
duces the model’s efficiency. An ideal TTS model requires an ence process and therefore limiting the use of it in practical
internal differentiable aligner that can be optimized jointly with application. In this work, we propose utilizing a GAN structure
the rest of the network. Soft attention [20] is mostly used in build- with a hierarchical-VAE based generator, which enables efficient
ing an internal differentiable aligner. However, computing soft training and high-fidelity generation.
attention requires autoregressive decoding, which is inefficient
for speech generation [9]. [10] proposes to use Gaussian upsam-
pling and Dynamic Time Warping (DTW) for alignment learning III. BACKGROUND: OVERVIEW OF EFFICIENTTTS
while training such a system is also inefficient. To the best of In this part, we briefly describe the underlying previous work
our knowledge, EFTS is the only NAR framework that enables EFTS, inspired by which we build our model that simultaneously
both differentiable alignment modeling and high-quality speech learns text-audio alignment and speech generation.
generation. However, the alignment of EFTS is derived from The architecture of EFTS is shown in Fig. 1. A text-encoder
the aligned position vector, which is less informative compared encodes the text sequence x ∈ RT 1 into a hidden vector xh ∈
with commonly used duration models [5], [6], [11], [12]. In this RT1 ,D , while a mel-encoder encodes the mel-spectrogram y ∈
work, we propose a hybrid attention method that integrates both RT 2,Dmel into vector y h ∈ RT2 ,D . During training, the text-
the aligned position vector and the token duration, significantly mel alignment is computed using a scaled dot-product attention
improving the model performance. mechanism [25], as (1), which enables parallel computation.
(ii) Generative modeling framework: The goal of a genera-  
tive task, such as TTS, is to estimate the probability distribu- y h · xh
α = SoftMax √ (1)
tion of the training data, which is usually intractable in prac- D
tice. Various deep generative frameworks have been proposed
However, y h is unavailable in the inference phase, making com-
puting α intractable. To address this problem, [15] introduces
1 Here, an external aligner refers to a model or tool (e.g., MFA [19]) that the idea of alignment vectors, which are used to reconstruct the
is separated from the TTS system and is usually used to obtain the text-audio attention matrix using a series of non-parametric differentiable
alignment as a preprocessing procedure before training the TTS system. An
internal aligner, on the other hand, is a module inside the TTS system that learns transformations:
the alignment during the training phase. No preprocess of text-audio alignment (3) (5-6) (7)
of any kind is needed before training. α −→ π ∈ RT2 −→ e ∈ RT1 −→ α (2)
1652 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024

Fig. 2. Overall architecture of EFTS2’s generator. LP refers to linear projection. The dotted lines refer to training objectives.

where π ∈ RT2 and e ∈ RT1 are two alignment vectors and α Let e denote the re-sampled alignment vector, then e is computed
is the reconstructed alignment matrix. A parametric alignment as follows:
predictor with an output of ê, the predicted e, is trained jointly exp (−σ −2 (πj − i)2 )
given input xh and therefore allows tractable computation of α γi,j = T2 −1 (5)
−2 2
in the inference phase based on ê. The alignment vector π is n=0 exp (−σ (πn − i) )

defined as the sum of the input index weighted by α: 2 −1


T
ei = γi,n ∗ n (6)
1 −1
T
n=0
πj = αi,j ∗ i (3)
i=0 Here σ is a hyper-parameter. In EFTS, e is called the aligned
position vector. A Gaussian transformation is used to calculate
where 0 ≤ i ≤ T1 − 1 and 0 ≤ j ≤ T2 − 1 are indexes of the
α from e:
input and output sequence respectively. Here, π can be consid-
ered as the expected location that each output timestep attends to  exp (−σ −2 (ei − j)2 )
αi,j = T1 −1 (7)
−2 2
m=0 exp (−σ (em − j) )
over all possible input locations. According to the conclusion of
EFTS, π should follow some monotonic constraints including:
In the training phase, e is used to construct α . In the inference
π0 = 0, 0 ≤ Δπi ≤ 1, πT2 −1 = T1 − 1 (4) phase, α is derived from the predicted alignment vector ê. As a
replacement of the original attention matrix α, the reconstructed
where Δπj = πj − πj−1 . Therefore, additional transformations
attention matrix α is further used to map the hidden vector xh
are employed to constraint π to be monotonic ([15], (8)–(10).
to time aligned representation xalign and produce the outputs.
It is worth noting that the alignment vector π is a compressed
representation of the alignment matrix with the same length
of the output. However, for a sequence-to-sequence task with IV. EFFICIENTTTS 2: VARIATIONAL END-TO-END
TEXT-TO-SPEECH
inconsistent input-output length like TTS, it is more natural to
have an input-level alignment vector during the inference phase. To better describe our model, we divide the generator, shown
Thus, a differentiable re-sampling method is proposed in EFTS. in Fig. 2, into two main blocks: (i) the differentiable aligner,
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1653

which maps the input hidden state xh to time-aligned hid- better than the conventional approaches: (i) compared to the
den representation xalign ; and (ii) the hierarchical-VAE-based repetition operation [5], [6], [11], [12], the proposed approach
waveform generator, which produces the output waveform y is differentiable and enables batch computation; (ii) compared
from xalign . More details will be discussed in the rest of this to the popular Gaussian upsampling [10], [26] that considers
section. only the centralized position, the proposed approach employs
boundary positions, which is more informative; (iii) compared
A. Differentiable Aligner to the learnable upsampling [14], [27], the proposed approach
is monotonic and much easier to train.
Grounded on EFTS, we construct the differentiable aligner
In preliminary experiments, we found out that the model
with two major improvements: (i) a hybrid attention mechanism
performance is greatly influenced by the choice of σ in (7)
and (ii) a variational alignment predictor. The structure of the
and (12). In order to obtain better model performance, we use
differentiable aligner is shown in Fig. 2(b).
learnable σ in this work. We further map the hidden repre-
Hybrid attention mechanism: The performance of the aligner
sentation xh to a time-aligned hidden representation xalign
in EFTS heavily depends on the expressiveness of the recon-
using an approach similar to the multi-head attention mechanism
structed attention matrix α , which is derived from the alignment
in [25]:
vector e. Here, e can be considered as the expected aligned
positions for each input token over all possible output frames. head(i) = α(i) · (xh W (i) ) (13)
However, in the TTS task, one input token normally attends
to multiple output frames. Therefore, it is better to incorporate xalign = Concat(head(1) , head(2) )Wo (14)
the boundary positions of each input token when constructing
where {W (i) }, W o are learnable linear transformations. The
the attention matrix. To this end, we introduce a hybrid atten-
xalign is then fed into the hierarchical-VAE-based waveform
tion mechanism that integrates two attention matrices: the first
generator as input.
attention matrix α(1) is derived from e as in EFTS ((2)–(7))
Variational alignment predictor: NAR TTS models generate
and the second attention matrix α(2) is derived from the token
the entire output speech in parallel, thus alignment information
boundaries using the following transformations:
is required in advance. To address this problem, many previous
(3) (10)   (12) NAR models train a duration predictor to predict the duration of
α −→ π −→ a ∈ RT1 , b ∈ RT1 −→ α(2) (8)
each input token [5], [11]. Similarly, EFTS employs an aligned
where (a ∈ RT1 , b ∈ RT1 ) are start and end boundaries of the position predictor to predict the aligned position vector e. As
input tokens. We call the process from the attention matrix α opposed to a vanilla deterministic alignment predictor (DAP),
to boundary pairs (a, b) the Attention to Boundaries (A2B) in this work, we use a variational alignment predictor (VAP) to
transformation and the process from boundary pairs (a, b) to predict the alignment vector e and the boundary positions a and
the reconstructed attention matrix α(2) the Boundaries to At- b. The main motivation behind this is to consider the alignment
tention (B2A) transformation. Inspired by ((5)–(6)), the A2B prediction problem as a generative problem since one text input
transformation is formulated using the following equations: can be expressed with different rhythms. Specifically, the VAP
encoder receives the relative distances e − a and b − a, and
exp (−σ −2 (πj − pi )2 )
βi,j = T2 −1 , (9) outputs a latent posterior distribution qφ (z A |e − a, b − a, xh )
−2 2
n=0 exp (−σ (πn − pi ) ) conditioned on xh , while the VAP decoder estimates the output
 distribution by inputting z A , and conditioned on xh . The prior
T2 −1
ai+1 , i < T1 − 1
ai = βi,n ∗ n, bi = (10) distribution is a standard Gaussian distribution. For simplicity,
n=0
T 2 − 1, i = T1 − 1 both the encoder and the decoder of VAP are parameterized with
0, i=0
non-causal WaveNet residual blocks. The training objective of
where pi = i − 0.5, 0 < i < T . the VAP is computed as:
1
In the meantime, the B2A transformation is designed as (1) (2)
follows: Lalign = λ1 (dθ (z A ) − log(e − a + )2 + dθ (z A )
(A)
energyi,j = − σ −2 (|j − ai | + |bi − j| − (bi − ai ))2 (11) − log(b − a + )2 ) + λ2 DKL (N (z A ; μφ (e
(A)
(2) exp (energyi,j ) − b, b − a, xh ), σφ (e − b, b − a, xh )) ||
αi,j = T1 −1 (12)
m=0 exp (energym,j ) N (z A ; 0, I)) (15)
As can be seen, for the ith input token with its corresponding (1) (2) (A)
boundaries (ai , bi ), {energyi,j } reaches the maximum value 0 where dθ and dθ are outputs of the VAP decoder, μφ and
(A)
only if the output position j falls into its boundaries, meaning σφ are outputs of the VAP encoder, and  is a small value
ai ≤ j ≤ bi . For an output position outside of the boundaries, to avoid numerical instabilities. The first term in (15) is the
the further it is away from the boundaries, the lower value of reconstruction loss that computes the log-scale mean square
{energyi,j } it gets, resulting in less attention weight. error (MSE) between the predicted relative distances and target
Note that the proposed B2A approach works for all those relative distances. The second term is the KL divergence between
TTS models with explicit token duration, and is potentially the posterior and prior distributions. In the inference phase, the
1654 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024

alignment vector ê and boundary positions â and b̂ are computed qφ (z|y) = qφ (z 1 |y lin )qφ (z 2 |y lin ) (18)
as: 
i The training objective is :
(2) 0, i = 0
b̂i = (exp((dθ (z A ))m ) − ), âi = (1)
m=0
b̂i−1 , i > 0 Lwav = λ3 y mel − ŷ mel 1 + λ4 (DKL (N (z 1 ; μφ (y lin ),
(1) (1) (1) (1)
êi = exp((dθ (z A ))i ) −  + âi (16) σφ (y lin )) || N (z 1 ; μθ (xalign ), σθ (xalign )))
where z A is sampled from the standard Gaussian distribution. A (2) (2)
+ DKL (N (z 2 ; μφ (y lin ), σφ (y lin )) ||
stop gradient operation is added to the inputs of the VAP encoder,
which helps the model to learn a more accurate alignment in the (2) (2)
N (z 2 ; μθ (z 1 ), σθ (z 1 )))) (19)
training phase.
Here, the reconstruction loss is the 1 loss between the target
B. Hierarchical-VAE-Based Waveform Generator mel-spectrogram y mel and the predicted mel-spectrogram ŷ mel
Producing high-quality waveforms from linguistic features which is derived from the generated waveform ŷ. In this work,
(e.g., texts, phonemes, or hidden linguistic representation the two prior estimators and the posterior estimator are all
xalign ) is known as a particularly challenging problem. This parameterized by stacks of non-causal WaveNet residual blocks
is mainly because linguistic features do not contain enough with the estimated means and variations, while the decoder is in-
necessary information (e.g., pitch and energies) for waveform spired by the generator of HiFi-GAN [29]. Similar to EFTS and
generation. A primary idea is to use a VAE-based genera- VITS, the decoder part is trained on sliced latent variables with
tor that learns the waveform generation from a latent vari- corresponding sliced audio segments for memory efficiency.
able z. This z is sampled from an informative posterior dis- Some previous TTS models [11], [14] also incorporate the VAE
tribution qφ (z|y) parameterized by a network with acoustic framework in end-to-end waveform generation. EFTS2 differs
features as input. A prior estimator pθ (z|xalign ) with xalign from them in several aspects: (i) EFTS2 uses 2-layer hierarchical
as input is also trained jointly. Training such a system is to VAE while previous works use single-layer VAE; (ii) in previ-
minimize the reconstruction error between the real and pre- ous work, the KL divergence between the prior and posterior
dicted waveform and the KL divergence between the prior distributions is estimated between a latent variable (which is
and the posterior distribution. However, the prior distribution just a sample from posterior distribution) and a multivariate
contains no acoustic information while the learned posterior Gaussian distribution, while EFTS2 computes the KL diver-
must be informative w.r.t. the acoustic information. The in- gence between two multivariate Gaussian distributions which
formation gap makes it hard to minimize the KL divergence allows for a more efficient training; and (iii) previous works
between the prior and the posterior distribution. To tackle this have to accompany the VAE with the bijective flow structure
problem, we introduce a 2-layer hierarchical VAE structure to produce the high-quality results, while EFTS2 is bijective
that enables informative prior formulation. The hierarchical- free.
VAE-based waveform generator is composed of the following
blocks: C. Bidirectional prior/posterior Training
r A posterior network which takes the linear spectrograms
One limitation of hierarchical VAE is that it can have training-
y lin as input and outputs two latent Gaussian posterior
inference mismatch. In the training phase, the reconstructed
qφ (z 1 |y lin ), qφ (z 2 |y lin ).
r A hierarchical prior network which consists of two stochas- waveform is derived from the posterior, while in the infer-
ence phase, the synthesized waveform is predicted from the
tic layers: the first layer receives xalign and outputs a
prior. Inspired by NaturalSpeech [14], we adopt a bidirectional
latent Gaussian prior pθ (z 1 |xalign ); the second layer takes
prior/posterior training method to reduce the training-inference
a latent variable z 1 and formulates another latent Gaus-
information gap. Specifically, we add a new training objective
sian prior pθ (z 2 |z 1 ), where z 1 is sampled from posterior
Lbi that minimizes the KL divergence between the enhanced
distribution qφ (z 1 |y lin ) in training phase and from prior
prior and the posterior.
distribution pθ (z 1 |xalign ) in inference phase.
r A decoder which produces the waveform from the latent (2) (p) (2) (p)
Lbi = DKL (N (z 2 ; μθ (z 1 ), σθ (z 1 )) ||
variable z 2 where z 2 is sampled from posterior distribution
(2) (2)
qφ (z 2 |y lin ) in training phase and from prior distribution N (z 2 ; μφ (y lin ), σφ (y lin ))) (20)
pθ (z 2 |z 1 ) in inference phase.
(p)
Therefore, the overall prior and posterior distributions are where z 1 is sampled from the output distribution of the first
formulated as:2 prior network pθ (z 1 |xalign ). Unlike NaturalSpeech which bidi-
pθ (z|xalign ) = pθ (z 1 |xalign )pθ (z 2 |z 1 ) (17) rectionally runs the bijective flow module, our method is to run
the second prior network twice. Since the second prior network
2 We hypothesize z , z representing different and conditionally indepen-
1 2
is significantly smaller than the flow module in NaturalSpeech,
dent representations. Therefore the posterior distribution is formulated as the training of the proposed method is computationally cheap
qφ (z 1 |y lin )qφ (z 2 |y lin ) instead of qφ (z 1 |y lin )qφ (z 2 |z 1 ) in this work. while still allowing for a large batch size per GPU. Note that
Fig. 4 confirms our assumption. A similar conclusion has been drawn by [28],
which states that a hierarchical VAE can learn a latent hierarchy of conditionally non-differentiable models such as VITS can’t be trained bidi-
independent variables. rectionally.
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1655

Fig. 3. Overall model architecture of EFTS2-VC. LP refers to linear projection. The dotted lines refer to the training objectives.

D. The Overall Model Architecture r The alignment predictor is excluded in EFTS2-VC since
there is no need to explicitly tell the text-spectrogram
The overall model architecture of EFTS2 is based on GAN,
alignment in the inference phase.
which consists of a generator and multiple discriminators. We r Instead of using e or the token boundaries, the recon-
follow [29] in implementing the multiple discriminators whose
structed attention matrix α is derived from π ((22)). This
performance is experimentally confirmed by many previous
works [11], [30]. The feature matching loss Lf m is also em- not only simplifies the computation pipeline but also allows
the network to obtain a more accurate text-spectrogram
ployed for training stability. In the training phase, a phoneme
alignment. Similar to the TTS model, EFTS2-VC uses
sequence x is passed through a phoneme encoder to produce the
latent representation xh , while the corresponding linear spectro- multiple reconstructed attentions by employing multiple
(k)
gram y lin is passed through a spectrogram encoder to produce learnable {σπ |k = 1, . . ., H}.
the latent representation y h and two latent Gaussian posteriors.3
 exp (−σπ−2 (πj − i)2 )
Same as EFTS and VITS, the phoneme encoder is parameterized αi,j = T1 −1 (22)
−2 2
by a stack of feed-forward Transformer blocks. The proposed m=0 exp (−σπ (πj − m) )
differentiable aligner receives the latent representation xh and r The speaker embedding of the source waveform, which is
y h and outputs the time-aligned latent representation xalign .
Then xalign is further fed to the hierarchical-VAE-based wave- extracted from a trained speaker encoder, is introduced as
form generator to produce the output ŷ. The overall training conditions to the spectrogram encoder, the second prior
objective of the proposed generator G is: network, and the HiFi-GAN generator.
Again, we consider the hierarchical-VAE-based framework
Ltotal = Lalign + Lwav + Lbi + Ladv (G) + Lf m (G) (21) discussed in Section IV-B. During training, the prior distribution
pθ (z 1 |xalign ) is estimated by a stack of WaveNet blocks with
V. EFTS2-VC: END-TO-END VOICE CONVERSION xalign as the only input. Since xalign is the time-aligned textual
Voice conversion (VC) is a task that modifies a source speech representation without any information about the speaker iden-
signal with the target speaker’s timbre while keeping the lin- tity, we can easily draw the conclusion that the prior distribution
guistic contents of the source speech unchanged. The proposed pθ (z 1 |xalign ) contains only the textual information and does
voice conversion model, EFTS2-VC (shown in Fig. 3), is built not contain any information about the speaker identity. The
upon EFTS2 with several module differences: conclusion can be further extended to the posterior distribution
pφ (z 1 |y lin ) since the network is trained by minimizing the
3 Ideally, an end-to-end TTS system should operate on unnormalized text. We
KL divergence between the prior distribution and posterior
use external tools to convert the unnormalized texts to phonemes in this work. We distribution. Therefore, the spectrogram encoder works as a
will explore some data-driven approaches in the future to address this limitation. speaker disentanglement network that strips the speaker identity
1656 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024

while preserving the textual (or content) information. Then the phonemizer,6 and the converted sequences are interspersed with
second prior network and the variational decoder reconstruct a blank token following the implementation of VITS [11].
the speech from the content information and the input speaker Configurations: The phoneme encoder of EFTS2 is a stack
embeddings. During inference, the disentanglement network of 6 Feed-Forward Transformer (FFT) blocks, where each FFT
and the reconstruction network are conditioned on different block consists of a multi-head attention layer with 2 attention
speaker embeddings. Specifically, the disentanglement network heads and a convolutional feed-forward layer with a hidden size
receives the spectrogram and the speaker embedding from a of 192. The rest of EFTS2 is composed of stacks of non-causal
source speaker, and outputs a latent distribution pφ (z 1 |y lin ). WaveNet residual blocks. The kernel size is 5 and the dilation
Meanwhile, the reconstruction network produces the output rate is 1 for all the WaveNet layers. EFTS2-VC shares similar
waveform from the latent variable z 1 and the speaker em- model configurations with EFTS2 except that the variational
bedding from a target speaker. With these designs, EFTS2- aligned predictor is excluded from EFTS2-VC. In order to obtain
VC performs an ideal conversion that preserves the content better speech quality, the variances of the variational priors are
information of the source spectrogram while producing the multiplied by different scaling factors at the inference stage.
output speech that matches the speaker characteristics of a target Specifically, the scaling factor on the variance of alignment tA
speaker. is set to 0.7, and the scaling factors on the variances of the
latent distributions pθ (z 1 ) and pθ (z 2 ), denoted as t1 and t2 , are
VI. EXPERIMENTS set to be 0.8, and 0.3 respectively. The deterministic alignment
predictor (DAP), which takes xh as input and outputs the align-
To measure the performance of EFTS2, we evaluate it on
ment vectors (e.g., â, b̂, ê), is parameterized by 2 convolution
two tasks: single-speaker TTS and multi-speaker TTS. Ablation
layers and a linear mapping. Each convolution layer is followed
studies are performed under the single-speaker TTS setting.
by a layer normalization and a leaky ReLU activation. The
We also examine the model from several different aspects:
trained speaker encoder of EFTS2-VC is a speaker recognition
comparison of different aligners, model size and inference speed,
model [34] trained on the voxceleb2 [35] dataset. The pre-trained
analysis of the latent hierarchy, and visualization of the attention
model is publicly available [34]. For the VC task, YourTTS [16],
matrices. Last but not least, we also evaluate the performance
a publicly available pre-trained model trained on the VCTK
of the EFTS2-VC model. The audio samples4 and source code5
dataset is used as the baseline model. For a fair comparison, we
are available on GitHub.
down-sampled the generated audios of EFTS2-VC to 16 kHz to
match the sample rate of YourTTS’ generated audios during
A. Experimental Setup the evaluation process. The hyper-parameters of EFTS2 and
Datasets: Three public datasets are used in our experiments, EFTS2-VC are listed in Table I.
the LJ Speech dataset [31], the VCTK dataset [32], and the Training: Both EFTS2 and EFTS-VC are trained on 4 Tesla
LibriTTS dataset [33]. The LJ Speech dataset is an English V100 GPUs with 16G memory. The batch size on each GPU is
speech corpus consisting of 13,100 audio clips of a single female set to 32. The AdamW optimizer [36] with β1 = 0.8, β2 = 0.99
speaker. Each audio file is a single-channel 16-bit PCM with a is used to train the models. The initial learning rate is set to
sampling rate of 22050 Hz. The VCTK dataset is a multi-speaker 2 ∗ 10−4 and decays at every training epoch with a decay rate of
English speech corpus that contains 44 hours of audio clips of 0.998. Both models converge at the 500 k th step.
108 native speakers with various accents. The original audio Evaluation Metrics: We employed both subjective and ob-
format is 16-bit PCM with a sample rate of 44 kHz. The LibriTTS jective evaluations to examine the performance of EFTS2 and
dataset is another multi-speaker English speech dataset with an EFTS2-VC. For subjective evaluations, we report the following
original sampling rate of 24,000 Hz. For the multi-speaker TTS metrics:
task, we train EFTS2 on a minimal version of LibriTTS, the r Mean Opinion Score (MOS): Audio samples from the test
train-clean-100, which contains 53 hours of audio clips of 247 sets are randomly chosen and provided to raters. Raters are
native speakers. In our experiments, all audio clips are converted asked to rate the naturalness of the audio samples.
into 16-bit and down-sampled to 22050 Hz. All datasets are r Comparative Mean Opinion Score (CMOS): Paired audio
randomly split into a training set, a validation set, and a test samples are randomly provided to raters for side-by-side
set. comparison. Audio samples within each pair are generated
Preprocessing: The linear spectrograms of the original audio by different models on the same text. Raters are asked to
are used as the input of the spectrogram encoder. The FFT size, rate the naturalness of the audio samples.
hop size, and window size used in Short-time Fourier transform r Similarity Mean Opinion Score (Sim-MOS): For the voice
(STFT) to obtain linear spectrograms are set to 1024, 256, conversion task, audio samples are provided to raters as
and 1024 respectively. Before training, the text sequences are groups. Within each group, audio samples are generated
converted to phoneme sequences using open-sourced software by different models while converting to the same target
speaker, and one audio from the target speaker is also
given as a reference. Raters are asked to scale the speaker
4 https://mcf330.github.io/efts2audiosamples/
5 https://github.com/mcf330/efts2code 6 https://github.com/bootphon/phonemizer
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1657

TABLE I TABLE II
HYPER-PARAMETERS OF EFTS2 AND EFTS2-VC MOS RESULTS FROM BASELINE MODELS, THE ABLATION STUDIES, AND
EFTS2 ON THE LJ-SPEECH. HERE, -HVAE REPRESENTS USING 1-LAYER VAE.
-HA REPRESENTS USING SINGLE ATTENTION DERIVED FROM TOKEN
BOUNDARIES. -BI REPRESENTS REMOVING THE BIDIRECTIONAL TRAINING
OBJECTIVE

and the best-performing model in the EFTS family, EFTS-CNN


+ HiFi-GAN, with our own implementation. Ablation studies
are also conducted to validate our design choices. For each of
the model settings, 40 utterances are randomly generated for
the test. The evaluation results in both objective metrics (WER
and MCD) and subjective MOS on the LJ-Speech dataset [31]
are shown in Table II. EFTS2 significantly outperforms 2-stage
EFTS-CNN and is slightly better than VITS, for both objective
and subjective evaluations. Although the difference between the
MOS score of ETFS2 and that of VITS is not significant, the
scores of both models are still comparable to that of the ground
truth audios, meaning that the speech quality of EFTS2 and
VITS are both very close to natural speech. Ablation studies
confirm the importance of our design choices. Removing either
the hierarchical-VAE structure (-HVAE) or the hybrid attention
similarity of the converted samples to the reference audio mechanism (-HA) leads to a significant MOS decrease and in-
samples. crease of both WER and MCD. Removing the bidirectional train-
For the above three tests, 15 raters were asked to scale ing (-BI) slightly decreases the MOS score. Although removing
the naturalness (or similarity, accordingly) of the given audio the variational alignment predictor (-VAP) slightly decreases the
samples on a 5-scaled score (1 = Bad; 2 = Poor; 3 = Fair; WER and MCD, it lacks diversity which is very important for
4 = Good; 5 = Excellent) with rating increments of 0.5. For speech generation.
objective evaluations, the following metrics are reported:
r Word Error Rate (WER): We evaluated the word error rate C. Multi-Speaker TTS
(WER) using a pretrained automatic speech recognition As EFTS2 has a very similar MOS score to VITS on the
(ASR) system. LJ-Speech dataset, we further compare EFTS2 with baseline
r Mel-Cepstral Distortion (MCD), which measures the dif- models on two challenging multi-speaker datasets, the VCTK
ference between the generated mel-cepstra and ground dataset [32] and the LibriTTS dataset [33]. For the multi-speaker
truth mel-cepstral [37]. setting, the one-hot speaker embeddings are fed to the HiFi-GAN
r Cosine Similarity (COS-Sim), which computes the cosine generator and all the non-causal WaveNet residual blocks in
similarity score between the speaker embeddings of the EFTS2 as global conditions. The rest of the settings are the
target speech and the converted speech [16]. same as the single-speaker setting. Twenty randomly selected
r Kullback-Leibler Divergence (KLD), which computes the sentences are generated by each model to conduct the CMOS
Kullback-Leibler divergence between the latent variables test and the results are shown in Table III. As shown in Table III,
extracted from the spectrogram encoder and the latent the advantage of EFTS2 over VITS is slight but still considerable
variables from the phoneme encoder. in a side-by-side evaluation. The improvement of EFTS2 over
EFTS-CNN + HiFi-GAN is significant.
B. Single-Speaker TTS and Ablation Studies
D. Comparison of Different Aligners
In this subsection, we compare the quality of audio samples
generated by EFTS2 and the baseline models. The baseline mod- To verify the effectiveness of the proposed aligner, we con-
els are the best-performing publicly-available model VITS [11] ducted the CMOS test with the following comparison models:
1658 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024

TABLE III TABLE V


CMOS RESULTS ON THE VCTK DATASET AND THE LIBRITTS DATASET COMPARISON OF THE NUMBER OF PARAMETERS AND INFERENCE SPEED
BETWEEN EFTS2 AND BASELINE MODELS

TABLE IV
CMOS RESULTS BETWEEN OUR HYBRID ATTENTION WITH
OTHER ALIGNERS ON THE LJ-SPEECH DATASET approach, which demonstrates the importance of jointly learning
alignment and speech generation. The proposed approach D-I-
Gaussian-Boundaries, which uses Gaussian attention derived
from token boundaries, significantly outperforms other upsam-
pling approaches. The best model D-Hybrid, which combines
D-I-Gaussian-e and D-I-Gaussian-Boundaries, further boosts
the model performance. 3) D-Hybrid achieves a performance
gain of 0.1 over ND-I-MAS, verifying the significance of the
proposed differentiable aligner over MAS. We also notice that
ND-I-MAS performs worse than VITS because VITS achieves
comparable speech quality to our best-performing model, while
there is a notable performance gap between ND-I-MAS and
r Non-Differentiable approaches. ND-E-Repetition: exter- D-Hybrid. One assumption is that the repeated latent variable of
nal aligner with repeated upsampling [5], [6]; and ND- ND-I-MAS has very similar local representations, which often
I-MAS: internal aligner using MAS [11]. require a very large receptive field size for the decoder network.
r Differentiable approaches. D-E-Gaussian-Central: exter-
nal aligner using the upsampling approach proposed by E. Model Size and Inference Speed
EATS [10], [26]; D-E-Gaussian-Boundaries: external The inductive biases of the proposed hierarchical-VAE-based
aligner using proposed B2A approach following (12); generator make the overall model smaller and significantly
D-I-Learnable: internal aligner with a single learnable faster than the baseline models. The model size and inference
attention [27]. The token boundaries are derived from speed of EFTS2 along with the baseline models are presented
π following (10). D-I-Gaussian-e: internal aligner. The in Table V. Since EFTS2’s generator employs a significantly
attention is derived from alignment vector e [15]; D-I- smaller number of convolution blocks than VITS, the inference
Gaussian-Boundaries: internal aligner. The attention is speed is greatly improved. Specifically, EFTS2 is capable of
derived from token boundaries following (12); D-Hybrid: synthesizing 22.05 kHz speech 101.92× faster than real-time,
the proposed hybrid attention. which is 1.5× faster than VITS.
All these models are built up using the 2-layer-hierarchical-
VAE-based waveform generator and deterministic alignment
F. Analysis of the Latent Hierarchy
predictor (DAP). The bidirectional prior/posterior training ob-
jective is also excluded from the training for a fair comparison. One question from Section IV-B is whether the hierarchical
For both the two non-differentiable models, the first convo- architecture of the proposed generator empowers the model to
lutional prior network is excluded. The first variational prior have controllable diversity in hierarchical and explainable latent
of the two non-differentiable models is formulated by firstly variables. To verify this statement, we designed the following
mapping the text hidden representation xh to an input level experiment. As mentioned above, tA is the scaling factor on the
Gaussian prior distribution, and then expanding the Gaussian variance of alignment, while t1 and t2 are two scaling factors
prior through repetition. The phoneme durations are extracted applied on the variances of the latent distributions pθ (z 1 ) and
using MAS [11] for all those models using external aligners. For pθ (z 2 ) respectively. In this experiment, we picked three sets of
this task, 15 sentences were randomly selected and generated values of (t1 , t2 ) and synthesized one fixed sentence 5 times
by each of the model settings, and the CMOS result between under each set of (t1 , t2 ), while fixing tA = 0 throughout this
our method (D-Hybrid) with other aligners are presented in experiment. In other words, all waveforms generated in this
Table IV. We have the following observations: 1) the model experiment share the exact same xalign . Then, for each set of
using learnable upsampling D-I-Learnable does not converge (t1 , t2 ), 5 pairs of z 1 and z 2 are sampled and used to synthesize
at all while other models are able to produce reasonable results; waveforms. The F0 contours of these waveforms are visualized
2) the models with internal aligners outperform those using in Fig. 4. As shown in the figure, increasing t1 considerably
external aligners even if facilitated with the same upsampling increases the variation of F 0, whereas large t2 barely produces
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1659

Fig. 4. F 0 contours obtained from the test samples generated by EFTS2 with different t1 . Each subplot represents the F0 contours of the five utterances, each
with different colors, generated under the marked (t1 , t2 ) values.

TABLE VI
MOS AND SIMILARITY MOS FOR VOICE CONVERSION EXPERIMENTS
ON THE VCTK DATASET

TABLE VII
OBJECTIVE EVALUATIONS ON THE VCTK DATASET

Fig. 5. The histograms of utterance lengths generated by EFTS2 with different


values of tA . The sentence “Mrs. De Mohrenschildt thought that Oswald,” is
synthesized 100 times for each value of tA . When tA is set to 0, the utterances
all have a length of 231 frames.

any variation on F 0 when t1 = 0. This means essential acoustic


proposed differentiable aligner produces with different values
features such as F 0 are mostly fixed after z 1 is sampled. In
of tA . One sentence is randomly selected from the test set and
other words, pθ (z 1 ) is a linguistic latent representation offering
generated 100 times for tA = 0.2, 0.5, 1.0. The corresponding
variations on the spectrogram domain acoustic information,
histograms of the lengths of generated utterances are shown in
while pθ (z 2 ) contains the spectrogram domain acoustic infor-
Fig. 5. As can be seen, there are more variations in speech length
mation and offers variations on time domain information. This is
as the value of tA increases. In other words, we can control the
important because though we did not explicitly give the model
degree of diversity in speech rhythms of the synthesized speech
any constraint, it still learns the hierarchical and explainable
by using different values of tA .
latent representations with controllable diversity.
The two attention matrices of EFTS2 are visualized in Fig. 6.
As can be seen, in either subplot, the brighter pixels clearly form
G. Analysis of the Speech Variation and Visualization of the a continuous diagonal line without skipping or rewinding. This
Attention Matrices
confirms that both of the attention matrices are well-learned.
To verify that our model has controllable diversity in speech The line in subplot 6a is smoother and more blurry while the
rhythms, we examined how many different utterance lengths the line in subplot 6b is more clear with distinct yellow blocks. This
1660 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024

Fig. 6. Visualization of the attention matrices of EFTS2. Each subplot represents one attention matrix, with the horizontal and vertical axes representing the
decoder and encoder timestep accordingly. Here the decoder and encoder timestep could be understood as the index of the output mel-spectrogram sequence and
the input phoneme-token sequence. Values within the matrix are represented by colors corresponding to the color chart on the right. A brighter color corresponds
to a larger value and indicates a larger possibility that this output frame attends to this input token. Subplot (a) is the attention matrix α(1) , which is reconstructed
using e. Subplot (b) is the attention matrix α(2) , which is reconstructed using boundary pairs (a, b).

TABLE VIII
COMPARISON WITH PREVIOUS TEXT-TO-WAVEFORM MODELS

indicates that the α(1) learns a smother overall alignment while KLD for both seen and unseen speakers, better COS-sim score
the α(2) learns specific duration boundaries of the input tokens. for seen speakers, and comparable COS-sim Score for unseen
speakers. Specifically, EFTS2-VC achieves significantly lower
KLD than YourTTS, indicating that the latent variable of EFTS2-
H. Voice Conversion Evaluation VC is closer to pure text representation and therefore proves the
The conversion performance of EFTS2-VC is evaluated on superior disentanglement capability of EFTS2-VC.
the VCTK dataset [32] with a comparison to the baseline model
YourTTS [16]. For both the seen and unseen target speaker
VII. ADVANTAGES OF EFTS2
settings, 25 converted utterances from each model are collected
to conduct the MOS and Sim-MOS test, and the results are In Table VIII we compare the advantages of EFTS2 with
presented in Table VI. EFTS2-VC achieves slightly better MOS previous text-to-waveform models in terms of training pipelines,
scores and comparable Sim-MOS scores on both seen and un- differentiability, model performance, and model efficiency.
seen speakers. Note that the conversion of YourTTS requires EFTS2 is the only differentiable model that allows for end-to-
running the flow module bidirectionally, which results in a slow end training, high-quality, and high-efficiency generation.
conversion speed. On the other hand, EFTS2-VC is significantly
faster. It runs 2.15× faster than YourTTS on a Tesla V100 GPU.
To further evaluate the ability of EFTS2-VC to disentangle the VIII. CONCLUSION AND DISCUSSION
content information and speaker-related information, objective We presented EfficientTTS 2 (EFTS2), a novel end-to-
evaluations on WER and COS-Sim are reported in Table VII. In end TTS model that adopts an adversarial training process,
addition, as both EFTS2-VC and YourTTS address the disen- with a generator composed of a differentiable aligner and a
tanglement problem by minimizing the KL divergence between hierarchical-VAE-based speech generator. Compared to baseline
the latent representation and the content representation, we also models, EFTS2 is fully differentiable and enjoys a smaller model
included the KL divergence (KLD) for comparison. All objective size with higher model efficiency, while still allowing high-
evaluations are conducted on the same utterances from Table VI. fidelity speech generation with controllable diversity. Moreover,
As presented in Table VII, EFTS2-VC offers better WER and we extend EFTS2 to the VC task and propose a VC model,
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1661

EFTS2-VC, that is capable of efficient and high-quality end-to- [14] X. Tan et al., “NaturalSpeech: End-to-end text to speech synthesis with
end voice conversion. human-level quality,” IEEE Trans. Pattern Anal. Mach. Intell., early ac-
cess, Jan. 19, 2024, doi: 10.1109/TPAMI.2024.3356232.
The primary goal of this work is to build a competitive TTS [15] C. Miao et al., “EfficientTTS: An efficient and high-quality text-to-speech
model that allows for end-to-end high-quality speech generation. architecture,” in Proc. 38th Int. Conf. Mach. Learn., 2021, pp. 7700–7709.
In the meantime, the proposed design choices can easily be [16] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M.
A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot
incorporated into other TTS frameworks. Firstly, the proposed voice conversion for everyone,” in Proc. 39th Int. Conf. Mach. Learn.,
B2A approach could potentially be a handier replacement for 2022, pp. 2709–2720.
conventional upsampling techniques in nearly all NAR TTS [17] N. Chen et al., “WaveGrad 2: Iterative refinement for text-to-speech
synthesis,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021,
models, given that it is differentiable, informative, and compu- pp. 3769–3769.
tationally cheap. Secondly, the differentiable aligner may be a [18] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS:
superior alternative for any external aligner or non-differentiable A diffusion probabilistic model for text-to-speech,” in Proc. 38th Int. Conf.
Mach. Learn., 2021, pp. 8599–8608.
aligner, as it improves the uniformity of the model and makes [19] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger,
the training process end-to-end. Thirdly, the 2-layer hierarchical- “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in
VAE-based waveform generator can potentially outperform the Proc. Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 498–502.
[20] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
popular flow-VAE-based counterpart [11], [14] since it is more jointly learning to align and translate,” in Proc. Int. Conf. Learn. Repre-
efficient and offers more flexibility in network design. Lastly and sentations, 2015, pp. 1–9.
most importantly, the entire architecture of EFTS2 could serve as [21] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1
convolutions,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1–9.
a practical solution to sequence-to-sequence tasks that have the [22] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”
nature of monotonic alignments. We leave these assumptions to in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 6840–6851.
future work while providing our implementations as a research [23] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
Inf. Process. Syst., 2014, pp. 1–9.
basis for further exploration. [24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc.
Int. Conf. Learn. Representations, 2014, pp. 1–9.
REFERENCES [25] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf.
Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[1] Y. Wang, R. Skerry-Ryan, D. Stanton, R. J. W. Y. Wu, N. Jaitly, and Z. [26] J. Shen et al., “Non-Attentive Tacotron: Robust and controllable neu-
Yang, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Annu. ral tts synthesis including unsupervised duration modeling,” 2020,
Conf. Int. Speech Commun. Assoc., 2017, pp. 4006–4010. arXiv:2010.04301.
[2] J. Shen et al., “Natural TTS synthesis by conditioning wavenet on MEL [27] I. Elias et al., “Parallel Tacotron 2: A non-autoregressive neural TTS Model
spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal with differentiable duration modeling,” in Proc. Annu. Conf. Int. Speech
Process., 2018, pp. 4779–4783. Commun. Assoc., 2021, pp. 141–145.
[3] W. Ping et al., “Deep voice 3: 2000-Speaker neural text-to-speech,” in [28] R. Child, “Very deep VAEs generalize autoregressive models and can
Proc. Int. Conf. Learn. Representations, 2018, pp. 214–217. outperform them on images,” in Proc. Int. Conf. Learn. Representations,
[4] C. Miao, S. Liang, M. Chen, J. Ma, S. Wang, and J. Xiao, “Flow-TTS: 2021, pp. 1–9.
A non-autoregressive network for text to speech based on flow,” in Proc. [29] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks
IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7209–7213. for efficient and high fidelity speech synthesis,” in Proc. Adv. Neural Inf.
[5] Y. Ren et al., “FastSpeech: Fast, robust and controllable text to speech,” in Process. Syst., 2020, pp. 17022–17033.
Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019, pp. 3171–3180. [30] J. You et al., “GAN-Vocoder: Multi-resolution discriminator is all you
[6] Y. Ren et al., “FastSpeech 2: Fast and high-quality end-to-end text to need,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 2177–
speech,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–9. 2181.
[7] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis [31] K. Ito, “The lj speech dataset,” 2017. [Online]. Available: https://keithito.
through linear prediction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal com/LJ-Speech-Dataset/
Process., 2019, pp. 5891–5895. [32] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus: En-
[8] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast glish multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”
waveform generation model based on generative adversarial networks with University of Edinburgh. The Centre for Speech Technology Research
multi-resolution spectrogram,” in Proc. IEEE Int. Conf. Acoust., Speech, (CSTR), 2019.
Signal Process., 2020, pp. 6199–6203. [33] H. Zen et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-
[9] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad, and D. P. speech,” in Proc. 20th Annu. Conf. Int. Speech Commun. Assoc., 2019,
Kingma, “Wave-Tacotron: Spectrogram-free end-to-end text-to-speech pp. 1526–1530.
synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, [34] H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “Clova baseline system for the
pp. 5679–5683. voxceleb speaker recognition challenge 2020,” 2020, arXiv:2009.14153.
[10] J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, and K. Simonyan, [35] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker
“End-to-end adversarial text-to-speech,” in Proc. Int. Conf. Learn. Repre- recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2018,
sentations, 2021, pp. 1–9. pp. 1086–1090.
[11] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with [36] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in
adversarial learning for end-to-end text-to-speech,” in Proc. 38th Int. Conf. Proc. Int. Conf. Learn. Representations, 2019, pp. 1–9.
Mach. Learn., 2021, pp. 5530–5540. [37] R. Kubichek, “Mel-cepstral distance measure for objective speech quality
[12] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A. generative flow for assessment,” in Proc. IEEE Pacific Rim Conf. Commun. Comput. Signal
text-to-speech via monotonic alignment search,” in Proc. 34th Int. Conf. Process., 1993, pp. 125–128.
Neural Inf. Process. Syst., 2020, pp. 8067–8077. [38] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation in
[13] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent end-to-end text-to-speech,” in Proc. Int. Conf. Learn. Representations,
components estimation,” 2014, arXiv:1410.8516. 2019, pp. 1–9.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy