Ieee
Ieee
32, 2024
Abstract—Recently, the field of Text-to-Speech (TTS) has been without the need for mel-spectrograms [9], [10], [11]. Among all
dominated by one-stage text-to-waveform models which have sig- the open-sourced text-to-waveform models, VITS [11] achieves
nificantly improved speech quality compared to two-stage models. the best model performance and efficiency. However, it still has
In this work, we propose EfficientTTS 2 (EFTS2), a one-stage
high-quality end-to-end TTS framework that is fully differentiable some drawbacks. Firstly, the MAS method [12] used to learn
and highly efficient. Our method adopts an adversarial training sequence alignment in VITS is precluded in the standard back-
process, with a differentiable aligner and a hierarchical-VAE-based propagation process, thus affecting training efficiency. Secondly,
waveform generator. These design choices free the model from the in order to generate a time-aligned textual representation, VITS
use of external aligners, invertible structures, and complex training simply repeats each hidden text representation by its correspond-
procedures as most previous TTS works have. Moreover, we extend
EFTS2 to the voice conversion (VC) task and propose EFTS2-VC, ing duration. This repetition operation is non-differentiable
an end-to-end VC model that allows high-quality speech-to-speech thus hurting the quality of generated speech. Thirdly, VITS
conversion. Experimental results suggest that the two proposed utilizes bijective transformations, specifically affine coupling
models achieve better or at least comparable speech quality com- layers [13], to compute latent representations. However, for
pared to baseline models, while also providing faster inference affine coupling layers, only half of the input data gets updated
speeds and smaller model sizes.
after each transformation. Therefore, one has to stack multiple
Index Terms—Text-to-speech, speech synthesis, voice convers- affine coupling layers to generate meaningful latent represen-
ion, differentiable aligner, VAE, hierarchical-VAE, end-to-end. tations, which increases the model size and further reduces the
model’s efficiency. A recent work NaturalSpeech [14] improves
I. INTRODUCTION upon VITS by leveraging a learnable differentiable aligner and
EXT-TO-SPEECH (TTS) task aims at producing human- a bidirectional prior/posterior module. However, the training of
T like synthetic speech signals from text inputs. In recent
years, sparked by the development of autoregressive (AR) mod-
the learnable differentiable aligner requires a warm-up stage,
which is a pretraining process with the help of external aligners.
els [1], [2], [3] and non-autoregressive (NAR) models [4], [5], Although the bidirectional prior/posterior module of Natural-
[6], neural network systems have dominated the TTS field. The Speech can reduce the training and inference mismatch caused
conventional neural TTS systems cascade two separate models: by the bijective flow module, it further increases the model’s
an acoustic model that transforms the input text sequences computational cost of training.
into acoustic features (e.g., mel-spectrogram) [1], [5], followed A recent work EfficientTTS (EFTS) [15] proposed a NAR
by a neural vocoder that transforms the acoustic features into architecture with differentiable alignment modeling that is op-
audio waveforms [7], [8]. Although two-stage TTS systems have timized jointly with the rest of the model. In EFTS, a family
demonstrated the capability of producing human-like speech, of text-to-mel-spectrograms models and a text-to-waveform
these systems come with several disadvantages. First of all, the model are developed. However, the performance of the text-
acoustic model and the neural vocoder cannot be optimized to-waveform model is close to but no better than two-stage
jointly, which often hurts the quality of the generated speech. models. Inspired by EFTS, we propose an end-to-end text-
Moreover, the separate training pipeline not only complicates to-waveform TTS system, the EfficientTTS 2 (EFTS2), that
the training and deployment but also makes it difficult to model overcomes the above issues of current one-stage models with
downstream tasks. competitive model performance and higher efficiency. The main
Recently, in the TTS field, there is a growing interest in de- contributions of this paper are as follows:
r We propose a differentiable aligner with a hybrid attention
veloping one-stage text-to-waveform models that can be trained
mechanism and a variational alignment predictor, which
Manuscript received 26 May 2023; revised 27 October 2023; accepted 12 empowers the model to learn expressive time-aligned latent
February 2024. Date of publication 23 February 2024; date of current version 1 representation and have controllable diversity in speech
March 2024. The associate editor coordinating the review of this manuscript and rhythms. (Section IV-A)
approving it for publication was Prof. Hema A Murthy. (Corresponding author: r We introduce a 2-layer hierarchical-VAE-based waveform
Chenfeng Miao.)
The authors are with Ping An Technology, Shanghai 200120, China (e-mail: generator that not only produces high-quality outputs but
miao_chenfeng@126.com; qingying.zhu@outlook.com; chenminchuan109@ also learns hierarchical and explainable latent variables
pingan.com.cn; majun@pingan.com.cn; swang.usa@gmail.com; xiaojing661@
pingan.com.cn). that control different aspects of the generated speech.
Digital Object Identifier 10.1109/TASLP.2024.3369528 (Section IV-B)
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1651
Fig. 2. Overall architecture of EFTS2’s generator. LP refers to linear projection. The dotted lines refer to training objectives.
where π ∈ RT2 and e ∈ RT1 are two alignment vectors and α Let e denote the re-sampled alignment vector, then e is computed
is the reconstructed alignment matrix. A parametric alignment as follows:
predictor with an output of ê, the predicted e, is trained jointly exp (−σ −2 (πj − i)2 )
given input xh and therefore allows tractable computation of α γi,j = T2 −1 (5)
−2 2
in the inference phase based on ê. The alignment vector π is n=0 exp (−σ (πn − i) )
which maps the input hidden state xh to time-aligned hid- better than the conventional approaches: (i) compared to the
den representation xalign ; and (ii) the hierarchical-VAE-based repetition operation [5], [6], [11], [12], the proposed approach
waveform generator, which produces the output waveform y is differentiable and enables batch computation; (ii) compared
from xalign . More details will be discussed in the rest of this to the popular Gaussian upsampling [10], [26] that considers
section. only the centralized position, the proposed approach employs
boundary positions, which is more informative; (iii) compared
A. Differentiable Aligner to the learnable upsampling [14], [27], the proposed approach
is monotonic and much easier to train.
Grounded on EFTS, we construct the differentiable aligner
In preliminary experiments, we found out that the model
with two major improvements: (i) a hybrid attention mechanism
performance is greatly influenced by the choice of σ in (7)
and (ii) a variational alignment predictor. The structure of the
and (12). In order to obtain better model performance, we use
differentiable aligner is shown in Fig. 2(b).
learnable σ in this work. We further map the hidden repre-
Hybrid attention mechanism: The performance of the aligner
sentation xh to a time-aligned hidden representation xalign
in EFTS heavily depends on the expressiveness of the recon-
using an approach similar to the multi-head attention mechanism
structed attention matrix α , which is derived from the alignment
in [25]:
vector e. Here, e can be considered as the expected aligned
positions for each input token over all possible output frames. head(i) = α(i) · (xh W (i) ) (13)
However, in the TTS task, one input token normally attends
to multiple output frames. Therefore, it is better to incorporate xalign = Concat(head(1) , head(2) )Wo (14)
the boundary positions of each input token when constructing
where {W (i) }, W o are learnable linear transformations. The
the attention matrix. To this end, we introduce a hybrid atten-
xalign is then fed into the hierarchical-VAE-based waveform
tion mechanism that integrates two attention matrices: the first
generator as input.
attention matrix α(1) is derived from e as in EFTS ((2)–(7))
Variational alignment predictor: NAR TTS models generate
and the second attention matrix α(2) is derived from the token
the entire output speech in parallel, thus alignment information
boundaries using the following transformations:
is required in advance. To address this problem, many previous
(3) (10) (12) NAR models train a duration predictor to predict the duration of
α −→ π −→ a ∈ RT1 , b ∈ RT1 −→ α(2) (8)
each input token [5], [11]. Similarly, EFTS employs an aligned
where (a ∈ RT1 , b ∈ RT1 ) are start and end boundaries of the position predictor to predict the aligned position vector e. As
input tokens. We call the process from the attention matrix α opposed to a vanilla deterministic alignment predictor (DAP),
to boundary pairs (a, b) the Attention to Boundaries (A2B) in this work, we use a variational alignment predictor (VAP) to
transformation and the process from boundary pairs (a, b) to predict the alignment vector e and the boundary positions a and
the reconstructed attention matrix α(2) the Boundaries to At- b. The main motivation behind this is to consider the alignment
tention (B2A) transformation. Inspired by ((5)–(6)), the A2B prediction problem as a generative problem since one text input
transformation is formulated using the following equations: can be expressed with different rhythms. Specifically, the VAP
encoder receives the relative distances e − a and b − a, and
exp (−σ −2 (πj − pi )2 )
βi,j = T2 −1 , (9) outputs a latent posterior distribution qφ (z A |e − a, b − a, xh )
−2 2
n=0 exp (−σ (πn − pi ) ) conditioned on xh , while the VAP decoder estimates the output
distribution by inputting z A , and conditioned on xh . The prior
T2 −1
ai+1 , i < T1 − 1
ai = βi,n ∗ n, bi = (10) distribution is a standard Gaussian distribution. For simplicity,
n=0
T 2 − 1, i = T1 − 1 both the encoder and the decoder of VAP are parameterized with
0, i=0
non-causal WaveNet residual blocks. The training objective of
where pi = i − 0.5, 0 < i < T . the VAP is computed as:
1
In the meantime, the B2A transformation is designed as (1) (2)
follows: Lalign = λ1 (dθ (z A ) − log(e − a + )2 + dθ (z A )
(A)
energyi,j = − σ −2 (|j − ai | + |bi − j| − (bi − ai ))2 (11) − log(b − a + )2 ) + λ2 DKL (N (z A ; μφ (e
(A)
(2) exp (energyi,j ) − b, b − a, xh ), σφ (e − b, b − a, xh )) ||
αi,j = T1 −1 (12)
m=0 exp (energym,j ) N (z A ; 0, I)) (15)
As can be seen, for the ith input token with its corresponding (1) (2) (A)
boundaries (ai , bi ), {energyi,j } reaches the maximum value 0 where dθ and dθ are outputs of the VAP decoder, μφ and
(A)
only if the output position j falls into its boundaries, meaning σφ are outputs of the VAP encoder, and is a small value
ai ≤ j ≤ bi . For an output position outside of the boundaries, to avoid numerical instabilities. The first term in (15) is the
the further it is away from the boundaries, the lower value of reconstruction loss that computes the log-scale mean square
{energyi,j } it gets, resulting in less attention weight. error (MSE) between the predicted relative distances and target
Note that the proposed B2A approach works for all those relative distances. The second term is the KL divergence between
TTS models with explicit token duration, and is potentially the posterior and prior distributions. In the inference phase, the
1654 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024
alignment vector ê and boundary positions â and b̂ are computed qφ (z|y) = qφ (z 1 |y lin )qφ (z 2 |y lin ) (18)
as:
i The training objective is :
(2) 0, i = 0
b̂i = (exp((dθ (z A ))m ) − ), âi = (1)
m=0
b̂i−1 , i > 0 Lwav = λ3 y mel − ŷ mel 1 + λ4 (DKL (N (z 1 ; μφ (y lin ),
(1) (1) (1) (1)
êi = exp((dθ (z A ))i ) − + âi (16) σφ (y lin )) || N (z 1 ; μθ (xalign ), σθ (xalign )))
where z A is sampled from the standard Gaussian distribution. A (2) (2)
+ DKL (N (z 2 ; μφ (y lin ), σφ (y lin )) ||
stop gradient operation is added to the inputs of the VAP encoder,
which helps the model to learn a more accurate alignment in the (2) (2)
N (z 2 ; μθ (z 1 ), σθ (z 1 )))) (19)
training phase.
Here, the reconstruction loss is the 1 loss between the target
B. Hierarchical-VAE-Based Waveform Generator mel-spectrogram y mel and the predicted mel-spectrogram ŷ mel
Producing high-quality waveforms from linguistic features which is derived from the generated waveform ŷ. In this work,
(e.g., texts, phonemes, or hidden linguistic representation the two prior estimators and the posterior estimator are all
xalign ) is known as a particularly challenging problem. This parameterized by stacks of non-causal WaveNet residual blocks
is mainly because linguistic features do not contain enough with the estimated means and variations, while the decoder is in-
necessary information (e.g., pitch and energies) for waveform spired by the generator of HiFi-GAN [29]. Similar to EFTS and
generation. A primary idea is to use a VAE-based genera- VITS, the decoder part is trained on sliced latent variables with
tor that learns the waveform generation from a latent vari- corresponding sliced audio segments for memory efficiency.
able z. This z is sampled from an informative posterior dis- Some previous TTS models [11], [14] also incorporate the VAE
tribution qφ (z|y) parameterized by a network with acoustic framework in end-to-end waveform generation. EFTS2 differs
features as input. A prior estimator pθ (z|xalign ) with xalign from them in several aspects: (i) EFTS2 uses 2-layer hierarchical
as input is also trained jointly. Training such a system is to VAE while previous works use single-layer VAE; (ii) in previ-
minimize the reconstruction error between the real and pre- ous work, the KL divergence between the prior and posterior
dicted waveform and the KL divergence between the prior distributions is estimated between a latent variable (which is
and the posterior distribution. However, the prior distribution just a sample from posterior distribution) and a multivariate
contains no acoustic information while the learned posterior Gaussian distribution, while EFTS2 computes the KL diver-
must be informative w.r.t. the acoustic information. The in- gence between two multivariate Gaussian distributions which
formation gap makes it hard to minimize the KL divergence allows for a more efficient training; and (iii) previous works
between the prior and the posterior distribution. To tackle this have to accompany the VAE with the bijective flow structure
problem, we introduce a 2-layer hierarchical VAE structure to produce the high-quality results, while EFTS2 is bijective
that enables informative prior formulation. The hierarchical- free.
VAE-based waveform generator is composed of the following
blocks: C. Bidirectional prior/posterior Training
r A posterior network which takes the linear spectrograms
One limitation of hierarchical VAE is that it can have training-
y lin as input and outputs two latent Gaussian posterior
inference mismatch. In the training phase, the reconstructed
qφ (z 1 |y lin ), qφ (z 2 |y lin ).
r A hierarchical prior network which consists of two stochas- waveform is derived from the posterior, while in the infer-
ence phase, the synthesized waveform is predicted from the
tic layers: the first layer receives xalign and outputs a
prior. Inspired by NaturalSpeech [14], we adopt a bidirectional
latent Gaussian prior pθ (z 1 |xalign ); the second layer takes
prior/posterior training method to reduce the training-inference
a latent variable z 1 and formulates another latent Gaus-
information gap. Specifically, we add a new training objective
sian prior pθ (z 2 |z 1 ), where z 1 is sampled from posterior
Lbi that minimizes the KL divergence between the enhanced
distribution qφ (z 1 |y lin ) in training phase and from prior
prior and the posterior.
distribution pθ (z 1 |xalign ) in inference phase.
r A decoder which produces the waveform from the latent (2) (p) (2) (p)
Lbi = DKL (N (z 2 ; μθ (z 1 ), σθ (z 1 )) ||
variable z 2 where z 2 is sampled from posterior distribution
(2) (2)
qφ (z 2 |y lin ) in training phase and from prior distribution N (z 2 ; μφ (y lin ), σφ (y lin ))) (20)
pθ (z 2 |z 1 ) in inference phase.
(p)
Therefore, the overall prior and posterior distributions are where z 1 is sampled from the output distribution of the first
formulated as:2 prior network pθ (z 1 |xalign ). Unlike NaturalSpeech which bidi-
pθ (z|xalign ) = pθ (z 1 |xalign )pθ (z 2 |z 1 ) (17) rectionally runs the bijective flow module, our method is to run
the second prior network twice. Since the second prior network
2 We hypothesize z , z representing different and conditionally indepen-
1 2
is significantly smaller than the flow module in NaturalSpeech,
dent representations. Therefore the posterior distribution is formulated as the training of the proposed method is computationally cheap
qφ (z 1 |y lin )qφ (z 2 |y lin ) instead of qφ (z 1 |y lin )qφ (z 2 |z 1 ) in this work. while still allowing for a large batch size per GPU. Note that
Fig. 4 confirms our assumption. A similar conclusion has been drawn by [28],
which states that a hierarchical VAE can learn a latent hierarchy of conditionally non-differentiable models such as VITS can’t be trained bidi-
independent variables. rectionally.
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1655
Fig. 3. Overall model architecture of EFTS2-VC. LP refers to linear projection. The dotted lines refer to the training objectives.
D. The Overall Model Architecture r The alignment predictor is excluded in EFTS2-VC since
there is no need to explicitly tell the text-spectrogram
The overall model architecture of EFTS2 is based on GAN,
alignment in the inference phase.
which consists of a generator and multiple discriminators. We r Instead of using e or the token boundaries, the recon-
follow [29] in implementing the multiple discriminators whose
structed attention matrix α is derived from π ((22)). This
performance is experimentally confirmed by many previous
works [11], [30]. The feature matching loss Lf m is also em- not only simplifies the computation pipeline but also allows
the network to obtain a more accurate text-spectrogram
ployed for training stability. In the training phase, a phoneme
alignment. Similar to the TTS model, EFTS2-VC uses
sequence x is passed through a phoneme encoder to produce the
latent representation xh , while the corresponding linear spectro- multiple reconstructed attentions by employing multiple
(k)
gram y lin is passed through a spectrogram encoder to produce learnable {σπ |k = 1, . . ., H}.
the latent representation y h and two latent Gaussian posteriors.3
exp (−σπ−2 (πj − i)2 )
Same as EFTS and VITS, the phoneme encoder is parameterized αi,j = T1 −1 (22)
−2 2
by a stack of feed-forward Transformer blocks. The proposed m=0 exp (−σπ (πj − m) )
differentiable aligner receives the latent representation xh and r The speaker embedding of the source waveform, which is
y h and outputs the time-aligned latent representation xalign .
Then xalign is further fed to the hierarchical-VAE-based wave- extracted from a trained speaker encoder, is introduced as
form generator to produce the output ŷ. The overall training conditions to the spectrogram encoder, the second prior
objective of the proposed generator G is: network, and the HiFi-GAN generator.
Again, we consider the hierarchical-VAE-based framework
Ltotal = Lalign + Lwav + Lbi + Ladv (G) + Lf m (G) (21) discussed in Section IV-B. During training, the prior distribution
pθ (z 1 |xalign ) is estimated by a stack of WaveNet blocks with
V. EFTS2-VC: END-TO-END VOICE CONVERSION xalign as the only input. Since xalign is the time-aligned textual
Voice conversion (VC) is a task that modifies a source speech representation without any information about the speaker iden-
signal with the target speaker’s timbre while keeping the lin- tity, we can easily draw the conclusion that the prior distribution
guistic contents of the source speech unchanged. The proposed pθ (z 1 |xalign ) contains only the textual information and does
voice conversion model, EFTS2-VC (shown in Fig. 3), is built not contain any information about the speaker identity. The
upon EFTS2 with several module differences: conclusion can be further extended to the posterior distribution
pφ (z 1 |y lin ) since the network is trained by minimizing the
3 Ideally, an end-to-end TTS system should operate on unnormalized text. We
KL divergence between the prior distribution and posterior
use external tools to convert the unnormalized texts to phonemes in this work. We distribution. Therefore, the spectrogram encoder works as a
will explore some data-driven approaches in the future to address this limitation. speaker disentanglement network that strips the speaker identity
1656 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 32, 2024
while preserving the textual (or content) information. Then the phonemizer,6 and the converted sequences are interspersed with
second prior network and the variational decoder reconstruct a blank token following the implementation of VITS [11].
the speech from the content information and the input speaker Configurations: The phoneme encoder of EFTS2 is a stack
embeddings. During inference, the disentanglement network of 6 Feed-Forward Transformer (FFT) blocks, where each FFT
and the reconstruction network are conditioned on different block consists of a multi-head attention layer with 2 attention
speaker embeddings. Specifically, the disentanglement network heads and a convolutional feed-forward layer with a hidden size
receives the spectrogram and the speaker embedding from a of 192. The rest of EFTS2 is composed of stacks of non-causal
source speaker, and outputs a latent distribution pφ (z 1 |y lin ). WaveNet residual blocks. The kernel size is 5 and the dilation
Meanwhile, the reconstruction network produces the output rate is 1 for all the WaveNet layers. EFTS2-VC shares similar
waveform from the latent variable z 1 and the speaker em- model configurations with EFTS2 except that the variational
bedding from a target speaker. With these designs, EFTS2- aligned predictor is excluded from EFTS2-VC. In order to obtain
VC performs an ideal conversion that preserves the content better speech quality, the variances of the variational priors are
information of the source spectrogram while producing the multiplied by different scaling factors at the inference stage.
output speech that matches the speaker characteristics of a target Specifically, the scaling factor on the variance of alignment tA
speaker. is set to 0.7, and the scaling factors on the variances of the
latent distributions pθ (z 1 ) and pθ (z 2 ), denoted as t1 and t2 , are
VI. EXPERIMENTS set to be 0.8, and 0.3 respectively. The deterministic alignment
predictor (DAP), which takes xh as input and outputs the align-
To measure the performance of EFTS2, we evaluate it on
ment vectors (e.g., â, b̂, ê), is parameterized by 2 convolution
two tasks: single-speaker TTS and multi-speaker TTS. Ablation
layers and a linear mapping. Each convolution layer is followed
studies are performed under the single-speaker TTS setting.
by a layer normalization and a leaky ReLU activation. The
We also examine the model from several different aspects:
trained speaker encoder of EFTS2-VC is a speaker recognition
comparison of different aligners, model size and inference speed,
model [34] trained on the voxceleb2 [35] dataset. The pre-trained
analysis of the latent hierarchy, and visualization of the attention
model is publicly available [34]. For the VC task, YourTTS [16],
matrices. Last but not least, we also evaluate the performance
a publicly available pre-trained model trained on the VCTK
of the EFTS2-VC model. The audio samples4 and source code5
dataset is used as the baseline model. For a fair comparison, we
are available on GitHub.
down-sampled the generated audios of EFTS2-VC to 16 kHz to
match the sample rate of YourTTS’ generated audios during
A. Experimental Setup the evaluation process. The hyper-parameters of EFTS2 and
Datasets: Three public datasets are used in our experiments, EFTS2-VC are listed in Table I.
the LJ Speech dataset [31], the VCTK dataset [32], and the Training: Both EFTS2 and EFTS-VC are trained on 4 Tesla
LibriTTS dataset [33]. The LJ Speech dataset is an English V100 GPUs with 16G memory. The batch size on each GPU is
speech corpus consisting of 13,100 audio clips of a single female set to 32. The AdamW optimizer [36] with β1 = 0.8, β2 = 0.99
speaker. Each audio file is a single-channel 16-bit PCM with a is used to train the models. The initial learning rate is set to
sampling rate of 22050 Hz. The VCTK dataset is a multi-speaker 2 ∗ 10−4 and decays at every training epoch with a decay rate of
English speech corpus that contains 44 hours of audio clips of 0.998. Both models converge at the 500 k th step.
108 native speakers with various accents. The original audio Evaluation Metrics: We employed both subjective and ob-
format is 16-bit PCM with a sample rate of 44 kHz. The LibriTTS jective evaluations to examine the performance of EFTS2 and
dataset is another multi-speaker English speech dataset with an EFTS2-VC. For subjective evaluations, we report the following
original sampling rate of 24,000 Hz. For the multi-speaker TTS metrics:
task, we train EFTS2 on a minimal version of LibriTTS, the r Mean Opinion Score (MOS): Audio samples from the test
train-clean-100, which contains 53 hours of audio clips of 247 sets are randomly chosen and provided to raters. Raters are
native speakers. In our experiments, all audio clips are converted asked to rate the naturalness of the audio samples.
into 16-bit and down-sampled to 22050 Hz. All datasets are r Comparative Mean Opinion Score (CMOS): Paired audio
randomly split into a training set, a validation set, and a test samples are randomly provided to raters for side-by-side
set. comparison. Audio samples within each pair are generated
Preprocessing: The linear spectrograms of the original audio by different models on the same text. Raters are asked to
are used as the input of the spectrogram encoder. The FFT size, rate the naturalness of the audio samples.
hop size, and window size used in Short-time Fourier transform r Similarity Mean Opinion Score (Sim-MOS): For the voice
(STFT) to obtain linear spectrograms are set to 1024, 256, conversion task, audio samples are provided to raters as
and 1024 respectively. Before training, the text sequences are groups. Within each group, audio samples are generated
converted to phoneme sequences using open-sourced software by different models while converting to the same target
speaker, and one audio from the target speaker is also
given as a reference. Raters are asked to scale the speaker
4 https://mcf330.github.io/efts2audiosamples/
5 https://github.com/mcf330/efts2code 6 https://github.com/bootphon/phonemizer
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1657
TABLE I TABLE II
HYPER-PARAMETERS OF EFTS2 AND EFTS2-VC MOS RESULTS FROM BASELINE MODELS, THE ABLATION STUDIES, AND
EFTS2 ON THE LJ-SPEECH. HERE, -HVAE REPRESENTS USING 1-LAYER VAE.
-HA REPRESENTS USING SINGLE ATTENTION DERIVED FROM TOKEN
BOUNDARIES. -BI REPRESENTS REMOVING THE BIDIRECTIONAL TRAINING
OBJECTIVE
TABLE IV
CMOS RESULTS BETWEEN OUR HYBRID ATTENTION WITH
OTHER ALIGNERS ON THE LJ-SPEECH DATASET approach, which demonstrates the importance of jointly learning
alignment and speech generation. The proposed approach D-I-
Gaussian-Boundaries, which uses Gaussian attention derived
from token boundaries, significantly outperforms other upsam-
pling approaches. The best model D-Hybrid, which combines
D-I-Gaussian-e and D-I-Gaussian-Boundaries, further boosts
the model performance. 3) D-Hybrid achieves a performance
gain of 0.1 over ND-I-MAS, verifying the significance of the
proposed differentiable aligner over MAS. We also notice that
ND-I-MAS performs worse than VITS because VITS achieves
comparable speech quality to our best-performing model, while
there is a notable performance gap between ND-I-MAS and
r Non-Differentiable approaches. ND-E-Repetition: exter- D-Hybrid. One assumption is that the repeated latent variable of
nal aligner with repeated upsampling [5], [6]; and ND- ND-I-MAS has very similar local representations, which often
I-MAS: internal aligner using MAS [11]. require a very large receptive field size for the decoder network.
r Differentiable approaches. D-E-Gaussian-Central: exter-
nal aligner using the upsampling approach proposed by E. Model Size and Inference Speed
EATS [10], [26]; D-E-Gaussian-Boundaries: external The inductive biases of the proposed hierarchical-VAE-based
aligner using proposed B2A approach following (12); generator make the overall model smaller and significantly
D-I-Learnable: internal aligner with a single learnable faster than the baseline models. The model size and inference
attention [27]. The token boundaries are derived from speed of EFTS2 along with the baseline models are presented
π following (10). D-I-Gaussian-e: internal aligner. The in Table V. Since EFTS2’s generator employs a significantly
attention is derived from alignment vector e [15]; D-I- smaller number of convolution blocks than VITS, the inference
Gaussian-Boundaries: internal aligner. The attention is speed is greatly improved. Specifically, EFTS2 is capable of
derived from token boundaries following (12); D-Hybrid: synthesizing 22.05 kHz speech 101.92× faster than real-time,
the proposed hybrid attention. which is 1.5× faster than VITS.
All these models are built up using the 2-layer-hierarchical-
VAE-based waveform generator and deterministic alignment
F. Analysis of the Latent Hierarchy
predictor (DAP). The bidirectional prior/posterior training ob-
jective is also excluded from the training for a fair comparison. One question from Section IV-B is whether the hierarchical
For both the two non-differentiable models, the first convo- architecture of the proposed generator empowers the model to
lutional prior network is excluded. The first variational prior have controllable diversity in hierarchical and explainable latent
of the two non-differentiable models is formulated by firstly variables. To verify this statement, we designed the following
mapping the text hidden representation xh to an input level experiment. As mentioned above, tA is the scaling factor on the
Gaussian prior distribution, and then expanding the Gaussian variance of alignment, while t1 and t2 are two scaling factors
prior through repetition. The phoneme durations are extracted applied on the variances of the latent distributions pθ (z 1 ) and
using MAS [11] for all those models using external aligners. For pθ (z 2 ) respectively. In this experiment, we picked three sets of
this task, 15 sentences were randomly selected and generated values of (t1 , t2 ) and synthesized one fixed sentence 5 times
by each of the model settings, and the CMOS result between under each set of (t1 , t2 ), while fixing tA = 0 throughout this
our method (D-Hybrid) with other aligners are presented in experiment. In other words, all waveforms generated in this
Table IV. We have the following observations: 1) the model experiment share the exact same xalign . Then, for each set of
using learnable upsampling D-I-Learnable does not converge (t1 , t2 ), 5 pairs of z 1 and z 2 are sampled and used to synthesize
at all while other models are able to produce reasonable results; waveforms. The F0 contours of these waveforms are visualized
2) the models with internal aligners outperform those using in Fig. 4. As shown in the figure, increasing t1 considerably
external aligners even if facilitated with the same upsampling increases the variation of F 0, whereas large t2 barely produces
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1659
Fig. 4. F 0 contours obtained from the test samples generated by EFTS2 with different t1 . Each subplot represents the F0 contours of the five utterances, each
with different colors, generated under the marked (t1 , t2 ) values.
TABLE VI
MOS AND SIMILARITY MOS FOR VOICE CONVERSION EXPERIMENTS
ON THE VCTK DATASET
TABLE VII
OBJECTIVE EVALUATIONS ON THE VCTK DATASET
Fig. 6. Visualization of the attention matrices of EFTS2. Each subplot represents one attention matrix, with the horizontal and vertical axes representing the
decoder and encoder timestep accordingly. Here the decoder and encoder timestep could be understood as the index of the output mel-spectrogram sequence and
the input phoneme-token sequence. Values within the matrix are represented by colors corresponding to the color chart on the right. A brighter color corresponds
to a larger value and indicates a larger possibility that this output frame attends to this input token. Subplot (a) is the attention matrix α(1) , which is reconstructed
using e. Subplot (b) is the attention matrix α(2) , which is reconstructed using boundary pairs (a, b).
TABLE VIII
COMPARISON WITH PREVIOUS TEXT-TO-WAVEFORM MODELS
indicates that the α(1) learns a smother overall alignment while KLD for both seen and unseen speakers, better COS-sim score
the α(2) learns specific duration boundaries of the input tokens. for seen speakers, and comparable COS-sim Score for unseen
speakers. Specifically, EFTS2-VC achieves significantly lower
KLD than YourTTS, indicating that the latent variable of EFTS2-
H. Voice Conversion Evaluation VC is closer to pure text representation and therefore proves the
The conversion performance of EFTS2-VC is evaluated on superior disentanglement capability of EFTS2-VC.
the VCTK dataset [32] with a comparison to the baseline model
YourTTS [16]. For both the seen and unseen target speaker
VII. ADVANTAGES OF EFTS2
settings, 25 converted utterances from each model are collected
to conduct the MOS and Sim-MOS test, and the results are In Table VIII we compare the advantages of EFTS2 with
presented in Table VI. EFTS2-VC achieves slightly better MOS previous text-to-waveform models in terms of training pipelines,
scores and comparable Sim-MOS scores on both seen and un- differentiability, model performance, and model efficiency.
seen speakers. Note that the conversion of YourTTS requires EFTS2 is the only differentiable model that allows for end-to-
running the flow module bidirectionally, which results in a slow end training, high-quality, and high-efficiency generation.
conversion speed. On the other hand, EFTS2-VC is significantly
faster. It runs 2.15× faster than YourTTS on a Tesla V100 GPU.
To further evaluate the ability of EFTS2-VC to disentangle the VIII. CONCLUSION AND DISCUSSION
content information and speaker-related information, objective We presented EfficientTTS 2 (EFTS2), a novel end-to-
evaluations on WER and COS-Sim are reported in Table VII. In end TTS model that adopts an adversarial training process,
addition, as both EFTS2-VC and YourTTS address the disen- with a generator composed of a differentiable aligner and a
tanglement problem by minimizing the KL divergence between hierarchical-VAE-based speech generator. Compared to baseline
the latent representation and the content representation, we also models, EFTS2 is fully differentiable and enjoys a smaller model
included the KL divergence (KLD) for comparison. All objective size with higher model efficiency, while still allowing high-
evaluations are conducted on the same utterances from Table VI. fidelity speech generation with controllable diversity. Moreover,
As presented in Table VII, EFTS2-VC offers better WER and we extend EFTS2 to the VC task and propose a VC model,
MIAO et al.: EFFICIENTTTS 2: VARIATIONAL END-TO-END TEXT-TO-SPEECH SYNTHESIS AND VOICE CONVERSION 1661
EFTS2-VC, that is capable of efficient and high-quality end-to- [14] X. Tan et al., “NaturalSpeech: End-to-end text to speech synthesis with
end voice conversion. human-level quality,” IEEE Trans. Pattern Anal. Mach. Intell., early ac-
cess, Jan. 19, 2024, doi: 10.1109/TPAMI.2024.3356232.
The primary goal of this work is to build a competitive TTS [15] C. Miao et al., “EfficientTTS: An efficient and high-quality text-to-speech
model that allows for end-to-end high-quality speech generation. architecture,” in Proc. 38th Int. Conf. Mach. Learn., 2021, pp. 7700–7709.
In the meantime, the proposed design choices can easily be [16] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M.
A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot
incorporated into other TTS frameworks. Firstly, the proposed voice conversion for everyone,” in Proc. 39th Int. Conf. Mach. Learn.,
B2A approach could potentially be a handier replacement for 2022, pp. 2709–2720.
conventional upsampling techniques in nearly all NAR TTS [17] N. Chen et al., “WaveGrad 2: Iterative refinement for text-to-speech
synthesis,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021,
models, given that it is differentiable, informative, and compu- pp. 3769–3769.
tationally cheap. Secondly, the differentiable aligner may be a [18] V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS:
superior alternative for any external aligner or non-differentiable A diffusion probabilistic model for text-to-speech,” in Proc. 38th Int. Conf.
Mach. Learn., 2021, pp. 8599–8608.
aligner, as it improves the uniformity of the model and makes [19] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger,
the training process end-to-end. Thirdly, the 2-layer hierarchical- “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in
VAE-based waveform generator can potentially outperform the Proc. Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 498–502.
[20] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
popular flow-VAE-based counterpart [11], [14] since it is more jointly learning to align and translate,” in Proc. Int. Conf. Learn. Repre-
efficient and offers more flexibility in network design. Lastly and sentations, 2015, pp. 1–9.
most importantly, the entire architecture of EFTS2 could serve as [21] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1
convolutions,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1–9.
a practical solution to sequence-to-sequence tasks that have the [22] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”
nature of monotonic alignments. We leave these assumptions to in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 6840–6851.
future work while providing our implementations as a research [23] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
Inf. Process. Syst., 2014, pp. 1–9.
basis for further exploration. [24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc.
Int. Conf. Learn. Representations, 2014, pp. 1–9.
REFERENCES [25] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf.
Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[1] Y. Wang, R. Skerry-Ryan, D. Stanton, R. J. W. Y. Wu, N. Jaitly, and Z. [26] J. Shen et al., “Non-Attentive Tacotron: Robust and controllable neu-
Yang, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Annu. ral tts synthesis including unsupervised duration modeling,” 2020,
Conf. Int. Speech Commun. Assoc., 2017, pp. 4006–4010. arXiv:2010.04301.
[2] J. Shen et al., “Natural TTS synthesis by conditioning wavenet on MEL [27] I. Elias et al., “Parallel Tacotron 2: A non-autoregressive neural TTS Model
spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal with differentiable duration modeling,” in Proc. Annu. Conf. Int. Speech
Process., 2018, pp. 4779–4783. Commun. Assoc., 2021, pp. 141–145.
[3] W. Ping et al., “Deep voice 3: 2000-Speaker neural text-to-speech,” in [28] R. Child, “Very deep VAEs generalize autoregressive models and can
Proc. Int. Conf. Learn. Representations, 2018, pp. 214–217. outperform them on images,” in Proc. Int. Conf. Learn. Representations,
[4] C. Miao, S. Liang, M. Chen, J. Ma, S. Wang, and J. Xiao, “Flow-TTS: 2021, pp. 1–9.
A non-autoregressive network for text to speech based on flow,” in Proc. [29] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks
IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7209–7213. for efficient and high fidelity speech synthesis,” in Proc. Adv. Neural Inf.
[5] Y. Ren et al., “FastSpeech: Fast, robust and controllable text to speech,” in Process. Syst., 2020, pp. 17022–17033.
Proc. 33rd Int. Conf. Neural Inf. Process. Syst., 2019, pp. 3171–3180. [30] J. You et al., “GAN-Vocoder: Multi-resolution discriminator is all you
[6] Y. Ren et al., “FastSpeech 2: Fast and high-quality end-to-end text to need,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 2177–
speech,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–9. 2181.
[7] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis [31] K. Ito, “The lj speech dataset,” 2017. [Online]. Available: https://keithito.
through linear prediction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal com/LJ-Speech-Dataset/
Process., 2019, pp. 5891–5895. [32] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus: En-
[8] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast glish multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),”
waveform generation model based on generative adversarial networks with University of Edinburgh. The Centre for Speech Technology Research
multi-resolution spectrogram,” in Proc. IEEE Int. Conf. Acoust., Speech, (CSTR), 2019.
Signal Process., 2020, pp. 6199–6203. [33] H. Zen et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-
[9] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad, and D. P. speech,” in Proc. 20th Annu. Conf. Int. Speech Commun. Assoc., 2019,
Kingma, “Wave-Tacotron: Spectrogram-free end-to-end text-to-speech pp. 1526–1530.
synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, [34] H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “Clova baseline system for the
pp. 5679–5683. voxceleb speaker recognition challenge 2020,” 2020, arXiv:2009.14153.
[10] J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, and K. Simonyan, [35] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker
“End-to-end adversarial text-to-speech,” in Proc. Int. Conf. Learn. Repre- recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2018,
sentations, 2021, pp. 1–9. pp. 1086–1090.
[11] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with [36] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in
adversarial learning for end-to-end text-to-speech,” in Proc. 38th Int. Conf. Proc. Int. Conf. Learn. Representations, 2019, pp. 1–9.
Mach. Learn., 2021, pp. 5530–5540. [37] R. Kubichek, “Mel-cepstral distance measure for objective speech quality
[12] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A. generative flow for assessment,” in Proc. IEEE Pacific Rim Conf. Commun. Comput. Signal
text-to-speech via monotonic alignment search,” in Proc. 34th Int. Conf. Process., 1993, pp. 125–128.
Neural Inf. Process. Syst., 2020, pp. 8067–8077. [38] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel wave generation in
[13] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent end-to-end text-to-speech,” in Proc. Int. Conf. Learn. Representations,
components estimation,” 2014, arXiv:1410.8516. 2019, pp. 1–9.