0% found this document useful (0 votes)
27 views10 pages

Flowtron

This document proposes Flowtron, an autoregressive flow-based generative network for text-to-speech synthesis that learns an invertible mapping of data to a latent space. This latent space can then be manipulated to control aspects of speech synthesis like pitch, tone, speech rate, and accent. Evaluation shows Flowtron matches state-of-the-art models in terms of speech quality and also enables control over variability, interpolation, and style transfer between speakers.

Uploaded by

sansri2609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

Flowtron

This document proposes Flowtron, an autoregressive flow-based generative network for text-to-speech synthesis that learns an invertible mapping of data to a latent space. This latent space can then be manipulated to control aspects of speech synthesis like pitch, tone, speech rate, and accent. Evaluation shows Flowtron matches state-of-the-art models in terms of speech quality and also enables control over variability, interpolation, and style transfer between speakers.

Uploaded by

sansri2609
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Flowtron: an Autoregressive Flow-based Generative Network for

Text-to-Speech Synthesis

Rafael Valle 1 Kevin Shih 1 Ryan Prenger 1 Bryan Catanzaro 1

Abstract because the non-textual is unlabeled. A voice actor may


speak the same text with different emphasis or emotion
In this paper we propose Flowtron: an autore-
arXiv:2005.05957v3 [cs.SD] 16 Jul 2020

based on context, but it is unclear how to label a particular


gressive flow-based generative network for text-
reading. Without labels for the non-textual information,
to-speech synthesis with control over speech vari-
models have fallen back to unsupervised learning. Recent
ation and style transfer. Flowtron borrows in-
models have achieved nearly human-level quality, despite
sights from IAF and revamps Tacotron in or-
treating the non-textual information as a black box. The
der to provide high-quality and expressive mel-
model’s only goal is to match the patterns in the training
spectrogram synthesis. Flowtron is optimized
data (Shen et al., 2017; Arik et al., 2017b;a; Ping et al.,
by maximizing the likelihood of the training
2017). Despite these models’ excellent ability to recreate
data, which makes training simple and stable.
the non-textual information in the training set, the user has
Flowtron learns an invertible mapping of data to
no insight into or control over the non-textual information.
a latent space that can be manipulated to control
many aspects of speech synthesis (pitch, tone, It is possible to formulate an unsupervised learning problem
speech rate, cadence, accent). Our mean opin- in such a way that the user can gain insights into the structure
ion scores (MOS) show that Flowtron matches of a data set. One way is to formulate the problem such that
state-of-the-art TTS models in terms of speech the data is assumed to have a representation in some latent
quality. In addition, we provide results on con- space, and have the model learn that representation. This
trol of speech variation, interpolation between latent space can then be investigated and manipulated to give
samples and style transfer between speakers seen the user more control over the generative model’s output.
and unseen during training. Code and pre- Such approaches have been popular in image generation
trained models will be made publicly available for some time now, allowing users to interpolate smoothly
at https://github.com/NVIDIA/flowtron. between images and to identify portions of the latent space
that correlate with various features (Radford et al., 2015;
Kingma & Dhariwal, 2018).
1. Introduction In audio, however, approaches have focused on embeddings
Current speech synthesis methods do not give the user that remove a large amount of information and are obtained
enough control over how speech actually sounds. Auto- from assumptions about what is interesting. Recent ap-
matically converting text to audio that successfully commu- proaches that utilize deep learning for expressive speech
nicates the text was achieved a long time ago (Umeda et al., synthesis combine text and a learned latent embedding for
1968; Badham et al., 1983). However, communicating only prosody or global style (Wang et al., 2018; Skerry-Ryan
the text information leaves out all of the acoustic properties et al., 2018). A variation of this approach is proposed
of the voice that convey much of the meaning and human by (Hsu et al., 2018), wherein a Gaussian mixture model
expressiveness. Nearly all the research into speech synthe- (GMM) encoding the audio is added to Tacotron to learn
sis since the 1960s has focused on adding that non-textual a latent embedding. These approaches control the non-
information to synthesized speech. But in spite of this, the textual information by learning a bank of embeddings or
typical speech synthesis problem is formulated as a text to by providing the target output as an input to the model and
speech problem in which the user inputs only text. compressing it. However, these approaches require making
assumptions about the dimensionality of the embeddings
Taming the non-textual information in speech is difficult before hand and are not guaranteed to contain all the non-
1
NVIDIA Applied Deep Learning Research (ADLR). Corre- textual information it takes to reconstruct speech, includ-
spondence to: Rafael Valle <rafaelvalle@nvidia.com>. ing the risk of having dummy dimensions or not enough
capacity, as the appendix sections in (Wang et al., 2018;
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Skerry-Ryan et al., 2018; Hsu et al., 2018) confirm. They learns an invertible mapping of the a latent space that can
also require finding an encoder and embedding that prevents be manipulated to control many aspects of speech synthe-
the model from simply learning a complex identity function sis. Our mean opinion scores (MOS) show that Flowtron
that ignores other inputs. Furthermore, these approaches matches state-of-the-art TTS models in terms of speech
focus on fixed-length embeddings under the assumption quality. In addition, we provide results on control of speech
that variable-length embeddings are not robust to text and variation, interpolation between samples, and style transfer
speaker perturbations. Finally, most of these approaches do between seen and unseen speakers with similar and different
not give the user control over the degree of variability in the sentences. To our knowledge, this work is the first to show
synthesized speech. evidence that normalizing flow models can also be used for
text-to-speech synthesis. We hope this will further stimulate
In this paper we propose Flowtron: an autoregressive flow-
developments in normalizing flows.
based generative network for mel-spectrogram synthesis
with control over acoustics and speech. Flowtron learns
an invertible function that maps a distribution over mel- 2. Related Work
spectrograms to a latent z space parameterized by a spheri-
Earlier approaches to text-to-speech synthesis that achieve
cal Gaussian. With this formalization, we can generate sam-
human like results focus on synthesizing acoustic features
ples containing specific speech charateristics manifested in
from text, treating the non-textual information as a black
mel-space by finding and sampling the corresponding region
box. (Shen et al., 2017; Arik et al., 2017b;a; Ping et al.,
in z-space. In the basic approach, we generate samples by
2017). Approaches like (Wang et al., 2017; Shen et al.,
sampling a zero mean spherical Gaussian prior and control
2017) require adding a critical Prenet layer to help with
the amount of variation by adjusting its variance. Despite its
convergence and improve generalization (Wang et al., 2017).
simplicity, this approach offers more speech variation and
Furthermore, such models require an additional Postnet
control than Tacotron.
residual layer and modified loss to produce ”better resolved
In Flowtron, we can access specific regions of mel- harmonics and high frequency formant structures, which
spectrogram space by sampling a posterior distribution con- reduces synthesis artifacts.”
ditioned on prior evidence from existing samples (Kingma
One approach to dealing with this lack of labels for underly-
& Dhariwal, 2018; Gambardella et al., 2019). This approach
ing non-textual information is to look for hand engineered
allows us to make a monotonous speaker more expressive by
statistics based on the audio that we believe are correlated
computing the region in z-space associated with expressive
with this underlying information.
speech as it is manifested in the prior evidence. Finally,
our formulation also allows us to impose a structure to the This is the approach taken by models like (Nishimura et al.,
z-space and parametrize it with a Gaussian mixture, for ex- 2016; Lee et al., 2019), wherein utterances are conditioned
ample. In this approach related to (Hsu et al., 2018), speech on audio statistics that can be calculated directly from the
charateristics in mel-spectrogram space can be associated training data such as F0 (fundamental frequency). However,
with individual components. Hence, it is possible to gener- in order to use such models, the statistics we hope to approx-
ate samples with specific speech characteristics by selecting imate must be decided upon a-priori, and the target value of
a component or a mixture thereof 1 . these statistics must be determined before synthesis.
Although VAEs and GANs (Hsu et al., 2018; Bińkowski Another approach to dealing with the issue of unlabeled
et al., 2019; Akuzawa et al., 2018) based models also non-textual information is to learn a latent embedding for
provide a latent prior that can be easily manipulated, in prosody or global style. This is the approach taken by
Flowtron this comes at no cost in speech quality nor opti- models like (Skerry-Ryan et al., 2018; Wang et al., 2018),
mization challenges. wherein in a bank of embeddings or a latent embedding
space of prosody is learned from unlabelled data. While
We find that Flowtron is able to generalize and produce sharp
these approaches have shown promise, manipulating such
mel-spectrograms by simply maximizing the likelihood of
latent variables only offers a coarse control over expressive
the data while not requiring any additional Prenet or Postnet
characteristics of speech.
layer (Wang et al., 2017), nor compound loss functions
required by most state of the art models like (Shen et al., A mixed approach consists of combining engineered statis-
2017; Arik et al., 2017b;a; Ping et al., 2017; Skerry-Ryan tics with latent embeddings learned in an unsupervised fash-
et al., 2018; Wang et al., 2018; Bińkowski et al., 2019). ion. This is the approach taken by models like Mellotron
(Valle et al., 2019b). In Mellotron, utterances are condi-
Flowtron is optimized by maximizing the likelihood of the
tioned on both audio statistics and a latent embedding of
training data, which makes training simple and stable. It
acoustic features derived from a reference acoustic represen-
1 tation. Despite its advantages, this approach still requires
What is relevant statistically might not be perceptually.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

determining these statistics before synthesis. Dhariwal, 2018). In our case, we use an affine coupling
layer (Dinh et al., 2016). Every input xt−1 produces scale
3. Flowtron and bias terms, s and b respectively, that affine-transform
the succeeding input xt :
Flowtron is an autoregressive generative model that gen-
erates a sequence of mel spectrogram frames p(x) by pro-
ducing each mel-spectrogram frame
Q based on previous mel- (log st , bt ) = N N (x1:t−1 , text, speaker) (7)
spectrogram frames p(x) = p(xt |x1:t−1 ). Our setup
x0t = st x t + bt (8)
uses a neural network as a generative model by sampling
from a simple distribution p(z). We consider two simple
distributions with the same number of dimensions as our Here N N () can be any autoregressive causal transformation.
desired mel-spectrogram: a zero-mean spherical Gaussian This can be achieved by time-wise concatenation of a 0-
and a mixture of spherical Gaussians with fixed or learnable valued vector to the input provided to N N (). The affine
parameters. coupling layer preserves invertibility for the overall network,
even though N N () does not need to be invertible. This
follows because the first input of N N () is a constant and
due to the autoregressive nature of the model the scaling
z ∼ N (z; 0, I) (1)
X and translation terms st and bt only depend on x1:t−1 and
z∼ φ̂k N (z; µˆk , Σ̂k ) (2) the fixed text and speaker vectors. Accordingly, when
k inverting the network, we can compute st and bt from the
preceding input x1:t−1 , and then invert x0t to compute xt ,
These samples are put through a series of invertible, by simply recomputing N N (x1:t−1 , text, speaker).
parametrized transformations f , in our case affine trans-
With an affine coupling layer, only the st term changes the
formations that transform p(z) into p(x).
volume of the mapping and adds a change of variables term
x = f 0 ◦ f 1 ◦ . . . f k (z) (3) to the loss. This term also serves to penalize the model for
non-invertible affine mappings.
As it is illustrated in (Kingma et al., 2016), in autoregres- log | det(J (f −1
coupling (x)))| = log |s| (9)
sive normalizing flows the t-th variable z 0t only depends on
previous timesteps z 1:t−1 :
With this setup, it is also possible to revert the ordering of
z 0t = f k (z 1:t−1 ) (4) the input x without loss of generality. Hence, we choose to
revert the order of the input at every even step of flow and
to maintain the original order on odd steps of flow. This
By using parametrized affine transformations for f and due
allows the model to learn dependencies both forward and
to the autoregressive structure, the Jacobian determinant
backwards in time while remaining causal and invertible.
of each of the transformations f is lower triangular, hence
easy to compute. With this setup we can train Flowtron by
3.2. Model architecture
maximizing the log-likelihood of the data, which can be
done using the change of variables: Our text encoder modifies Tacotron’s by replacing batch-
k
norm with instance-norm. Our decoder and N N architec-
X ture, depicted in Figure 1, removes the essential Prenet and
log pθ (x) = log pθ (z) + log | det(J (f −1
i (x)))| (5)
Postnet layers from Tacotron. We use the content-based
i=1
tanh attention described in (Vinyals et al., 2015). We use the
z= f −1
k ◦ −1
f k−1 ◦ . . . f −1
0 (x) (6) Mel Encoder described in (Hsu et al., 2018) for Flowtron
models that predict the parameters of the Gaussian mixture.
For the forward pass through the network, we take the mel-
spectrograms as vectors and process them through several Unlike (Ping et al., 2017; Gibiansky et al., 2017), where
“steps of flow conditioned on the text and speaker ids. A step site specific speaker embeddings are used, we use a single
of flow here consists of an affine coupling layer, described speaker embedding that is channel-wise concatenated with
below. the encoder outputs at every token. We use a fixed dummy
speaker embedding for models not conditioned on speaker
id. Finally, we add a dense layer with a sigmoid output the
3.1. Affine Coupling Layer
flow step closest to z. This provides the model with a gating
Invertible neural networks are typically constructed us- mechanism as early as possible during inference to avoid
ing coupling layers (Dinh et al., 2014; 2016; Kingma & extra computation.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Although we provide images to illustrate our results, they


can best be appreciated by listening. Hence, we ask the
readers to visit our website 2 to listen to Flowtron samples.

4.1. Training setup


We train our Flowtron, Tacotron 2 and Tacotron 2 GST
models using a dataset that combines the LJSpeech (LJS)
dataset (Ito et al., 2017) with two proprietary single speaker
datasets with 20 and 10 hours each (Sally and Helen). We
will refer to this combined dataset as LSH. We also train a
Flowtron model on the train-clean-100 subset of LibriTTS
Figure 1: Flowtron network. Text and speaker embeddings
(Zen et al., 2019) with 123 speakers and 25 minutes on
are channel-wise concatenated. A 0-valued vector is con-
average per speaker. Speakers with less than 5 minutes of
catenated with x in the time dimension.
data and files that are larger than 10 seconds are filtered
out. For each dataset we use at least 180 randomly chosen
samples for the validation set and the remainder for the
3.3. Inference training set.
Once the network is trained, doing inference is simply a The models are trained on uniformly sampled normalized
matter of randomly sampling z values from a spherical text and ARPAbet encodings obtained from the CMU Pro-
Gaussian, or Gaussian Mixture, and running them through nouncing Dictionary (Weide, 1998). We do not perform
the network, reverting the order of the input when necessary. any data augmentation. We adapt the public Tacotron 2 and
During training we used σ 2 = 1. The parameters of the Tacotron 2 GST repos to include speaker embeddings as
Gaussian mixture are either fixed or predicted by Flowtron. described in Section 3.
In section 4.3 we explore the effects of different values for
σ 2 . In general, we found that sampling z values from a We use a sampling rate of 22050 Hz and mel-spectrograms
Gaussian with a lower standard deviation from that assumed with 80 bins using librosa mel filter defaults. We apply
during training resulted in mel-spectrograms that sounded the STFT with a FFT size of 1024, window size of 1024
better, as found in (Kingma & Dhariwal, 2018), and earlier samples and hop size of 256 samples (∼ 12ms).
work on likelihood-based generative models (Parmar et al., We use the ADAM (Kingma & Ba, 2014) optimizer with
2018). During inference we sampled z values from a Gaus- default parameters, 1e-4 learning rate and 1e-6 weight de-
sian with σ 2 = 0.5, unless otherwise specified. The text and cay for Flowtron and 1e-3 learning rate and 1e-5 weight
speaker embeddings are included at each of the coupling decay for the other models, following guidelines in (Wang
layers as before, but now the affine transforms are inverted et al., 2017). We anneal the learning rate once the general-
in time, and these inverses are also guaranteed by the loss. ization error starts to plateau and stop training once the the
generalization error stops significantly decreasing or starts
4. Experiments increasing. The Flowtron models with 2 steps of flow were
trained on the LSH dataset for approximately 1000 epochs
This section describes our training setup and provides quan- and then fine-tuned on LibriTTS for 500 epochs. Tacotron
titative and qualitative results. Our quantitative results show 2 and Tacotron 2 GST are trained for approximately 500
that Flowtron has mean opinion scores (MOS) that are com- epochs. Each model is trained on a single NVIDIA DGX-1
parable to that of state of the art models for text to mel- with 8 GPUs.
spectrogram synthesis such as Tacotron 2. Our qualitative
results display many features that are not possible or not We find it faster to first learn to attend on a Flowtron model
efficient with Tacotron and Tacotron 2 GST. These features with a single step of flow and large amounts of data than
include control of the amount of variation in speech, inter- multiple steps of flow and less data. After the model has
polation between samples and style transfer between seen learned to attend, we transfer its parameters to models with
and unseen speakers during training. more steps of flow and speakers with less data. Thus, we
first train Flowtron model with a single step of flow on the
We decode all mel-spectrograms into waveforms by using a LSH dataset with many hours per speaker. Then we fine
single pre-trained WaveGlow (Prenger et al., 2019) model tune this model to Flowtron models with more steps of flow.
trained on a single speaker and available on github (Valle Finally, these models are fine tuned on LibriTTS with an
et al., 2019a). During inference we used σ 2 = 0.7. In optional new speaker embedding.
consonance with (Valle et al., 2019b), our results suggests
2
that WaveGlow can be used as an universal decoder. https://nv-adlr.github.io/Flowtron
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

4.2. Mean Opinion Score comparison et al., 2017; Arik et al., 2017b;a; Ping et al., 2017; Skerry-
Ryan et al., 2018; Wang et al., 2018; Bińkowski et al., 2019),
We provide results that compare mean opinion scores
Flowtron generates sharp harmonics and well resolved for-
(MOS) from real data from the LJS dataset, samples from
mants without a compound loss nor Prenet or Postnet layers.
a Flowtron with 2 steps of flow and samples from our im-
plementation of Tacotron 2, both trained on LSH. Although
the models evaluated are multi-speaker, we only compute
mean opinion scores on LJS. In addition, we use the mean
opinion scores provided in (Prenger et al., 2019) for ground
truth data from the LJS dataset.
We crowd-sourced mean opinion score (MOS) tests on Ama-
zon Mechanical Turk. Raters first had to pass a hearing test
to be eligible. Then they listened to an utterance, after which
they rated pleasantness on a five-point scale. We used 30
volume normalized utterances from all speakers disjoint
from the training set for evaluation, and randomly chose the
utterances for each subject. (a) σ 2 = 0

The mean opinion scores are shown in Table 1 with 95% con-
fidence intervals computed over approximately 250 scores
per source. The results roughly match our subjective qual-
itative assessment. The larger advantage of Flowtron is in
the control over the amount of speech variation and the
manipulation of the latent space.

Source Flows Mean Opinion Score (MOS)


Real - 4.274 ± 0.1340
Flowtron 3 3.665 ± 0.1634
Tacotron 2 - 3.521 ± 0.1721
(b) σ 2 = 0.5
Table 1: Mean Opinion Score (MOS) evaluations with 95%
confidence intervals for various sources.

4.3. Sampling the prior


The simplest approach to generate samples with Flowtron
is to sample from a prior distribution z ∼ N (0, σ 2 ) and
adjust σ 2 to control amount of variation. Whereas σ 2 = 0
completely removes variation and produces outputs based
on the model bias, increasing the value of σ 2 will increase
the amount of variation in speech.
(c) σ 2 = 1
4.3.1. S PEECH VARIATION
To showcase the amount of variation and control thereof in Figure 2: Mel-spectrograms generated with Flowtron using
Flowtron, we synthesize 10 mel-spectrograms and sample different σ 2 . This parameter can be adjusted to control
the Gaussian prior with σ 2 ∈ {0.0, 0.5, 1.0}. All samples mel-spectrogram variability during inference.
are generated conditioned on a fixed speaker Sally and text
“How much variation is there?” to illustrate the relationship Now we show that adjusting σ 2 is a simple and valuable
between σ 2 and variability. approach that provides more variation and control than
Tacotron, without sacrificing speech quality. For this, we
Our results show that despite all the variability added by
synthesize 10 samples with Tacotron 2 using different val-
increasing σ 2 , all the samples synthesized with Flowtron
ues for the Prenet dropout probability p ∈ {0.45, 0.5, 0.55}.
still produce high quality speech.
We scale the outputs of the dropout output such that the
Figure 2 also shows that unlike most SOTA models (Shen mean of the output remains equal to the mean with p = 0.5,
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

the value used during training. Although we also provide


samples computed on values of p ∈ [0, 1] in our supplemen-
tal material, we do not include them in our results because
they are unintelligible.
In Figure 3 below we provide scatter plots from sample (a) Flowtron σ 2 = 0
duration in seconds. Our results show that whereas σ 2 = 0
produces samples with no variation in duration, larger values
of σ 2 produces samples with more variation in duration.
Humans manipulate word and sentence length to express
themselves, hence this is valuable.
(b) Flowtron σ 2 = 0.5

(c) Flowtron σ 2 = 1

Figure 3: Sample duration in seconds given parameters σ 2


and p. These results show that Flowtron provides more
variation in sample duration than Tacotron 2. (d) Tacotron 2 p ∈ {0.45, 0.5, 0.55}

Figure 4: F0 contours obtained from samples generated by


In Figure 4 we provide scatter plots of F0 contours extracted Flowtron and Tacotron 2 with different values for σ 2 and p.
with the YIN algorithm (De Cheveigné & Kawahara, 2002), Flowtron provides more expressivity than Tacotron 2.
with minimum F0 , maximum F0 and harmonicity threshold
equal to 80 Hz, 400 Hz and 0.3 respectively. Our results
show a behavior similar to the previous sample duration
analysis. As expected, σ 2 = 0 provides no variation in F0
For the experiment without speaker embeddings we inter-
contour3 , while increasing the value of σ 2 will increase the
polate between Sally and Helen using the phrase “We are
amount of variation in F0 contours.
testing this model.”. First, we perform inference by sam-
Our results in Figure 4 also show that the samples produced pling z ∼ N (0, 0.5) until we find two z values, zh and
with Flowtron are considerably less monotonous than the zs , that produce mel-spectrograms with Helen’s and Sally’s
samples produced with Tacotron 2. Whereas increasing σ 2 voice respectively. We then generate samples by performing
considerably increases variation in F0 , modifying p barely inference while linearly interpolating between zh and zs .
produces any variation. This is valuable because expressive
Our same speaker interpolation samples show that Flowtron
speech is associated with non-monotonic F0 contours.
is able to interpolate between multiple samples while pro-
ducing correct alignment maps. In addition, our different
4.3.2. I NTERPOLATION BETWEEN SAMPLES
speaker interpolation samples show that Flowtron is able
With Flowtron we can perform interpolation in z-space to to blurry the boundaries between two speakers, creating a
achieve interpolation in mel-spectrogram space. For this speaker that combines the characteristics of both.
experiment we evaluate Flowtron models with and with-
out speaker embeddings. For the experiment with speaker 4.4. Sampling the posterior
embeddings we choose the Sally speaker and the phrase
“It is well known that deep generative models have a rich In this approach we generate samples with Flowtron by sam-
latent space.”. We generate mel-spectrograms by sampling pling a posterior distribution conditioned on prior evidence
z ∼ N (0, 0.8) twice and interpolating between them over containing speech characteristics of interest, as described in
100 steps. (Gambardella et al., 2019; Kingma & Dhariwal, 2018). In
this experiment, we collect prior evidence z e by perform-
3
Variations in σ 2 = 0 are due to different z for WaveGlow. ing a forward pass with the speaker id to be used during
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

inference4 , observed mel-spectrogram and text from a set


of samples with characteristics of interest. If necessary,
we time-concatenate each z e with itself to fulfill minimum
length requirements defined according to the text length to
be said during inference.
Tacotron 2 GST (Wang et al., 2018) has an equivalent pos-
terior sampling approach, in which during inference the
model is conditioned on a weighted sum of global style
tokens (posterior) queried through an embedding of existing (a) Female
audio samples (prior). For Tacotron 2 GST, we evaluate two
approaches: in one we use a single sample to query a style
token in the other we use an average style token computed
over multiple samples.

4.4.1. S EEN SPEAKER WITHOUT ALIGNMENTS


In this experiment we compare Sally samples from Flowtron
and Tacotron 2 GST generated by conditioning on the pos-
terior computed over 30 Helen samples with the highest (b) Transfer
variance in fundamental frequency. The goal is to make a
monotonic speaker sound expressive. Our experiments show
that by sampling from the posterior or interpolating between
the posterior and a standard Gaussian prior, Flowtron is
able to make a monotonic speaker gradually sound more
expressive. On the other hand, Tacotron 2 GST is barely
able to alter characteristics of the monotonic speaker.

4.4.2. S EEN SPEAKER WITH ALIGNMENTS


We use a Flowtron model with speaker embeddings to illus- (c) Male
trate Flowtron’s ability to learn and transfer acoustic char-
acteristics that are hard to express algorithmically but easy Figure 5: Mel-spectrograms from a female speaker, male
to perceive acoustically, we select a female speaker from speaker and a sample where we transfer the acoustic char-
LibriTTS with a distinguished nasal voice and oscillation acteristics from the female speaker to the male speaker. It
in F0 as our source speaker and transfer her style to a male shows that the transferred sample is more similar to the
speaker, also from LibriTTS, with acoustic characteristics female speaker than the male speaker.
that sound different from the female speaker. Unlike the
previous experiment, this time the text and the alignment
maps are transferred from the female to the male speaker. Our samples show that Tacotron 2 GST fails to emulate the
somber style from Born of Darkness’s data. We show that
Figure 5 is an attempt to visualize the transfer of these Flowtron succeeds in transferring not only to the somber
acoustic qualities we described. It shows that after the style in the evaluation data, but also the long pauses associ-
transfer, the lower partials of the male speaker oscillate ated with the narrative style.
more and become more similar to the female speaker.
4.4.4. U NSEEN SPEAKER
4.4.3. U NSEEN SPEAKER STYLE
In this experiment we compare Flowtron and Tacotron 2
We compare samples generated with Flowtron and Tacotron GST samples in which we transfer the speaking style of a
2 GST with speaker embeddings in which we modify a speaker not seen during training. Both models use speaker
speaker’s style by using data from the same speaker but from embeddings.
a style not seen during training. Whereas Sally’s data used
during training consists of news article readings, the evalu- For these experiments, we consider two speakers. The first
ation samples contain Sally’s interpretation of the somber comes from speaker ID 03 from RAVDESS, a dataset with
and vampiresque novel Born of Darkness. emotion labels. We focus on the label “surprised”. The
second speaker is Richard Feynman, using a set of 10 audio
4
To remove this speaker’s information from z e samples collected from the web.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

For each experiment, we use the Sally speaker and the sen-
tences “Humans are walking on the street?” and “Surely
you are joking mister Feynman.”, which do not exist in
RAVDESS nor in the audio samples from Richard Feyn-
man.
The samples generated with Tacotron 2 GST are not able to
emulate the surprised style from RAVDESS nor Feynman’s Figure 7: Component assignments for Flowtron GM-B.
prosody and acoustic characteristics. Flowtron, on the other Components 7 and 8 are assigned different probabilities
hand, is able to make Sally sound surprised, which is dras- according to gender, suggesting that the information stored
tically different from the monotonous baseline. Likewise, in the components is gender dependent.
Flowtron is able to pick up on the prosody and articulation
details particular to Feynman’s speaking style, and transfer
them to Sally. 4.5.2. T RANSLATING DIMENSIONS

4.5. Sampling the Gaussian Mixture In this subsection, we use the model Flowtron GM-A de-
scribed previously. We focus on selecting a single mixture
In this last section we showcase visualizations and samples component and translating one of its dimensions by adding
from Flowtron Gaussian Mixture (GM). First we investi- an offset.
gate how different mixture components and speakers are
correlated. Then we provide sound examples in which we The samples in our supplementary material show that we
modulate speech characteristics by translating one of the the are able to modulate specific speech characteristics like
dimensions of an individual component. pitch and word duration. Although the samples generated
by translating one the dimensions associated with pitch
4.5.1. V ISUALIZING ASSIGNMENTS height have different pitch contours, they have the same
duration. Similarly, our samples show that translating the
For the first experiment, we train a Flowtrom Gaussian dimension associated with length of the first word does not
Mixture on LSH with 2 steps of flow, speaker embeddings modulate the pitch of the first word. This provides evidence
and fixed mean and covariance (Flowtron GM-A). We ob- that we can modulate these attributes by manipulating these
tain mixture component assignments per mel-spectrogram dimensions and that the model is able to learn a disentangled
by performing a forward pass and averaging the compo- representation of these speech attributes.
nent assignment over time and samples. Figure 6 shows
that whereas most speakers are equally assigned to all com-
ponents, component 7 is almost exclusively assigned to
5. Discussion
Helen’s data. In this paper we propose a new text to mel-spectrogram
synthesis model based on autoregressive flows that is opti-
mized by maximizing the likelihood and allows for control
of speech variation and style transfer. Our results show that
samples generated with FlowTron achieve mean opinion
scores that are similar to samples generated with state-of-
the-art text-to-speech synthesis models. In addition, we
demonstrate that at no extra cost and without a compound
loss term, our model learns a latent space that stores non-
Figure 6: Component assignments for Flowtron GM-A.
textual information. Our experiments show that FlowTron
Unlike LJS and Sally, Helen is almost exclusively assigned
gives the user the possibility to transfer charactersitics from
to component 7.
a source sample or speaker to a target speaker, for example
making a monotonic speaker sound more expressive.
In the second experiment, we train a Flowtron Gaussian Our results show that despite all the variability added by
Mixture on LibriTTS with 1 step of flow, without speaker increasing σ 2 , the samples synthesized with FlowTron
embeddings and predicted mean and covariance (Flowtron still produce high quality speech. Our results show that
GM-B). Figure 7 shows that Flowtron GM assigns more FlowTron learns a latent space over non-textual features that
probability to component 7 when the speaker is male than can be investigated and manipulated to give the user more
when it’s female. Conversely, the model assigns more proba- control over the generative models output. We provide many
bility to component 6 when the speaker is female than when examples that showcase this including increasing variation
it’s male. in mel-spectrograms in a controllable manner, transferring
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

the style from speakers seen and unseen during training to References
another speaker using sentences with similar or different
Akuzawa, K., Iwasawa, Y., and Matsuo, Y. Expressive
text, and making a monotonic speaker sound more expres-
speech synthesis via modeling expressions with varia-
sive.
tional autoencoder. arXiv preprint arXiv:1804.02135,
Flowtron produces expressive speech without labeled data 2018.
or ever seeing expressive data. It pushes text-to-speech syn-
thesis beyond the expressive limits of personal assistants. It Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng,
opens new avenues for speech synthesis in human-computer K., Ping, W., Raiman, J., and Zhou, Y. Deep voice
interaction and the arts, where realism and expressivity are 2: Multi-speaker neural text-to-speech. arXiv preprint
of utmost importance. To our knowledge, this work is the arXiv:1705.08947, 2017a.
first to demonstrate the advantages of using normalizing
flow models in text to mel-spectrogram synthesis. Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gib-
iansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J.,
et al. Deep voice: Real-time neural text-to-speech. arXiv
preprint arXiv:1702.07825, 2017b.

Badham, J., Lasker, L., Parkes, W. F., Rubinstein, A. B.,


Broderick, M., Coleman, D., and Wood, J. Wargames,
1983.

Bińkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen,


E., Casagrande, N., Cobo, L. C., and Simonyan, K. High
fidelity speech synthesis with adversarial networks. arXiv
preprint arXiv:1909.11646, 2019.

De Cheveigné, A. and Kawahara, H. Yin, a fundamental


frequency estimator for speech and music. The Journal
of the Acoustical Society of America, 111(4):1917–1930,
2002.

Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear


independent components estimation. arXiv preprint
arXiv:1410.8516, 2014.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti-


mation using real nvp. arXiv preprint arXiv:1605.08803,
2016.

Gambardella, A., Baydin, A. G., and Torr, P. H. Transflow


learning: Repurposing flow models without retraining.
arXiv preprint arXiv:1911.13270, 2019.

Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K.,
Ping, W., Raiman, J., and Zhou, Y. Deep voice 2: Multi-
speaker neural text-to-speech. In Advances in neural
information processing systems, pp. 2962–2970, 2017.

Hsu, W.-N., Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Wang,
Y., Cao, Y., Jia, Y., Chen, Z., Shen, J., et al. Hierarchical
generative modeling for controllable speech synthesis.
arXiv preprint arXiv:1810.07217, 2018.

Ito, K. et al. The LJ speech dataset, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochastic


optimization. arXiv preprint arXiv:1412.6980, 2014.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Kingma, D. P. and Dhariwal, P. Glow: Generative Valle, R., Li, J., Prenger, R., and Catanzaro, B. Mellotron
flow with invertible 1x1 convolutions. arXiv preprint github repo, 2019a. URL https://github.com/
arXiv:1807.03039, 2018. NVIDIA/mellotron.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Valle, R., Li, J., Prenger, R., and Catanzaro, B. Mellotron:
Sutskever, I., and Welling, M. Improved variational in- Multispeaker expressive voice synthesis by conditioning
ference with inverse autoregressive flow. In Advances in on rhythm, pitch and global style tokens. arXiv preprint
Neural Information Processing Systems, pp. 4743–4751, arXiv:1910.11997, 2019b.
2016.
Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., and
Lee, J., Choi, H.-S., Jeon, C.-B., Koo, J., and Lee, K. Adver- Hinton, G. Grammar as a foreign language. In Advances
sarially trained end-to-end korean singing voice synthesis in neural information processing systems, pp. 2773–2781,
system. arXiv preprint arXiv:1908.01919, 2019. 2015.

Nishimura, M., Hashimoto, K., Oura, K., Nankaku, Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J.,
Y., and Tokuda, K. Singing voice synthesis Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.
based on deep neural networks. In Interspeech Tacotron: A fully end-to-end text-to-speech synthesis
2016, pp. 2478–2482, 2016. doi: 10.21437/ model. arXiv preprint arXiv:1703.10135, 2017.
Interspeech.2016-1027. URL http://dx.doi.org/
Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R., Batten-
10.21437/Interspeech.2016-1027.
berg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., and Saurous,
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, R. A. Style tokens: Unsupervised style modeling, con-
N., Ku, A., and Tran, D. Image transformer. arXiv trol and transfer in end-to-end speech synthesis. arXiv
preprint arXiv:1802.05751, 2018. preprint arXiv:1803.09017, 2018.

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, Weide, R. L. The cmu pronouncing dictionary. URL:
A., Narang, S., Raiman, J., and Miller, J. Deep voice http://www.speech.cs.cmu.edu/cgi-bin/cmudict, 1998.
3: 2000-speaker neural text-to-speech. arXiv preprint Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J.,
arXiv:1710.07654, 2017. Jia, Y., Chen, Z., and Wu, Y. Libritts: A corpus de-
rived from librispeech for text-to-speech. arXiv preprint
Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A
arXiv:1904.02882, 2019.
flow-based generative network for speech synthesis. In
ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp.
3617–3621. IEEE, 2019.

Radford, A., Metz, L., and Chintala, S. Unsupervised rep-


resentation learning with deep convolutional generative
adversarial networks. arXiv preprint arXiv:1511.06434,
2015.

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly,


N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-
Ryan, R., et al. Natural tts synthesis by conditioning
wavenet on mel spectrogram predictions. arXiv preprint
arXiv:1712.05884, 2017.

Skerry-Ryan, R., Battenberg, E., Xiao, Y., Wang, Y., Stan-


ton, D., Shor, J., Weiss, R. J., Clark, R., and Saurous,
R. A. Towards end-to-end prosody transfer for expres-
sive speech synthesis with tacotron. arXiv preprint
arXiv:1803.09047, 2018.

Umeda, N., Matsui, E., Suzuki, T., and Omura, H. Synthesis


of fairy tales using an analog vocal tract. In Proceedings
of 6th International Congress on Acoustics, pp. B159–
162, 1968.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy