Flowtron
Flowtron
Text-to-Speech Synthesis
Skerry-Ryan et al., 2018; Hsu et al., 2018) confirm. They learns an invertible mapping of the a latent space that can
also require finding an encoder and embedding that prevents be manipulated to control many aspects of speech synthe-
the model from simply learning a complex identity function sis. Our mean opinion scores (MOS) show that Flowtron
that ignores other inputs. Furthermore, these approaches matches state-of-the-art TTS models in terms of speech
focus on fixed-length embeddings under the assumption quality. In addition, we provide results on control of speech
that variable-length embeddings are not robust to text and variation, interpolation between samples, and style transfer
speaker perturbations. Finally, most of these approaches do between seen and unseen speakers with similar and different
not give the user control over the degree of variability in the sentences. To our knowledge, this work is the first to show
synthesized speech. evidence that normalizing flow models can also be used for
text-to-speech synthesis. We hope this will further stimulate
In this paper we propose Flowtron: an autoregressive flow-
developments in normalizing flows.
based generative network for mel-spectrogram synthesis
with control over acoustics and speech. Flowtron learns
an invertible function that maps a distribution over mel- 2. Related Work
spectrograms to a latent z space parameterized by a spheri-
Earlier approaches to text-to-speech synthesis that achieve
cal Gaussian. With this formalization, we can generate sam-
human like results focus on synthesizing acoustic features
ples containing specific speech charateristics manifested in
from text, treating the non-textual information as a black
mel-space by finding and sampling the corresponding region
box. (Shen et al., 2017; Arik et al., 2017b;a; Ping et al.,
in z-space. In the basic approach, we generate samples by
2017). Approaches like (Wang et al., 2017; Shen et al.,
sampling a zero mean spherical Gaussian prior and control
2017) require adding a critical Prenet layer to help with
the amount of variation by adjusting its variance. Despite its
convergence and improve generalization (Wang et al., 2017).
simplicity, this approach offers more speech variation and
Furthermore, such models require an additional Postnet
control than Tacotron.
residual layer and modified loss to produce ”better resolved
In Flowtron, we can access specific regions of mel- harmonics and high frequency formant structures, which
spectrogram space by sampling a posterior distribution con- reduces synthesis artifacts.”
ditioned on prior evidence from existing samples (Kingma
One approach to dealing with this lack of labels for underly-
& Dhariwal, 2018; Gambardella et al., 2019). This approach
ing non-textual information is to look for hand engineered
allows us to make a monotonous speaker more expressive by
statistics based on the audio that we believe are correlated
computing the region in z-space associated with expressive
with this underlying information.
speech as it is manifested in the prior evidence. Finally,
our formulation also allows us to impose a structure to the This is the approach taken by models like (Nishimura et al.,
z-space and parametrize it with a Gaussian mixture, for ex- 2016; Lee et al., 2019), wherein utterances are conditioned
ample. In this approach related to (Hsu et al., 2018), speech on audio statistics that can be calculated directly from the
charateristics in mel-spectrogram space can be associated training data such as F0 (fundamental frequency). However,
with individual components. Hence, it is possible to gener- in order to use such models, the statistics we hope to approx-
ate samples with specific speech characteristics by selecting imate must be decided upon a-priori, and the target value of
a component or a mixture thereof 1 . these statistics must be determined before synthesis.
Although VAEs and GANs (Hsu et al., 2018; Bińkowski Another approach to dealing with the issue of unlabeled
et al., 2019; Akuzawa et al., 2018) based models also non-textual information is to learn a latent embedding for
provide a latent prior that can be easily manipulated, in prosody or global style. This is the approach taken by
Flowtron this comes at no cost in speech quality nor opti- models like (Skerry-Ryan et al., 2018; Wang et al., 2018),
mization challenges. wherein in a bank of embeddings or a latent embedding
space of prosody is learned from unlabelled data. While
We find that Flowtron is able to generalize and produce sharp
these approaches have shown promise, manipulating such
mel-spectrograms by simply maximizing the likelihood of
latent variables only offers a coarse control over expressive
the data while not requiring any additional Prenet or Postnet
characteristics of speech.
layer (Wang et al., 2017), nor compound loss functions
required by most state of the art models like (Shen et al., A mixed approach consists of combining engineered statis-
2017; Arik et al., 2017b;a; Ping et al., 2017; Skerry-Ryan tics with latent embeddings learned in an unsupervised fash-
et al., 2018; Wang et al., 2018; Bińkowski et al., 2019). ion. This is the approach taken by models like Mellotron
(Valle et al., 2019b). In Mellotron, utterances are condi-
Flowtron is optimized by maximizing the likelihood of the
tioned on both audio statistics and a latent embedding of
training data, which makes training simple and stable. It
acoustic features derived from a reference acoustic represen-
1 tation. Despite its advantages, this approach still requires
What is relevant statistically might not be perceptually.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
determining these statistics before synthesis. Dhariwal, 2018). In our case, we use an affine coupling
layer (Dinh et al., 2016). Every input xt−1 produces scale
3. Flowtron and bias terms, s and b respectively, that affine-transform
the succeeding input xt :
Flowtron is an autoregressive generative model that gen-
erates a sequence of mel spectrogram frames p(x) by pro-
ducing each mel-spectrogram frame
Q based on previous mel- (log st , bt ) = N N (x1:t−1 , text, speaker) (7)
spectrogram frames p(x) = p(xt |x1:t−1 ). Our setup
x0t = st x t + bt (8)
uses a neural network as a generative model by sampling
from a simple distribution p(z). We consider two simple
distributions with the same number of dimensions as our Here N N () can be any autoregressive causal transformation.
desired mel-spectrogram: a zero-mean spherical Gaussian This can be achieved by time-wise concatenation of a 0-
and a mixture of spherical Gaussians with fixed or learnable valued vector to the input provided to N N (). The affine
parameters. coupling layer preserves invertibility for the overall network,
even though N N () does not need to be invertible. This
follows because the first input of N N () is a constant and
due to the autoregressive nature of the model the scaling
z ∼ N (z; 0, I) (1)
X and translation terms st and bt only depend on x1:t−1 and
z∼ φ̂k N (z; µˆk , Σ̂k ) (2) the fixed text and speaker vectors. Accordingly, when
k inverting the network, we can compute st and bt from the
preceding input x1:t−1 , and then invert x0t to compute xt ,
These samples are put through a series of invertible, by simply recomputing N N (x1:t−1 , text, speaker).
parametrized transformations f , in our case affine trans-
With an affine coupling layer, only the st term changes the
formations that transform p(z) into p(x).
volume of the mapping and adds a change of variables term
x = f 0 ◦ f 1 ◦ . . . f k (z) (3) to the loss. This term also serves to penalize the model for
non-invertible affine mappings.
As it is illustrated in (Kingma et al., 2016), in autoregres- log | det(J (f −1
coupling (x)))| = log |s| (9)
sive normalizing flows the t-th variable z 0t only depends on
previous timesteps z 1:t−1 :
With this setup, it is also possible to revert the ordering of
z 0t = f k (z 1:t−1 ) (4) the input x without loss of generality. Hence, we choose to
revert the order of the input at every even step of flow and
to maintain the original order on odd steps of flow. This
By using parametrized affine transformations for f and due
allows the model to learn dependencies both forward and
to the autoregressive structure, the Jacobian determinant
backwards in time while remaining causal and invertible.
of each of the transformations f is lower triangular, hence
easy to compute. With this setup we can train Flowtron by
3.2. Model architecture
maximizing the log-likelihood of the data, which can be
done using the change of variables: Our text encoder modifies Tacotron’s by replacing batch-
k
norm with instance-norm. Our decoder and N N architec-
X ture, depicted in Figure 1, removes the essential Prenet and
log pθ (x) = log pθ (z) + log | det(J (f −1
i (x)))| (5)
Postnet layers from Tacotron. We use the content-based
i=1
tanh attention described in (Vinyals et al., 2015). We use the
z= f −1
k ◦ −1
f k−1 ◦ . . . f −1
0 (x) (6) Mel Encoder described in (Hsu et al., 2018) for Flowtron
models that predict the parameters of the Gaussian mixture.
For the forward pass through the network, we take the mel-
spectrograms as vectors and process them through several Unlike (Ping et al., 2017; Gibiansky et al., 2017), where
“steps of flow conditioned on the text and speaker ids. A step site specific speaker embeddings are used, we use a single
of flow here consists of an affine coupling layer, described speaker embedding that is channel-wise concatenated with
below. the encoder outputs at every token. We use a fixed dummy
speaker embedding for models not conditioned on speaker
id. Finally, we add a dense layer with a sigmoid output the
3.1. Affine Coupling Layer
flow step closest to z. This provides the model with a gating
Invertible neural networks are typically constructed us- mechanism as early as possible during inference to avoid
ing coupling layers (Dinh et al., 2014; 2016; Kingma & extra computation.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
4.2. Mean Opinion Score comparison et al., 2017; Arik et al., 2017b;a; Ping et al., 2017; Skerry-
Ryan et al., 2018; Wang et al., 2018; Bińkowski et al., 2019),
We provide results that compare mean opinion scores
Flowtron generates sharp harmonics and well resolved for-
(MOS) from real data from the LJS dataset, samples from
mants without a compound loss nor Prenet or Postnet layers.
a Flowtron with 2 steps of flow and samples from our im-
plementation of Tacotron 2, both trained on LSH. Although
the models evaluated are multi-speaker, we only compute
mean opinion scores on LJS. In addition, we use the mean
opinion scores provided in (Prenger et al., 2019) for ground
truth data from the LJS dataset.
We crowd-sourced mean opinion score (MOS) tests on Ama-
zon Mechanical Turk. Raters first had to pass a hearing test
to be eligible. Then they listened to an utterance, after which
they rated pleasantness on a five-point scale. We used 30
volume normalized utterances from all speakers disjoint
from the training set for evaluation, and randomly chose the
utterances for each subject. (a) σ 2 = 0
The mean opinion scores are shown in Table 1 with 95% con-
fidence intervals computed over approximately 250 scores
per source. The results roughly match our subjective qual-
itative assessment. The larger advantage of Flowtron is in
the control over the amount of speech variation and the
manipulation of the latent space.
(c) Flowtron σ 2 = 1
For each experiment, we use the Sally speaker and the sen-
tences “Humans are walking on the street?” and “Surely
you are joking mister Feynman.”, which do not exist in
RAVDESS nor in the audio samples from Richard Feyn-
man.
The samples generated with Tacotron 2 GST are not able to
emulate the surprised style from RAVDESS nor Feynman’s Figure 7: Component assignments for Flowtron GM-B.
prosody and acoustic characteristics. Flowtron, on the other Components 7 and 8 are assigned different probabilities
hand, is able to make Sally sound surprised, which is dras- according to gender, suggesting that the information stored
tically different from the monotonous baseline. Likewise, in the components is gender dependent.
Flowtron is able to pick up on the prosody and articulation
details particular to Feynman’s speaking style, and transfer
them to Sally. 4.5.2. T RANSLATING DIMENSIONS
4.5. Sampling the Gaussian Mixture In this subsection, we use the model Flowtron GM-A de-
scribed previously. We focus on selecting a single mixture
In this last section we showcase visualizations and samples component and translating one of its dimensions by adding
from Flowtron Gaussian Mixture (GM). First we investi- an offset.
gate how different mixture components and speakers are
correlated. Then we provide sound examples in which we The samples in our supplementary material show that we
modulate speech characteristics by translating one of the the are able to modulate specific speech characteristics like
dimensions of an individual component. pitch and word duration. Although the samples generated
by translating one the dimensions associated with pitch
4.5.1. V ISUALIZING ASSIGNMENTS height have different pitch contours, they have the same
duration. Similarly, our samples show that translating the
For the first experiment, we train a Flowtrom Gaussian dimension associated with length of the first word does not
Mixture on LSH with 2 steps of flow, speaker embeddings modulate the pitch of the first word. This provides evidence
and fixed mean and covariance (Flowtron GM-A). We ob- that we can modulate these attributes by manipulating these
tain mixture component assignments per mel-spectrogram dimensions and that the model is able to learn a disentangled
by performing a forward pass and averaging the compo- representation of these speech attributes.
nent assignment over time and samples. Figure 6 shows
that whereas most speakers are equally assigned to all com-
ponents, component 7 is almost exclusively assigned to
5. Discussion
Helen’s data. In this paper we propose a new text to mel-spectrogram
synthesis model based on autoregressive flows that is opti-
mized by maximizing the likelihood and allows for control
of speech variation and style transfer. Our results show that
samples generated with FlowTron achieve mean opinion
scores that are similar to samples generated with state-of-
the-art text-to-speech synthesis models. In addition, we
demonstrate that at no extra cost and without a compound
loss term, our model learns a latent space that stores non-
Figure 6: Component assignments for Flowtron GM-A.
textual information. Our experiments show that FlowTron
Unlike LJS and Sally, Helen is almost exclusively assigned
gives the user the possibility to transfer charactersitics from
to component 7.
a source sample or speaker to a target speaker, for example
making a monotonic speaker sound more expressive.
In the second experiment, we train a Flowtron Gaussian Our results show that despite all the variability added by
Mixture on LibriTTS with 1 step of flow, without speaker increasing σ 2 , the samples synthesized with FlowTron
embeddings and predicted mean and covariance (Flowtron still produce high quality speech. Our results show that
GM-B). Figure 7 shows that Flowtron GM assigns more FlowTron learns a latent space over non-textual features that
probability to component 7 when the speaker is male than can be investigated and manipulated to give the user more
when it’s female. Conversely, the model assigns more proba- control over the generative models output. We provide many
bility to component 6 when the speaker is female than when examples that showcase this including increasing variation
it’s male. in mel-spectrograms in a controllable manner, transferring
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
the style from speakers seen and unseen during training to References
another speaker using sentences with similar or different
Akuzawa, K., Iwasawa, Y., and Matsuo, Y. Expressive
text, and making a monotonic speaker sound more expres-
speech synthesis via modeling expressions with varia-
sive.
tional autoencoder. arXiv preprint arXiv:1804.02135,
Flowtron produces expressive speech without labeled data 2018.
or ever seeing expressive data. It pushes text-to-speech syn-
thesis beyond the expressive limits of personal assistants. It Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng,
opens new avenues for speech synthesis in human-computer K., Ping, W., Raiman, J., and Zhou, Y. Deep voice
interaction and the arts, where realism and expressivity are 2: Multi-speaker neural text-to-speech. arXiv preprint
of utmost importance. To our knowledge, this work is the arXiv:1705.08947, 2017a.
first to demonstrate the advantages of using normalizing
flow models in text to mel-spectrogram synthesis. Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gib-
iansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J.,
et al. Deep voice: Real-time neural text-to-speech. arXiv
preprint arXiv:1702.07825, 2017b.
Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K.,
Ping, W., Raiman, J., and Zhou, Y. Deep voice 2: Multi-
speaker neural text-to-speech. In Advances in neural
information processing systems, pp. 2962–2970, 2017.
Hsu, W.-N., Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Wang,
Y., Cao, Y., Jia, Y., Chen, Z., Shen, J., et al. Hierarchical
generative modeling for controllable speech synthesis.
arXiv preprint arXiv:1810.07217, 2018.
Kingma, D. P. and Dhariwal, P. Glow: Generative Valle, R., Li, J., Prenger, R., and Catanzaro, B. Mellotron
flow with invertible 1x1 convolutions. arXiv preprint github repo, 2019a. URL https://github.com/
arXiv:1807.03039, 2018. NVIDIA/mellotron.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Valle, R., Li, J., Prenger, R., and Catanzaro, B. Mellotron:
Sutskever, I., and Welling, M. Improved variational in- Multispeaker expressive voice synthesis by conditioning
ference with inverse autoregressive flow. In Advances in on rhythm, pitch and global style tokens. arXiv preprint
Neural Information Processing Systems, pp. 4743–4751, arXiv:1910.11997, 2019b.
2016.
Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., and
Lee, J., Choi, H.-S., Jeon, C.-B., Koo, J., and Lee, K. Adver- Hinton, G. Grammar as a foreign language. In Advances
sarially trained end-to-end korean singing voice synthesis in neural information processing systems, pp. 2773–2781,
system. arXiv preprint arXiv:1908.01919, 2019. 2015.
Nishimura, M., Hashimoto, K., Oura, K., Nankaku, Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J.,
Y., and Tokuda, K. Singing voice synthesis Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.
based on deep neural networks. In Interspeech Tacotron: A fully end-to-end text-to-speech synthesis
2016, pp. 2478–2482, 2016. doi: 10.21437/ model. arXiv preprint arXiv:1703.10135, 2017.
Interspeech.2016-1027. URL http://dx.doi.org/
Wang, Y., Stanton, D., Zhang, Y., Skerry-Ryan, R., Batten-
10.21437/Interspeech.2016-1027.
berg, E., Shor, J., Xiao, Y., Ren, F., Jia, Y., and Saurous,
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, R. A. Style tokens: Unsupervised style modeling, con-
N., Ku, A., and Tran, D. Image transformer. arXiv trol and transfer in end-to-end speech synthesis. arXiv
preprint arXiv:1802.05751, 2018. preprint arXiv:1803.09017, 2018.
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, Weide, R. L. The cmu pronouncing dictionary. URL:
A., Narang, S., Raiman, J., and Miller, J. Deep voice http://www.speech.cs.cmu.edu/cgi-bin/cmudict, 1998.
3: 2000-speaker neural text-to-speech. arXiv preprint Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J.,
arXiv:1710.07654, 2017. Jia, Y., Chen, Z., and Wu, Y. Libritts: A corpus de-
rived from librispeech for text-to-speech. arXiv preprint
Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A
arXiv:1904.02882, 2019.
flow-based generative network for speech synthesis. In
ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp.
3617–3621. IEEE, 2019.