0% found this document useful (0 votes)
37 views5 pages

Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

The IndexTTS system is an advanced text-to-speech (TTS) solution that integrates a hybrid character-pinyin modeling approach to enhance pronunciation control, particularly for polyphonic characters in Chinese. It leverages a conformer-based encoder and a BigVGAN2 decoder, resulting in improved naturalness, content consistency, and faster inference speeds compared to existing TTS systems. The system simplifies the training process and supports zero-shot voice cloning, making it suitable for industrial applications and multilingual expansion.

Uploaded by

cosmos2022weirdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views5 pages

Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

The IndexTTS system is an advanced text-to-speech (TTS) solution that integrates a hybrid character-pinyin modeling approach to enhance pronunciation control, particularly for polyphonic characters in Chinese. It leverages a conformer-based encoder and a BigVGAN2 decoder, resulting in improved naturalness, content consistency, and faster inference speeds compared to existing TTS systems. The system simplifies the training process and supports zero-shot voice cloning, making it suitable for industrial applications and multilingual expansion.

Uploaded by

cosmos2022weirdo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot

Text-To-Speech System
Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang
1
Artificial Intelligence Platform Department, bilibili, China
{xuanwu,zhousiyi02,shujingchen,wangjinchao,wanglu08}@bilibili.com

Abstract and straightforward, yet it has drawbacks: longer training and


inference times, along with compromised stability. The second
Recently, large language model (LLM) based text-to- is end-to-end diffusion-based TTS, a non-autoregressive (NAR)
speech (TTS) systems have gradually become the mainstream model. F5-TTS[5] and Seed-TTS[9] are case-in-point. This ap-
in the industry due to their high naturalness and powerful zero- proach yields high-quality synthesized audio and is suitable for
shot voice cloning capabilities.
arXiv:2502.05512v1 [cs.SD] 8 Feb 2025

voice editing but is difficult to stream, so not for real-time use.


Here, we introduce the IndexTTS system, which is mainly Finally, the hybrid architecture typically uses a single codebook
based on the XTTS and Tortoise model. We add some novel and a low bitrate codec, generating high-quality audio through
improvements. Specifically, in Chinese scenarios, we adopt a a standalone decoder such as diffusion or HiFiGAN[10]. It
hybrid modeling method that combines characters and pinyin, balances performance and generation quality and offers good
making the pronunciations of polyphonic characters and long- stability. Due to the success of large language models, tok-
tail characters controllable. We also performed a comparative enization is the trend of the future. For industrial-level applica-
analysis of the Vector Quantization (VQ) with Finite-Scalar tions, stability is crucial. Here, we opt for the hybrid architec-
Quantization (FSQ) for codebook utilization of acoustic speech ture, using a single codebook codec and reconstruct high qual-
tokens. To further enhance the effect and stability of voice ity voice through a speech decoder, such as XTTS, Fish-Speech
cloning, we introduce a conformer-based speech conditional en- and CosyVoice2.
coder and replace the speechcode decoder with BigVGAN2.
Compared with XTTS, it has achieved significant improve- Based on XTTS[1] and Tortoise[11], we have made sev-
ments in naturalness, content consistency, and zero-shot voice eral improvements, which mainly include the following: We re-
cloning. As for the popular TTS systems in the open-source, move the front-end G2P module and use raw text as input, along
such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, with a BPE-based text tokenizer. This simplifies input prepro-
IndexTTS has a relatively simple training process, more con- cessing, facilitates multi-language expansion, and enables end-
trollable usage, and faster inference speed. Moreover, its perfor- to-end learning of word or polyphone pronunciations via big
mance surpasses that of these systems. Our demos are available data context integration. To address the pronunciation control of
at https://index-tts.github.io. polyphones and low-frequency characters in Chinese scenarios,
Index Terms: LLM based zero-shot TTS, industrial-Level text- which inevitably occur in real world video creation, we propose
to-speech, polyphone controllable a hybrid character-pinyin modeling approach. This allows video
creators to correct pronunciations by directly inputting pinyin.
1. Introduction Moreover, VQ[12] may suffer from low-utilization of the quan-
tization codebook due to codebook collapse. we conducted a
Text-to-speech synthesis (TTS) has extensive applications in comparative analysis between VQ and FSQ[13] in terms of their
fields such as human-computer interaction, education, and en- codebook utilization for acoustic token representation, achiev-
tertainment. For example, in video creation scenarios in recent ing nearly 100% codebook utilization. Finally, we have made
years, TTS can assist users in quickly generating video dubbing, significant improvements in prosody naturalness, the similarity
saving recording time, and thus playing a crucial role in the cre- of zero-shot voice cloning, and system stability. The main im-
ation process. Many creators hope to provide personalized and provements and contributions are summarized as follows:
highly natural speech synthesis services to meet the needs of
different scenarios. • In Chinese scenarios, we have introduced a character-pinyin
The TTS system, which is based on large language mod- hybrid modeling approach. This allows for quick correction
els and can be trained using massive amounts of general of mispronounced characters.
speech data, demonstrates impressive performance in speech • We develop the IndexTTS system, incorporating a conformer
generation, such as XTTS[1], Fish-Speech[2], CosyVoice2[3], conditioning encoder and a BigVGAN2[14]-based speech-
FireRedTTS[4] and F5-TTS[5]. Compared to traditional sys- code decoder. This improves training stability, voice timbre
tems that rely on more intricate manual designs, such as Mega- similarity, and sound quality.
tts 2[6] and Yourtts[7], these systems have achieved significant
improvements in naturalness, particularly in zero-shot voice • We release all test sets, including those for polysyllabic
cloning. Generative TTS powered by big data can be roughly words, subjective and objective test sets1 .
classified into three categories. The first is the neural codec lan-
guage model. To ensure the quality of synthesized audio, it typi-
cally employs a multi-codebook codec along with a high-frame-
rate configuration, like in Vall-E[8]. This architecture is simple 1 https://github.com/index-tts/index-tts
2. IndexTTS System is “speaker info, [BT], text, [ET], [BA]”. The autoregressive
generation of LM is started from such input prefix sequence un-
Similar to XTTS[1], our system incorporates speech-to-codec
til the “End of sequence” token “[EA]” is detected.
VQVAE[12] codec, text-to-codec language model and latent-
We adopt the SEQ3. It is worth emphasizing that not rely-
to-audio decoder, as depicted in Figure 1.
ing on prompt text is crucial in certain scenarios. For example,
in cross-language voice cloning, if prompt text must be pro-
2.1. Text tokenizer
vided or identified through a multilingual ASR system, its us-
Currently, our system only supports two languages, Chinese ability will be significantly limited. Additionally, conditioning
and English. We directly use the raw text as input, which is on both the prompt text and the audio token series will substan-
tokenized by a BPE-based text tokenizer, This makes it con- tially increase the inference time.
venient to extend the system to other languages. Due to the We also found that, compared to single-speaker encoding
large number of polyphonic characters in Chinese, we adopt vectors such as Tortoise [11] and CosyVoice [3], or speech-
a hybrid-modeling approach of Chinese characters and pinyin prompting methods like Vall-E, the Conformer-based Perceiver
in Chinese-related scenarios. The vocabulary size of the text demonstrates superior ability in capturing speaker characteris-
tokenizer is 12,000. It encompasses 8,400 Chinese charac- tics. Moreover, it ensures consistent model outputs across dif-
ters along with their corresponding 1,721 pinyin, English word ferent runs, effectively mitigating the issue of speaker shifting
pieces, and several special symbols. During specific train- that may occur between various model executions. The Per-
ing, we randomly select a certain proportion of non-polyphonic ceiver offers the advantage of utilizing multiple references with-
characters and replace them with pinyin. An example of pre- out imposing length restrictions. This flexibility enables it to
processing process is presented in Table 1. comprehensively capture diverse aspects of the target speaker.
Furthermore, it even allows for the integration of features from
2.2. Neural Speech Tokenizer other speakers, thereby facilitating the creation of a truly unique
voice.
Vector Quantization (VQ) is a powerful tool for speech cod-
ing, but it may suffer from codebook collapse[13], The code-
2.4. Speech Decoder
book utilization of VQ and FSQ was analyzed in the following
experiments. We increased the parameters of the Variational The last stage is to convert the SpeechLLM output into wave-
Autoencoder (VAE) to around 50M. The VAE receives a mel- form. One is to utilize a flow matching[15] or diffusion-
spectrogram as input and encodes each frame with VQ using based[9] model to transform the speech code generated by the
approximately 8192 codes. The sampling rate of the input au- SpeechLLM into an intermediate representation, such as the
dio is 24 kHz, and the token rate output by the speech tokenizer Mel spectrogram[11][9]. Then, followed by a vocoder, such
is 25 Hz. as the HifiGAN vocoder, to convert the Mel spectrogram into
waveform. This method can generate high-quality audio, but
2.3. Large Language Model for TTS it suffers from slow inference and faces complexity in achiev-
ing streaming. The second approach is to directly convert the
The text-to-codec large language model (LLM) is based on the SpeechLLM output, conditioned on speaker embedding, into
decoder-only transformer architecture, similar to XTTS. It gen- the final waveform. We adopts the second approach, based on
erates a series of audio mel tokens from the input series of text the BigVGAN2[14] vocoder, directly reconstructing the audio
tokens. The LLM is also conditioned by a transformer-based based on the last hidden state of the SpeechLLM, which is con-
conditioning encoder, which we replace with a Conformer en- ditioned with speaker embedding. The latent sampling rate is
coder with a subsample rate of 2. We found that this replace- 25Hz. It is interpolated to 100Hz and then input into BigV-
ment can enhance timbre similarity and training stability. GAN2. Subsequently, the signal is decoded by BigVGAN2 and
The training processes of conditional LLM can be broadly finally outputs at a frequency of 24KHz.
categorized into the following types, The input sequence is
structured as follows ([BT] [ET] indicate the beginning and end
of the text token sequence. [BA] and [EA] denote the start and 3. Experiments
end of the audio token sequence): 3.1. Dataset
• SEQ1: “[BT], prompt text, text, [ET], [BA],
All training data was collected from the internet, with an initial
prompt audio, audio, [EA]”, such as Vall-E and Fish-
120,000 hours of raw audio. After voice separation, speaker
Speech, it concatenates all the tokens of the prompt and the
segmentation, and filtering using Demucs [16], we obtained
target.
34,000 hours of high-quality Chinese-English bilingual data.
• SEQ2: “[BT], text, [ET], [BA], audio, [EA]”, for instance, The dataset includes 25,000 hours of Chinese and 9,000 hours
CosyVoice2 directly generates audio tokens from the text to- of English audio. We then use ASR (Automatic Speech Recog-
kens series. nition) to generate pseudo-labels for the corresponding audio.
• SEQ3: “speaker info, [BT], text, [ET], [BA], au- Finally, we emphasize that punctuation marks are added to the
dio, [EA]”, for example, in XTTS[1], CosyVoice[3] and ASR results based on text semantics and speech pauses to cre-
Tortoise[11], the speaker information of the prompt audio is ate the final training texts. This approach allows users to control
compressed into one or 32 latent vectors, which serve as the pauses flexibly, beyond relying solely on text semantics.
conditions for the LLM.
SEQ1 and SEQ2 must rely on the text corresponding to the 3.2. Experimental Settings
prompt audio during the inference process. The inference in-
3.2.1. Mixed training of Chinese characters and pinyin
put prefix sequence can be constructed as “[BT], prompt text,
text, [ET], [BA], prompt audio”. In comparison, SEQ3 only We randomly select 50% of the training samples. For each sam-
requires the prompt audio. The inference input prefix sequence ple, we randomly pick 20% of the Chinese characters. If a char-
1 2 3 4 5 E 1 2 3 4 5 6 7 8 T
Condition Vector

Speaker Vector
Text-Speech Language Model BigVGAN2 Decoder
Text Token

Latent Acusitic Token


... S 1 2 3 4 5 B 1 2 3 4 5 6 7 8

S Start of Text

Perceiver Audio Codec


Conditioner Text Tokenizer Speaker Encoder E End of Text

B Start of Speech

T End of Speech
Text
Prompt Speech GT Speech Prompt Speech

Figure 1: An overview of IndexTTS, a text-to-speech language model conditioned on prompt speech and text tokens generates acoustic
tokens, and the BigVGAN2 decoder convert the LLM output latent into waveform.

Table 1: Preprocessing Examples for Training Samples Combining Chinese Characters and Pinyin

Input: 晕眩是一种感觉,I want to go to the supermarket!


Mix Pinyin: 晕 XUAN4 是 一 种 GAN3 觉 , I WANT TO GO TO THE SUPERMARKET !
BPE Tokens: 晕, XUAN4, 是, 一, 种, GAN3, 觉, ,, I, WANT, TO, GO, TO, THE, SUPER, M, AR, KE, T, !

Table 2: Error and Correction Statistics for Polyphonic Char- ity (SS), we utilize the ERes2Net2 model to extract the speaker
acter Pronunciation embeddings from both the prompt and the generated utterances.
The raw cosine similarity between these embeddings is then re-
Sentences Percentage garded as the measure of speaker similarity.
Total 2500 100% Additionally, to evaluate the pronunciation correction capa-
A1 465 18.6% bility for polyphonic characters, we constructed a challenging
A2 437 94.0% Chinese polyphonic character test set comprising 2,500 entries.

3.3. Experimental Results


acter is not a polyphonic character, we replace it with its corre- 3.3.1. Controllability of polyphonic characters
sponding pinyin. The replaced text may include Chinese char-
acters, pinyin, English words, and punctuation marks. Then, it We conducted tests on 2,500 sentences that contain polyphonic
is directly tokenized by the BPE tokenizer. characters. The test results are presented in Table 2. Specifi-
cally, the inputs of A1 are all characters, there are 465 synthe-
3.2.2. Speech Codec Training sized audio with pronunciation errors of polyphonic characters.
This accounts for 18.6% of the total. Among these audio with
In the training of the Speech codec, we only replace Vector pronunciation errors, 437 of them can be accurately corrected
Quantization with Finite Scalar Quantization, while keeping by incorporating the correct pinyin as mixed inputs, as shown
other model configurations unchanged. The FSQ levels are in A2, accounting for 94%. The remaining 28 errors, account-
[8, 8, 8, 6, 5], the dimension of the VQ codebook is 512, and ing for 1.1%, that could not be corrected by pinyin might be
it contains 8192 codes. Considering that the size and diversity because errors introduced by the training data have reinforced
of the training data might affect the utilization rate of the VQ the SpeechLLM.
codebook, we also conduct training on a 6,000 hours subset and
the entire training dataset respectively. 3.3.2. Evaluate The Codec Quantizer
3.2.3. Evaluation Settings We compared VQ and FSQ in terms of codebook utilization
under varying training data scales(6k and 34k hours) and eval-
We evaluate indexTTS on four test sets. The first two clean test
uate on the above four test sets. Results show that with 6k
sets are librispeech[17] and Aishell-1[18] test corpus. The last
hours training data, VQ has a 55% low codebook utilization
two sets are composed of 2,000 Chinese samples and 1,000 En-
rate. However, when the training data reaches 34k hours, there
glish samples selected from the CommonVoice[19] test dataset.
is little difference between VQ and FSQ, and VQ’s utilization
In each set, each speaker has more than two samples.
rate can also approach 100%. 50% of the tokens cover more
During the evaluation, for each sample, one other sample than 80% of the total quantity of the tokens that appear in all
from the same speaker corresponding to this sample is randomly training data.
selected as the condition prompt. We use Paraformer[20] ASR
to recognize the synthesis results of the Chinese test set, and
for the English test set, we use Whisper-large V3[21]. This is 2 https://www.modelscope.cn/models/iic/speech_
to evaluate the content consistency. Regarding speaker similar- eres2net_sv_zh-cn_16k-common
Table 3: Word Error Rate (WER) and Speaker Similarity (SS) Results for IndexTTS and Baseline Models

aishell1 test commonvoice zh commonvoice en librispeech test clean AVG


Model
CER(%)↓ SS↑ CER(%)↓ SS↑ WER(%)↓ SS↑ WER(%)↓ SS↑ WER(%)↓ SS↑
Human 2.0 0.846 9.5 0.809 10.0 0.820 2.4 0.858 5.1 0.836
CosyVoice2 1.8 0.796 9.1 0.743 7.3 0.742 4.9 0.837 5.9 0.788
F5TTS 3.9 0.743 11.7 0.747 5.4 0.746 7.8 0.828 8.2 0.779
Fishspeech 2.4 0.488 11.4 0.552 8.8 0.622 8.0 0.701 8.3 0.612
FireRedTTS 2.2 0.579 11.0 0.593 16.3 0.587 5.7 0.698 7.7 0.631
XTTS 3.0 0.573 11.4 0.586 7.1 0.648 3.5 0.761 6.0 0.663
IndexTTS 1.3 0.744 7.0 0.742 5.3 0.758 2.1 0.823 3.7 0.776

Table 4: MOS Scores for Zero-Shot Cloned Voice

Model Prosody Timbre Quality AVG


CosyVoice2 3.67 4.05 3.73 3.81
F5TTS 3.56 3.88 3.56 3.66
Fishspeech 3.40 3.63 3.69 3.57
FireRedTTS 3.79 3.72 3.60 3.70
XTTS 3.23 2.99 3.10 3.11
Figure 2: Compare the distribution of codebook utilization rates
IndexTTS 3.79 4.20 4.05 4.01
of VQ and FSQ under different training data scales

Table 5: GPU Utilization Rate and Test Duration in Experimen-


3.3.3. Comparison Results with Baselines tal Evaluation

We select several top popular zero-shot TTS models


Model Duration(s) GPU Util
in the opensource for comparison, including systems
XTTS[1], CosyVoice2(non-streaming)[22], FishSpeech[2], CosyVoice2 805 48.41%
FireRedTTS[4] and F5-TTS[5]. The evaluation methodology F5TTS 320 42.13%
encompasses both objective and subjective metrics: the word Fishspeech 756 71.43%
error rate (WER) for the content consistency, the speaker FireRedTTS 732 92.65%
embedding similarity (SS) measure for the evaluation of speech XTTS 488 87.65%
cloning fidelity, and the mean opinion score (MOS) for the IndexTTS 397 28.47%
quantification of perceptual quality.
The objective evaluation results of WER and SS for In-
dexTTS and the baseline models across four test sets are pre-
sented in Table 3. IndexTTS significantly outperforms all other 3.4. Conclusion
open-source models, demonstrating its robustness and stabil-
The IndexTTS system we developed is a GPT-style text-to-
ity. Regarding the SS metric, the performance gap between In-
speech (TTS) model. It is capable of correcting the pronuncia-
dexTTS, CosyVoice2, and F5-TTS is minimal, yet these models
tion of Chinese characters using pinyin and controlling pauses
exhibit clear advantages over other compared models.
at any position through punctuation marks. We enhanced mul-
In terms of evaluating the perceptual quality of synthesized tiple modules of the system, including the improvement of
audio, we carried out MOS covering three dimensions: prosody, speaker condition feature representation, and the integration of
timbre similarity, and sound quality. We conducted a double- BigVGAN2 to optimize audio quality. Trained on tens of thou-
blind evaluation by randomly selecting 100 samples from the sands of hours of data, our system achieves state-of-the-art per-
complete test set to ensure unbiased results. In the subjective formance, outperforming current popular TTS systems such as
evaluation process, we place greater emphasis on the similarity XTTS, CosyVoice2, Fish-Speech, and F5-TTS.
between the synthesized audio and the prompt audio across all
aspects. For example, if the sample speeches contain stutters
3.5. Limitations
or pauses, we assign a lower score to synthesized results that
exhibit overly smooth prosody. Additionally, we also consider In this work, several limitations should be acknowledged. Cur-
the restoration of the sound field characteristics in the prompt rently, our system does not support instructed voice generation
audio. The results are shown in Table 4, we have outperformed and is limited to Chinese and English, with insufficient capabil-
the baseline in nearly all evaluation dimensions, demonstrating ity to replicate rich emotional expressions. In future work, we
significant advantages in timbre similarity and sound quality. plan to extend the system to support additional languages, en-
Moreover, we randomly selected 200 test samples and cal- hance emotion replication through methods such as reinforce-
culated the total time taken by all models to synthesize these ment learning, and incorporate the ability to control hyper-
samples and the GPU resource consumption. The results are realistic paralinguistic expressions, including laughter, hesita-
presented in Table 5. tion, and surprise, in paralinguistic speech generation.
4. References I/O systems and assessment (O-COCOSDA). IEEE, 2017, pp.
1–5.
[1] E. Casanova, K. Davis, E. Gölge, G. Göknar, I. Gulea, L. Hart,
A. Aljafari, J. Meyer, R. Morais, S. Olayemi et al., “Xtts: a [19] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,
massively multilingual zero-shot text-to-speech model,” arXiv J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,
preprint arXiv:2406.04904, 2024. “Common voice: A massively-multilingual speech corpus,” arXiv
preprint arXiv:1912.06670, 2019.
[2] S. Liao, Y. Wang, T. Li, Y. Cheng, R. Zhang, R. Zhou, and
Y. Xing, “Fish-speech: Leveraging large language models for [20] Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast
advanced multilingual text-to-speech synthesis,” arXiv preprint and accurate parallel transformer for non-autoregressive end-to-
arXiv:2411.01156, 2024. end speech recognition,” arXiv preprint arXiv:2206.08317, 2022.
[3] Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y. Yang, H. Hu, [21] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
S. Zheng, Y. Gu, Z. Ma et al., “Cosyvoice: A scalable multi- I. Sutskever, “Robust speech recognition via large-scale weak
lingual zero-shot text-to-speech synthesizer based on supervised supervision,” in International conference on machine learning.
semantic tokens,” arXiv preprint arXiv:2407.05407, 2024. PMLR, 2023, pp. 28 492–28 518.
[4] H.-H. Guo, K. Liu, F.-Y. Shen, Y.-C. Wu, F.-L. Xie, K. Xie, [22] Z. Du, Y. Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao,
and K.-T. Xu, “Fireredtts: A foundation text-to-speech framework Y. Yang, C. Gao, H. Wang et al., “Cosyvoice 2: Scalable stream-
for industry-level generative speech applications,” arXiv preprint ing speech synthesis with large language models,” arXiv preprint
arXiv:2409.03283, 2024. arXiv:2412.10117, 2024.
[5] Y. Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and
X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech
with flow matching,” arXiv preprint arXiv:2410.06885, 2024.
[6] Z. Jiang, J. Liu, Y. Ren, J. He, C. Zhang, Z. Ye, P. Wei,
C. Wang, X. Yin, Z. Ma et al., “Mega-tts 2: Zero-shot text-
to-speech with arbitrary length speech prompts,” arXiv preprint
arXiv:2307.07218, 2023.
[7] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and
M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and
zero-shot voice conversion for everyone,” in International Con-
ference on Machine Learning. PMLR, 2022, pp. 2709–2720.
[8] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen,
Y. Liu, H. Wang, J. Li et al., “Neural codec language mod-
els are zero-shot text to speech synthesizers,” arXiv preprint
arXiv:2301.02111, 2023.
[9] P. Anastassiou, J. Chen, J. Chen, Y. Chen, Z. Chen, Z. Chen,
J. Cong, L. Deng, C. Ding, L. Gao et al., “Seed-tts: A family of
high-quality versatile speech generation models,” arXiv preprint
arXiv:2406.02430, 2024.
[10] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net-
works for efficient and high fidelity speech synthesis,” Advances
in neural information processing systems, vol. 33, pp. 17 022–
17 033, 2020.
[11] J. Betker, “Better speech synthesis through scaling,” arXiv
preprint arXiv:2305.07243, 2023.
[12] A. Van Den Oord, O. Vinyals et al., “Neural discrete represen-
tation learning,” Advances in neural information processing sys-
tems, vol. 30, 2017.
[13] F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Fi-
nite scalar quantization: Vq-vae made simple,” arXiv preprint
arXiv:2309.15505, 2023.
[14] S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon,
“Bigvgan: A universal neural vocoder with large-scale training,”
arXiv preprint arXiv:2206.04658, 2022.
[15] S. Mehta, R. Tu, J. Beskow, É. Székely, and G. E. Henter,
“Matcha-tts: A fast tts architecture with conditional flow match-
ing,” in ICASSP 2024-2024 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024,
pp. 11 341–11 345.
[16] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Demucs: Deep
extractor for music sources with extra unlabeled data remixed,”
arXiv preprint arXiv:1909.01174, 2019.
[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: an asr corpus based on public domain audio books,”
in 2015 IEEE international conference on acoustics, speech and
signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-
source mandarin speech corpus and a speech recognition base-
line,” in 2017 20th conference of the oriental chapter of the inter-
national coordinating committee on speech databases and speech

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy