0% found this document useful (0 votes)

37 views5 pages

Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

The IndexTTS system is an advanced text-to-speech (TTS) solution that integrates a hybrid character-pinyin modeling approach to enhance pronunciation control, particularly for polyphonic characters in Chinese. It leverages a conformer-based encoder and a BigVGAN2 decoder, resulting in improved naturalness, content consistency, and faster inference speeds compared to existing TTS systems. The system simplifies the training process and supports zero-shot voice cloning, making it suitable for industrial applications and multilingual expansion.

Uploaded by

cosmos2022weirdo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views5 pages

Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Uploaded by

cosmos2022weirdo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot

Text-To-Speech System
Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang
1
Artificial Intelligence Platform Department, bilibili, China
{xuanwu,zhousiyi02,shujingchen,wangjinchao,wanglu08}@bilibili.com

Abstract and straightforward, yet it has drawbacks: longer training and

inference times, along with compromised stability. The second
Recently, large language model (LLM) based text-to- is end-to-end diffusion-based TTS, a non-autoregressive (NAR)
speech (TTS) systems have gradually become the mainstream model. F5-TTS[5] and Seed-TTS[9] are case-in-point. This ap-
in the industry due to their high naturalness and powerful zero- proach yields high-quality synthesized audio and is suitable for
shot voice cloning capabilities.
arXiv:2502.05512v1 [cs.SD] 8 Feb 2025

voice editing but is difficult to stream, so not for real-time use.

Here, we introduce the IndexTTS system, which is mainly Finally, the hybrid architecture typically uses a single codebook
based on the XTTS and Tortoise model. We add some novel and a low bitrate codec, generating high-quality audio through
improvements. Specifically, in Chinese scenarios, we adopt a a standalone decoder such as diffusion or HiFiGAN[10]. It
hybrid modeling method that combines characters and pinyin, balances performance and generation quality and offers good
making the pronunciations of polyphonic characters and long- stability. Due to the success of large language models, tok-
tail characters controllable. We also performed a comparative enization is the trend of the future. For industrial-level applica-
analysis of the Vector Quantization (VQ) with Finite-Scalar tions, stability is crucial. Here, we opt for the hybrid architec-
Quantization (FSQ) for codebook utilization of acoustic speech ture, using a single codebook codec and reconstruct high qual-
tokens. To further enhance the effect and stability of voice ity voice through a speech decoder, such as XTTS, Fish-Speech
cloning, we introduce a conformer-based speech conditional en- and CosyVoice2.
coder and replace the speechcode decoder with BigVGAN2.
Compared with XTTS, it has achieved significant improve- Based on XTTS[1] and Tortoise[11], we have made sev-
ments in naturalness, content consistency, and zero-shot voice eral improvements, which mainly include the following: We re-
cloning. As for the popular TTS systems in the open-source, move the front-end G2P module and use raw text as input, along
such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, with a BPE-based text tokenizer. This simplifies input prepro-
IndexTTS has a relatively simple training process, more con- cessing, facilitates multi-language expansion, and enables end-
trollable usage, and faster inference speed. Moreover, its perfor- to-end learning of word or polyphone pronunciations via big
mance surpasses that of these systems. Our demos are available data context integration. To address the pronunciation control of
at https://index-tts.github.io. polyphones and low-frequency characters in Chinese scenarios,
Index Terms: LLM based zero-shot TTS, industrial-Level text- which inevitably occur in real world video creation, we propose
to-speech, polyphone controllable a hybrid character-pinyin modeling approach. This allows video
creators to correct pronunciations by directly inputting pinyin.
1. Introduction Moreover, VQ[12] may suffer from low-utilization of the quan-
tization codebook due to codebook collapse. we conducted a
Text-to-speech synthesis (TTS) has extensive applications in comparative analysis between VQ and FSQ[13] in terms of their
fields such as human-computer interaction, education, and en- codebook utilization for acoustic token representation, achiev-
tertainment. For example, in video creation scenarios in recent ing nearly 100% codebook utilization. Finally, we have made
years, TTS can assist users in quickly generating video dubbing, significant improvements in prosody naturalness, the similarity
saving recording time, and thus playing a crucial role in the cre- of zero-shot voice cloning, and system stability. The main im-
ation process. Many creators hope to provide personalized and provements and contributions are summarized as follows:
highly natural speech synthesis services to meet the needs of
different scenarios. • In Chinese scenarios, we have introduced a character-pinyin
The TTS system, which is based on large language mod- hybrid modeling approach. This allows for quick correction
els and can be trained using massive amounts of general of mispronounced characters.
speech data, demonstrates impressive performance in speech • We develop the IndexTTS system, incorporating a conformer
generation, such as XTTS[1], Fish-Speech[2], CosyVoice2[3], conditioning encoder and a BigVGAN2[14]-based speech-
FireRedTTS[4] and F5-TTS[5]. Compared to traditional sys- code decoder. This improves training stability, voice timbre
tems that rely on more intricate manual designs, such as Mega- similarity, and sound quality.
tts 2[6] and Yourtts[7], these systems have achieved significant
improvements in naturalness, particularly in zero-shot voice • We release all test sets, including those for polysyllabic
cloning. Generative TTS powered by big data can be roughly words, subjective and objective test sets1 .
classified into three categories. The first is the neural codec lan-
guage model. To ensure the quality of synthesized audio, it typi-
cally employs a multi-codebook codec along with a high-frame-
rate configuration, like in Vall-E[8]. This architecture is simple 1 https://github.com/index-tts/index-tts
2. IndexTTS System is “speaker info, [BT], text, [ET], [BA]”. The autoregressive
generation of LM is started from such input prefix sequence un-
Similar to XTTS[1], our system incorporates speech-to-codec
til the “End of sequence” token “[EA]” is detected.
VQVAE[12] codec, text-to-codec language model and latent-
We adopt the SEQ3. It is worth emphasizing that not rely-
to-audio decoder, as depicted in Figure 1.
ing on prompt text is crucial in certain scenarios. For example,
in cross-language voice cloning, if prompt text must be pro-
2.1. Text tokenizer
vided or identified through a multilingual ASR system, its us-
Currently, our system only supports two languages, Chinese ability will be significantly limited. Additionally, conditioning
and English. We directly use the raw text as input, which is on both the prompt text and the audio token series will substan-
tokenized by a BPE-based text tokenizer, This makes it con- tially increase the inference time.
venient to extend the system to other languages. Due to the We also found that, compared to single-speaker encoding
large number of polyphonic characters in Chinese, we adopt vectors such as Tortoise [11] and CosyVoice [3], or speech-
a hybrid-modeling approach of Chinese characters and pinyin prompting methods like Vall-E, the Conformer-based Perceiver
in Chinese-related scenarios. The vocabulary size of the text demonstrates superior ability in capturing speaker characteris-
tokenizer is 12,000. It encompasses 8,400 Chinese charac- tics. Moreover, it ensures consistent model outputs across dif-
ters along with their corresponding 1,721 pinyin, English word ferent runs, effectively mitigating the issue of speaker shifting
pieces, and several special symbols. During specific train- that may occur between various model executions. The Per-
ing, we randomly select a certain proportion of non-polyphonic ceiver offers the advantage of utilizing multiple references with-
characters and replace them with pinyin. An example of pre- out imposing length restrictions. This flexibility enables it to
processing process is presented in Table 1. comprehensively capture diverse aspects of the target speaker.
Furthermore, it even allows for the integration of features from
2.2. Neural Speech Tokenizer other speakers, thereby facilitating the creation of a truly unique
voice.
Vector Quantization (VQ) is a powerful tool for speech cod-
ing, but it may suffer from codebook collapse[13], The code-
2.4. Speech Decoder
book utilization of VQ and FSQ was analyzed in the following
experiments. We increased the parameters of the Variational The last stage is to convert the SpeechLLM output into wave-
Autoencoder (VAE) to around 50M. The VAE receives a mel- form. One is to utilize a flow matching[15] or diffusion-
spectrogram as input and encodes each frame with VQ using based[9] model to transform the speech code generated by the
approximately 8192 codes. The sampling rate of the input au- SpeechLLM into an intermediate representation, such as the
dio is 24 kHz, and the token rate output by the speech tokenizer Mel spectrogram[11][9]. Then, followed by a vocoder, such
is 25 Hz. as the HifiGAN vocoder, to convert the Mel spectrogram into
waveform. This method can generate high-quality audio, but
2.3. Large Language Model for TTS it suffers from slow inference and faces complexity in achiev-
ing streaming. The second approach is to directly convert the
The text-to-codec large language model (LLM) is based on the SpeechLLM output, conditioned on speaker embedding, into
decoder-only transformer architecture, similar to XTTS. It gen- the final waveform. We adopts the second approach, based on
erates a series of audio mel tokens from the input series of text the BigVGAN2[14] vocoder, directly reconstructing the audio
tokens. The LLM is also conditioned by a transformer-based based on the last hidden state of the SpeechLLM, which is con-
conditioning encoder, which we replace with a Conformer en- ditioned with speaker embedding. The latent sampling rate is
coder with a subsample rate of 2. We found that this replace- 25Hz. It is interpolated to 100Hz and then input into BigV-
ment can enhance timbre similarity and training stability. GAN2. Subsequently, the signal is decoded by BigVGAN2 and
The training processes of conditional LLM can be broadly finally outputs at a frequency of 24KHz.
categorized into the following types, The input sequence is
structured as follows ([BT] [ET] indicate the beginning and end
of the text token sequence. [BA] and [EA] denote the start and 3. Experiments
end of the audio token sequence): 3.1. Dataset
• SEQ1: “[BT], prompt text, text, [ET], [BA],
All training data was collected from the internet, with an initial
prompt audio, audio, [EA]”, such as Vall-E and Fish-
120,000 hours of raw audio. After voice separation, speaker
Speech, it concatenates all the tokens of the prompt and the
segmentation, and filtering using Demucs [16], we obtained
target.
34,000 hours of high-quality Chinese-English bilingual data.
• SEQ2: “[BT], text, [ET], [BA], audio, [EA]”, for instance, The dataset includes 25,000 hours of Chinese and 9,000 hours
CosyVoice2 directly generates audio tokens from the text to- of English audio. We then use ASR (Automatic Speech Recog-
kens series. nition) to generate pseudo-labels for the corresponding audio.
• SEQ3: “speaker info, [BT], text, [ET], [BA], au- Finally, we emphasize that punctuation marks are added to the
dio, [EA]”, for example, in XTTS[1], CosyVoice[3] and ASR results based on text semantics and speech pauses to cre-
Tortoise[11], the speaker information of the prompt audio is ate the final training texts. This approach allows users to control
compressed into one or 32 latent vectors, which serve as the pauses flexibly, beyond relying solely on text semantics.
conditions for the LLM.
SEQ1 and SEQ2 must rely on the text corresponding to the 3.2. Experimental Settings
prompt audio during the inference process. The inference in-
3.2.1. Mixed training of Chinese characters and pinyin
put prefix sequence can be constructed as “[BT], prompt text,
text, [ET], [BA], prompt audio”. In comparison, SEQ3 only We randomly select 50% of the training samples. For each sam-
requires the prompt audio. The inference input prefix sequence ple, we randomly pick 20% of the Chinese characters. If a char-
1 2 3 4 5 E 1 2 3 4 5 6 7 8 T
Condition Vector

Speaker Vector
Text-Speech Language Model BigVGAN2 Decoder
Text Token

Latent Acusitic Token

... S 1 2 3 4 5 B 1 2 3 4 5 6 7 8

S Start of Text

Perceiver Audio Codec

Conditioner Text Tokenizer Speaker Encoder E End of Text

B Start of Speech

T End of Speech
Text
Prompt Speech GT Speech Prompt Speech

Figure 1: An overview of IndexTTS, a text-to-speech language model conditioned on prompt speech and text tokens generates acoustic
tokens, and the BigVGAN2 decoder convert the LLM output latent into waveform.

Table 1: Preprocessing Examples for Training Samples Combining Chinese Characters and Pinyin

Input: 晕眩是一种感觉，I want to go to the supermarket!

Mix Pinyin: 晕 XUAN4 是一种 GAN3 觉， I WANT TO GO TO THE SUPERMARKET !
BPE Tokens: 晕, XUAN4, 是, 一, 种, GAN3, 觉, ，, I, WANT, TO, GO, TO, THE, SUPER, M, AR, KE, T, !

Table 2: Error and Correction Statistics for Polyphonic Char- ity (SS), we utilize the ERes2Net2 model to extract the speaker
acter Pronunciation embeddings from both the prompt and the generated utterances.
The raw cosine similarity between these embeddings is then re-
Sentences Percentage garded as the measure of speaker similarity.
Total 2500 100% Additionally, to evaluate the pronunciation correction capa-
A1 465 18.6% bility for polyphonic characters, we constructed a challenging
A2 437 94.0% Chinese polyphonic character test set comprising 2,500 entries.

3.3. Experimental Results

acter is not a polyphonic character, we replace it with its corre- 3.3.1. Controllability of polyphonic characters
sponding pinyin. The replaced text may include Chinese char-
acters, pinyin, English words, and punctuation marks. Then, it We conducted tests on 2,500 sentences that contain polyphonic
is directly tokenized by the BPE tokenizer. characters. The test results are presented in Table 2. Specifi-
cally, the inputs of A1 are all characters, there are 465 synthe-
3.2.2. Speech Codec Training sized audio with pronunciation errors of polyphonic characters.
This accounts for 18.6% of the total. Among these audio with
In the training of the Speech codec, we only replace Vector pronunciation errors, 437 of them can be accurately corrected
Quantization with Finite Scalar Quantization, while keeping by incorporating the correct pinyin as mixed inputs, as shown
other model configurations unchanged. The FSQ levels are in A2, accounting for 94%. The remaining 28 errors, account-
[8, 8, 8, 6, 5], the dimension of the VQ codebook is 512, and ing for 1.1%, that could not be corrected by pinyin might be
it contains 8192 codes. Considering that the size and diversity because errors introduced by the training data have reinforced
of the training data might affect the utilization rate of the VQ the SpeechLLM.
codebook, we also conduct training on a 6,000 hours subset and
the entire training dataset respectively. 3.3.2. Evaluate The Codec Quantizer
3.2.3. Evaluation Settings We compared VQ and FSQ in terms of codebook utilization
under varying training data scales(6k and 34k hours) and eval-
We evaluate indexTTS on four test sets. The first two clean test
uate on the above four test sets. Results show that with 6k
sets are librispeech[17] and Aishell-1[18] test corpus. The last
hours training data, VQ has a 55% low codebook utilization
two sets are composed of 2,000 Chinese samples and 1,000 En-
rate. However, when the training data reaches 34k hours, there
glish samples selected from the CommonVoice[19] test dataset.
is little difference between VQ and FSQ, and VQ’s utilization
In each set, each speaker has more than two samples.
rate can also approach 100%. 50% of the tokens cover more
During the evaluation, for each sample, one other sample than 80% of the total quantity of the tokens that appear in all
from the same speaker corresponding to this sample is randomly training data.
selected as the condition prompt. We use Paraformer[20] ASR
to recognize the synthesis results of the Chinese test set, and
for the English test set, we use Whisper-large V3[21]. This is 2 https://www.modelscope.cn/models/iic/speech_
to evaluate the content consistency. Regarding speaker similar- eres2net_sv_zh-cn_16k-common
Table 3: Word Error Rate (WER) and Speaker Similarity (SS) Results for IndexTTS and Baseline Models

aishell1 test commonvoice zh commonvoice en librispeech test clean AVG

Model
CER(%)↓ SS↑ CER(%)↓ SS↑ WER(%)↓ SS↑ WER(%)↓ SS↑ WER(%)↓ SS↑
Human 2.0 0.846 9.5 0.809 10.0 0.820 2.4 0.858 5.1 0.836
CosyVoice2 1.8 0.796 9.1 0.743 7.3 0.742 4.9 0.837 5.9 0.788
F5TTS 3.9 0.743 11.7 0.747 5.4 0.746 7.8 0.828 8.2 0.779
Fishspeech 2.4 0.488 11.4 0.552 8.8 0.622 8.0 0.701 8.3 0.612
FireRedTTS 2.2 0.579 11.0 0.593 16.3 0.587 5.7 0.698 7.7 0.631
XTTS 3.0 0.573 11.4 0.586 7.1 0.648 3.5 0.761 6.0 0.663
IndexTTS 1.3 0.744 7.0 0.742 5.3 0.758 2.1 0.823 3.7 0.776

Table 4: MOS Scores for Zero-Shot Cloned Voice

Model Prosody Timbre Quality AVG

CosyVoice2 3.67 4.05 3.73 3.81
F5TTS 3.56 3.88 3.56 3.66
Fishspeech 3.40 3.63 3.69 3.57
FireRedTTS 3.79 3.72 3.60 3.70
XTTS 3.23 2.99 3.10 3.11
Figure 2: Compare the distribution of codebook utilization rates
IndexTTS 3.79 4.20 4.05 4.01
of VQ and FSQ under different training data scales

Table 5: GPU Utilization Rate and Test Duration in Experimen-

3.3.3. Comparison Results with Baselines tal Evaluation

We select several top popular zero-shot TTS models

Model Duration(s) GPU Util
in the opensource for comparison, including systems
XTTS[1], CosyVoice2(non-streaming)[22], FishSpeech[2], CosyVoice2 805 48.41%
FireRedTTS[4] and F5-TTS[5]. The evaluation methodology F5TTS 320 42.13%
encompasses both objective and subjective metrics: the word Fishspeech 756 71.43%
error rate (WER) for the content consistency, the speaker FireRedTTS 732 92.65%
embedding similarity (SS) measure for the evaluation of speech XTTS 488 87.65%
cloning fidelity, and the mean opinion score (MOS) for the IndexTTS 397 28.47%
quantification of perceptual quality.
The objective evaluation results of WER and SS for In-
dexTTS and the baseline models across four test sets are pre-
sented in Table 3. IndexTTS significantly outperforms all other 3.4. Conclusion
open-source models, demonstrating its robustness and stabil-
The IndexTTS system we developed is a GPT-style text-to-
ity. Regarding the SS metric, the performance gap between In-
speech (TTS) model. It is capable of correcting the pronuncia-
dexTTS, CosyVoice2, and F5-TTS is minimal, yet these models
tion of Chinese characters using pinyin and controlling pauses
exhibit clear advantages over other compared models.
at any position through punctuation marks. We enhanced mul-
In terms of evaluating the perceptual quality of synthesized tiple modules of the system, including the improvement of
audio, we carried out MOS covering three dimensions: prosody, speaker condition feature representation, and the integration of
timbre similarity, and sound quality. We conducted a double- BigVGAN2 to optimize audio quality. Trained on tens of thou-
blind evaluation by randomly selecting 100 samples from the sands of hours of data, our system achieves state-of-the-art per-
complete test set to ensure unbiased results. In the subjective formance, outperforming current popular TTS systems such as
evaluation process, we place greater emphasis on the similarity XTTS, CosyVoice2, Fish-Speech, and F5-TTS.
between the synthesized audio and the prompt audio across all
aspects. For example, if the sample speeches contain stutters
3.5. Limitations
or pauses, we assign a lower score to synthesized results that
exhibit overly smooth prosody. Additionally, we also consider In this work, several limitations should be acknowledged. Cur-
the restoration of the sound field characteristics in the prompt rently, our system does not support instructed voice generation
audio. The results are shown in Table 4, we have outperformed and is limited to Chinese and English, with insufficient capabil-
the baseline in nearly all evaluation dimensions, demonstrating ity to replicate rich emotional expressions. In future work, we
significant advantages in timbre similarity and sound quality. plan to extend the system to support additional languages, en-
Moreover, we randomly selected 200 test samples and cal- hance emotion replication through methods such as reinforce-
culated the total time taken by all models to synthesize these ment learning, and incorporate the ability to control hyper-
samples and the GPU resource consumption. The results are realistic paralinguistic expressions, including laughter, hesita-
presented in Table 5. tion, and surprise, in paralinguistic speech generation.
4. References I/O systems and assessment (O-COCOSDA). IEEE, 2017, pp.
1–5.
[1] E. Casanova, K. Davis, E. Gölge, G. Göknar, I. Gulea, L. Hart,
A. Aljafari, J. Meyer, R. Morais, S. Olayemi et al., “Xtts: a [19] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,
massively multilingual zero-shot text-to-speech model,” arXiv J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,
preprint arXiv:2406.04904, 2024. “Common voice: A massively-multilingual speech corpus,” arXiv
preprint arXiv:1912.06670, 2019.
[2] S. Liao, Y. Wang, T. Li, Y. Cheng, R. Zhang, R. Zhou, and
Y. Xing, “Fish-speech: Leveraging large language models for [20] Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast
advanced multilingual text-to-speech synthesis,” arXiv preprint and accurate parallel transformer for non-autoregressive end-to-
arXiv:2411.01156, 2024. end speech recognition,” arXiv preprint arXiv:2206.08317, 2022.
[3] Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y. Yang, H. Hu, [21] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
S. Zheng, Y. Gu, Z. Ma et al., “Cosyvoice: A scalable multi- I. Sutskever, “Robust speech recognition via large-scale weak
lingual zero-shot text-to-speech synthesizer based on supervised supervision,” in International conference on machine learning.
semantic tokens,” arXiv preprint arXiv:2407.05407, 2024. PMLR, 2023, pp. 28 492–28 518.
[4] H.-H. Guo, K. Liu, F.-Y. Shen, Y.-C. Wu, F.-L. Xie, K. Xie, [22] Z. Du, Y. Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao,
and K.-T. Xu, “Fireredtts: A foundation text-to-speech framework Y. Yang, C. Gao, H. Wang et al., “Cosyvoice 2: Scalable stream-
for industry-level generative speech applications,” arXiv preprint ing speech synthesis with large language models,” arXiv preprint
arXiv:2409.03283, 2024. arXiv:2412.10117, 2024.
[5] Y. Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and
X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech
with flow matching,” arXiv preprint arXiv:2410.06885, 2024.
[6] Z. Jiang, J. Liu, Y. Ren, J. He, C. Zhang, Z. Ye, P. Wei,
C. Wang, X. Yin, Z. Ma et al., “Mega-tts 2: Zero-shot text-
to-speech with arbitrary length speech prompts,” arXiv preprint
arXiv:2307.07218, 2023.
[7] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and
M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and
zero-shot voice conversion for everyone,” in International Con-
ference on Machine Learning. PMLR, 2022, pp. 2709–2720.
[8] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen,
Y. Liu, H. Wang, J. Li et al., “Neural codec language mod-
els are zero-shot text to speech synthesizers,” arXiv preprint
arXiv:2301.02111, 2023.
[9] P. Anastassiou, J. Chen, J. Chen, Y. Chen, Z. Chen, Z. Chen,
J. Cong, L. Deng, C. Ding, L. Gao et al., “Seed-tts: A family of
high-quality versatile speech generation models,” arXiv preprint
arXiv:2406.02430, 2024.
[10] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net-
works for efficient and high fidelity speech synthesis,” Advances
in neural information processing systems, vol. 33, pp. 17 022–
17 033, 2020.
[11] J. Betker, “Better speech synthesis through scaling,” arXiv
preprint arXiv:2305.07243, 2023.
[12] A. Van Den Oord, O. Vinyals et al., “Neural discrete represen-
tation learning,” Advances in neural information processing sys-
tems, vol. 30, 2017.
[13] F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Fi-
nite scalar quantization: Vq-vae made simple,” arXiv preprint
arXiv:2309.15505, 2023.
[14] S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon,
“Bigvgan: A universal neural vocoder with large-scale training,”
arXiv preprint arXiv:2206.04658, 2022.
[15] S. Mehta, R. Tu, J. Beskow, É. Székely, and G. E. Henter,
“Matcha-tts: A fast tts architecture with conditional flow match-
ing,” in ICASSP 2024-2024 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024,
pp. 11 341–11 345.
[16] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Demucs: Deep
extractor for music sources with extra unlabeled data remixed,”
arXiv preprint arXiv:1909.01174, 2019.
[17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: an asr corpus based on public domain audio books,”
in 2015 IEEE international conference on acoustics, speech and
signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[18] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-
source mandarin speech corpus and a speech recognition base-
line,” in 2017 20th conference of the oriental chapter of the inter-
national coordinating committee on speech databases and speech

ARTIFICIAL INTELLIGENCE Question Paper 21 22
0% (1)
ARTIFICIAL INTELLIGENCE Question Paper 21 22
3 pages
WCM Config and Documentation
No ratings yet
WCM Config and Documentation
50 pages
Nail Your Next Job Interview Leaked Python Interview Questions
50% (2)
Nail Your Next Job Interview Leaked Python Interview Questions
14 pages
Python Oop Assignment Result - PDF 20250310 095423 0000
No ratings yet
Python Oop Assignment Result - PDF 20250310 095423 0000
21 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
No ratings yet
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
5 pages
FAA Form 337 User Guide
No ratings yet
FAA Form 337 User Guide
165 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
DiTTo TTS
No ratings yet
DiTTo TTS
34 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model With Single-Stream Decoupled Speech Tokens
No ratings yet
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model With Single-Stream Decoupled Speech Tokens
22 pages
DB Report Low Resource Text To Speech Synthesis
No ratings yet
DB Report Low Resource Text To Speech Synthesis
18 pages
Minimax Speech
No ratings yet
Minimax Speech
20 pages
Suoni
No ratings yet
Suoni
38 pages
SupertonicTTS Towards Highly Scalable and Efficient Text-To-Speech System
No ratings yet
SupertonicTTS Towards Highly Scalable and Efficient Text-To-Speech System
21 pages
Gpt-Imgeval: A Comprehensive Benchmark For Diagnosing Gpt4O in Image Generation
No ratings yet
Gpt-Imgeval: A Comprehensive Benchmark For Diagnosing Gpt4O in Image Generation
17 pages
Real Time Chat Application Using Socket - Io
No ratings yet
Real Time Chat Application Using Socket - Io
48 pages
Towards Controllable Speech Synthesis in The Era of Large Language Models A Survey
No ratings yet
Towards Controllable Speech Synthesis in The Era of Large Language Models A Survey
23 pages
Agent S2: A Compositional Generalist-Specialist Framework For Computer Use Agents
No ratings yet
Agent S2: A Compositional Generalist-Specialist Framework For Computer Use Agents
18 pages
CosyVoice 2
No ratings yet
CosyVoice 2
19 pages
eTIMS Paypoint Windows User Guide Final 2023
No ratings yet
eTIMS Paypoint Windows User Guide Final 2023
32 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Option - 1 Option - 2 Option - 3 Option - 4 Correct Answer Marks
No ratings yet
Option - 1 Option - 2 Option - 3 Option - 4 Correct Answer Marks
4 pages
PDF Export
No ratings yet
PDF Export
110 pages
Thesis
No ratings yet
Thesis
37 pages
NaturalSpeech 3: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
No ratings yet
NaturalSpeech 3: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
22 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
Style TTS2
No ratings yet
Style TTS2
28 pages
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
No ratings yet
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
13 pages
F5-TTS A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching 2410.06885v1
No ratings yet
F5-TTS A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching 2410.06885v1
18 pages
U 4
No ratings yet
U 4
8 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Parrot TTS
No ratings yet
Parrot TTS
13 pages
Ieee
No ratings yet
Ieee
12 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
Emotional Speech Synthesis Using End-to-End Neural TTS Models
No ratings yet
Emotional Speech Synthesis Using End-to-End Neural TTS Models
7 pages
Metastylespeech
No ratings yet
Metastylespeech
16 pages
Portable and High-Quality
No ratings yet
Portable and High-Quality
19 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
NLP Proposal For Evaluating and Enhancing Persian and Bengali Text To Speech TTS Technology
No ratings yet
NLP Proposal For Evaluating and Enhancing Persian and Bengali Text To Speech TTS Technology
7 pages
Presets and Actions
No ratings yet
Presets and Actions
17 pages
CosyVoice v1
No ratings yet
CosyVoice v1
10 pages
Arik 17 A
No ratings yet
Arik 17 A
10 pages
2406 18009v2
No ratings yet
2406 18009v2
8 pages
Lightweight End-To-End Text-To-Speech Synthesis Fo
No ratings yet
Lightweight End-To-End Text-To-Speech Synthesis Fo
6 pages
Guidelines For Technical Writing in Biomedical
No ratings yet
Guidelines For Technical Writing in Biomedical
34 pages
Text To Audio Generation Instruction LLM
No ratings yet
Text To Audio Generation Instruction LLM
15 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Yeh 2012 Speech-Communication
No ratings yet
Yeh 2012 Speech-Communication
12 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
No ratings yet
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
16 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Liu22c Interspeech
No ratings yet
Liu22c Interspeech
5 pages
styleTTS2205 15439
No ratings yet
styleTTS2205 15439
20 pages
Phases of Project Management
100% (1)
Phases of Project Management
20 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
No ratings yet
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
5 pages
SAX (Simple API For XML)
No ratings yet
SAX (Simple API For XML)
16 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Elog All210426
No ratings yet
Elog All210426
17 pages
Voice Connect - S2ST Reserch Paper
No ratings yet
Voice Connect - S2ST Reserch Paper
4 pages
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
No ratings yet
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
27 pages
IJRPR4449
No ratings yet
IJRPR4449
4 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
No ratings yet
Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
5 pages
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
No ratings yet
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
16 pages
Lastexception 63858456909
No ratings yet
Lastexception 63858456909
2 pages
Civil Site Design V 1700
No ratings yet
Civil Site Design V 1700
6 pages
Text To Speech With Custom Voice
No ratings yet
Text To Speech With Custom Voice
10 pages
Lancelot User Manual
No ratings yet
Lancelot User Manual
8 pages
Iec Ibnu
No ratings yet
Iec Ibnu
3 pages
Cs1001 Resource Management Techniques 3 0 0 100
No ratings yet
Cs1001 Resource Management Techniques 3 0 0 100
28 pages
SDE2 - Full Stack (MERN) Interview Questions
No ratings yet
SDE2 - Full Stack (MERN) Interview Questions
4 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
Test Universe Differential Module Application Note Example of Use Transformer ENU
No ratings yet
Test Universe Differential Module Application Note Example of Use Transformer ENU
16 pages
ECONOMIA
No ratings yet
ECONOMIA
9 pages
Ccs369-Unit 4
No ratings yet
Ccs369-Unit 4
13 pages
Aq200 Series Flyer 2.0en
No ratings yet
Aq200 Series Flyer 2.0en
2 pages
UT Dallas Syllabus For cs6367.001.10s Taught by Joao Cangussu (jwc021000)
No ratings yet
UT Dallas Syllabus For cs6367.001.10s Taught by Joao Cangussu (jwc021000)
4 pages
Assignment 4
No ratings yet
Assignment 4
10 pages
Threads, Concurrency, and Deadlocks
No ratings yet
Threads, Concurrency, and Deadlocks
2 pages
Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
No ratings yet
Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
5 pages
Digi India
No ratings yet
Digi India
3 pages
Verify Steps Fix Corba Service
No ratings yet
Verify Steps Fix Corba Service
4 pages
A Power Exchange For Db22024
No ratings yet
A Power Exchange For Db22024
9 pages
18Y10-B-300 T1 18Y10-A-300 T1: Software Consultants (Pty) LTD
No ratings yet
18Y10-B-300 T1 18Y10-A-300 T1: Software Consultants (Pty) LTD
1 page
SAP Simple Finance Training Course Content
No ratings yet
SAP Simple Finance Training Course Content
5 pages
CV For Sanni Joseph Adeiza Acted
No ratings yet
CV For Sanni Joseph Adeiza Acted
3 pages
Akram Karam Resume
No ratings yet
Akram Karam Resume
4 pages
Aimybox Voice Assistant Development: Definitive Reference for Developers and Engineers
From Everand
Aimybox Voice Assistant Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Uploaded by

Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Uploaded by

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot

Abstract and straightforward, yet it has drawbacks: longer training and

voice editing but is difficult to stream, so not for real-time use.

Latent Acusitic Token

Perceiver Audio Codec

Input: 晕眩是一种感觉，I want to go to the supermarket!

3.3. Experimental Results

aishell1 test commonvoice zh commonvoice en librispeech test clean AVG

Table 4: MOS Scores for Zero-Shot Cloned Voice

Model Prosody Timbre Quality AVG

Table 5: GPU Utilization Rate and Test Duration in Experimen-

We select several top popular zero-shot TTS models

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.