Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Text-To-Speech System
Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang
1
Artificial Intelligence Platform Department, bilibili, China
{xuanwu,zhousiyi02,shujingchen,wangjinchao,wanglu08}@bilibili.com
Speaker Vector
Text-Speech Language Model BigVGAN2 Decoder
Text Token
S Start of Text
B Start of Speech
T End of Speech
Text
Prompt Speech GT Speech Prompt Speech
Figure 1: An overview of IndexTTS, a text-to-speech language model conditioned on prompt speech and text tokens generates acoustic
tokens, and the BigVGAN2 decoder convert the LLM output latent into waveform.
Table 1: Preprocessing Examples for Training Samples Combining Chinese Characters and Pinyin
Table 2: Error and Correction Statistics for Polyphonic Char- ity (SS), we utilize the ERes2Net2 model to extract the speaker
acter Pronunciation embeddings from both the prompt and the generated utterances.
The raw cosine similarity between these embeddings is then re-
Sentences Percentage garded as the measure of speaker similarity.
Total 2500 100% Additionally, to evaluate the pronunciation correction capa-
A1 465 18.6% bility for polyphonic characters, we constructed a challenging
A2 437 94.0% Chinese polyphonic character test set comprising 2,500 entries.