F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
T ECHNICAL R EPORT
Shijia Liao1 Yuxuan Wang1 Tianyu Li1 Yifan Cheng1 Ruoyi Zhang1 Rongzhi Zhou1
Yijin Xing1
arXiv:2411.01156v1 [cs.SD] 2 Nov 2024
1
Fish Audio
{lengyue,honst,stardust}@fish.audio, yf_cheng@hust.edu.cn,
potato_zhang@nuist.edu.cn, laziman@fish.audio, rcell233@outlook.com
November 5, 2024
A BSTRACT
Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features,
handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities
that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework
that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the
stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This
architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making
it particularly effective for AI interactions and voice cloning.
Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating
the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthe-
sis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ
to achieve superior compression ratios and near 100% codebook utilization.
Our approach addresses key limitations of current TTS systems while providing a foundation for
more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech
significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning
tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation
is open source at https://github.com/fishaudio/fish-speech.
1 Introduction
The past decade has seen remarkable progress in Text-to-Speech (TTS) systems, transforming applications from virtual
assistants to educational tools. Current TTS architectures, such as VALL-E [Wang et al. [2023]], VITS [Kim et al.
[2021]], Fastspeech [Ren et al. [2020]] typically rely on grapheme-to-phoneme (G2P) conversion [Klatt [1987]] to
convert text into phonetic representations before synthesis. While effective, this approach struggles with context-
dependent polyphonic words and cross-lingual generalization due to complex phonetic rules. Recent advances in
zero-shot voice conversion such as YourTTS [Casanova et al. [2022]] and unified speech generation model UniAudio
[Yang et al. [2023]] have shown the potential of neural architectures in handling various speech tasks. Additionally,
flow-based models like CosyVoice[Du et al. [2024], MatchaTTSMehta et al. [2024]] have demonstrated promising
results in natural speech synthesis. However, most solutions disentangled semantic and accoustic feature as a trade-of to
improve stability, and reduce the voice cloning / bg understanding ability.
As demand grows for multilingual TTS systems, the limitations of G2P-based approaches become more apparent. The
need for language-specific phonetic rules and lexicons hinders scalability and complicates system maintenance. Recent
Fish-Speech T ECHNICAL R EPORT
research has explored the use of Large Language Models (LLMs) for direct linguistic feature extraction, eliminating the
need for explicit G2P conversion. [Betker [2023]].
We introduce Fish-Speech, a novel TTS framework featuring a serial fast-slow dual autoregressive (Dual-AR) architec-
ture. This design improves the stability of grouped finite scalar vector quantization (GFSQ) in sequence generation
while maintaining high-quality output. By incorporating LLMs into the TTS pipeline, Fish-Speech simplifies the
synthesis process and better handles polyphonic characters and multilingual text. The model trains on 720,000 hours of
multilingual audio data, enabling it to learn diverse linguistic patterns and pronunciation variations.
To improve synthesis quality, we develop Firefly-GAN (FFGAN), a new vocoder architecture based on Grouped Finite
Scalar Vector Quantization(GFSQ). FFGAN combines Finite Scalar Quantization (FSQ) [Mentzer et al. [2023]], and
Group Vector Quantization (GVQ) to optimize compression ratios and codebook usage. Our evaluations show 100%
codebook utilization, representing state-of-the-art performance in this field.
The primary contributions of this work are as follows:
• We introduce Fish-Speech, a novel TTS framework that leverages LLMs and a Dual-AR architecture to replace
traditional G2P conversion, providing robust and scalable multilingual speech synthesis.
• We present FFGAN, an advanced vocoder that integrates multiple vector quantization techniques to achieve
high-fidelity speech synthesis with optimized compression ratios and codebook utilization.
• We develop fish-tech acceleration methodologies, the system achieves real-time factors of approximately 1:5
on consumer-grade NVIDIA RTX 4060 mobile platforms and 1:15 on high-performance NVIDIA RTX 4090
configurations. And a latency of 150ms which is far less than other TTS system using DiT and Flow structure.
We encourage readers to listen to our samples on fish speech 1.4 sample. We also highly recommend that you go to our
online synthesis site fish.audio to try out the different speakers of audio synthesized by the community.
2 Related Work
2.1 Text-to-Speech Systems
Text-to-Speech (TTS) systems have evolved dramatically from basic phoneme-based models to sophisticated end-to-end
neural approaches that directly convert text to speech [Tan et al. [2021]]. This transformation, driven by advances
in deep learning and increased computational power, has led to major improvements in speech naturalness, prosody
control, and cross-language capability [Ren et al. [2019]]. Modern TTS systems now serve diverse applications, from
intelligent assistants to accessibility tools and human-computer interfaces [Capes et al. [2017]].
Neural vocoders have played a key role in improving speech synthesis quality. WaveNet [Van Den Oord et al. [2016]]
first introduced autoregressive models for audio generation, followed by more efficient architectures like WaveRNN
[Kalchbrenner et al. [2018]] and WaveGrad [Chen et al. [2020]]. HiFi-GAN [Kong et al. [2020]] later introduced
adversarial training, setting new standards in audio quality and computational efficiency. EVA-GAN is a brand new
GAN structure vocoder developed by NVIDIA [Liao et al. [2024]], it uses a Context Aware Module(CAM) for
improved performance with minimal computational overhead. EVA-GAN shows superior performance over existing
state-of-the-art vocoders in both objective and subjective metrics, particularly in spectral continuity and high-frequency
reconstruction.
Vector Quantization (VQ) has become essential in modern speech synthesis. VQ-VAE [Van Den Oord et al. [2017]]
showed the effectiveness of discrete latent representations for audio generation, while SoundStream [Zeghidour et al.
[2021]] and EnCodec [Défossez et al. [2022]] further improved these techniques for high-quality audio compression
and synthesis.
Large Language Models (LLMs) are increasingly important in speech processing. Nowadays there are more and more
models using BERT, huBERT as an intermediate structure for TTS, such as Parler TTS [Lacombe et al. [2024]], Melo
2
Fish-Speech T ECHNICAL R EPORT
TTS [Zhao et al. [2023]], E3-TTS [Gao et al. [2023]], XTTS [Casanova et al. [2024]] and so on. They all achieve better
synthesis effect.
Multilingual speech synthesis faces unique challenges in maintaining consistent quality across languages. Recent
solutions include unified multilingual models [Liu and Mak [2019]], cross-lingual transfer learning [Nekvinda and
Dušek [2020]], and language-agnostic representations [Li et al. [2019]].
3 Methods
Fish-Speech is a novel Text-to-Speech (TTS) framework that addresses key limitations of current non-grapheme-to-
phoneme (non-G2P) TTS systems. The framework is specifically designed to handle multi-emotional and multilingual
speech synthesis, with a focus on meeting the demands of advanced AI conversational agents.
Building on recent advances in vector quantization and condition representation [Kumar et al. [2024], Chen et al.
[2023], Wang et al. [2019]], we introduce a Grouped Finite Scalar Vector Quantization(GFSQ) technique. This method
efficiently encodes latent conditions, enabling better capture and reproduction of subtle speech variations. Our approach
achieves 100% codebook utilization, maximizing the effectiveness of the quantization space.
We also develop a dual autoregressive (dual-AR) architecture that solves two major challenges in current TTS systems.
First, it improves the stability of code generation, a common issue in existing frameworks. Second, it offers better
generation efficiency compared to Diffusion Transformers (DiT), making it well-suited for real-time applications. Last,
but most importantly, it is ready for Voice Agent, which we will release in near future.
This section describes the Dual Autoregressive (Dual-AR) architecture [Fig. 2] of Fish-Speech, a TTS system
designed to handle complex linguistic features, polyphonic words, and natural-sounding multilingual synthesis. The
Dual-AR architecture improves the stability and computational efficiency of codebook processing during sequence
generation, particularly when using Grouped Finite Scalar Vector Quantization (GFSQ).
Slow Transformer
The Slow Transformer processes input text embeddings to capture global linguistic structures and semantic content. It
generates intermediate hidden states and predicts semantic tokens.
3
Fish-Speech T ECHNICAL R EPORT
The Slow Transformer functions at an elevated level of abstraction, processing input text embeddings to encode global
linguistic structures and semantic content. This module is responsible for generating intermediate hidden states and
predicting semantic tokens with high precision.
Given an input sequence of tokens x = [x1 , x2 , . . . , xT ], the Slow Transformer generates hidden states h ∈ RT ×D and
token logits z through the following transformations:
h = SlowTransformer(x) (1)
z = Wtok · Norm(h) (2)
where Norm(·) represents layer normalization, and Wtok denotes the learnable parameters of the token prediction layer.
Fast Transformer
The Fast Transformer refines the Slow Transformer’s output through codebook embedding processing, capturing
detailed acoustic features needed for natural speech. It processes residual information and optimizes codebook usage.
The Fast Transformer takes as input the concatenated sequence of hidden states h and codebook embeddings c according
to:
h̃ = [h; c], (hfast ) (3)
hfast = FastTransformer(h̃, (hfast )) (4)
y = Wcbk · Norm(hfast ) (5)
where [h; c] represents the concatenation operation of h and c, Wcbk comprises the learnable parameters of the
codebook prediction layer, and y denotes the resultant codebook logits.
1. Enhanced Sequence Generation Stability: The hierarchical processing of global and local information
significantly improves the stability of GFSQ in sequence generation tasks.
4
Fish-Speech T ECHNICAL R EPORT
2. Optimized Codebook Processing: The Fast Transformer implements an efficient mechanism for codebook
embedding processing that achieves improved performance without significant computational overhead,
particularly for models of scale 7B or larger.
3. Superior Speech Synthesis Quality: The synergistic interaction between Slow and Fast Transformers enables
high-fidelity speech synthesis capable of handling complex linguistic phenomena.
4. Advanced Multilingual Processing: The integration with Large Language Models (LLMs) for linguistic
feature generation eliminates traditional grapheme-to-phoneme conversion dependencies, thereby streamlining
the synthesis pipeline and enhancing multilingual capabilities. By mixing text data the comprehension will be
further enhanced.
3.2 Firefly-GAN
Firefly-GAN (FF-GAN) is an enhanced version of the EVA-GAN architecture with significant structural improvements.
It replaces the traditional convolutional components of HiFi-GAN [Kong et al. [2020]] with a more efficient design,
featuring a ParallelBlock instead of the Multi-Receptive Field (MRF) module. By incorporating Grouped Finite Scalar
Vector Quantization (GFSQ), FF-GAN achieves better sequence generation stability and improved handling of linguistic
variations, making it particularly effective for multilingual synthesis in AI applications.
Firefly-GAN (FF-GAN) is an enhanced version of the EVA-GAN architecture with significant structural improvements.
The model implements an optimized structure that replace the conventional convolutional and transposed convolutional
components found in HiFi-GAN [Kong et al. [2020]]. We use ParallelBlock to subtitute for Multi-Receptive Field
(MRF), which specifically engineered for typo-codebook vocoder applications. Through the integration of the Grouped
Finite Scalar Vector Quantization (GFSQ) methodology, FF-GAN demonstrates enhanced sequence generation stability,
effectively improving complex linguistic variations and facilitating natural multilingual synthesis. This architectural
innovation distinguishes FF-GAN from traditional vocoders, particularly in text-to-speech synthesis and voice cloning
applications within AI agent frameworks.
5
Fish-Speech T ECHNICAL R EPORT
Downsampling
Using a downsampling function fdown to downsample the input tensor z, resulting in a downsampled tensor zd ∈
RB×Cd ×Ld :
zd = fdown (z) (6)
GFSQ Process
• Feature Grouping Input feature matrix Z is divided into G groups:
Z = [Z(1) , Z(2) , . . . , Z(G) ] (7)
(g)
• Scalar Quantization For each scalar zb,c,l :
(g) (g)
ẑb,c,l = Q(zb,c,l ) (8)
(g)
• Index Generation Each scalar maps to index kb,c,l
• Decoding
(g) (g)
ẑb,c,l = Codebook(g) [kb,c,l ] (9)
Upsampling
Use the upsampling function fup to restore the quantized downsampled tensor to its original size, resulting in the final
quantized tensor zq ∈ RB×C×L :
zq = fup (zqd ) (11)
The goal is to make zq approximate the original input z as closely as possible:
zq ≈ z (12)
3.2.3 Conclusion
Our implementation of using GFSQ techniques achieves neaely 100% codebook utilization, and gain better objetive
and subjective score in our inner ablation than other quantization techiques like RFSQ, RVQ and GRFSQ. FF-GAN
significantly enhancing stability in typo-codebook operations and ensuring comprehensive intermediate variable
information retention in multi-emotional and multilingual tasks.
FF-GAN’s innovative approach to typo-codebook stability has being already used in various song and music generation
applications. The framework’s performance and architectural could make it position it as a reference model for future
AI agent development.
Fish Speech uses a three-stage training approach: initial pre-training with large batches of standard data, followed by
SFT using smaller batches of high-quality data, and finally DPO training using manually labeled positive and negative
sample pairs.
6
Fish-Speech T ECHNICAL R EPORT
The training infrastructure was split into two components Fig. 1: The AR training utilized an 8*H100 80G GPUs for
one week, while the vocoder training employed an 8*4090 GPUs for an additional week. Note that these timelines
exclude the DPO stage.
4.2 Inference
Our inference strategy follows the architecture in Fig. 1. Using fish-tech including KV-cache [Pope et al. [2023]],
torch compile and other acceleration methodologies, the system achieves real-time factors of approximately 1:5 on
consumer-grade NVIDIA RTX 4060 mobile platforms and 1:15 on high-performance NVIDIA RTX 4090 configurations.
These architectural optimizations significantly improve latency in infering, achieving a first-packet latency of 150ms.
Furthermore, the system can process information in flow, making it easy to work with modern AI tools and use them in
different situations.
5 Dataset
Our training data includes a large collection of speech samples from both public sources and our own data collection
process. The dataset contains about 720,000 hours of speech across different languages, with 300,000 hours each of
English and Mandarin Chinese as the main components. We also included 20,000 hours each of other language families:
Germanic (German), Romance (French, Italian), East Asian (Japanese, Korean), and Semitic (Arabic).
We carefully balanced the data across languages to help the model learn multiple languages at once. This approach
helps the model perform well generating mixed-language content. The large size and variety of our dataset significantly
improves the model’s ability to handle multiple languages naturally.
6 Experimental Evaluation
1
We conducted an evaluation for the speaker cloning task to access the effect of our architecture compared to the
baseline models. The evaluation methodology includes both objective and subjective metrics: word error rate (WER)
for intelligibility assessment, speaker embedding similarity measure for speech cloning fidelity assessment, and mean
opinion score (MOS) for perceptual quality quantification. This evaluation framework aims to assess the model’s ability
to preserve speaker identity while maintaining high fidelity speech synthesis.
Table 1: Word Error Rate (WER) Results for Voice Cloning Tasks
Analysis of Table 1 shows that our model achieves a WER of 6.89% in voice cloning tasks, which is not only much
lower than the baseline models, but also exceeds the ground truth recordings (9.22%). This performance provides strong
evidence for the ability of our model in voice cloning scenarios. The gap between our model and competing models
(ranging from 11.92% to 22.20%) underscores the improved synthetic stability and content fidelity of our methodology.
7
Fish-Speech T ECHNICAL R EPORT
Table 2: Speaker similarity scores for different models, including ground truth.
Table 2 shows the effect of our typo-codebook strategy on speaker similarity metrics. Our fish-speech model achieves
similarity scores of 0.914 and 0.762 on Resemblyzer and SpeechBrain, respectively, which are remarkably close to the
ground truth performance (0.921 and 0.770). The gap of only 0.76% from the ground truth in Resemblyzer and 1.04%
in SpeechBrain evaluations shows the superior ability of our model to capture natural speech characteristics. The results
strongly suggest that our typo-codebook architecture enables a more comprehensive capture of acoustic states, leading
to improved timbral fidelity of synthesized speech. Our approach significantly outperforms baseline models such as
F5-TTS (0.905 and 0.787) and rechoo (0.887 and 0.636). The consistent performance in both evaluation frameworks
proves the validity of our method in preserving speaker features, which is crucial for high-quality text-to-speech
synthesis and agent tasks.
Table 3: Five-scale Mean Opinion Score (MOS) Ratings of Cloned Voice Quality
To evaluate the perceptual quality of synthesized audio, we conducted a comprehensive Mean Opinion Score (MOS)
listening test with naive listeners who had no prior experience with audio processing. The evaluation followed a double-
blind, randomized method to ensure an unbiased evaluation. The results show that fish-speech achieved significantly
higher subjective scores compared to other baseline models (p < 0.05), demonstrating superior performance in terms
of speech naturalness and speaker similarity. This evaluation in human perception metrics strongly suggests that
fish-speech can better capture and reproduce the natural characteristics of human speech, especially in the context of
voice cloning tasks.
7 Conclusion
Our research represents a significant advance in the field of text-to-speech (TTS) by introducing a novel multilingual
and multi-emotional stabilization solution. The core innovation lies in our development of a typo-codebook vocoder
integrated with a dual autoregressive (dual-AR) generation architecture. This architectural combination shows stability
in the synthesis process while preserving acoustic features within the generated speech. Furthermore, our work utilizes a
1
For experimental validation, we constrained our analysis to monolingual voice cloning scenarios, excluding cross-lingual
synthesis tasks. The evaluation corpus comprised 10 distinct speaker (include different languages) identities, with 30 synthesized
utterances per speaker, yielding a comprehensive evaluation set of 300 samples. It should be noted that cross-linguistic synthesis was
not included in this evaluation.
1
Results obtained using OpenAI Whisper-medium ASR model for transcription evaluation.
2
Experiments were conducted using the SpeechBrain framework Ravanelli et al. [2024], version 1.0.1, as the underlying speech
processing toolkit.
8
Fish-Speech T ECHNICAL R EPORT
non-grapheme-to-phoneme (non-G2P) structure, an approach that effectively addresses limitations inherent in traditional
phoneme-based systems and provides a robust foundation for cross-lingual and emotionally diverse TTS applications,
particularly in the context of AI agent interactions.
8 Future Work
Building on these foundations, we propose several directions for future research. We plan to improve the performance
of our model by integrating reinforcement learning techniques, focusing on improving cross-lingual generalization and
emotional stability. We are also developing the Fish Agent application, an end-to-end language model based on our
Fish-Speech framework. A preliminary demonstration of this system is currently available at fish.audio/demo/live. We
remain committed to the open source community and will continue to maintain and extend our codebase to provide
broader access to these technologies for researchers and developers.
References
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming
Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint
arXiv:2301.02111, 2023.
Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end
text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality
end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
Dennis H Klatt. Review of text-to-speech conversion for english. The Journal of the Acoustical Society of America, 82
(3):737–793, 1987.
Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti.
Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference
on Machine Learning, pages 2709–2720. PMLR, 2022.
Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao,
Jiang Bian, Xixin Wu, et al. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint
arXiv:2310.00704, 2023.
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma,
et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.
arXiv preprint arXiv:2407.05407, 2024.
Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-tts: A fast tts architecture with
conditional flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 11341–11345. IEEE, 2024.
James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made
simple. arXiv preprint arXiv:2309.15505, 2023.
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561,
2021.
Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Almost unsupervised text to speech and automatic
speech recognition. In International conference on machine learning, pages 5410–5419. PMLR, 2019.
Tim Capes, Paul Coles, Alistair Conkie, Ladan Golipour, Abie Hadjitarkhani, Qiong Hu, Nancy Huddleston, Melvyn
Hunt, Jiangchuan Li, Matthias Neeracher, et al. Siri on-device deep learning-guided unit selection text-to-speech
system. In Interspeech, pages 4011–4015, 2017.
Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbren-
ner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint
arXiv:1609.03499, 12, 2016.
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg,
Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference
on Machine Learning, pages 2410–2419. PMLR, 2018.
9
Fish-Speech T ECHNICAL R EPORT
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating
gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high
fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
Shijia Liao, Shiyi Lan, and Arun George Zachariah. Eva-gan: Enhanced various audio generation via scalable generative
adversarial networks. arXiv preprint arXiv:2402.00892, 2024.
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information
processing systems, 30, 2017.
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end
neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv
preprint arXiv:2210.13438, 2022.
Yoach Lacombe, Vaibhav Srivastav, and Sanchit Gandhi. Parler-tts. https://github.com/huggingface/
parler-tts, 2024.
Wenliang Zhao, Xumin Yu, and Zengyi Qin. Melotts: High-quality multi-lingual multi-accent text-to-speech, 2023.
URL https://github.com/myshell-ai/MeloTTS.
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 tts: Easy end-to-end diffusion-based text to speech. In
2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer,
Reuben Morais, Samuel Olayemi, et al. Xtts: a massively multilingual zero-shot text-to-speech model. arXiv preprint
arXiv:2406.04904, 2024.
Zhaoyu Liu and Brian Mak. Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using
parallel corpus for unseen speakers. arXiv preprint arXiv:1911.11601, 2019.
Tomáš Nekvinda and Ondřej Dušek. One model, many languages: Meta-learning for multilingual text-to-speech. arXiv
preprint arXiv:2008.00768, 2020.
Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, and William Chan. Bytes are all you need: End-to-end multilingual
speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 5621–5625. IEEE, 2019.
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio
compression with improved rvqgan. Advances in Neural Information Processing Systems, 36, 2024.
Li-Wei Chen, Shinji Watanabe, and Alexander Rudnicky. A vector quantized approach for text to speech synthesis on
real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages
12644–12652, 2023.
Xin Wang, Shinji Takaki, Junichi Yamagishi, Simon King, and Keiichi Tokuda. A vector quantized variational
autoencoder (vq-vae) autoregressive neural f _0 model for statistical parametric speech synthesis. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 28:157–170, 2019.
A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. Attention is all you need in
speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 21–25. IEEE, 2021.
Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
F Yu. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao,
Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and
Systems, 5:606–624, 2023.
Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang,
Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao,
Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Pierre Champion, Aku Rouhe, Rudolf Braun, Florian Mai,
Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar, Jarod Duret, Salima
Mdhaffar, Gaelle Laperriere, Mickael Rouvier, Renato De Mori, and Yannick Esteve. Open-source conversational ai
with SpeechBrain 1.0, 2024. URL https://arxiv.org/abs/2407.00463.
10
Fish-Speech T ECHNICAL R EPORT
A Training Details
We trained our model on an NVIDIA H100 GPU with the following hyperparameters:
Optimization:
Training Config:
11