0% found this document useful (0 votes)

82 views11 pages

F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views11 pages

F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

F ISH -S PEECH : L EVERAGING L ARGE L ANGUAGE M ODELS FOR

A DVANCED M ULTILINGUAL T EXT- TO -S PEECH S YNTHESIS

T ECHNICAL R EPORT

Shijia Liao1 Yuxuan Wang1 Tianyu Li1 Yifan Cheng1 Ruoyi Zhang1 Rongzhi Zhou1
Yijin Xing1
arXiv:2411.01156v1 [cs.SD] 2 Nov 2024

1
Fish Audio
{lengyue,honst,stardust}@fish.audio, yf_cheng@hust.edu.cn,
potato_zhang@nuist.edu.cn, laziman@fish.audio, rcell233@outlook.com

November 5, 2024

A BSTRACT
Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features,
handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities
that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework
that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the
stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This
architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making
it particularly effective for AI interactions and voice cloning.
Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating
the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthe-
sis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ
to achieve superior compression ratios and near 100% codebook utilization.
Our approach addresses key limitations of current TTS systems while providing a foundation for
more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech
significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning
tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation
is open source at https://github.com/fishaudio/fish-speech.

Keywords Text to Speecch · LLM · Voice Cloning

1 Introduction
The past decade has seen remarkable progress in Text-to-Speech (TTS) systems, transforming applications from virtual
assistants to educational tools. Current TTS architectures, such as VALL-E [Wang et al. [2023]], VITS [Kim et al.
[2021]], Fastspeech [Ren et al. [2020]] typically rely on grapheme-to-phoneme (G2P) conversion [Klatt [1987]] to
convert text into phonetic representations before synthesis. While effective, this approach struggles with context-
dependent polyphonic words and cross-lingual generalization due to complex phonetic rules. Recent advances in
zero-shot voice conversion such as YourTTS [Casanova et al. [2022]] and unified speech generation model UniAudio
[Yang et al. [2023]] have shown the potential of neural architectures in handling various speech tasks. Additionally,
flow-based models like CosyVoice[Du et al. [2024], MatchaTTSMehta et al. [2024]] have demonstrated promising
results in natural speech synthesis. However, most solutions disentangled semantic and accoustic feature as a trade-of to
improve stability, and reduce the voice cloning / bg understanding ability.
As demand grows for multilingual TTS systems, the limitations of G2P-based approaches become more apparent. The
need for language-specific phonetic rules and lexicons hinders scalability and complicates system maintenance. Recent
Fish-Speech T ECHNICAL R EPORT

research has explored the use of Large Language Models (LLMs) for direct linguistic feature extraction, eliminating the
need for explicit G2P conversion. [Betker [2023]].
We introduce Fish-Speech, a novel TTS framework featuring a serial fast-slow dual autoregressive (Dual-AR) architec-
ture. This design improves the stability of grouped finite scalar vector quantization (GFSQ) in sequence generation
while maintaining high-quality output. By incorporating LLMs into the TTS pipeline, Fish-Speech simplifies the
synthesis process and better handles polyphonic characters and multilingual text. The model trains on 720,000 hours of
multilingual audio data, enabling it to learn diverse linguistic patterns and pronunciation variations.
To improve synthesis quality, we develop Firefly-GAN (FFGAN), a new vocoder architecture based on Grouped Finite
Scalar Vector Quantization(GFSQ). FFGAN combines Finite Scalar Quantization (FSQ) [Mentzer et al. [2023]], and
Group Vector Quantization (GVQ) to optimize compression ratios and codebook usage. Our evaluations show 100%
codebook utilization, representing state-of-the-art performance in this field.
The primary contributions of this work are as follows:

• We introduce Fish-Speech, a novel TTS framework that leverages LLMs and a Dual-AR architecture to replace
traditional G2P conversion, providing robust and scalable multilingual speech synthesis.
• We present FFGAN, an advanced vocoder that integrates multiple vector quantization techniques to achieve
high-fidelity speech synthesis with optimized compression ratios and codebook utilization.
• We develop fish-tech acceleration methodologies, the system achieves real-time factors of approximately 1:5
on consumer-grade NVIDIA RTX 4060 mobile platforms and 1:15 on high-performance NVIDIA RTX 4090
configurations. And a latency of 150ms which is far less than other TTS system using DiT and Flow structure.

We encourage readers to listen to our samples on fish speech 1.4 sample. We also highly recommend that you go to our
online synthesis site fish.audio to try out the different speakers of audio synthesized by the community.

2 Related Work
2.1 Text-to-Speech Systems

Text-to-Speech (TTS) systems have evolved dramatically from basic phoneme-based models to sophisticated end-to-end
neural approaches that directly convert text to speech [Tan et al. [2021]]. This transformation, driven by advances
in deep learning and increased computational power, has led to major improvements in speech naturalness, prosody
control, and cross-language capability [Ren et al. [2019]]. Modern TTS systems now serve diverse applications, from
intelligent assistants to accessibility tools and human-computer interfaces [Capes et al. [2017]].

2.2 Neural Vocoders

Neural vocoders have played a key role in improving speech synthesis quality. WaveNet [Van Den Oord et al. [2016]]
first introduced autoregressive models for audio generation, followed by more efficient architectures like WaveRNN
[Kalchbrenner et al. [2018]] and WaveGrad [Chen et al. [2020]]. HiFi-GAN [Kong et al. [2020]] later introduced
adversarial training, setting new standards in audio quality and computational efficiency. EVA-GAN is a brand new
GAN structure vocoder developed by NVIDIA [Liao et al. [2024]], it uses a Context Aware Module(CAM) for
improved performance with minimal computational overhead. EVA-GAN shows superior performance over existing
state-of-the-art vocoders in both objective and subjective metrics, particularly in spectral continuity and high-frequency
reconstruction.

2.3 Vector Quantization in Speech Synthesis

Vector Quantization (VQ) has become essential in modern speech synthesis. VQ-VAE [Van Den Oord et al. [2017]]
showed the effectiveness of discrete latent representations for audio generation, while SoundStream [Zeghidour et al.
[2021]] and EnCodec [Défossez et al. [2022]] further improved these techniques for high-quality audio compression
and synthesis.

2.4 Large Language Models in Speech Processing

Large Language Models (LLMs) are increasingly important in speech processing. Nowadays there are more and more
models using BERT, huBERT as an intermediate structure for TTS, such as Parler TTS [Lacombe et al. [2024]], Melo

2
Fish-Speech T ECHNICAL R EPORT

TTS [Zhao et al. [2023]], E3-TTS [Gao et al. [2023]], XTTS [Casanova et al. [2024]] and so on. They all achieve better
synthesis effect.

2.5 Multilingual Speech Synthesis

Multilingual speech synthesis faces unique challenges in maintaining consistent quality across languages. Recent
solutions include unified multilingual models [Liu and Mak [2019]], cross-lingual transfer learning [Nekvinda and
Dušek [2020]], and language-agnostic representations [Li et al. [2019]].

3 Methods
Fish-Speech is a novel Text-to-Speech (TTS) framework that addresses key limitations of current non-grapheme-to-
phoneme (non-G2P) TTS systems. The framework is specifically designed to handle multi-emotional and multilingual
speech synthesis, with a focus on meeting the demands of advanced AI conversational agents.
Building on recent advances in vector quantization and condition representation [Kumar et al. [2024], Chen et al.
[2023], Wang et al. [2019]], we introduce a Grouped Finite Scalar Vector Quantization(GFSQ) technique. This method
efficiently encodes latent conditions, enabling better capture and reproduction of subtle speech variations. Our approach
achieves 100% codebook utilization, maximizing the effectiveness of the quantization space.
We also develop a dual autoregressive (dual-AR) architecture that solves two major challenges in current TTS systems.
First, it improves the stability of code generation, a common issue in existing frameworks. Second, it offers better
generation efficiency compared to Diffusion Transformers (DiT), making it well-suited for real-time applications. Last,
but most importantly, it is ready for Voice Agent, which we will release in near future.

Figure 1: Fish Speech Architecture

3.1 Dual Autoregressive Architecture in Fish-Speech

This section describes the Dual Autoregressive (Dual-AR) architecture [Fig. 2] of Fish-Speech, a TTS system
designed to handle complex linguistic features, polyphonic words, and natural-sounding multilingual synthesis. The
Dual-AR architecture improves the stability and computational efficiency of codebook processing during sequence
generation, particularly when using Grouped Finite Scalar Vector Quantization (GFSQ).

3.1.1 Overview of the Dual-AR Architecture

The Dual-AR architecture consists of two sequential autoregressive transformer [Vaswani [2017], Subakan et al. [2021]]
modules: a Slow Transformer and a Fast Transformer [Yang et al. [2023]]. This design processes both high-level and
detailed aspects of speech synthesis efficiently.

Slow Transformer
The Slow Transformer processes input text embeddings to capture global linguistic structures and semantic content. It
generates intermediate hidden states and predicts semantic tokens.

3
Fish-Speech T ECHNICAL R EPORT

Figure 2: Architectural overview of the Dual Autoregressive (Dual-AR) framework in Fish-Speech.

The Slow Transformer functions at an elevated level of abstraction, processing input text embeddings to encode global
linguistic structures and semantic content. This module is responsible for generating intermediate hidden states and
predicting semantic tokens with high precision.
Given an input sequence of tokens x = [x1 , x2 , . . . , xT ], the Slow Transformer generates hidden states h ∈ RT ×D and
token logits z through the following transformations:
h = SlowTransformer(x) (1)
z = Wtok · Norm(h) (2)

where Norm(·) represents layer normalization, and Wtok denotes the learnable parameters of the token prediction layer.

Fast Transformer
The Fast Transformer refines the Slow Transformer’s output through codebook embedding processing, capturing
detailed acoustic features needed for natural speech. It processes residual information and optimizes codebook usage.
The Fast Transformer takes as input the concatenated sequence of hidden states h and codebook embeddings c according
to:
h̃ = [h; c], (hfast ) (3)
hfast = FastTransformer(h̃, (hfast )) (4)
y = Wcbk · Norm(hfast ) (5)

where [h; c] represents the concatenation operation of h and c, Wcbk comprises the learnable parameters of the
codebook prediction layer, and y denotes the resultant codebook logits.

3.1.2 Advantages of the Dual-AR Architecture

The Dual-AR architecture in Fish-Speech demonstrates several significant advantages:

1. Enhanced Sequence Generation Stability: The hierarchical processing of global and local information
significantly improves the stability of GFSQ in sequence generation tasks.

4
Fish-Speech T ECHNICAL R EPORT

2. Optimized Codebook Processing: The Fast Transformer implements an efficient mechanism for codebook
embedding processing that achieves improved performance without significant computational overhead,
particularly for models of scale 7B or larger.
3. Superior Speech Synthesis Quality: The synergistic interaction between Slow and Fast Transformers enables
high-fidelity speech synthesis capable of handling complex linguistic phenomena.
4. Advanced Multilingual Processing: The integration with Large Language Models (LLMs) for linguistic
feature generation eliminates traditional grapheme-to-phoneme conversion dependencies, thereby streamlining
the synthesis pipeline and enhancing multilingual capabilities. By mixing text data the comprehension will be
further enhanced.

3.2 Firefly-GAN

Firefly-GAN (FF-GAN) is an enhanced version of the EVA-GAN architecture with significant structural improvements.
It replaces the traditional convolutional components of HiFi-GAN [Kong et al. [2020]] with a more efficient design,
featuring a ParallelBlock instead of the Multi-Receptive Field (MRF) module. By incorporating Grouped Finite Scalar
Vector Quantization (GFSQ), FF-GAN achieves better sequence generation stability and improved handling of linguistic
variations, making it particularly effective for multilingual synthesis in AI applications.
Firefly-GAN (FF-GAN) is an enhanced version of the EVA-GAN architecture with significant structural improvements.
The model implements an optimized structure that replace the conventional convolutional and transposed convolutional
components found in HiFi-GAN [Kong et al. [2020]]. We use ParallelBlock to subtitute for Multi-Receptive Field
(MRF), which specifically engineered for typo-codebook vocoder applications. Through the integration of the Grouped
Finite Scalar Vector Quantization (GFSQ) methodology, FF-GAN demonstrates enhanced sequence generation stability,
effectively improving complex linguistic variations and facilitating natural multilingual synthesis. This architectural
innovation distinguishes FF-GAN from traditional vocoders, particularly in text-to-speech synthesis and voice cloning
applications within AI agent frameworks.

Figure 3: FireFly GAN Architecture

3.2.1 Firefly Generator

FF-GAN use enhanced Conv stucture including depth-wise separable convolution Howard [2017] and dilated convolu-
tions Yu [2015], replacing conventional Conv1d layers. This architectural refinement enhances the model’s capacity to
capture and synthesize complex audio features.
In our architecture, the conventional Multi-Receptive Field (MRF) module is replaced by a ParallelBlock, optimizing
typo-codebook input processing efficiency. The ParallelBlock implements configurable convolution kernel sizes and
dilation rates Yu [2015], utilizing stack-and-average mechanism for processing outputs from three ResBlocks, instead of
directly addition operations. ParallelBlock Liao et al. [2024] offers enhanced receptive field coverage, superior feature
extraction capabilities, and improved configurability, contributing to higher-quality audio synthesis.

5
Fish-Speech T ECHNICAL R EPORT

3.2.2 Quantization Techniques

In order to fit the typo-codebook task, we use Grouped Finite Scalar Vector Quantization (GFSQ) to act as a vq codebook
in our system. The text below will elaboratly tell you how we develop the GFSQ.
Given an input tensor z ∈ RB×C×L . The entire process includes the following steps:

Downsampling
Using a downsampling function fdown to downsample the input tensor z, resulting in a downsampled tensor zd ∈
RB×Cd ×Ld :
zd = fdown (z) (6)

GFSQ Process
• Feature Grouping Input feature matrix Z is divided into G groups:
Z = [Z(1) , Z(2) , . . . , Z(G) ] (7)
(g)
• Scalar Quantization For each scalar zb,c,l :
(g) (g)
ẑb,c,l = Q(zb,c,l ) (8)
(g)
• Index Generation Each scalar maps to index kb,c,l
• Decoding
(g) (g)
ẑb,c,l = Codebook(g) [kb,c,l ] (9)

Reconstruct the Quantized Downsampled Tensor

Concatenate the quantized vectors of all groups along the channel dimension to obtain the quantized downsampled
tensor zqd ∈ RB×Cd ×Ld : h i
zqd (b, :, l) = z(1) (2) (G)
qd (b, :, l); zqd (b, :, l); . . . ; zqd (b, :, l) (10)

Upsampling
Use the upsampling function fup to restore the quantized downsampled tensor to its original size, resulting in the final
quantized tensor zq ∈ RB×C×L :
zq = fup (zqd ) (11)
The goal is to make zq approximate the original input z as closely as possible:
zq ≈ z (12)

3.2.3 Conclusion
Our implementation of using GFSQ techniques achieves neaely 100% codebook utilization, and gain better objetive
and subjective score in our inner ablation than other quantization techiques like RFSQ, RVQ and GRFSQ. FF-GAN
significantly enhancing stability in typo-codebook operations and ensuring comprehensive intermediate variable
information retention in multi-emotional and multilingual tasks.
FF-GAN’s innovative approach to typo-codebook stability has being already used in various song and music generation
applications. The framework’s performance and architectural could make it position it as a reference model for future
AI agent development.

4 Training and Inference

4.1 Training

Fish Speech uses a three-stage training approach: initial pre-training with large batches of standard data, followed by
SFT using smaller batches of high-quality data, and finally DPO training using manually labeled positive and negative
sample pairs.

6
Fish-Speech T ECHNICAL R EPORT

The training infrastructure was split into two components Fig. 1: The AR training utilized an 8*H100 80G GPUs for
one week, while the vocoder training employed an 8*4090 GPUs for an additional week. Note that these timelines
exclude the DPO stage.

4.2 Inference

Our inference strategy follows the architecture in Fig. 1. Using fish-tech including KV-cache [Pope et al. [2023]],
torch compile and other acceleration methodologies, the system achieves real-time factors of approximately 1:5 on
consumer-grade NVIDIA RTX 4060 mobile platforms and 1:15 on high-performance NVIDIA RTX 4090 configurations.
These architectural optimizations significantly improve latency in infering, achieving a first-packet latency of 150ms.
Furthermore, the system can process information in flow, making it easy to work with modern AI tools and use them in
different situations.

5 Dataset

Our training data includes a large collection of speech samples from both public sources and our own data collection
process. The dataset contains about 720,000 hours of speech across different languages, with 300,000 hours each of
English and Mandarin Chinese as the main components. We also included 20,000 hours each of other language families:
Germanic (German), Romance (French, Italian), East Asian (Japanese, Korean), and Semitic (Arabic).
We carefully balanced the data across languages to help the model learn multiple languages at once. This approach
helps the model perform well generating mixed-language content. The large size and variety of our dataset significantly
improves the model’s ability to handle multiple languages naturally.

6 Experimental Evaluation

1
We conducted an evaluation for the speaker cloning task to access the effect of our architecture compared to the
baseline models. The evaluation methodology includes both objective and subjective metrics: word error rate (WER)
for intelligibility assessment, speaker embedding similarity measure for speech cloning fidelity assessment, and mean
opinion score (MOS) for perceptual quality quantification. This evaluation framework aims to assess the model’s ability
to preserve speaker identity while maintaining high fidelity speech synthesis.

6.1 Word Error Rate Analysis

Model Name WER(%)

Ground Truth 9.22
fish-speech 6.89
rechoo 11.92
F5-TTS 13.98
CosyVoice 22.20

Table 1: Word Error Rate (WER) Results for Voice Cloning Tasks

Analysis of Table 1 shows that our model achieves a WER of 6.89% in voice cloning tasks, which is not only much
lower than the baseline models, but also exceeds the ground truth recordings (9.22%). This performance provides strong
evidence for the ability of our model in voice cloning scenarios. The gap between our model and competing models
(ranging from 11.92% to 22.20%) underscores the improved synthetic stability and content fidelity of our methodology.

7
Fish-Speech T ECHNICAL R EPORT

Model Name Resemblyzer SpeechBarin2

Ground Truth 0.921 0.770
CosyVoice 0.936 0.813
fish-speech 0.914 0.762
F5-TTS 0.905 0.787
rechoo 0.887 0.636

Table 2: Speaker similarity scores for different models, including ground truth.

6.2 Speaker Similarity Analysis

Table 2 shows the effect of our typo-codebook strategy on speaker similarity metrics. Our fish-speech model achieves
similarity scores of 0.914 and 0.762 on Resemblyzer and SpeechBrain, respectively, which are remarkably close to the
ground truth performance (0.921 and 0.770). The gap of only 0.76% from the ground truth in Resemblyzer and 1.04%
in SpeechBrain evaluations shows the superior ability of our model to capture natural speech characteristics. The results
strongly suggest that our typo-codebook architecture enables a more comprehensive capture of acoustic states, leading
to improved timbral fidelity of synthesized speech. Our approach significantly outperforms baseline models such as
F5-TTS (0.905 and 0.787) and rechoo (0.887 and 0.636). The consistent performance in both evaluation frameworks
proves the validity of our method in preserving speaker features, which is crucial for high-quality text-to-speech
synthesis and agent tasks.

6.3 Perceptual Quality Assessment

Model Name MOS

Ground Truth 5.00
fish-speech 4.05
CosyVoice 3.80
F5-TTS 2.90
rechoo 3.76

Table 3: Five-scale Mean Opinion Score (MOS) Ratings of Cloned Voice Quality

To evaluate the perceptual quality of synthesized audio, we conducted a comprehensive Mean Opinion Score (MOS)
listening test with naive listeners who had no prior experience with audio processing. The evaluation followed a double-
blind, randomized method to ensure an unbiased evaluation. The results show that fish-speech achieved significantly
higher subjective scores compared to other baseline models (p < 0.05), demonstrating superior performance in terms
of speech naturalness and speaker similarity. This evaluation in human perception metrics strongly suggests that
fish-speech can better capture and reproduce the natural characteristics of human speech, especially in the context of
voice cloning tasks.

7 Conclusion
Our research represents a significant advance in the field of text-to-speech (TTS) by introducing a novel multilingual
and multi-emotional stabilization solution. The core innovation lies in our development of a typo-codebook vocoder
integrated with a dual autoregressive (dual-AR) generation architecture. This architectural combination shows stability
in the synthesis process while preserving acoustic features within the generated speech. Furthermore, our work utilizes a
1
For experimental validation, we constrained our analysis to monolingual voice cloning scenarios, excluding cross-lingual
synthesis tasks. The evaluation corpus comprised 10 distinct speaker (include different languages) identities, with 30 synthesized
utterances per speaker, yielding a comprehensive evaluation set of 300 samples. It should be noted that cross-linguistic synthesis was
not included in this evaluation.
1
Results obtained using OpenAI Whisper-medium ASR model for transcription evaluation.
2
Experiments were conducted using the SpeechBrain framework Ravanelli et al. [2024], version 1.0.1, as the underlying speech
processing toolkit.

8
Fish-Speech T ECHNICAL R EPORT

non-grapheme-to-phoneme (non-G2P) structure, an approach that effectively addresses limitations inherent in traditional
phoneme-based systems and provides a robust foundation for cross-lingual and emotionally diverse TTS applications,
particularly in the context of AI agent interactions.

8 Future Work
Building on these foundations, we propose several directions for future research. We plan to improve the performance
of our model by integrating reinforcement learning techniques, focusing on improving cross-lingual generalization and
emotional stability. We are also developing the Fish Agent application, an end-to-end language model based on our
Fish-Speech framework. A preliminary demonstration of this system is currently available at fish.audio/demo/live. We
remain committed to the open source community and will continue to maintain and extend our codebase to provide
broader access to these technologies for researchers and developers.

References
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming
Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint
arXiv:2301.02111, 2023.
Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional variational autoencoder with adversarial learning for end-to-end
text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality
end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
Dennis H Klatt. Review of text-to-speech conversion for english. The Journal of the Acoustical Society of America, 82
(3):737–793, 1987.
Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti.
Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference
on Machine Learning, pages 2709–2720. PMLR, 2022.
Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao,
Jiang Bian, Xixin Wu, et al. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint
arXiv:2310.00704, 2023.
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma,
et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.
arXiv preprint arXiv:2407.05407, 2024.
Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-tts: A fast tts architecture with
conditional flow matching. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 11341–11345. IEEE, 2024.
James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made
simple. arXiv preprint arXiv:2309.15505, 2023.
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561,
2021.
Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Almost unsupervised text to speech and automatic
speech recognition. In International conference on machine learning, pages 5410–5419. PMLR, 2019.
Tim Capes, Paul Coles, Alistair Conkie, Ladan Golipour, Abie Hadjitarkhani, Qiong Hu, Nancy Huddleston, Melvyn
Hunt, Jiangchuan Li, Matthias Neeracher, et al. Siri on-device deep learning-guided unit selection text-to-speech
system. In Interspeech, pages 4011–4015, 2017.
Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbren-
ner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio. arXiv preprint
arXiv:1609.03499, 12, 2016.
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg,
Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference
on Machine Learning, pages 2410–2419. PMLR, 2018.

9
Fish-Speech T ECHNICAL R EPORT

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating
gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high
fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
Shijia Liao, Shiyi Lan, and Arun George Zachariah. Eva-gan: Enhanced various audio generation via scalable generative
adversarial networks. arXiv preprint arXiv:2402.00892, 2024.
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information
processing systems, 30, 2017.
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end
neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv
preprint arXiv:2210.13438, 2022.
Yoach Lacombe, Vaibhav Srivastav, and Sanchit Gandhi. Parler-tts. https://github.com/huggingface/
parler-tts, 2024.
Wenliang Zhao, Xumin Yu, and Zengyi Qin. Melotts: High-quality multi-lingual multi-accent text-to-speech, 2023.
URL https://github.com/myshell-ai/MeloTTS.
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 tts: Easy end-to-end diffusion-based text to speech. In
2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer,
Reuben Morais, Samuel Olayemi, et al. Xtts: a massively multilingual zero-shot text-to-speech model. arXiv preprint
arXiv:2406.04904, 2024.
Zhaoyu Liu and Brian Mak. Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using
parallel corpus for unseen speakers. arXiv preprint arXiv:1911.11601, 2019.
Tomáš Nekvinda and Ondřej Dušek. One model, many languages: Meta-learning for multilingual text-to-speech. arXiv
preprint arXiv:2008.00768, 2020.
Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, and William Chan. Bytes are all you need: End-to-end multilingual
speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 5621–5625. IEEE, 2019.
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio
compression with improved rvqgan. Advances in Neural Information Processing Systems, 36, 2024.
Li-Wei Chen, Shinji Watanabe, and Alexander Rudnicky. A vector quantized approach for text to speech synthesis on
real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages
12644–12652, 2023.
Xin Wang, Shinji Takaki, Junichi Yamagishi, Simon King, and Keiichi Tokuda. A vector quantized variational
autoencoder (vq-vae) autoregressive neural f _0 model for statistical parametric speech synthesis. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 28:157–170, 2019.
A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. Attention is all you need in
speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 21–25. IEEE, 2021.
Andrew G Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
F Yu. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao,
Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and
Systems, 5:606–624, 2023.
Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang,
Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao,
Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Pierre Champion, Aku Rouhe, Rudolf Braun, Florian Mai,
Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar, Jarod Duret, Salima
Mdhaffar, Gaelle Laperriere, Mickael Rouvier, Renato De Mori, and Yannick Esteve. Open-source conversational ai
with SpeechBrain 1.0, 2024. URL https://arxiv.org/abs/2407.00463.

10
Fish-Speech T ECHNICAL R EPORT

A Training Details
We trained our model on an NVIDIA H100 GPU with the following hyperparameters:

Optimization:

• Optimizer: AdamW (β1 = 0.9, β2 = 0.98, ϵ = 10−8 )

• LR: 5 × 10−4
• Weight decay: 0.01

Training Config:

• Batch size: 1M tokens

• Training steps: 500K
• LR schedule: Cosine decay with warmup
• Warmup steps: 2K
• Final LR ratio: 0.1

Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
No ratings yet
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
5 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Suoni
No ratings yet
Suoni
38 pages
Thesis
No ratings yet
Thesis
37 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
Style TTS2
No ratings yet
Style TTS2
28 pages
Portable and High-Quality
No ratings yet
Portable and High-Quality
19 pages
Cardiovascular Pharmacotherapeutics PDF
No ratings yet
Cardiovascular Pharmacotherapeutics PDF
798 pages
CosyVoice 2
No ratings yet
CosyVoice 2
19 pages
Lecture 10 - Text To Speech
No ratings yet
Lecture 10 - Text To Speech
76 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
No ratings yet
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
16 pages
Metastylespeech
No ratings yet
Metastylespeech
16 pages
Audio Generation With Diffusion Models
No ratings yet
Audio Generation With Diffusion Models
16 pages
Audio Gen
No ratings yet
Audio Gen
16 pages
Text To Audio Generation Instruction LLM
No ratings yet
Text To Audio Generation Instruction LLM
15 pages
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
No ratings yet
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
13 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
No ratings yet
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
18 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
Huang 22
No ratings yet
Huang 22
17 pages
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model With Single-Stream Decoupled Speech Tokens
No ratings yet
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model With Single-Stream Decoupled Speech Tokens
22 pages
Arik 17 A
No ratings yet
Arik 17 A
10 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Text To Speech
No ratings yet
Text To Speech
14 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Minimax Speech
No ratings yet
Minimax Speech
20 pages
Emotional Speech Synthesis Using End-to-End Neural TTS Models
No ratings yet
Emotional Speech Synthesis Using End-to-End Neural TTS Models
7 pages
Multi-Band Melgan Faster Waveform Generation For High-Quality Text-To-Speech
No ratings yet
Multi-Band Melgan Faster Waveform Generation For High-Quality Text-To-Speech
7 pages
Phonetics 2
No ratings yet
Phonetics 2
14 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Literature Survey
No ratings yet
Literature Survey
6 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
HiFi GAN
No ratings yet
HiFi GAN
14 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
ssw9 PS2-13 Wu
No ratings yet
ssw9 PS2-13 Wu
6 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
Liu22c Interspeech
No ratings yet
Liu22c Interspeech
5 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Encodec Trans
No ratings yet
Encodec Trans
5 pages
NLPReport Phase 1
No ratings yet
NLPReport Phase 1
5 pages
BCPC Assessment Indicators and MOVs
No ratings yet
BCPC Assessment Indicators and MOVs
8 pages
U 4
No ratings yet
U 4
8 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Emotional Speech Synthesis Review
No ratings yet
Emotional Speech Synthesis Review
4 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
No ratings yet
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
5 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Keywords
No ratings yet
Keywords
4 pages
Lightweight End-To-End Text-To-Speech Synthesis Fo
No ratings yet
Lightweight End-To-End Text-To-Speech Synthesis Fo
6 pages
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
No ratings yet
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
5 pages
Text To Speech IT
No ratings yet
Text To Speech IT
3 pages
Neurocomputing: Mario Malcangi, David Frontini
No ratings yet
Neurocomputing: Mario Malcangi, David Frontini
10 pages
The Most Realistic AI Speech The Most Realistic AI Speech: Meet The Team Meet The Team
No ratings yet
The Most Realistic AI Speech The Most Realistic AI Speech: Meet The Team Meet The Team
2 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
Ccs369-Unit 4
No ratings yet
Ccs369-Unit 4
13 pages
A Framework For Deepfake V2
No ratings yet
A Framework For Deepfake V2
24 pages
Western Political Thought
No ratings yet
Western Political Thought
290 pages
C5c Total Internal Reflection and The Critical Angle
No ratings yet
C5c Total Internal Reflection and The Critical Angle
2 pages
Self-Loosening of Threaded Fasteners: by Dr. Bill Eccles, Bolt Science
No ratings yet
Self-Loosening of Threaded Fasteners: by Dr. Bill Eccles, Bolt Science
2 pages
Petroleum Engineering 311 Reservoir Petr
No ratings yet
Petroleum Engineering 311 Reservoir Petr
224 pages
A Comprehensive Overview and Comparative Analysis On Deep Learning Models
No ratings yet
A Comprehensive Overview and Comparative Analysis On Deep Learning Models
62 pages
Project English Language Gandhian Principal
No ratings yet
Project English Language Gandhian Principal
3 pages
Science 9 Q4 SML17 V2
No ratings yet
Science 9 Q4 SML17 V2
15 pages
Te Brochure Uk 12apr22 Screen
No ratings yet
Te Brochure Uk 12apr22 Screen
52 pages
Revisiting Artifacts of Kohn-Sham Density Functionals For Biosimulation
No ratings yet
Revisiting Artifacts of Kohn-Sham Density Functionals For Biosimulation
55 pages
Writing 1
No ratings yet
Writing 1
10 pages
Three-Phase Equilibria of Hydrates From Computer Simulation. III. Effect of Dispersive Interactions in The Methane and Carbon Dioxide Hydrates
No ratings yet
Three-Phase Equilibria of Hydrates From Computer Simulation. III. Effect of Dispersive Interactions in The Methane and Carbon Dioxide Hydrates
12 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
63 pages
Abstract
No ratings yet
Abstract
24 pages
Symbolic Dynamics For The Kuramoto-Sivashinsky PDE On The Line II
No ratings yet
Symbolic Dynamics For The Kuramoto-Sivashinsky PDE On The Line II
59 pages
Gallery Walk Final Report
No ratings yet
Gallery Walk Final Report
14 pages
Americans and The Ottoman Navy in The Levant
No ratings yet
Americans and The Ottoman Navy in The Levant
105 pages
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
No ratings yet
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
50 pages
MPKV Vacancy
No ratings yet
MPKV Vacancy
1 page
CO Capture Using Boron, Nitrogen, and Phosphorus-Doped C in The Present Electric Field: A DFT Study
No ratings yet
CO Capture Using Boron, Nitrogen, and Phosphorus-Doped C in The Present Electric Field: A DFT Study
11 pages
HEST-1k: A Dataset For Spatial Transcriptomics and Histology Image Analysis
No ratings yet
HEST-1k: A Dataset For Spatial Transcriptomics and Histology Image Analysis
36 pages
Scalable Production and Supply Chain of Diamond Using Microwave Plasma: A Mini-Review
No ratings yet
Scalable Production and Supply Chain of Diamond Using Microwave Plasma: A Mini-Review
35 pages
An Electronic Structure Investigation of Pedot With Alcl Anions, A Promising Redox Combination For Energy Storage Applications
No ratings yet
An Electronic Structure Investigation of Pedot With Alcl Anions, A Promising Redox Combination For Energy Storage Applications
29 pages
GITAM School of Technology, Visakhapatnam
No ratings yet
GITAM School of Technology, Visakhapatnam
4 pages
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
No ratings yet
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
32 pages
Final Term Table Exam Fall 2024-2025 Final
No ratings yet
Final Term Table Exam Fall 2024-2025 Final
3 pages
Thermal Conductivity of Double Polymorph Ga O Structures
No ratings yet
Thermal Conductivity of Double Polymorph Ga O Structures
26 pages
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
No ratings yet
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
21 pages
UAS High School Profile 2024 25 Vers2
No ratings yet
UAS High School Profile 2024 25 Vers2
4 pages
Double-Sided Van Der Waals Epitaxy of Topological Insulators Across An Atomically Thin Membrane
No ratings yet
Double-Sided Van Der Waals Epitaxy of Topological Insulators Across An Atomically Thin Membrane
24 pages
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
No ratings yet
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
27 pages
Ap Physics 2 Lab: Photoelectric Effect
No ratings yet
Ap Physics 2 Lab: Photoelectric Effect
9 pages
2403 15846v1
No ratings yet
2403 15846v1
22 pages
Work and Energy
No ratings yet
Work and Energy
13 pages
Γ M Γ M (a) (b) Γ M M: Tunable Hubbard models in twisted square homobilayers
No ratings yet
Γ M Γ M (a) (b) Γ M M: Tunable Hubbard models in twisted square homobilayers
21 pages
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
No ratings yet
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
21 pages
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
No ratings yet
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
20 pages
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
No ratings yet
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
18 pages
Local Media3092843488830198412
100% (1)
Local Media3092843488830198412
2 pages
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
No ratings yet
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
17 pages
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
No ratings yet
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
16 pages
Generalized Convolutional Many Body Distribution Functional Representations
No ratings yet
Generalized Convolutional Many Body Distribution Functional Representations
14 pages
2406 02095v1
No ratings yet
2406 02095v1
13 pages
Report Writing Fomat
No ratings yet
Report Writing Fomat
3 pages
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
No ratings yet
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
13 pages
Highlights: Plasma-Metal Junction: A Junction With Negative Turn-On Voltage
No ratings yet
Highlights: Plasma-Metal Junction: A Junction With Negative Turn-On Voltage
10 pages
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
No ratings yet
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
11 pages
What Can Lattice DFT Teach Us About Real-Space DFT?: Nahual Sobrino, David Jacob, and Stefan Kurth
No ratings yet
What Can Lattice DFT Teach Us About Real-Space DFT?: Nahual Sobrino, David Jacob, and Stefan Kurth
8 pages
Exploring Electron Affinities, LUMO Energies, and Band Gaps With Electron-Pair Theories
No ratings yet
Exploring Electron Affinities, LUMO Energies, and Band Gaps With Electron-Pair Theories
5 pages
ProMax LB02A Multifuntion Process Calibrator Datasheet
No ratings yet
ProMax LB02A Multifuntion Process Calibrator Datasheet
5 pages
Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey
No ratings yet
Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey
5 pages
Random Processes: 8.1 Basic Concepts
No ratings yet
Random Processes: 8.1 Basic Concepts
14 pages
As Phy Revision BK For Mid Term PDF
No ratings yet
As Phy Revision BK For Mid Term PDF
10 pages
3500-46M Hydro Monitor Datasheet - 144408
No ratings yet
3500-46M Hydro Monitor Datasheet - 144408
16 pages
Food Culture and A Travelogue Nine Fishy Tales of Samanth Subramanian's Following Fish
No ratings yet
Food Culture and A Travelogue Nine Fishy Tales of Samanth Subramanian's Following Fish
4 pages
Questionnaire 3 - MultiGroup Analysis
No ratings yet
Questionnaire 3 - MultiGroup Analysis
1 page
MS Broschuere FLUITEX EN Metric
No ratings yet
MS Broschuere FLUITEX EN Metric
12 pages
CV Rezaei
No ratings yet
CV Rezaei
2 pages
VADOSE ZONE Microbial Ecology
No ratings yet
VADOSE ZONE Microbial Ecology
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis

Uploaded by

F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis

Uploaded by

F ISH -S PEECH : L EVERAGING L ARGE L ANGUAGE M ODELS FOR

A DVANCED M ULTILINGUAL T EXT- TO -S PEECH S YNTHESIS

Keywords Text to Speecch · LLM · Voice Cloning

2.2 Neural Vocoders

2.3 Vector Quantization in Speech Synthesis

2.4 Large Language Models in Speech Processing

2.5 Multilingual Speech Synthesis

Figure 1: Fish Speech Architecture

3.1 Dual Autoregressive Architecture in Fish-Speech

3.1.1 Overview of the Dual-AR Architecture

Figure 2: Architectural overview of the Dual Autoregressive (Dual-AR) framework in Fish-Speech.

3.1.2 Advantages of the Dual-AR Architecture

Figure 3: FireFly GAN Architecture

3.2.1 Firefly Generator

3.2.2 Quantization Techniques

Reconstruct the Quantized Downsampled Tensor

4 Training and Inference

6.1 Word Error Rate Analysis

Model Name WER(%)

Model Name Resemblyzer SpeechBarin2

6.2 Speaker Similarity Analysis

6.3 Perceptual Quality Assessment

Model Name MOS

• Optimizer: AdamW (β1 = 0.9, β2 = 0.98, ϵ = 10−8 )

• Batch size: 1M tokens

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.