Important
License Notice
This codebase is released under Apache License and all model weights are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.
Warning
Legal Disclaimer
We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
True human-like Text-to-Speech and Voice Cloning
FishAudio-S1 is an expressive text-to-speech (TTS) and voice cloning model developed by Fish Audio, designed to generate speech that sounds natural, realistic, and emotionally rich — not robotic, not flat, and not constrained to studio-style narration.
FishAudio-S1 focuses on how humans actually speak: with emotion, variation, pauses, and intent.
We are excited to announce that we have rebranded to Fish Audio — introducing a revolutionary new series of advanced Text-to-Speech models that builds upon the foundation of Fish-Speech.
We are proud to release FishAudio-S1 (also known as OpenAudio S1) as the first model in this series, delivering significant improvements in quality, performance, and capabilities.
FishAudio-S1 comes in two versions: FishAudio-S1 and FishAudio-S1-mini. Both models are now available on Fish Audio Playground (for FishAudio-S1) and Hugging Face (for FishAudio-S1-mini).
Visit the Fish Audio website for live playground tech report.
| Model | Size | Availability | Description |
|---|---|---|---|
| FishAudio-S1 | 4B parameters | fish.audio | Full-featured flagship model with maximum quality and stability |
| FishAudio-S1-mini | 0.5B parameters | huggingface | Open-source distilled model with core capabilities |
Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).
Here are the official documents for Fish Speech, follow the instructions to get started easily.
We use Seed TTS Eval Metrics to evaluate the model performance, and the results show that FishAudio S1 achieves 0.008 WER and 0.004 CER on English text, which is significantly better than previous models. (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM)
| Model | Word Error Rate (WER) | Character Error Rate (CER) | Speaker Distance |
|---|---|---|---|
| S1 | 0.008 | 0.004 | 0.332 |
| S1-mini | 0.011 | 0.005 | 0.380 |
FishAudio S1 has achieved the #1 ranking on TTS-Arena2, the benchmark for text-to-speech evaluation:
FishAudio-S1 generates speech that sounds natural and conversational rather than robotic or overly polished. The model captures subtle variations in timing, emphasis, and prosody, avoiding the “studio recording” effect common in traditional TTS systems.
FishAudio S1 is the first TTS model to support open-domain fine-grained emotion control through explicit emotion and tone markers. We can now precisely steer how a voice sounds:
- Basic emotions:
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
- Advanced emotions:
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
- Tone markers:
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
- Special audio effects:
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
You can also use Ha,ha,ha to control, there's many other cases waiting to be explored by yourself.
FishAudio-S1 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing.
Languages supporting emotion markers include: English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, and Portuguese.
The list is constantly expanding, check Fish Audio for the latest releases.
FishAudio-S1 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.
-
Zero-shot & Few-shot TTS: Input a 10 to 30-second vocal sample to generate high-quality TTS output. For detailed guidelines, see Voice Cloning Best Practices.
-
Multilingual & Cross-lingual Support: Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.
-
No Phoneme Dependency: The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script.
-
Highly Accurate: Achieves a low CER (Character Error Rate) of around 0.4% and WER (Word Error Rate) of around 0.8% for Seed-TTS Eval.
-
Fast: Accelerated by torch compile, the real-time factor is approximately 1:7 on an Nvidia RTX 4090 GPU.
-
WebUI Inference: Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers.
-
Deploy-Friendly: Easily set up an inference server with native support for Linux and Windows (macOS support coming soon), minimizing performance loss.
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}