Fish Speech

Important

License Notice
This codebase is released under Apache License and all model weights are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.

Warning

Legal Disclaimer
We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

FishAudio-S1

True human-like Text-to-Speech and Voice Cloning

FishAudio-S1 is an expressive text-to-speech (TTS) and voice cloning model developed by Fish Audio, designed to generate speech that sounds natural, realistic, and emotionally rich — not robotic, not flat, and not constrained to studio-style narration.

FishAudio-S1 focuses on how humans actually speak: with emotion, variation, pauses, and intent.

Announcement 🎉

We are excited to announce that we have rebranded to Fish Audio — introducing a revolutionary new series of advanced Text-to-Speech models that builds upon the foundation of Fish-Speech.

We are proud to release FishAudio-S1 (also known as OpenAudio S1) as the first model in this series, delivering significant improvements in quality, performance, and capabilities.

FishAudio-S1 comes in two versions: FishAudio-S1 and FishAudio-S1-mini. Both models are now available on Fish Audio Playground (for FishAudio-S1) and Hugging Face (for FishAudio-S1-mini).

Visit the Fish Audio website for live playground tech report.

Model Variants

Model	Size	Availability	Description
FishAudio-S1	4B parameters	fish.audio	Full-featured flagship model with maximum quality and stability
FishAudio-S1-mini	0.5B parameters	huggingface	Open-source distilled model with core capabilities

Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).

Start Here

Here are the official documents for Fish Speech, follow the instructions to get started easily.

Highlights

Excellent TTS quality

We use Seed TTS Eval Metrics to evaluate the model performance, and the results show that FishAudio S1 achieves 0.008 WER and 0.004 CER on English text, which is significantly better than previous models. (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM)

Model	Word Error Rate (WER)	Character Error Rate (CER)	Speaker Distance
S1	0.008	0.004	0.332
S1-mini	0.011	0.005	0.380

Best Model in TTS-Arena2 🏆

FishAudio S1 has achieved the #1 ranking on TTS-Arena2, the benchmark for text-to-speech evaluation:

True Human-Like Speech

FishAudio-S1 generates speech that sounds natural and conversational rather than robotic or overly polished. The model captures subtle variations in timing, emphasis, and prosody, avoiding the “studio recording” effect common in traditional TTS systems.

Emotion Control and Expressiveness

FishAudio S1 is the first TTS model to support open-domain fine-grained emotion control through explicit emotion and tone markers. We can now precisely steer how a voice sounds:

Basic emotions:

(angry) (sad) (excited) (surprised) (satisfied) (delighted) 
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)

Advanced emotions:

(disdainful) (unhappy) (anxious) (hysterical) (indifferent) 
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)

Tone markers:

(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)

Special audio effects:

(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)

You can also use Ha,ha,ha to control, there's many other cases waiting to be explored by yourself.

Multilingual Support

FishAudio-S1 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing.

Languages supporting emotion markers include: English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, and Portuguese.

The list is constantly expanding, check Fish Audio for the latest releases.

Rapid Voice Cloning

FishAudio-S1 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.

Features

Zero-shot & Few-shot TTS: Input a 10 to 30-second vocal sample to generate high-quality TTS output. For detailed guidelines, see Voice Cloning Best Practices.
Multilingual & Cross-lingual Support: Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.
No Phoneme Dependency: The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script.
Highly Accurate: Achieves a low CER (Character Error Rate) of around 0.4% and WER (Word Error Rate) of around 0.8% for Seed-TTS Eval.
Fast: Accelerated by torch compile, the real-time factor is approximately 1:7 on an Nvidia RTX 4090 GPU.
WebUI Inference: Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers.
Deploy-Friendly: Easily set up an inference server with native support for Linux and Windows (macOS support coming soon), minimizing performance loss.

Media & Demos

Social Media

Interactive Demos

Video Showcases

Credits

Tech Report (V1.4)

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156},
}

Name		Name	Last commit message	Last commit date
Latest commit History 701 Commits
.github		.github
docker		docker
docs		docs
fish_speech		fish_speech
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.project-root		.project-root
.readthedocs.yaml		.readthedocs.yaml
API_FLAGS.txt		API_FLAGS.txt
LICENSE		LICENSE
README.md		README.md
compose.base.yml		compose.base.yml
compose.yml		compose.yml
dockerfile.dev		dockerfile.dev
entrypoint.sh		entrypoint.sh
inference.ipynb		inference.ipynb
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fish Speech

FishAudio-S1

Announcement 🎉

Model Variants

Start Here

Highlights

Excellent TTS quality

Best Model in TTS-Arena2 🏆

True Human-Like Speech

Emotion Control and Expressiveness

Multilingual Support

Rapid Voice Cloning

Features

Media & Demos

Social Media

Interactive Demos

Video Showcases

Credits

Tech Report (V1.4)

About

Uh oh!

Releases 13

Contributors 83

Languages

License

fishaudio/fish-speech

Folders and files

Latest commit

History

Repository files navigation

Fish Speech

FishAudio-S1

Announcement 🎉

Model Variants

Start Here

Highlights

Excellent TTS quality

Best Model in TTS-Arena2 🏆

True Human-Like Speech

Emotion Control and Expressiveness

Multilingual Support

Rapid Voice Cloning

Features

Media & Demos

Social Media

Interactive Demos

Video Showcases

Credits

Tech Report (V1.4)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Contributors 83

Languages