Skip to content

fishaudio/fish-speech

Repository files navigation




Important

License Notice
This codebase is released under Apache License and all model weights are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.

Warning

Legal Disclaimer
We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

FishAudio-S1

True human-like Text-to-Speech and Voice Cloning

FishAudio-S1 is an expressive text-to-speech (TTS) and voice cloning model developed by Fish Audio, designed to generate speech that sounds natural, realistic, and emotionally rich — not robotic, not flat, and not constrained to studio-style narration.

FishAudio-S1 focuses on how humans actually speak: with emotion, variation, pauses, and intent.

Announcement 🎉

We are excited to announce that we have rebranded to Fish Audio — introducing a revolutionary new series of advanced Text-to-Speech models that builds upon the foundation of Fish-Speech.

We are proud to release FishAudio-S1 (also known as OpenAudio S1) as the first model in this series, delivering significant improvements in quality, performance, and capabilities.

FishAudio-S1 comes in two versions: FishAudio-S1 and FishAudio-S1-mini. Both models are now available on Fish Audio Playground (for FishAudio-S1) and Hugging Face (for FishAudio-S1-mini).

Visit the Fish Audio website for live playground tech report.

Model Variants

Model Size Availability Description
FishAudio-S1 4B parameters fish.audio Full-featured flagship model with maximum quality and stability
FishAudio-S1-mini 0.5B parameters huggingface Open-source distilled model with core capabilities

Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).

Start Here

Here are the official documents for Fish Speech, follow the instructions to get started easily.

Highlights

Excellent TTS quality

We use Seed TTS Eval Metrics to evaluate the model performance, and the results show that FishAudio S1 achieves 0.008 WER and 0.004 CER on English text, which is significantly better than previous models. (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM)

Model Word Error Rate (WER) Character Error Rate (CER) Speaker Distance
S1 0.008 0.004 0.332
S1-mini 0.011 0.005 0.380

Best Model in TTS-Arena2 🏆

FishAudio S1 has achieved the #1 ranking on TTS-Arena2, the benchmark for text-to-speech evaluation:

TTS-Arena2 Ranking

True Human-Like Speech

FishAudio-S1 generates speech that sounds natural and conversational rather than robotic or overly polished. The model captures subtle variations in timing, emphasis, and prosody, avoiding the “studio recording” effect common in traditional TTS systems.

Emotion Control and Expressiveness

FishAudio S1 is the first TTS model to support open-domain fine-grained emotion control through explicit emotion and tone markers. We can now precisely steer how a voice sounds:

  • Basic emotions:
(angry) (sad) (excited) (surprised) (satisfied) (delighted) 
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
  • Advanced emotions:
(disdainful) (unhappy) (anxious) (hysterical) (indifferent) 
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
  • Tone markers:
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
  • Special audio effects:
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)

You can also use Ha,ha,ha to control, there's many other cases waiting to be explored by yourself.

Multilingual Support

FishAudio-S1 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing.

Languages supporting emotion markers include: English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, and Portuguese.

The list is constantly expanding, check Fish Audio for the latest releases.

Rapid Voice Cloning

FishAudio-S1 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.

Features

  1. Zero-shot & Few-shot TTS: Input a 10 to 30-second vocal sample to generate high-quality TTS output. For detailed guidelines, see Voice Cloning Best Practices.

  2. Multilingual & Cross-lingual Support: Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.

  3. No Phoneme Dependency: The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script.

  4. Highly Accurate: Achieves a low CER (Character Error Rate) of around 0.4% and WER (Word Error Rate) of around 0.8% for Seed-TTS Eval.

  5. Fast: Accelerated by torch compile, the real-time factor is approximately 1:7 on an Nvidia RTX 4090 GPU.

  6. WebUI Inference: Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers.

  7. Deploy-Friendly: Easily set up an inference server with native support for Linux and Windows (macOS support coming soon), minimizing performance loss.

Media & Demos

Social Media

Latest Demo on X

Interactive Demos

Try FishAudio S1 Try S1 Mini

Video Showcases

FishAudio S1 Video

Credits

Tech Report (V1.4)

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156},
}