We release the training and inference code for ConversationTTS. Also, we release the first checkpoint, which trained on 1.5 epoch on about 20w hours speech data.
wget https://huggingface.co/AudioFoundation/SpeechFoundation/resolve/main/ckpt1.checkpoint
We use large-scale TTS data, such as Emili-Yodas, wenetspeech, MLS, People speech. We collect a lot of podcast dataset, including English, Chinese, Cantonese. We use different speaker label (e.g. [1], [2]) to indicates different speaker. The first version is only trained on 20w hours data. We will update the checkpoints trained on more then 50w hours data.
Install and Run CapSpeech locally.
- 💿 Installation & Usage: 📄 Instrucitons
Please refer to the following documents to prepare the data, train the model, and evaluate its performance.
- Data Preparation
- Training
- Evaluation (Developing...)
- [Dongchao Yang]
- [Dading Cong]
- [Jiankun Zhao]
- [Yuanyuan Wang]
- [Helin Wang]
If you find this work useful, please consider contributing to this repo and cite this work:
All datasets, listening samples, source code, pretrained checkpoints, and the evaluation toolkit are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See the LICENSE file for details.
This implementation is based on UniAudio, CSM, Moshi, RSTNet. We appreciate their awesome work.
If you find this repo helpful or interesting, consider dropping a ⭐ — it really helps and means a lot!