CrispTTS is a versatile command-line Text-to-Speech (TTS) tool designed for synthesizing German speech using a variety of popular local and cloud-based TTS engines. Its modular architecture allows for easy maintenance and straightforward addition of new TTS handlers.
| Project | Role |
|---|---|
| Susurrus | Python GUI + CLI — 30+ ASR, 12 TTS, translation |
| CrispASR | C++ ASR/TTS engine — 24+ backends, ggml inference |
| CrispTTS | This repo — Python TTS CLI with 28+ handlers |
| CrisperWeaver | Flutter transcription app — desktop + mobile |
NOTE: This is in experimental / work in progress state. Some Python-only models may be broken due to dependency conflicts. The CrispASR-based handlers (crispasr_*) are the most reliable — they use native C++ inference with no Python ML dependencies.
- 28+ TTS Engine Support:
- CrispASR native C++ engines (7 backends, auto-download, no Python ML deps):
- Kokoro (multilingual, Apache 2.0)
- Orpheus + Kartoffel-Orpheus DE (19 German speakers, llama3.2 license)
- Qwen3-TTS (voice cloning + voice design, Apache 2.0)
- Chatterbox (CFM synthesis, MIT)
- VibeVoice TTS (voice cloning)
- IndexTTS (zero-shot cloning, Apache 2.0)
- VoxCPM2 (48 kHz, 30 languages, Apache 2.0)
- Microsoft Edge TTS (cloud-based, requires
edge-tts) - Coqui TTS (XTTS v2, VITS, etc.)
- Piper (local ONNX, requires
piper-tts) - Orpheus GGUF (local, requires
llama-cpp-python) - Orpheus via LM Studio / Ollama API
- OuteTTS (LlamaCPP or HF backend)
- SpeechT5 (German fine-tune via HF Transformers)
- FastPitch (German via NeMo)
- mlx-audio (Bark, Kokoro, Dia — Apple Silicon)
- LLaSA (hybrid, German, multilingual variants)
- F5-TTS (MLX/PyTorch)
- Kokoro ONNX (lightweight)
- TTS.cpp (GGUF models)
- Zonos (acoustic conditioning)
- Chatterbox Python (Kartoffelbox)
- CrispASR native C++ engines (7 backends, auto-download, no Python ML deps):
- AI Audio Watermarking & Provenance (EU AI Act compliant):
- Spread-spectrum watermark (always on, imperceptible, ~-46 dB)
- AudioSeal neural watermark (optional upgrade via
pip install audiosealor CrispASR GGUF) - WAV LIST/INFO and MP3 ID3v2 metadata marking audio as AI-generated
- C2PA content credentials signing (optional,
pip install c2pa-python) - Voice-cloning consent gate (
--i-have-rights)
- CrispASR Integration:
--verify: ASR roundtrip verification of TTS output quality--translate: Pre-synthesis translation (EN→DE via m2m100/MadLad)--speech-speed: Rate multiplier (maps to CrispASR--pace)--trim-silence: Remove leading/trailing silence from output--tts-steps: Diffusion model inference steps (quality vs speed)--tts-language: Override language for multilingual models--pitch-shift: Pitch shift in Hz for FastPitch backends--instruct: Natural-language voice descriptions (Qwen3-TTS VoiceDesign)--stream: Stream audio playback during synthesis--output-sample-rate: Resample output to target sample rate
- OpenAI-Compatible API Server (
--server):POST /v1/audio/speech— drop-in replacement for OpenAI TTSGET /v1/audio/models— list all configured models
- Text Input Flexibility: Synthesize from CLI,
.txt,.md,.html,.pdf,.epub - Smart Text Chunking: Automatic sentence-boundary splitting for long texts
- Customizable Output: Save audio to
.wav,.mp3,.flac, or.opus - Direct Playback: Play synthesized audio immediately
- Voice Selection: Override default voices/speakers for most models
- Model Parameter Tuning: JSON-formatted parameters for fine-tuning
- Comprehensive Testing:
--test-all: Test all models with default voices--test-all-speakers: Test all models with all configured voices- 160+ unit and live tests
- Modular Design:
config.py+utils.py+handlers/+main.py - Logging: Configurable logging levels
- Automatic Patching: Runtime monkeypatches for library compatibility
crisptts_project/
├── main.py # Main CLI application script
├── config.py # Model configurations and global constants
├── utils.py # Shared utility functions and classes
├── watermark.py # Audio watermarking, metadata, consent gate, C2PA
├── chunking.py # Smart sentence-boundary text splitting
├── server.py # OpenAI-compatible HTTP API server
├── decoder.py # User-provided decoder for Orpheus models (if used)
├── handlers/ # Package for individual TTS engine handlers
│ ├── __init__.py # Makes 'handlers' a package, exports handler functions
│ ├── edge_handler.py # Edge TTS cloud service handler
│ ├── piper_handler.py # Piper TTS (ONNX) handler
│ ├── orpheus_gguf_handler.py # Local Orpheus GGUF model handler
│ ├── orpheus_api_handler.py # Handlers for LM Studio and Ollama API
│ ├── outetts_handler.py # OuteTTS model handler
│ ├── speecht5_handler.py # SpeechT5 model handler
│ ├── nemo_handler.py # NeMo FastPitch handler
│ ├── coqui_tts_handler.py # Coqui TTS handler (for XTTS, VITS etc.)
│ ├── kartoffel_handler.py # Orpheus "Kartoffel" Transformers handler
│ ├── kokoro_onnx_handler.py # Kokoro (multilingual but no German) ONNX handler
│ ├── llasa_hybrid_handler.py # LLaSA Hybrid handler
│ ├── tts_cpp_handler.py # TTS.cpp handler supporting GGUF models
│ └── mlx_audio_handler.py # Handler for mlx-audio library (e.g., Bark)
├── requirements.txt # Python package dependencies
└── README.md # This documentation file
- Python and
pipfor installing packages - For
mlx-audiobased models: Apple Silicon Mac is required for GPU acceleration - For
TTS.cppa C++ compiler and CMake are required to build the engine
-
Clone/Download Files
git clone https://github.com/CrispStrobe/CrispTTS
-
Create a Virtual Environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies: A
requirements.txtfile is provided. Install the necessary packages:pip install -r requirements.txt
Note: Some libraries like PyTorch, NeMo, LlamaCPP, and
mlx-audiocan have specific installation needs depending on your OS and hardware (e.g., CUDA for Nvidia GPUs, Metal for Apple Silicon). Please refer to their official documentation if you encounter issues. Ensure you haveffmpeginstalled and available in your system's PATH if you encounter issues with audio file format conversions or direct playback (some underlying libraries might need it). -
Install and Build Engine-Specific Dependencies (required for certain handlers):
For TTS.cpp: Clone and build the TTS.cpp project separately.
git clone https://github.com/mmwillet/TTS.cpp.git cd TTS.cpp cmake -B build cmake --build build --config Release cd ..
For kokoro-onnx: Install the Python package and download model files.
pip install kokoro-onnx
-
Environment Variables (Optional but Recommended):
HF_TOKEN: If you need to download models from gated or private Hugging Face repositories, set this environment variable with your Hugging Face API token:export HF_TOKEN="your_huggingface_token_here"
GGML_METAL_NDEBUG=1: Set automatically bymain.pyto reduce verbose Metal logs fromllama-cpp-pythonon macOS.
The config.py file is central to defining which TTS models are available and their default settings.
-
GERMAN_TTS_MODELSDictionary: This is the primary configuration structure. Each key is a uniqueMODEL_IDused in the CLI. The value is a dictionary containing:"handler_function_key"(Optional, defaults toMODEL_ID): The key used to look up the synthesis function inhandlers.ALL_HANDLERS- Specific parameters for that model (e.g.,
model_repo_id,default_voice_id, API URLs,onnx_repo_id, etc.) "notes": A brief description of the model
-
mlx-audioBark Configuration Example: To use themlx-audioBark model, your configuration might look like this, enabling the dual-source strategy for voice prompts (main model frommlx-community, voice NPYs fromsuno):"mlx_audio_bark_de": { "handler_function_key": "mlx_audio", "mlx_model_path": "mlx-community/bark-small", # Main MLX model # Voice prompts will be fetched by the patched handler from "suno/bark-small" "default_voice_id": "v2/de_speaker_3", "available_voices": ["v2/de_speaker_0", "v2/de_speaker_1", "v2/de_speaker_3", "..."], "lang_code": "de", "sample_rate": 24000, "notes": "mlx-audio (Bark) with main model from mlx-community/bark-small and voices from suno/bark-small (via patch)." },
-
Global Constants: API URLs, default voice names, and sample rates are also defined here
-
Adding/Modifying Models: To add a new variation of an existing engine or a completely new engine (after creating its handler), you would add a new entry to
GERMAN_TTS_MODELS
All interactions are done through main.py from your project's root directory.
python main.py [ACTION_FLAG | --model-id <MODEL_ID> [OPTIONS]]List all available models:
python main.py --list-modelsGet information about voices for a specific model:
python main.py --voice-info edge
python main.py --voice-info mlx_audio_bark_deSynthesize text using a specific model:
python main.py --model-id edge --input-text "Hallo, wie geht es Ihnen heute?" --output-file hallo_edge.mp3 --play-directSynthesize text using mlx-audio Bark (German):
python main.py --model-id mlx_audio_bark_de --input-text "Das ist ein Test mit Bark auf Apple Silicon." --output-file bark_test_de.wavUse a specific German voice (if supported by the model):
python main.py --model-id edge --input-text "Ein Test mit einer anderen Stimme." --german-voice-id de-DE-ConradNeural --output-file conrad_test.mp3Check --voice-info <MODEL_ID> for available voice IDs/formats for that model.
Synthesize text from a file:
python main.py --model-id piper_local --input-file ./my_text.txt --output-file piper_output.wavSupported input file types: .txt, .md, .html, .pdf, .epub.
Use model-specific parameters (as a JSON string):
python main.py --model-id orpheus_gguf --input-text "Ein Test." --model-params "{\"temperature\": 0.8, \"n_gpu_layers\": -1}" --output-file orpheus_custom.wavTest all configured models with default voices:
python main.py --input-text "Dies ist ein kurzer Test für alle Modelle." --test-all --output-dir ./test_resultsTest all models with all their configured available voices/speakers:
python main.py --input-text "Ein Test für alle Stimmen." --test-all-speakers --output-dir ./test_results_all_speakersSpeech speed and pitch control:
python main.py --model-id crispasr_kokoro --input-text "Schneller sprechen." --speech-speed 1.3 --output-file fast.wav
python main.py --model-id crispasr_kokoro --input-text "Höher." --pitch-shift 50 --output-file high.wavSilence trimming and resampling:
python main.py --model-id crispasr_kokoro --input-text "Test." --trim-silence --output-sample-rate 16000 --output-file trimmed_16k.wavVoiceDesign — generate voices from text descriptions:
python main.py --model-id crispasr_qwen3_tts_voicedesign --instruct "A calm elderly man" --input-text "Hallo" --output-file calm.wavStreaming playback (hear audio while it generates):
python main.py --model-id crispasr_kokoro --input-text "Dies wird sofort abgespielt." --streamRun as OpenAI-compatible API server:
python main.py --server --server-port 8880
# Then: curl -X POST http://localhost:8880/v1/audio/speech \
# -H "Content-Type: application/json" \
# -d '{"model":"crispasr_kokoro","input":"Hallo Welt","voice":"af_heart"}' \
# --output speech.wavChange Logging Level (for debugging):
python main.py --model-id edge --input-text "Debug Test." --loglevel DEBUGOverride API URLs (for API-based models like Orpheus LM Studio/Ollama):
python main.py --model-id orpheus_lm_studio --input-text "Hallo API" --lm-studio-api-url http://localhost:5000/v1/completions
python main.py --model-id orpheus_ollama --input-text "Hallo Ollama" --ollama-api-url http://localhost:11223/api/generate --ollama-model-name my-orpheus-ollama-modelRefer to the output of python main.py --list-models for the currently configured models and their notes. The script supports integration with:
- Microsoft Edge TTS
- Piper TTS
- Orpheus GGUF (via llama-cpp-python)
- Orpheus via LM Studio API
- Orpheus via Ollama API
- OuteTTS (LlamaCPP and Hugging Face ONNX backends)
- SpeechT5 (Hugging Face Transformers)
- FastPitch (NeMo / Hugging Face)
- Coqui TTS (XTTS, VITS, etc.)
- Orpheus "Kartoffel" (Transformers-based)
- LLaSA Hybrid (Experimental MLX + PyTorch)
- mlx-audio (e.g., Bark for Apple Silicon)
The modular design makes it easy to add support for new TTS engines:
-
Create a New Handler File: In the
handlers/directory, create a new Python file (e.g.,my_new_tts_handler.py) -
Implement Synthesis Function: Inside this file, write a function that takes the standard arguments:
model_config,text,voice_id_override,model_params_override,output_file_str,play_direct. This function should handle all aspects of using the new TTS engine. -
Update
handlers/__init__.py: Import your new function and add it to theALL_HANDLERSdictionary. -
Update
config.py: Add a new entry toGERMAN_TTS_MODELSfor your new engine.
For all Orpheus-based models (GGUF local, LM Studio API, Ollama API, Kartoffel), this project relies on a user-provided decoder.py file located in the project's root directory. This file must contain a function:
def convert_to_audio(multiframe_tokens: list[int], total_token_count: int) -> bytes | None:
# Your implementation here to convert Orpheus token IDs to raw PCM audio bytes
# (16-bit, 24000 Hz, mono)
# Return audio frame bytes, or None/empty bytes on error.
passIf this file or function is missing, Orpheus models will not produce audible output, and a placeholder will be used.
CrispTTS is a synthesis tool — it does not bundle or redistribute
any voice/model weights. Each model is downloaded at runtime from its
upstream repository into a local cache (Piper voices from
rhasspy/piper-voices,
Coqui models via the TTS library, etc.). You obtain the weights directly
from the source, under that source's terms.
You are responsible for honouring each voice's license for whatever
you produce. Licenses vary per voice and are not uniform across
rhasspy/piper-voices — check the upstream MODEL_CARD (and, where it
only says "See URL", the underlying dataset), because the card fields are
self-reported. Notable cases among the German Piper voices CrispTTS lists:
- thorsten, kerstin — CC0 (public domain).
- eva_k, karlsson, ramona — M-AILABS, BSD-style (commercial OK; retain the copyright notice).
- mls — CC-BY 4.0 (attribution required).
- pavoque — CC BY-NC-SA 4.0 (non-commercial) — do not use the output commercially.
For a redistributable, pre-curated permissive-only GGUF set (the same
voices minus the non-commercial/restricted ones, converted for the
CrispASR/CrisperWeaver native runtime), see
cstr/piper-voices-GGUF.
CrispTTS automatically marks all synthesized audio as AI-generated using a multi-layered provenance system ported from CrispASR. Article 50 transparency obligations take effect 2 August 2026.
| Layer | What | Status | Install |
|---|---|---|---|
| Spread-spectrum | Frequency-domain watermark (32 bins, alpha=0.08, ~38 dB SNR) | Always active | Built-in (numpy) |
| AudioSeal | Neural watermark (Meta, 16-bit message, sample-rate aware) | Auto-detected | pip install audioseal |
| WAV/MP3 metadata | LIST/INFO + ID3v2 TXXX tags | Always active | Built-in |
| C2PA credentials | Signed provenance manifests (trainedAlgorithmicMedia) |
Opt-in | pip install c2pa-python |
| Spoken disclaimer | AI disclosure prepended to voice-cloned audio | Auto for cloning | Built-in |
| Consent gate | Voice-cloning attestation + audit logging | Required for cloning | Built-in |
| Post-embed verification | Watermark detection after file write | Always active | Built-in |
| Feature | CrispTTS | CrispASR | CrisperWeaver |
|---|---|---|---|
| Spread-spectrum watermark | numpy (Python) | C++ header-only | Dart LSB + native FFI |
| AudioSeal neural watermark | Python + crispasr GGUF | C++ ggml (GGUF) | via CrispASR FFI |
| WAV LIST/INFO metadata | ISFT + ICMT | ISFT + ICMT | ISFT + ICMT + IART + ICRD |
| MP3 ID3v2 tags | TXXX (AI_GENERATED) | TXXX (AI_GENERATED) | TXXX (AI_GENERATED) |
| C2PA content credentials | c2pa-python (optional) | c2pa-c (compile-time) | — |
| Spoken AI disclaimer | Edge TTS / beep fallback | Native TTS (cached) | Beep marker |
| Voice-cloning consent gate | --i-have-rights CLI |
--i-have-rights CLI + server JSON |
GDPR Art. 9(2)(a) consent files |
| Consent audit logging | [CONSENT] stderr |
[CONSENT] stderr |
[CONSENT] log + .consent.json |
| Post-embed verification | detect after save | detect after save | detect after embed |
| Watermark detection CLI | --detect-watermark |
--detect-watermark |
detect in service |
| Cross-project detection | Yes (shared PRNG key) | Yes (shared PRNG key) | Yes (via CrispASR FFI) |
# Default: spread-spectrum watermark + metadata (no extra deps)
python main.py --model-id edge --input-text "Hallo" --output-file out.mp3
# With AudioSeal neural watermark (auto-detected if installed)
pip install audioseal
python main.py --model-id edge --input-text "Hallo" --output-file out.mp3
# With C2PA content credentials
pip install c2pa-python
python main.py --c2pa-cert cert.pem --c2pa-key key.pem --model-id edge --input-text "Hallo" --output-file out.mp3
# Voice-cloning models require consent attestation (spoken disclaimer auto-prepended)
python main.py --model-id coqui_xtts_v2 --i-have-rights --input-text "Hallo" --output-file out.wav
# Detect watermark in existing audio
python main.py --detect-watermark out.wav
# Disable watermarking (debug only)
python main.py --no-watermark --model-id edge --input-text "Hallo" --output-file out.mp3from watermark import watermark_detect
import soundfile as sf
pcm, sr = sf.read("out.wav", dtype="float32")
confidence = watermark_detect(pcm, sample_rate=sr)
print(f"Watermark confidence: {confidence:.3f}") # >0.65 = AI-generatedThe spread-spectrum watermark uses the same PRNG seed (0x437269737041535F), FFT parameters, and bin selection as CrispASR's C++ implementation and CrisperWeaver's native FFI path. Audio watermarked by any project in the ecosystem can be detected by the others.
CrispTTS includes an OpenAI-compatible HTTP server for integration with applications that use the OpenAI TTS SDK.
# Start the server
python main.py --server --server-port 8880
# Or run directly
python server.py --host 0.0.0.0 --port 8880| Method | Path | Description |
|---|---|---|
| POST | /v1/audio/speech |
Synthesize audio (OpenAI-compatible) |
| GET | /v1/audio/models |
List available models and voices |
| GET | /health |
Health check |
{
"model": "crispasr_kokoro",
"input": "Hallo, wie geht es Ihnen?",
"voice": "af_heart",
"response_format": "wav",
"speed": 1.0
}Response: audio bytes with appropriate Content-Type header. All output is automatically watermarked.
espeak-ng for Kokoro: The Kokoro backend requires espeak-ng for phonemization. Install via:
pip install py-espeak-ng # installs espeak-ng CLI to ~/.local/bin
# or system-wide: apt install espeak-ngCrispASR voice paths: The CrispASR binary auto-downloads models but voice packs need full paths for older binary versions. Use the cached path directly:
python main.py --model-id crispasr_kokoro \
--german-voice-id ~/.cache/crispasr/kokoro-voice-af_heart.gguf \
--input-text "Test" --output-file out.wavMissing Libraries: If a specific TTS engine fails, ensure you have installed all its required libraries via pip install -r requirements.txt and any extra steps mentioned in their documentation.
mlx-audio Bark Specifics:
- This handler currently requires the main MLX model to be from a repository like
mlx-community/bark-small(which should provide MLX-compatible.safetensorsor model files) - The voice prompts (speaker embeddings) are fetched from
suno/bark-smallby default (due to an included monkey patch inmlx_audio_handler.py) which has a comprehensive set of speaker prompts as separate.npyfiles. This dual-source setup is necessary becausemlx-community/bark-smallhas limited voice prompt files in the required format - If mlx-audio's
load_modelfunction reports "No safetensors found" for the mainmlx_model_path, you may need to convert the target Bark model to MLX format usingpython -m mlx_audio.tts.convertand pointmlx_model_pathto the local converted directory. The voice prompt patch in the handler is designed to work with either an HF repo ID or a local path formlx_model_pathwhen determining how to fetch/locate the.npyprompts fromsuno/bark-smallor aspeaker_embeddingssubfolder
API Keys/Servers: API-based models require the respective servers (LM Studio, Ollama) to be running and accessible.
Model Downloads: First-time use of a model that needs to be downloaded from Hugging Face Hub might take some time. Ensure you have an internet connection. Set HF_TOKEN for gated models.
Verbose Output: Use --loglevel DEBUG for detailed diagnostic information if you encounter issues.
RAM Usage: Local GGUF and large Transformer models can be memory-intensive. Ensure your system has sufficient RAM.
Paths: When providing paths for --input-file, --output-file, or speaker WAV files (--german-voice-id), use appropriate relative or absolute paths.