CrispTTS: Modular German Text-to-Speech Synthesizer

CrispTTS is a versatile command-line Text-to-Speech (TTS) tool designed for synthesizing German speech using a variety of popular local and cloud-based TTS engines. Its modular architecture allows for easy maintenance and straightforward addition of new TTS handlers.

Part of the Crisp ecosystem

Project	Role
Susurrus	Python GUI + CLI — 30+ ASR, 12 TTS, translation
CrispASR	C++ ASR/TTS engine — 24+ backends, ggml inference
CrispTTS	This repo — Python TTS CLI with 28+ handlers
CrisperWeaver	Flutter transcription app — desktop + mobile

NOTE: This is in experimental / work in progress state. Some Python-only models may be broken due to dependency conflicts. The CrispASR-based handlers (crispasr_*) are the most reliable — they use native C++ inference with no Python ML dependencies.

Features

28+ TTS Engine Support:
- CrispASR native C++ engines (7 backends, auto-download, no Python ML deps):
  - Kokoro (multilingual, Apache 2.0)
  - Orpheus + Kartoffel-Orpheus DE (19 German speakers, llama3.2 license)
  - Qwen3-TTS (voice cloning + voice design, Apache 2.0)
  - Chatterbox (CFM synthesis, MIT)
  - VibeVoice TTS (voice cloning)
  - IndexTTS (zero-shot cloning, Apache 2.0)
  - VoxCPM2 (48 kHz, 30 languages, Apache 2.0)
- Microsoft Edge TTS (cloud-based, requires edge-tts)
- Coqui TTS (XTTS v2, VITS, etc.)
- Piper (local ONNX, requires piper-tts)
- Orpheus GGUF (local, requires llama-cpp-python)
- Orpheus via LM Studio / Ollama API
- OuteTTS (LlamaCPP or HF backend)
- SpeechT5 (German fine-tune via HF Transformers)
- FastPitch (German via NeMo)
- mlx-audio (Bark, Kokoro, Dia — Apple Silicon)
- LLaSA (hybrid, German, multilingual variants)
- F5-TTS (MLX/PyTorch)
- Kokoro ONNX (lightweight)
- TTS.cpp (GGUF models)
- Zonos (acoustic conditioning)
- Chatterbox Python (Kartoffelbox)
AI Audio Watermarking & Provenance (EU AI Act compliant):
- Spread-spectrum watermark (always on, imperceptible, ~-46 dB)
- AudioSeal neural watermark (optional upgrade via pip install audioseal or CrispASR GGUF)
- WAV LIST/INFO and MP3 ID3v2 metadata marking audio as AI-generated
- C2PA content credentials signing (optional, pip install c2pa-python)
- Voice-cloning consent gate (--i-have-rights)
CrispASR Integration:
- --verify: ASR roundtrip verification of TTS output quality
- --translate: Pre-synthesis translation (EN→DE via m2m100/MadLad)
- --speech-speed: Rate multiplier (maps to CrispASR --pace)
- --trim-silence: Remove leading/trailing silence from output
- --tts-steps: Diffusion model inference steps (quality vs speed)
- --tts-language: Override language for multilingual models
- --pitch-shift: Pitch shift in Hz for FastPitch backends
- --instruct: Natural-language voice descriptions (Qwen3-TTS VoiceDesign)
- --stream: Stream audio playback during synthesis
- --output-sample-rate: Resample output to target sample rate
OpenAI-Compatible API Server (--server):
- POST /v1/audio/speech — drop-in replacement for OpenAI TTS
- GET /v1/audio/models — list all configured models
Text Input Flexibility: Synthesize from CLI, .txt, .md, .html, .pdf, .epub
Smart Text Chunking: Automatic sentence-boundary splitting for long texts
Customizable Output: Save audio to .wav, .mp3, .flac, or .opus
Direct Playback: Play synthesized audio immediately
Voice Selection: Override default voices/speakers for most models
Model Parameter Tuning: JSON-formatted parameters for fine-tuning
Comprehensive Testing:
- --test-all: Test all models with default voices
- --test-all-speakers: Test all models with all configured voices
- 160+ unit and live tests
Modular Design: config.py + utils.py + handlers/ + main.py
Logging: Configurable logging levels
Automatic Patching: Runtime monkeypatches for library compatibility

Project Structure

crisptts_project/
├── main.py                     # Main CLI application script
├── config.py                   # Model configurations and global constants
├── utils.py                    # Shared utility functions and classes
├── watermark.py                # Audio watermarking, metadata, consent gate, C2PA
├── chunking.py                 # Smart sentence-boundary text splitting
├── server.py                   # OpenAI-compatible HTTP API server
├── decoder.py                  # User-provided decoder for Orpheus models (if used)
├── handlers/                   # Package for individual TTS engine handlers
│   ├── __init__.py             # Makes 'handlers' a package, exports handler functions
│   ├── edge_handler.py         # Edge TTS cloud service handler
│   ├── piper_handler.py        # Piper TTS (ONNX) handler
│   ├── orpheus_gguf_handler.py # Local Orpheus GGUF model handler
│   ├── orpheus_api_handler.py  # Handlers for LM Studio and Ollama API
│   ├── outetts_handler.py      # OuteTTS model handler
│   ├── speecht5_handler.py     # SpeechT5 model handler
│   ├── nemo_handler.py         # NeMo FastPitch handler
│   ├── coqui_tts_handler.py    # Coqui TTS handler (for XTTS, VITS etc.)
│   ├── kartoffel_handler.py    # Orpheus "Kartoffel" Transformers handler
│   ├── kokoro_onnx_handler.py  # Kokoro (multilingual but no German) ONNX handler
│   ├── llasa_hybrid_handler.py # LLaSA Hybrid handler
│   ├── tts_cpp_handler.py      # TTS.cpp handler supporting GGUF models
│   └── mlx_audio_handler.py    # Handler for mlx-audio library (e.g., Bark)
├── requirements.txt            # Python package dependencies
└── README.md                   # This documentation file

Setup and Installation

Prerequisites

Python and pip for installing packages
For mlx-audio based models: Apple Silicon Mac is required for GPU acceleration
For TTS.cpp a C++ compiler and CMake are required to build the engine

Installation Steps

Clone/Download Files

git clone https://github.com/CrispStrobe/CrispTTS

Create a Virtual Environment (Recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies: A requirements.txt file is provided. Install the necessary packages:
```
pip install -r requirements.txt
```
Note: Some libraries like PyTorch, NeMo, LlamaCPP, and mlx-audio can have specific installation needs depending on your OS and hardware (e.g., CUDA for Nvidia GPUs, Metal for Apple Silicon). Please refer to their official documentation if you encounter issues. Ensure you have ffmpeg installed and available in your system's PATH if you encounter issues with audio file format conversions or direct playback (some underlying libraries might need it).
Install and Build Engine-Specific Dependencies (required for certain handlers):

For TTS.cpp: Clone and build the TTS.cpp project separately.
```
git clone https://github.com/mmwillet/TTS.cpp.git
cd TTS.cpp
cmake -B build
cmake --build build --config Release
cd ..
```
Update the tts_cpp_executable_path in config.py to point to ./TTS.cpp/build/cli

For kokoro-onnx: Install the Python package and download model files.
```
pip install kokoro-onnx
```
Download the .onnx model and voices.bin from the kokoro-onnx GitHub releases page.

Update the paths in config.py to point to your downloaded files.
Environment Variables (Optional but Recommended):
- HF_TOKEN: If you need to download models from gated or private Hugging Face repositories, set this environment variable with your Hugging Face API token:
```
export HF_TOKEN="your_huggingface_token_here"
```
- GGML_METAL_NDEBUG=1: Set automatically by main.py to reduce verbose Metal logs from llama-cpp-python on macOS.

Configuration (`config.py`)

The config.py file is central to defining which TTS models are available and their default settings.

GERMAN_TTS_MODELS Dictionary: This is the primary configuration structure. Each key is a unique MODEL_ID used in the CLI. The value is a dictionary containing:
- "handler_function_key" (Optional, defaults to MODEL_ID): The key used to look up the synthesis function in handlers.ALL_HANDLERS
- Specific parameters for that model (e.g., model_repo_id, default_voice_id, API URLs, onnx_repo_id, etc.)
- "notes": A brief description of the model

mlx-audio Bark Configuration Example: To use the mlx-audio Bark model, your configuration might look like this, enabling the dual-source strategy for voice prompts (main model from mlx-community, voice NPYs from suno):

"mlx_audio_bark_de": {
    "handler_function_key": "mlx_audio",
    "mlx_model_path": "mlx-community/bark-small", # Main MLX model
    # Voice prompts will be fetched by the patched handler from "suno/bark-small"
    "default_voice_id": "v2/de_speaker_3", 
    "available_voices": ["v2/de_speaker_0", "v2/de_speaker_1", "v2/de_speaker_3", "..."],
    "lang_code": "de",
    "sample_rate": 24000,
    "notes": "mlx-audio (Bark) with main model from mlx-community/bark-small and voices from suno/bark-small (via patch)."
},

Global Constants: API URLs, default voice names, and sample rates are also defined here
Adding/Modifying Models: To add a new variation of an existing engine or a completely new engine (after creating its handler), you would add a new entry to GERMAN_TTS_MODELS

Usage (`main.py`)

All interactions are done through main.py from your project's root directory.

Basic Command Structure

python main.py [ACTION_FLAG | --model-id <MODEL_ID> [OPTIONS]]

Common Examples

List all available models:

python main.py --list-models

Get information about voices for a specific model:

python main.py --voice-info edge
python main.py --voice-info mlx_audio_bark_de

Synthesize text using a specific model:

python main.py --model-id edge --input-text "Hallo, wie geht es Ihnen heute?" --output-file hallo_edge.mp3 --play-direct

Synthesize text using mlx-audio Bark (German):

python main.py --model-id mlx_audio_bark_de --input-text "Das ist ein Test mit Bark auf Apple Silicon." --output-file bark_test_de.wav

Use a specific German voice (if supported by the model):

python main.py --model-id edge --input-text "Ein Test mit einer anderen Stimme." --german-voice-id de-DE-ConradNeural --output-file conrad_test.mp3

Check --voice-info <MODEL_ID> for available voice IDs/formats for that model.

Synthesize text from a file:

python main.py --model-id piper_local --input-file ./my_text.txt --output-file piper_output.wav

Supported input file types: .txt, .md, .html, .pdf, .epub.

Use model-specific parameters (as a JSON string):

python main.py --model-id orpheus_gguf --input-text "Ein Test." --model-params "{\"temperature\": 0.8, \"n_gpu_layers\": -1}" --output-file orpheus_custom.wav

Test all configured models with default voices:

python main.py --input-text "Dies ist ein kurzer Test für alle Modelle." --test-all --output-dir ./test_results

Test all models with all their configured available voices/speakers:

python main.py --input-text "Ein Test für alle Stimmen." --test-all-speakers --output-dir ./test_results_all_speakers

Speech speed and pitch control:

python main.py --model-id crispasr_kokoro --input-text "Schneller sprechen." --speech-speed 1.3 --output-file fast.wav
python main.py --model-id crispasr_kokoro --input-text "Höher." --pitch-shift 50 --output-file high.wav

Silence trimming and resampling:

python main.py --model-id crispasr_kokoro --input-text "Test." --trim-silence --output-sample-rate 16000 --output-file trimmed_16k.wav

VoiceDesign — generate voices from text descriptions:

python main.py --model-id crispasr_qwen3_tts_voicedesign --instruct "A calm elderly man" --input-text "Hallo" --output-file calm.wav

Streaming playback (hear audio while it generates):

python main.py --model-id crispasr_kokoro --input-text "Dies wird sofort abgespielt." --stream

Run as OpenAI-compatible API server:

python main.py --server --server-port 8880
# Then: curl -X POST http://localhost:8880/v1/audio/speech \
#   -H "Content-Type: application/json" \
#   -d '{"model":"crispasr_kokoro","input":"Hallo Welt","voice":"af_heart"}' \
#   --output speech.wav

Change Logging Level (for debugging):

python main.py --model-id edge --input-text "Debug Test." --loglevel DEBUG

Override API URLs (for API-based models like Orpheus LM Studio/Ollama):

python main.py --model-id orpheus_lm_studio --input-text "Hallo API" --lm-studio-api-url http://localhost:5000/v1/completions
python main.py --model-id orpheus_ollama --input-text "Hallo Ollama" --ollama-api-url http://localhost:11223/api/generate --ollama-model-name my-orpheus-ollama-model

Supported TTS Engines

Refer to the output of python main.py --list-models for the currently configured models and their notes. The script supports integration with:

Microsoft Edge TTS
Piper TTS
Orpheus GGUF (via llama-cpp-python)
Orpheus via LM Studio API
Orpheus via Ollama API
OuteTTS (LlamaCPP and Hugging Face ONNX backends)
SpeechT5 (Hugging Face Transformers)
FastPitch (NeMo / Hugging Face)
Coqui TTS (XTTS, VITS, etc.)
Orpheus "Kartoffel" (Transformers-based)
LLaSA Hybrid (Experimental MLX + PyTorch)
mlx-audio (e.g., Bark for Apple Silicon)

Adding New TTS Handlers

The modular design makes it easy to add support for new TTS engines:

Create a New Handler File: In the handlers/ directory, create a new Python file (e.g., my_new_tts_handler.py)
Implement Synthesis Function: Inside this file, write a function that takes the standard arguments: model_config, text, voice_id_override, model_params_override, output_file_str, play_direct. This function should handle all aspects of using the new TTS engine.
Update handlers/__init__.py: Import your new function and add it to the ALL_HANDLERS dictionary.
Update config.py: Add a new entry to GERMAN_TTS_MODELS for your new engine.

`decoder.py` Requirement for Orpheus

For all Orpheus-based models (GGUF local, LM Studio API, Ollama API, Kartoffel), this project relies on a user-provided decoder.py file located in the project's root directory. This file must contain a function:

def convert_to_audio(multiframe_tokens: list[int], total_token_count: int) -> bytes | None:
    # Your implementation here to convert Orpheus token IDs to raw PCM audio bytes
    # (16-bit, 24000 Hz, mono)
    # Return audio frame bytes, or None/empty bytes on error.
    pass

If this file or function is missing, Orpheus models will not produce audible output, and a placeholder will be used.

Voice & model licensing

CrispTTS is a synthesis tool — it does not bundle or redistribute any voice/model weights. Each model is downloaded at runtime from its upstream repository into a local cache (Piper voices from rhasspy/piper-voices, Coqui models via the TTS library, etc.). You obtain the weights directly from the source, under that source's terms.

You are responsible for honouring each voice's license for whatever you produce. Licenses vary per voice and are not uniform across rhasspy/piper-voices — check the upstream MODEL_CARD (and, where it only says "See URL", the underlying dataset), because the card fields are self-reported. Notable cases among the German Piper voices CrispTTS lists:

thorsten, kerstin — CC0 (public domain).
eva_k, karlsson, ramona — M-AILABS, BSD-style (commercial OK; retain the copyright notice).
mls — CC-BY 4.0 (attribution required).
pavoque — CC BY-NC-SA 4.0 (non-commercial) — do not use the output commercially.

For a redistributable, pre-curated permissive-only GGUF set (the same voices minus the non-commercial/restricted ones, converted for the CrispASR/CrisperWeaver native runtime), see cstr/piper-voices-GGUF.

Audio Watermarking & Provenance (EU AI Act Art. 50)

CrispTTS automatically marks all synthesized audio as AI-generated using a multi-layered provenance system ported from CrispASR. Article 50 transparency obligations take effect 2 August 2026.

Layers

Layer	What	Status	Install
Spread-spectrum	Frequency-domain watermark (32 bins, alpha=0.08, ~38 dB SNR)	Always active	Built-in (numpy)
AudioSeal	Neural watermark (Meta, 16-bit message, sample-rate aware)	Auto-detected	`pip install audioseal`
WAV/MP3 metadata	LIST/INFO + ID3v2 TXXX tags	Always active	Built-in
C2PA credentials	Signed provenance manifests (`trainedAlgorithmicMedia`)	Opt-in	`pip install c2pa-python`
Spoken disclaimer	AI disclosure prepended to voice-cloned audio	Auto for cloning	Built-in
Consent gate	Voice-cloning attestation + audit logging	Required for cloning	Built-in
Post-embed verification	Watermark detection after file write	Always active	Built-in

Compliance comparison across the Crisp ecosystem

Feature	CrispTTS	CrispASR	CrisperWeaver
Spread-spectrum watermark	numpy (Python)	C++ header-only	Dart LSB + native FFI
AudioSeal neural watermark	Python + crispasr GGUF	C++ ggml (GGUF)	via CrispASR FFI
WAV LIST/INFO metadata	ISFT + ICMT	ISFT + ICMT	ISFT + ICMT + IART + ICRD
MP3 ID3v2 tags	TXXX (AI_GENERATED)	TXXX (AI_GENERATED)	TXXX (AI_GENERATED)
C2PA content credentials	c2pa-python (optional)	c2pa-c (compile-time)	—
Spoken AI disclaimer	Edge TTS / beep fallback	Native TTS (cached)	Beep marker
Voice-cloning consent gate	`--i-have-rights` CLI	`--i-have-rights` CLI + server JSON	GDPR Art. 9(2)(a) consent files
Consent audit logging	`[CONSENT]` stderr	`[CONSENT]` stderr	`[CONSENT]` log + `.consent.json`
Post-embed verification	detect after save	detect after save	detect after embed
Watermark detection CLI	`--detect-watermark`	`--detect-watermark`	detect in service
Cross-project detection	Yes (shared PRNG key)	Yes (shared PRNG key)	Yes (via CrispASR FFI)

Usage

# Default: spread-spectrum watermark + metadata (no extra deps)
python main.py --model-id edge --input-text "Hallo" --output-file out.mp3

# With AudioSeal neural watermark (auto-detected if installed)
pip install audioseal
python main.py --model-id edge --input-text "Hallo" --output-file out.mp3

# With C2PA content credentials
pip install c2pa-python
python main.py --c2pa-cert cert.pem --c2pa-key key.pem --model-id edge --input-text "Hallo" --output-file out.mp3

# Voice-cloning models require consent attestation (spoken disclaimer auto-prepended)
python main.py --model-id coqui_xtts_v2 --i-have-rights --input-text "Hallo" --output-file out.wav

# Detect watermark in existing audio
python main.py --detect-watermark out.wav

# Disable watermarking (debug only)
python main.py --no-watermark --model-id edge --input-text "Hallo" --output-file out.mp3

Detection (Python API)

from watermark import watermark_detect
import soundfile as sf

pcm, sr = sf.read("out.wav", dtype="float32")
confidence = watermark_detect(pcm, sample_rate=sr)
print(f"Watermark confidence: {confidence:.3f}")  # >0.65 = AI-generated

Cross-compatibility

The spread-spectrum watermark uses the same PRNG seed (0x437269737041535F), FFT parameters, and bin selection as CrispASR's C++ implementation and CrisperWeaver's native FFI path. Audio watermarked by any project in the ecosystem can be detected by the others.

API Server

CrispTTS includes an OpenAI-compatible HTTP server for integration with applications that use the OpenAI TTS SDK.

# Start the server
python main.py --server --server-port 8880

# Or run directly
python server.py --host 0.0.0.0 --port 8880

Endpoints

Method	Path	Description
POST	`/v1/audio/speech`	Synthesize audio (OpenAI-compatible)
GET	`/v1/audio/models`	List available models and voices
GET	`/health`	Health check

Request format (POST /v1/audio/speech)

{
  "model": "crispasr_kokoro",
  "input": "Hallo, wie geht es Ihnen?",
  "voice": "af_heart",
  "response_format": "wav",
  "speed": 1.0
}

Response: audio bytes with appropriate Content-Type header. All output is automatically watermarked.

Troubleshooting & Notes

espeak-ng for Kokoro: The Kokoro backend requires espeak-ng for phonemization. Install via:

pip install py-espeak-ng     # installs espeak-ng CLI to ~/.local/bin
# or system-wide: apt install espeak-ng

CrispASR voice paths: The CrispASR binary auto-downloads models but voice packs need full paths for older binary versions. Use the cached path directly:

python main.py --model-id crispasr_kokoro \
  --german-voice-id ~/.cache/crispasr/kokoro-voice-af_heart.gguf \
  --input-text "Test" --output-file out.wav

Missing Libraries: If a specific TTS engine fails, ensure you have installed all its required libraries via pip install -r requirements.txt and any extra steps mentioned in their documentation.

mlx-audio Bark Specifics:

This handler currently requires the main MLX model to be from a repository like mlx-community/bark-small (which should provide MLX-compatible .safetensors or model files)
The voice prompts (speaker embeddings) are fetched from suno/bark-small by default (due to an included monkey patch in mlx_audio_handler.py) which has a comprehensive set of speaker prompts as separate .npy files. This dual-source setup is necessary because mlx-community/bark-small has limited voice prompt files in the required format
If mlx-audio's load_model function reports "No safetensors found" for the main mlx_model_path, you may need to convert the target Bark model to MLX format using python -m mlx_audio.tts.convert and point mlx_model_path to the local converted directory. The voice prompt patch in the handler is designed to work with either an HF repo ID or a local path for mlx_model_path when determining how to fetch/locate the .npy prompts from suno/bark-small or a speaker_embeddings subfolder

API Keys/Servers: API-based models require the respective servers (LM Studio, Ollama) to be running and accessible.

Model Downloads: First-time use of a model that needs to be downloaded from Hugging Face Hub might take some time. Ensure you have an internet connection. Set HF_TOKEN for gated models.

Verbose Output: Use --loglevel DEBUG for detailed diagnostic information if you encounter issues.

RAM Usage: Local GGUF and large Transformer models can be memory-intensive. Ensure your system has sufficient RAM.

Paths: When providing paths for --input-file, --output-file, or speaker WAV files (--german-voice-id), use appropriate relative or absolute paths.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
handlers		handlers
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
PLAN.md		PLAN.md
chunking.py		chunking.py
config.py		config.py
convert_f5_to_mlx.py		convert_f5_to_mlx.py
decoder.py		decoder.py
env-snapshot-current.txt		env-snapshot-current.txt
env-snapshot-kartoffelbox.txt		env-snapshot-kartoffelbox.txt
env-snapshot-old.txt		env-snapshot-old.txt
f5-test.py		f5-test.py
f5_debug.py		f5_debug.py
main.py		main.py
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt
server.py		server.py
utils.py		utils.py
watermark.py		watermark.py

Folders and files

Latest commit

History

Repository files navigation

CrispTTS: Modular German Text-to-Speech Synthesizer

Part of the Crisp ecosystem

Features

Project Structure

Setup and Installation

Prerequisites

Installation Steps

Update the tts_cpp_executable_path in config.py to point to ./TTS.cpp/build/cli

Download the .onnx model and voices.bin from the kokoro-onnx GitHub releases page.

Update the paths in config.py to point to your downloaded files.

Configuration (config.py)

Usage (main.py)

Basic Command Structure

Common Examples

Supported TTS Engines

Adding New TTS Handlers

decoder.py Requirement for Orpheus

Voice & model licensing

Audio Watermarking & Provenance (EU AI Act Art. 50)

Layers

Compliance comparison across the Crisp ecosystem

Usage

Detection (Python API)

Cross-compatibility

API Server

Endpoints

Request format (POST /v1/audio/speech)

Troubleshooting & Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Update the `tts_cpp_executable_path` in config.py to point to ./TTS.cpp/build/cli

Configuration (`config.py`)

Usage (`main.py`)

`decoder.py` Requirement for Orpheus

Packages