Skip to content

fix(crispasr): write piper TTS WAV at the model's native sample rate#10277

Merged
mudler merged 1 commit into
masterfrom
fix/crispasr-piper-samplerate
Jun 12, 2026
Merged

fix(crispasr): write piper TTS WAV at the model's native sample rate#10277
mudler merged 1 commit into
masterfrom
fix/crispasr-piper-samplerate

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Problem

CrispASR's piper backend returns PCM at the voice's native sample rate (read from the GGUF piper.sample_rate key — 16 kHz for x_low/low, 22.05 kHz for medium/high) and does not resample. The Go WAV encoder in the crispasr backend hardcoded 24000 Hz, so every piper voice was written with a wrong header and played back at the wrong pitch/speed (~+9% for medium voices).

The session-level C-ABI (crispasr_session_synthesize) only returns the sample buffer + count, not the rate, so the rate must be recovered on the Go side.

Fix

  • piperSampleRate() reads piper.sample_rate (u32) from the model's GGUF metadata via the already-vendored gguf-parser-go.
  • Load() stores it on the CrispASR struct, falling back to the 24 kHz default for the other CrispASR TTS engines (vibevoice / orpheus / chatterbox / qwen3-tts) that emit 24 kHz and carry no such key.
  • writeWAV(dst, pcm, rate) (was writeWAV24k) uses the stored rate for both the encoder and the audio.Format.

Pure Go change; no shim/C rebuild needed.

Tests

  • Unit specs: craft minimal in-memory GGUFs (22050 / 16000 / non-piper / garbage) and decode the produced WAV header — no network or model needed.
  • Env-gated e2e spec (CRISPASR_PIPER_MODEL_PATH), same convention as the other model-backed specs.

Verified e2e: built libgocrispasr-fallback.so from the current pin and synthesized en_GB-cori-medium through backend:piper → WAV header is 22050 Hz (old code: 24000).

Split out as a standalone correctness fix from in-progress work to add piper voices to the gallery.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

CrispASR's piper backend returns PCM at the voice's native rate (from the GGUF
piper.sample_rate key: 16 kHz for x_low/low, 22.05 kHz for medium/high) and does
not resample, but the Go WAV encoder hardcoded 24000 Hz. Every piper voice was
therefore written with a wrong header and played back at the wrong pitch/speed.

Read piper.sample_rate from the model's GGUF metadata at Load via the vendored
gguf-parser-go and use it for the WAV header, falling back to the 24 kHz default
for the other CrispASR TTS engines (vibevoice/orpheus/chatterbox/qwen3-tts) that
emit 24 kHz and carry no such key.

Adds unit specs (minimal crafted GGUFs + WAV-header decode) and an env-gated
end-to-end spec (CRISPASR_PIPER_MODEL_PATH). Verified e2e: en_GB-cori-medium
synthesizes a 22050 Hz WAV through backend:piper.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler merged commit 46ba706 into master Jun 12, 2026
58 of 69 checks passed
@mudler mudler deleted the fix/crispasr-piper-samplerate branch June 12, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants