fix(crispasr): write piper TTS WAV at the model's native sample rate#10277
Merged
Conversation
CrispASR's piper backend returns PCM at the voice's native rate (from the GGUF piper.sample_rate key: 16 kHz for x_low/low, 22.05 kHz for medium/high) and does not resample, but the Go WAV encoder hardcoded 24000 Hz. Every piper voice was therefore written with a wrong header and played back at the wrong pitch/speed. Read piper.sample_rate from the model's GGUF metadata at Load via the vendored gguf-parser-go and use it for the WAV header, falling back to the 24 kHz default for the other CrispASR TTS engines (vibevoice/orpheus/chatterbox/qwen3-tts) that emit 24 kHz and carry no such key. Adds unit specs (minimal crafted GGUFs + WAV-header decode) and an env-gated end-to-end spec (CRISPASR_PIPER_MODEL_PATH). Verified e2e: en_GB-cori-medium synthesizes a 22050 Hz WAV through backend:piper. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
CrispASR's piper backend returns PCM at the voice's native sample rate (read from the GGUF
piper.sample_ratekey — 16 kHz forx_low/low, 22.05 kHz formedium/high) and does not resample. The Go WAV encoder in the crispasr backend hardcoded 24000 Hz, so every piper voice was written with a wrong header and played back at the wrong pitch/speed (~+9% for medium voices).The session-level C-ABI (
crispasr_session_synthesize) only returns the sample buffer + count, not the rate, so the rate must be recovered on the Go side.Fix
piperSampleRate()readspiper.sample_rate(u32) from the model's GGUF metadata via the already-vendoredgguf-parser-go.Load()stores it on theCrispASRstruct, falling back to the 24 kHz default for the other CrispASR TTS engines (vibevoice / orpheus / chatterbox / qwen3-tts) that emit 24 kHz and carry no such key.writeWAV(dst, pcm, rate)(waswriteWAV24k) uses the stored rate for both the encoder and theaudio.Format.Pure Go change; no shim/C rebuild needed.
Tests
CRISPASR_PIPER_MODEL_PATH), same convention as the other model-backed specs.Verified e2e: built
libgocrispasr-fallback.sofrom the current pin and synthesizeden_GB-cori-mediumthroughbackend:piper→ WAV header is 22050 Hz (old code: 24000).Assisted-by: Claude:claude-opus-4-8 [Claude Code]