feat(crispasr): bundle espeak-ng and add piper TTS voices to the gallery#10283
Merged
Conversation
CrispASR's piper backend phonemizes non-English text via espeak-ng (dlopen, the MIT-clean path; English uses a built-in G2P). The FROM scratch crispasr image shipped none of it, so non-English piper voices loaded but failed synthesis with "phonemization failed". Bundle the espeak-ng runtime so they work: - Dockerfile.golang: install espeak-ng-data + libespeak-ng1 and its libpcaudio0 / libsonic0 deps in the crispasr builder (espeak's dlopen fails without the latter two). - package.sh: copy libespeak-ng.so.1, libpcaudio.so.0, libsonic.so.0 into package/lib/ and the espeak-ng-data dir into the package root. - run.sh: export CRISPASR_ESPEAK_DATA_PATH so the bundled data is found. Add 9 single-speaker piper voices (de/en/it, incl. Italian paola + riccardo) to the gallery, run through backend:piper, hosted at LocalAI-Community/piper-voices-GGUF (converted from rhasspy/piper-voices with CrispASR's convert-piper-to-gguf.py). Only single-speaker low/medium voices are included; the engine does not yet support multi-speaker or high-quality piper decoders. All 9 verified end-to-end: each synthesizes a WAV at the model's native sample rate using only the image-bundled espeak payload. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds Piper TTS voices to the gallery, run through the
crispasrbackend'sbackend:piperengine, and bundles espeak-ng into the backend image so non-English voices work.Pairs with #10277 (the piper WAV sample-rate fix) for correct playback rate.
espeak-ng bundling
CrispASR's piper backend phonemizes non-English text via espeak-ng (loaded through the MIT-clean dlopen path; English uses a built-in CMUdict/LTS G2P). The
FROM scratchcrispasr image shipped none of it, so German/Italian/etc. voices loaded but failed synthesis withpiper_tts: phonemization failed.Dockerfile.golang- installespeak-ng-data libespeak-ng1 libpcaudio0 libsonic0in the crispasr builder. espeak'sdlopenoflibespeak-ng.so.1succeeds but fails unlesslibpcaudio.so.0+libsonic.so.0are also present (confirmed via strace).package.sh- copy the three.sointopackage/lib/and theespeak-ng-data/dir into the package root.run.sh- exportCRISPASR_ESPEAK_DATA_PATHso the bundled data is found.No CrispASR rebuild/flag needed: the dlopen path is already compiled in (
CRISPASR_WITH_ESPEAK_NG=AUTO).Voices (9, single-speaker)
Hosted at
LocalAI-Community/piper-voices-GGUF, converted from rhasspy/piper-voices with CrispASR'smodels/convert-piper-to-gguf.py:Only single-speaker, low/medium voices are included - the CrispASR piper engine currently segfaults on multi-speaker models (mls, thorsten_emotional, libritts_r) and
high-quality decoders (thorsten-high).Verification
Built the crispasr image (
make docker-build-crispasr), extracted its package, and confirmed every voice synthesizes a WAV at the model's native sample rate using only the image-bundled espeak payload (the build host has no system espeak-ng).Assisted-by: Claude:claude-opus-4-8 [Claude Code]