feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS by localai-bot · Pull Request #10099 · mudler/LocalAI

localai-bot · 2026-05-30T23:32:02Z

Summary

Adds a new crispasr Go backend wrapping CrispASR (a whisper.cpp fork on ggml, MIT) via its C-ABI loaded with purego (CGO_ENABLED=0). Supports CrispASR's many ASR architectures plus TTS, behind LocalAI's standard transcription / speech RPCs. One binary serves ASR or TTS depending on the loaded model; architecture is auto-detected from the GGUF (or forced via an option). Does not modify the existing whisper backend.

ASR

CrispASR session API (crispasr_session_open / crispasr_session_transcribe_lang) — not the whisper-only whisper_full() path; CMake links the full backend set.
AudioTranscription + AudioTranscriptionStream.
e2e: Whisper, Parakeet (auto-detect), Moonshine (explicit backend: + tokenizer companion) all transcribed jfk.wav correctly.

TTS

TTS / TTSStream via crispasr_session_synthesize → 24 kHz mono WAV at req.Dst.
All 4 gallery TTS engines e2e-verified (real models → valid 24 kHz/mono/16-bit WAVs): vibevoice (built-in voice), chatterbox (built-in voice, needs codec:<s3gen>), qwen3-tts CustomVoice (backend:qwen3-tts+codec:<tokenizer>+speaker:vivian), orpheus (codec:<snac>+speaker:tara).

Model options (existing `options:` config)

backend:<name> — force an explicit CrispASR architecture.
codec:<file> — load a companion (tokenizer / SNAC / s3gen) via crispasr_session_set_codec_path.
speaker:<name> — select a baked speaker (crispasr_session_set_speaker_name).
voice:<path> (+voice_text:<ref>) — load a voice pack GGUF, or a WAV zero-shot clone (crispasr_session_set_voice).
Relative codec:/voice: paths resolve against the model dir. All e2e-verified.

Model gallery — 36 `-crispasr` entries (32 ASR + 4 TTS), virtual-model style

Inline in gallery/index.yaml against the shared virtual.yaml base (no per-model files). Every entry verified reachable; companion files wired from CrispASR's registry; each TTS entry carries the verified options: set.

ASR: parakeet (+8), canary, cohere, qwen3 (+1.7b), voxtral, voxtral4b, granite (+3), fastconformer-ctc, wav2vec2 (+de), vibevoice, hubert, data2vec, glm-asr, kyutai-stt, firered-asr, moonshine (+de/tiny-de/streaming), mimo-asr.
TTS: vibevoice-tts, chatterbox-tts, qwen3-tts-customvoice, orpheus-tts.
Excluded (no working load path in v0.6.11): funasr/paraformer/sensevoice/omniasr/indextts/voxcpm2/mega-asr/kokoro.

Plumbing

Pinned to CrispASR v0.6.11; bump_deps tracks upstream. Idempotent Makefile patch: ${CMAKE_SOURCE_DIR} → ${PROJECT_SOURCE_DIR} so vendored llama.cpp resolves under add_subdirectory.
Full whisper-equivalent CI build matrix; backend gallery entries in backend/index.yaml; pref-only entry in GET /backends/known.
Gated behavioral tests (transcription + TTS synthesis), env-gated per the kokoros convention (CI runs unit tests only).

Test Plan

golangci-lint/go vet/build clean; suites skip cleanly without fixtures (CI: golangci-lint + Yamllint + build-test all green)
Runtime e2e: 3 ASR architectures (whisper/parakeet/moonshine) transcribe; all 4 TTS engines (vibevoice/chatterbox/qwen3-tts-customvoice/orpheus) synthesize valid WAVs — against real GGUFs
CI backend-matrix image build across all platforms

Follow-ups

Phase 3: dedicated Diarize / VAD / VoiceEmbed/VoiceVerify RPCs; real-time streaming via crispasr_stream_*
kokoro voicepack; qwen3-tts base WAV-clone presets; more architectures as upstream extends the session router

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…ld files) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Mirror the whisper Go backend registration for the new crispasr backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks, BACKEND_CRISPASR definition, docker-build target generation, and the docker-build-backends aggregate target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64, CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Add the crispasr meta anchor and its full set of image gallery entries (cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64, L4T cuda13 arm64, plus -development variants), mirroring the whisper backend gallery block. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in backend/go/crispasr/Makefile. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Mirror the whisper Go backend: its AudioTranscription test is gated on model/audio fixtures and skips in CI, so building crispasr (the heaviest ggml compile in the tree) inside the unit-test lane adds a long compile for zero coverage. The backend image build in backend-matrix.yml remains the authoritative compile check. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The metal-crispasr gallery entries and capabilities.metal mapping reference -metal-darwin-arm64-crispasr, which is only produced by an includeDarwin entry. Mirror whisper's darwin metal entry so the tag actually gets built. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The shim used whisper_full(), which in CrispASR is the whisper-only path: libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR, Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI, which auto-detects the architecture from the GGUF and dispatches to the matching backend. Rewrite the C shim around crispasr_session_open / _transcribe_lang / _result_* and add get_backend() so the selected backend is logged. load_model now takes a threads param (session_open binds n_threads at open). The session result is segment+word based with no token IDs and no per-decode callback, so drop n_tokens / get_token_id / get_segment_speaker_turn_next / set_new_segment_callback. set_abort is kept for API parity but is best-effort: the session transcribe is blocking with no abort hook. Update the purego bindings and gocrispasr.go to match: tokens are left empty, speaker-turn handling is removed, and AudioTranscriptionStream emits one delta per non-empty segment after the blocking decode returns (no progressive streaming via the session API), preserving the concat(deltas) == final.Text invariant. crispasr_session_set_translate is exported by libcrispasr but not declared in crispasr.h, so it is forward-declared in the shim alongside the open/transcribe/result functions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The shim's crispasr_session_* dispatch calls into the per-architecture backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer, sensevoice, ...), which CrispASR builds as static archives. Linking only crispasr + ggml dead-stripped every backend object from the final module (nm backend-symbol count: 0), leaving a whisper-only .so. Link the same backend set as crispasr-cli so the static archives are pulled in. After this the module carries the backend symbols (nm count 407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to every compiled-in architecture. Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to ${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone and as a subproject; the sed is idempotent. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ack) Register the new symbol set (drop the removed token/speaker/callback funcs, add get_backend; load_model now takes 2 args). The session transcribe is blocking with no abort hook, so a mid-decode cancel can't interrupt it: change the cancellation spec to cancel the context before the call and assert codes.Canceled from the pre-call ctx.Err() check, dropping the <5s mid-decode timing assertion. The streaming spec still holds with per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The crispasr backend loads models via crispasr_session_open, which auto-detects the backend from the GGUF general.architecture using crispasr_detect_backend_from_gguf. Architectures not in that detect map cannot be opened, so those gallery entries fail to load. Removed entries whose architecture is not wired into CrispASR v0.6.11's session auto-detect router (they can be re-added when upstream maps them): - Not in the detect map: data2vec, firered-asr, funasr, fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr, moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b}, paraformer, sensevoice. - Pending verification (filename-heuristic routed, not arch-detected): parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the fastconformer-ctc backend by a filename heuristic in the model registry, which implies general.architecture is not a mapped string. Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py writes general.architecture="parakeet" unconditionally and encodes the rnnt/ctc distinction in metadata fields, so they session-auto-detect. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse the already-open g_session (crispasr_session_open auto-detects a TTS model) and dispatch to the upstream synthesis call, which returns malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we do not set, so it returns NULL here and surfaces as an error Go-side. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Bind the new shim functions via purego and implement TTS, TTSStream and a writeWAV24k helper. synthesize copies the C-owned PCM out before freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via go-audio/wav. CrispASR has no progressive synth, so TTSStream synthesizes fully, encodes to WAV, and emits the bytes as a single chunk; it owns the results-channel close (the gRPC server wrapper ranges until close), mirroring vibevoice-cpp's TTSStream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox, and orpheus require companion codec/s3gen/SNAC paths (set_codec_path / set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2 aren't in the session auto-detect map. Those are follow-ups. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…int) The crispasr Go file is entirely new, so new-from-merge-base lints every line (unlike the grandfathered whisper backend it was forked from): - handle os.RemoveAll / fh.Close return values in AudioTranscription - annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…mpanion files) Add two model-config options to the CrispASR backend via opts.Options: - backend:<name> selects an explicit CrispASR backend (bypassing auto-detect) by routing load_model through crispasr_session_open_explicit, unlocking architectures the detector won't pick on its own (qwen3, cohere, granite, voxtral, moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.). - codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC, chatterbox s3gen, or mimo-asr tokenizer) via the universal crispasr_session_set_codec_path setter after the session opens. A relative path resolves against the model directory. rc==0 means success or not-applicable; only a negative rc is fatal. The C shim load_model gains a backend_name argument and a new set_codec_path entry point; the Go bridge parses the prefix:value options and registers the new symbol. The vad_only path is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…plicit arch + companions) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The crispasr entries are just backend + model + a couple options, fully expressed inline via overrides:/files: in gallery/index.yaml. Point each url: at the shared gallery/virtual.yaml (the established 'virtual' model trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the current shim: the codec: companion loads fine, but these engines additionally need a voice pack / voice prompt / reference clip (qwen3-tts base errors 'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is 'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice, e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ce packs/prompts) speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice (voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the default voice; req.Voice still overrides the speaker per request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…-customvoice, orpheus) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 22 commits May 30, 2026 22:49

feat(crispasr): backend source files (Go gRPC server, C-ABI shim, bui…

8c5f8e9

…ld files) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

polish(crispasr): brand error strings + fix stale shim comment

2336043

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow

96f5d63

Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in backend/go/crispasr/Makefile. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

ci(crispasr): place hipblas matrix entry next to whisper twin

47e022f

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

feat(crispasr): register crispasr as pref-only ASR backend + test

d2579c0

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

test(crispasr): port whisper behavioral suite (cancellation + streaming)

0f3721d

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

test(crispasr): fix skip message env var names to CRISPASR_*

7c1b4fe

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

feat(gallery): add CrispASR ASR model entries (-crispasr)

8dfe33b

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

feat(crispasr): log when a TTS voice override is not honored

4f66ecf

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

test(crispasr): gated TTS synthesis spec

10e8100

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

localai-bot mentioned this pull request May 31, 2026

feat(crispasr): TTS support (Phase 2) — vibevoice synthesis #10102

Merged

3 tasks

mudler changed the title ~~feat(crispasr): add CrispASR ASR backend (Phase 0+1: scaffolding + transcription)~~ feat(crispasr): add CrispASR ASR/TTS backend May 31, 2026

localai-bot changed the title ~~feat(crispasr): add CrispASR ASR/TTS backend~~ feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS May 31, 2026

mudler added 4 commits May 31, 2026 07:38

feat(gallery): expand CrispASR models via backend:/codec: options (ex…

9d019ca

…plicit arch + companions) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 2 commits May 31, 2026 08:22

feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts…

5f208a4

…-customvoice, orpheus) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler merged commit 76fe0bb into master May 31, 2026
68 checks passed

mudler deleted the feat/crispasr-backend branch May 31, 2026 10:11

localai-bot added the enhancement New feature or request label Jun 10, 2026

BrewTestBot mentioned this pull request Jun 10, 2026

localai 4.4.0 Homebrew/homebrew-core#287347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS#10099

feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS#10099
mudler merged 29 commits into
masterfrom
feat/crispasr-backend

localai-bot commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

ASR

TTS

Model options (existing options: config)

Model gallery — 36 -crispasr entries (32 ASR + 4 TTS), virtual-model style

Plumbing

Test Plan

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented May 30, 2026 •

edited

Loading

Model options (existing `options:` config)

Model gallery — 36 `-crispasr` entries (32 ASR + 4 TTS), virtual-model style