Skip to content

feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS#10099

Merged
mudler merged 29 commits into
masterfrom
feat/crispasr-backend
May 31, 2026
Merged

feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS#10099
mudler merged 29 commits into
masterfrom
feat/crispasr-backend

Conversation

@localai-bot

@localai-bot localai-bot commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a new crispasr Go backend wrapping CrispASR (a whisper.cpp fork on ggml, MIT) via its C-ABI loaded with purego (CGO_ENABLED=0). Supports CrispASR's many ASR architectures plus TTS, behind LocalAI's standard transcription / speech RPCs. One binary serves ASR or TTS depending on the loaded model; architecture is auto-detected from the GGUF (or forced via an option). Does not modify the existing whisper backend.

ASR

  • CrispASR session API (crispasr_session_open / crispasr_session_transcribe_lang) — not the whisper-only whisper_full() path; CMake links the full backend set.
  • AudioTranscription + AudioTranscriptionStream.
  • e2e: Whisper, Parakeet (auto-detect), Moonshine (explicit backend: + tokenizer companion) all transcribed jfk.wav correctly.

TTS

  • TTS / TTSStream via crispasr_session_synthesize → 24 kHz mono WAV at req.Dst.
  • All 4 gallery TTS engines e2e-verified (real models → valid 24 kHz/mono/16-bit WAVs): vibevoice (built-in voice), chatterbox (built-in voice, needs codec:<s3gen>), qwen3-tts CustomVoice (backend:qwen3-tts+codec:<tokenizer>+speaker:vivian), orpheus (codec:<snac>+speaker:tara).

Model options (existing options: config)

  • backend:<name> — force an explicit CrispASR architecture.
  • codec:<file> — load a companion (tokenizer / SNAC / s3gen) via crispasr_session_set_codec_path.
  • speaker:<name> — select a baked speaker (crispasr_session_set_speaker_name).
  • voice:<path> (+voice_text:<ref>) — load a voice pack GGUF, or a WAV zero-shot clone (crispasr_session_set_voice).
    Relative codec:/voice: paths resolve against the model dir. All e2e-verified.

Model gallery — 36 -crispasr entries (32 ASR + 4 TTS), virtual-model style

Inline in gallery/index.yaml against the shared virtual.yaml base (no per-model files). Every entry verified reachable; companion files wired from CrispASR's registry; each TTS entry carries the verified options: set.

  • ASR: parakeet (+8), canary, cohere, qwen3 (+1.7b), voxtral, voxtral4b, granite (+3), fastconformer-ctc, wav2vec2 (+de), vibevoice, hubert, data2vec, glm-asr, kyutai-stt, firered-asr, moonshine (+de/tiny-de/streaming), mimo-asr.
  • TTS: vibevoice-tts, chatterbox-tts, qwen3-tts-customvoice, orpheus-tts.
  • Excluded (no working load path in v0.6.11): funasr/paraformer/sensevoice/omniasr/indextts/voxcpm2/mega-asr/kokoro.

Plumbing

  • Pinned to CrispASR v0.6.11; bump_deps tracks upstream. Idempotent Makefile patch: ${CMAKE_SOURCE_DIR}${PROJECT_SOURCE_DIR} so vendored llama.cpp resolves under add_subdirectory.
  • Full whisper-equivalent CI build matrix; backend gallery entries in backend/index.yaml; pref-only entry in GET /backends/known.
  • Gated behavioral tests (transcription + TTS synthesis), env-gated per the kokoros convention (CI runs unit tests only).

Test Plan

  • golangci-lint/go vet/build clean; suites skip cleanly without fixtures (CI: golangci-lint + Yamllint + build-test all green)
  • Runtime e2e: 3 ASR architectures (whisper/parakeet/moonshine) transcribe; all 4 TTS engines (vibevoice/chatterbox/qwen3-tts-customvoice/orpheus) synthesize valid WAVs — against real GGUFs
  • CI backend-matrix image build across all platforms

Follow-ups

  • Phase 3: dedicated Diarize / VAD / VoiceEmbed/VoiceVerify RPCs; real-time streaming via crispasr_stream_*
  • kokoro voicepack; qwen3-tts base WAV-clone presets; more architectures as upstream extends the session router

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler added 22 commits May 30, 2026 22:49
…ld files)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror the whisper Go backend registration for the new crispasr
backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks,
BACKEND_CRISPASR definition, docker-build target generation, and the
docker-build-backends aggregate target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64,
CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T
arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add the crispasr meta anchor and its full set of image gallery entries
(cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64,
L4T cuda13 arm64, plus -development variants), mirroring the whisper
backend gallery block.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in
backend/go/crispasr/Makefile.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror the whisper Go backend: its AudioTranscription test is gated on
model/audio fixtures and skips in CI, so building crispasr (the heaviest
ggml compile in the tree) inside the unit-test lane adds a long compile
for zero coverage. The backend image build in backend-matrix.yml remains
the authoritative compile check.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The metal-crispasr gallery entries and capabilities.metal mapping
reference -metal-darwin-arm64-crispasr, which is only produced by an
includeDarwin entry. Mirror whisper's darwin metal entry so the tag
actually gets built.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The shim used whisper_full(), which in CrispASR is the whisper-only path:
libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture
transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR,
Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI,
which auto-detects the architecture from the GGUF and dispatches to the
matching backend.

Rewrite the C shim around crispasr_session_open / _transcribe_lang /
_result_* and add get_backend() so the selected backend is logged.
load_model now takes a threads param (session_open binds n_threads at
open). The session result is segment+word based with no token IDs and no
per-decode callback, so drop n_tokens / get_token_id /
get_segment_speaker_turn_next / set_new_segment_callback. set_abort is
kept for API parity but is best-effort: the session transcribe is blocking
with no abort hook.

Update the purego bindings and gocrispasr.go to match: tokens are left
empty, speaker-turn handling is removed, and AudioTranscriptionStream
emits one delta per non-empty segment after the blocking decode returns
(no progressive streaming via the session API), preserving the
concat(deltas) == final.Text invariant.

crispasr_session_set_translate is exported by libcrispasr but not declared
in crispasr.h, so it is forward-declared in the shim alongside the
open/transcribe/result functions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The shim's crispasr_session_* dispatch calls into the per-architecture
backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer,
sensevoice, ...), which CrispASR builds as static archives. Linking only
crispasr + ggml dead-stripped every backend object from the final module
(nm backend-symbol count: 0), leaving a whisper-only .so.

Link the same backend set as crispasr-cli so the static archives are
pulled in. After this the module carries the backend symbols (nm count
407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to
every compiled-in architecture.

Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to
${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR
locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when
CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend
dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone
and as a subproject; the sed is idempotent.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ack)

Register the new symbol set (drop the removed token/speaker/callback funcs,
add get_backend; load_model now takes 2 args). The session transcribe is
blocking with no abort hook, so a mid-decode cancel can't interrupt it:
change the cancellation spec to cancel the context before the call and
assert codes.Canceled from the pre-call ctx.Err() check, dropping the
<5s mid-decode timing assertion. The streaming spec still holds with
per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The crispasr backend loads models via crispasr_session_open, which
auto-detects the backend from the GGUF general.architecture using
crispasr_detect_backend_from_gguf. Architectures not in that detect
map cannot be opened, so those gallery entries fail to load.

Removed entries whose architecture is not wired into CrispASR
v0.6.11's session auto-detect router (they can be re-added when
upstream maps them):

- Not in the detect map: data2vec, firered-asr, funasr,
  fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr,
  moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b},
  paraformer, sensevoice.
- Pending verification (filename-heuristic routed, not arch-detected):
  parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the
  fastconformer-ctc backend by a filename heuristic in the model
  registry, which implies general.architecture is not a mapped string.

Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py
writes general.architecture="parakeet" unconditionally and encodes the
rnnt/ctc distinction in metadata fields, so they session-auto-detect.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse
the already-open g_session (crispasr_session_open auto-detects a TTS
model) and dispatch to the upstream synthesis call, which returns
malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we
do not set, so it returns NULL here and surfaces as an error Go-side.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Bind the new shim functions via purego and implement TTS, TTSStream and
a writeWAV24k helper. synthesize copies the C-owned PCM out before
freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via
go-audio/wav. CrispASR has no progressive synth, so TTSStream
synthesizes fully, encodes to WAV, and emits the bytes as a single
chunk; it owns the results-channel close (the gRPC server wrapper ranges
until close), mirroring vibevoice-cpp's TTSStream.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox,
and orpheus require companion codec/s3gen/SNAC paths (set_codec_path /
set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2
aren't in the session auto-detect map. Those are follow-ups.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…int)

The crispasr Go file is entirely new, so new-from-merge-base lints every
line (unlike the grandfathered whisper backend it was forked from):
- handle os.RemoveAll / fh.Close return values in AudioTranscription
- annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler changed the title feat(crispasr): add CrispASR ASR backend (Phase 0+1: scaffolding + transcription) feat(crispasr): add CrispASR ASR/TTS backend May 31, 2026
@localai-bot localai-bot changed the title feat(crispasr): add CrispASR ASR/TTS backend feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS May 31, 2026
mudler added 4 commits May 31, 2026 07:38
…mpanion files)

Add two model-config options to the CrispASR backend via opts.Options:

- backend:<name> selects an explicit CrispASR backend (bypassing
  auto-detect) by routing load_model through
  crispasr_session_open_explicit, unlocking architectures the
  detector won't pick on its own (qwen3, cohere, granite, voxtral,
  moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.).
- codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC,
  chatterbox s3gen, or mimo-asr tokenizer) via the universal
  crispasr_session_set_codec_path setter after the session opens. A
  relative path resolves against the model directory. rc==0 means
  success or not-applicable; only a negative rc is fatal.

The C shim load_model gains a backend_name argument and a new
set_codec_path entry point; the Go bridge parses the prefix:value
options and registers the new symbol. The vad_only path is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…plicit arch + companions)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The crispasr entries are just backend + model + a couple options, fully
expressed inline via overrides:/files: in gallery/index.yaml. Point each
url: at the shared gallery/virtual.yaml (the established 'virtual' model
trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the
current shim: the codec: companion loads fine, but these engines additionally
need a voice pack / voice prompt / reference clip (qwen3-tts base errors
'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that
the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is
'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed
to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice,
e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added 2 commits May 31, 2026 08:22
…ce packs/prompts)

speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts
CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice
(voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the
default voice; req.Voice still overrides the speaker per request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…-customvoice, orpheus)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit 76fe0bb into master May 31, 2026
68 checks passed
@mudler mudler deleted the feat/crispasr-backend branch May 31, 2026 10:11
@localai-bot localai-bot added the enhancement New feature or request label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants