On-device speech recognition + text-to-speech. No cloud. 40+ model families, one app.
CrisperWeaver is a cross-platform Flutter app for fully-offline audio transcription and speech synthesis. Drop in a file, paste a URL, or record with the mic — audio never leaves the device. 40+ open-weight ASR families and 20+ TTS families are supported through a single unified engine (CrispASR): Whisper (+ Distil-Whisper), Parakeet (TDT/CTC/RNNT variants + Japanese), Canary, Voxtral, Qwen3-ASR (+ Mega-ASR LoRA), Cohere, Granite (3.x + 4.1 family), FastConformer-CTC, Canary-CTC, Wav2Vec2 / HuBERT / Data2Vec, OmniASR (CTC + LLM + streaming), FireRed, Kyutai-STT, GLM-ASR, Moonshine (+ German variants), VibeVoice ASR, MiMo ASR, Gemma4-E2B (140+ languages), FunASR (+ 31-lang MLT), Paraformer-ZH, SenseVoice (built-in language ID) — plus Kokoro / VibeVoice / Qwen3-TTS (0.6B + 1.7B + VoiceDesign + Gwen-TTS) / Orpheus (+ German variants) / Chatterbox (+ Turbo + Kartoffelbox + Lahgtna Arabic) / IndexTTS / VoxCPM2 / CosyVoice3 / F5-TTS / Bark / CSM / Dia / FastPitch / MeloTTS / OuteTTS / Parler-TTS / Pocket TTS / SpeechT5 / KugelAudio / Piper for synthesis, Pyannote v3 for ML diarisation, FireRedPunc + fullstop-punc + BiLSTM truecaser for punctuation/capitalisation restoration, CLD3 + GlotLID + FastText LID-176 for text language detection, Silero LID for audio language detection, and FireRed/MarbleNet/Whisper-VAD-EncDec VAD options.
| Project | Role |
|---|---|
| CrispASR | C++ ASR + TTS engine powering this app — 40+ ASR + 20+ TTS backends, CLI + C-ABI |
| CrisperWeaver | This app — Flutter GUI for CrispASR |
| CrispEmbed | Text embedding engine (ggml) — XLM-R, Qwen3-Embed, Gemma3, dense + sparse + ColBERT |
| Susurrus | Python ASR GUI with 9 backends (faster-whisper, mlx-whisper, voxtral, ...) |
- Transcribe any audio file — WAV / MP3 / FLAC decoded on-device, no ffmpeg required.
- Batch-process multiple files — drop many onto the window or onto the Batch Queue card, "Transcribe all" drains serially; each run lands in History.
- Record from the mic and transcribe live.
- Paste a URL to a remote file; CrisperWeaver downloads and processes it.
- Drag and drop files onto the transcription screen (desktop) or directly on the batch queue.
- Receive shared audio from the OS share sheet (Android / iOS / macOS).
- Choose your model family and quantisation — q4_0 / q5_0 / q4_k / q5_k / q6_k / q8_0 variants plus f16 originals. Model picker filters by name + backend; Model Management screen auto-probes HuggingFace to discover every available quant.
- Add from HuggingFace repo — link-icon button on the Models screen accepts any
OWNER/NAMEHF repo (e.g.cstr/voxtral-mini-3b-2507-GGUF), lists the GGUF / .bin files under your chosen backend, and registers them as downloadable models. Mirror ofcrispasr --hf-repoon the CLI side; escape hatch for repos not yet in the baked catalogue. - Download and manage models from a built-in browser — parallel queue, resume, SHA-1 verify, cancel, delete. Companion files (codec / voice / tokenizer GGUFs) auto-download alongside their parent model, including for runtime-discovered HF quants.
- Advanced decoding knobs — translate-to-English (Whisper), beam search, initial-prompt vocabulary bias (huge win for domain audio), audio Q&A prompt for instruct-tuned LLM backends (Voxtral / Qwen3), source + target language pickers for true speech translation.
- Tune the decoder live — best-of-N slider (1–10, picks the highest-scoring of N decodes; works on every backend) and decoder temperature slider for every backend
crispasr_session_set_temperaturehonours (canary, cohere, parakeet, moonshine, voxtral, qwen3, granite, glm-asr, gemma4, omniasr-llm, kyutai-stt). - Tune the VAD — pick between Silero (bundled), FireRedVAD (F1 97.57%), MarbleNet, Whisper-VAD-EncDec; live sliders for threshold, min-speech-ms, min-silence-ms, speech-pad-ms.
- Pick the diarisation method — vad-turns (default, mono-friendly), pyannote (ML, GGUF), stereo energy, stereo cross-correlation.
- Pick the language-detection method — Whisper-encoder LID (reuses an existing model) or Silero 95-langs (faster + smaller).
- See live performance numbers — real-time factor, words per second, wall-clock.
- Get word-level timestamps and language auto-detection (via Whisper).
- Stream transcription from long-running mic input — partial transcripts appear in the output card while you talk (10 s sliding window / 3 s step).
- Watch long files transcribe in real time — chunked Whisper splits >60 s files into 30 s windows and streams segments through as each finishes, instead of waiting until the end.
- Export to
.txt,.srt,.vtt,.json,.csv,.lrc(lyrics), or.wts(Whisper Text Segments debug) through the system share sheet. - Review history — every run is persisted as JSON and browseable / re-exportable, with speaker renames preserved across launches.
- Rename diariser speakers — tap a speaker chip in the output to override the auto-assigned label ("Speaker 1" → "Alice"); the new name persists in history.
- See where storage went — Settings → Storage breakdown lists per-backend disk usage with a one-click "delete all of X" action.
- Diagnose with logs — in-app viewer with filter / search / copy / export, optional file sink.
- Synthesize speech (TTS) — pick a downloaded TTS model + voice + codec on the Synthesize screen, type text, hit Synthesize; output plays in-app and saves as WAV. Advanced section adds trim-silence, 0.25×–4× speed slider, reference-transcript field for runtime voice cloning, and natural-language voice-design prompts (qwen3-tts VoiceDesign).
- Translate text — dedicated Translate screen wired to CrispASR's
crispasr_session_translate_text. Pick from M2M-100 (100 langs, any-to-any), WMT21 Dense (en↔X, two separate checkpoints), or MADLAD-400 (419 langs); source/target language dropdowns with one-click swap, max-tokens slider, copy-to-clipboard. - Bake your own voice — Bake voice screen drives CrispASR's
bake-chatterbox-voice-from-wav.pyviaProcess.start: pick a WAV reference, set the output filename, hit Bake. Stdout + stderr stream into a live tail panel + the in-app Log viewer; the resulting GGUF is dropped straight into the models directory so it shows up in the voicepack picker on the next open. Desktop-only — mobile has no Python runtime. Available from the cake icon in the Synthesize app-bar. - Restore punctuation — FireRedPunc (ZH+EN) or fullstop-punc (EN/DE/FR/IT) toggle in Advanced Options; turns
wav2vec2 / fastconformer-ctc / firered-asrlowercase output into properly punctuated text. Picker chooses which family runs when both are downloaded. - Use CrisperWeaver in English or German — full i18n scaffold via
flutter_localizations, 866 keys per locale covering every user-facing string (guarded by an ARB-consistency test). - Capture system audio — "Transcribe what's playing in Zoom / YouTube / any app." ScreenCaptureKit on macOS 13+,
parecon Linux, ffmpeg-WASAPI loopback on Windows, MediaProjection on Android 10+; iOS deliberately unsupported (Apple's sandbox forbids it). - Boost custom vocabulary — chip list in Advanced Options; routed per-backend-class via
initial_prompt(whisper / moonshine),setAskprefix (audio-LLM backends), or no-op (CTC-style with explanatory helper text). - Edit any segment inline — long-press → edit dialog. Edits update both AppState AND the on-disk history JSON so they survive a reload. Edited segments get a pencil-icon marker.
- Search history — substring filter on title + transcript with yellow-highlighted matches; auto-expand of matching entries.
- Edit audio with waveform sync — dedicated audio editor screen with trim / cut middle / split into chapters + an optional collapsible transcript pane on the same screen. Tap a segment → playhead jumps. Long-press a segment → Select / Trim to / Mark for split. Tap the waveform → matching segment highlights. Pure-Dart, no FFmpeg, identical on every platform.
- Tidy any transcript — pure-Dart deterministic pass (filler removal, repeat collapsing, sentence-case, punctuation fix, optional annotation strip) with a before/after preview of the first three segments. Optional opt-in LLM pass on top via a three-mode selector: Off (deterministic only), Cloud (bring your own OpenAI-compatible endpoint — OpenAI / Anthropic via proxy / OpenRouter / Groq / Cerebras / Together / local llama-server; key stays on device), or Local (point at a GGUF chat model on disk and run it on-device via Metal / CUDA / CPU through libcrispasr's chat ABI — no network, no API key, model stays warm across passes).
- Summarise meetings — Action Items / Key Topics / Decisions as structured Markdown, via either the cloud BYOK endpoint or the on-device chat model — same three-mode chooser as the cleanup pass.
- Save presets — bundle
(backend, modelId, language, AdvancedOptions)into a named preset. One tap restores all four atomically. JSON-backed with schema-versioned migration so a stale preset on a newer build doesn't crash. - Push-to-transcribe from anywhere — desktop-only system hotkey (macOS / Linux / Windows). Push-to-talk OR toggle behaviour; combo parser handles modifier aliases (cmd / command / win / super → meta, ctrl → control, option → alt).
- Clone a voice in three steps — guided wizard launched from the Synthesize screen: record 10 s OR pick a WAV → type the reference transcript → hand off to Synthesize with everything pre-populated. Runs on top of the existing chatterbox / indextts / qwen3-tts-base / vibevoice runtime-cloning surface.
- Whisper subtitle formatting — tokens-per-segment cap + split-on-word toggle in Advanced Options. Produces SRT-friendly short subtitle lines instead of long-paragraph segments. Split-on-punct additionally breaks at sentence-ending punctuation (. ! ?) — works with any backend, not just whisper.
- Semantic transcript search — search your history by meaning, not just keywords. When a small embedding model is downloaded (~23 MB), History search uses real vector embeddings (cosine similarity) instead of substring matching. Persisted embeddings avoid re-encoding. Cross-modal audio embeddings supported with larger models.
- Subtitle overlay / teleprompter — fullscreen always-on-top transparent overlay showing live streaming transcription as subtitles. Font size, position, and background controls. On macOS the window floats above other apps via platform channel.
- Compare transcripts side by side — re-transcribe the same audio with two models in parallel (two single-worker pools via
Future.wait) and see a word-level diff with Jaccard similarity stats. Helps pick the best model for your domain. - Tag transcript segments — annotate segments with bookmark, action-item, question, important, highlight, decision, or follow-up tags. Tags persist in history and render as emoji badges.
- Keyboard-navigate transcripts — J/K to jump segments, Space to play/pause, Enter to edit, Tab to jump to next low-confidence word. Desktop power-user feature.
- Export to note-taking tools — Obsidian (YAML frontmatter + timestamped bullets), Notion (speaker headers), Logseq (indented blocks), YouTube chapters (HH:MM:SS lines).
- Watch a folder for new audio — point CrisperWeaver at a directory and new audio files are auto-transcribed. Duplicate files are auto-skipped via audio fingerprinting. Desktop-only; configurable in Settings.
- TTS pronunciation lexicon — user-editable word-to-pronunciation override table for proper nouns, acronyms, and domain terms. Applied before TTS synthesis.
- Detect chapters automatically — topic-shift detection via vocabulary analysis, exportable as YouTube chapters or Podcasting 2.0 JSON.
- Confidence heatmap — toggle in the transcript output menu to color-code words by decoder confidence (background gradient: green → yellow → orange → red).
- Synthetic content compliance — every TTS-generated WAV is watermarked (native CrispASR spread-spectrum, upgradeable to AudioSeal neural watermark via GGUF; pure-Dart LSB fallback on older dylibs) and carries LIST INFO provenance metadata identifying it as AI-generated. MP3 exports carry ID3v2.3 TXXX frames for AI provenance (
AI_GENERATED,GENERATOR,AI_CONTENT_NOTICE). Voice-cloned TTS output is prepended with a 3-beep 880 Hz audio disclaimer (EU AI Act Art. 50(4)). Post-embed watermark verification detects the watermark immediately after embedding and warns if absent. Consent attestation audit logging ([CONSENT] ts=… model=… voice=… attestation="user consent") records every voice-cloned synthesis in the same format used by CrispASR and CrispTTS. Speaker enrollment requires explicit biometric consent (GDPR Art. 9); consent records are persisted and deleted alongside speaker profiles. Exports support an optional machine-readable disclosure notice. See About → Synthetic Content Compliance in the app.
One dispatcher (CrispasrSession) handles every backend; bundled libcrispasr reports at startup which are linked in the current build, and the Models screen filter chips group them by kind (ASR / TTS / Voices / Codecs / Post-processors). The Model Management screen also probes CrispASR's built-in C-side registry on every open, so any backend the bundled libcrispasr knows about appears even if it isn't hardcoded in the app catalog.
| Family | Sizes | Languages | Notes |
|---|---|---|---|
| Whisper | tiny → large-v3 + q4_0/q5_0/q8_0 | 99 | Word-level ts, lang-detect, streaming |
| Distil-Whisper | distil-large-v3 (f16 / q5_0 / q4_k) | 99 | ~6× faster than large-v3, ~50% smaller |
| Parakeet (NVIDIA) | TDT 0.6b v2/v3, 1.1b; CTC 0.6b/1.1b; TDT+CTC 110m/1.1b; RNNT 0.6b/1.1b; JA 0.6b | 25 EU (auto-detect) / JA | Fast, native word timestamps |
| Canary (NVIDIA) | 1b-v2 | 25 EU (explicit src/tgt) | Speech translation X ↔ en |
| Qwen3-ASR | 0.6b, 1.7b | 30 + 22 Chinese dialects | Multilingual |
| Mega-ASR | 1.7b (Qwen3-ASR + robustness LoRA) | 30 + 22 Chinese dialects | LoRA merged offline, qwen3 runtime |
| Cohere | 03-2026 | 13 | High-accuracy Conformer decoder |
| Granite Speech | 3.2-8b, 3.3-2b/8b, 4.0-1b, 4.1-2b | en fr de es pt ja | Instruction-tuned |
| Granite Speech 4.1 | 2B / 4.1+ / 4.1-NAR | en fr de es pt ja | Instruction-tuned + parallel decode |
| FastConformer-CTC | small → xxlarge | en | Low-latency CTC |
| Canary-CTC | 1b | 25 EU | CTC variant of canary |
| Voxtral Mini | 3B (2507), 4B realtime (2602) | 8 / 13 | Speech translation; realtime 4B |
| Wav2Vec2 / HuBERT / Data2Vec | large-xlsr-53 (EN, DE, multi) + HuBERT-Large + Data2Vec-Base | per-model | Self-supervised CTC family |
| OmniASR | CTC 300M/1B, LLM 300M/1B | 1600+ languages | CTC + LLM variants |
| OmniASR LLM unlim. | 300M v2 streaming | 1600+ languages | Streaming, 15 s protocol, unlimited audio |
| FireRed ASR2 | aed-2b | zh / en | AED-style |
| Kyutai STT | 1b | en | Streaming-style |
| GLM-ASR Nano | nano | multilingual | GLM-family |
| Moonshine | tiny / base + streaming + DE variants | en / de | Tiny CPU-friendly, BPE tokenizer companion |
| VibeVoice ASR | large | multilingual | Large multilingual ASR (~4.5 GB) |
| MiMo ASR | 2.5B + tokenizer companion | en zh | XiaomiMiMo, two-file (model + codec) |
| Gemma4-E2B | 2B (q4_k) | 140+ languages | USM Conformer + Gemma-4 35L |
| FunASR Nano | 2512 (f16 / q4_k), MLT 2512 | zh + en (nano) / multi (mlt) | Alibaba's compact ASR, 70-block SANM |
| Paraformer-ZH | base (f16 / q4_k / q8_0) | zh | FunASR family, NAR Mandarin ASR |
| SenseVoice | small (f16 / q4_k / q8_0) | zh en ja ko yue | Built-in LID + use_itn punctuation |
| MOSS-Audio | 4B Instruct (q4_k ~3.8 GB) | multilingual | ASR + audio QA + scene description (Whisper enc + Qwen3 LLM) |
| Family | Sizes | Notes |
|---|---|---|
| Kokoro | 82M + voicepacks | Multilingual, espeak-ng phonemiser bundled |
| VibeVoice | realtime 0.5B (f32 + tokenizer), 1.5B base | + voicepack via setVoice; 1.5B supports runtime WAV cloning |
| Qwen3-TTS | 0.6B base + customvoice + codec, 1.7B base/CV/VoiceDesign, Gwen-TTS (Vietnamese) | Customvoice has 9 baked speakers; VoiceDesign + Parler-TTS accept natural-language voice descriptions via setInstruct |
| Orpheus | 3B + SNAC codec + German variants (lex-au, Kartoffel natural/synthetic) | 8 baked English speakers; SNAC via setCodecPath |
| Chatterbox | 850 MB base + turbo / Kartoffelbox (DE) / Lahgtna (Arabic) | T3 AR + S3Gen flow-matching; voice cloning via baked GGUF |
| IndexTTS | 1.6 GB | GPT-2 AR + BigVGAN; zero-shot WAV cloning, ZH+EN |
| VoxCPM2 | q4_k (1.6 GB) + f16 | Tokenizer-free diffusion AR; zero-shot, 29 languages, 48 kHz native |
| CosyVoice3 | 0.5B (LLM + flow + HiFT + voices) | 9 languages + 18 Chinese dialects; zero-shot voice cloning |
| F5-TTS | v1 Base (~953 MB) | DiT flow-matching; zero-shot voice clone from 3-15 s WAV |
| Bark | small (~500 MB) | 3-stage GPT-2; multilingual, 10 German speakers |
| CSM | 1B (q4_k ~1.4 GB) | Sesame CSM-1B conversational TTS, single EN voice |
| Dia | 1.6B (f16 ~3 GB) + DAC codec | Dialogue TTS with [S1]/[S2] speaker tags, English |
| FastPitch | 60M (~120 MB) | NVIDIA non-autoregressive parallel TTS, deterministic, EN |
| MeloTTS | v2 (4 EN speakers) + v3 + BERT companion | VITS2, 44.1 kHz |
| OuteTTS | 0.3 1B + WavTokenizer decoder | OLMo-1B + VQ-GAN; voice clone via JSON speaker, EN |
| Parler-TTS | Mini v1.1 (~900 MB) | Describe voice in natural language via setInstruct, EN |
| Pocket TTS | 100M (~220 MB) | Kyutai continuous-latent AR; voice clone from WAV, EN |
| SpeechT5 | 80M (~300 MB) | Microsoft AR mel decoder + HiFi-GAN, EN |
| Zonos | v0.1 (q4_k ~872 MB, f16 ~3.1 GB) + DAC codec | Zyphra 500M — emotion, pitch, rate control, speaker cloning, 44.1 kHz |
| KugelAudio | 0 Open (f16 ~14 GB) | Large TTS model |
| Piper | 15-60 MB per voice | VITS, 250+ community voices, 30+ languages |
| Family | Languages | Notes |
|---|---|---|
| PCS | 47 languages | All-in-one: punctuation + truecasing + sentence boundaries (XLM-R) |
| FireRedPunc | ZH + EN | BERT-based punctuation + capitalisation |
| Fullstop-punc | EN / DE / FR / IT | Multilingual punctuation restoration |
| Truecaser LSTM | DE / EN / ES / RU | BiLSTM character-level truecasing (97.9% F1 German) |
| Family | Role | Notes |
|---|---|---|
| Pyannote v3 seg | Diarisation | ML segmentation, up to 3 speakers per slice |
| TitaNet-Large | Speaker embedding | 192-d L2-normalised; pair with enrolled speakers |
| Silero LID 95 | Audio language ID | 95-language classifier, ~16 MB GGUF — faster than Whisper LID |
| ECAPA-TDNN LID | Audio language ID | 107 languages, ~42 MB |
| FireRed LID | Audio language ID | 120 languages, ~300 MB |
| CLD3 | Text language ID | 109 languages, ~440 KB — powers Translate auto-detect |
| GlotLID v3 | Text language ID | 2102 languages (ISO 639-3), ~250 MB |
| FastText LID-176 | Text language ID | 176 languages, ~63 MB |
| FireRedVAD | VAD | F1 97.57%, ~3 MB |
| MarbleNet VAD | VAD | Small (~500 KB), EN/DE/FR/ES/RU/ZH |
| Whisper-VAD-EncDec | VAD (experimental) | Japanese-trained (works on EN too), ~22 MB |
Downloads pull f16 from ggerganov/whisper.cpp and quantised variants from cstr/whisper-ggml-quants and other cstr/*-GGUF repos. Skip-checksum toggle in Settings for custom or mirrored GGUFs.
| Platform | State |
|---|---|
| macOS | ✅ Released — .app.zip, Metal-enabled, all 24+ backend dylibs (incl. kokoro / orpheus / mimo-asr) bundled, espeak-ng auto-bundled for kokoro phonemisation |
| Linux | ✅ Released — .tar.gz bundle |
| Windows | ✅ Released — portable .zip + installable .msix with Explorer "Open With" registration for audio + subtitle types |
| Android | ✅ Released — real-ASR APK (arm64-v8a) with libwhisper.so cross-built in CI. External models dir picker requests "All files access" runtime permission (Android 11+) so reinstalls don't lose access to /storage/emulated/0/... paths. |
| iOS | |
| Web (PWA) | ✅ Live at crisperweaver-web.vercel.app — ASR + TTS via CrispASR HF Space cloud engine (9 ASR + 5 TTS backends). Client-side text embeddings via CrispEmbed WASM (~19 MB model, ~50-100ms/sentence). Auto-deploys from GitHub Actions. |
Roadmap and blockers: see PLAN.md.
- Flutter 3.38.x (
stable) - Xcode + CocoaPods for iOS / macOS
- JDK 17 + Android SDK for Android
- CMake + a C++ toolchain to build
libwhisperfrom CrispASR - Optional: Metal (macOS), CUDA / Vulkan (Linux/Windows), Core ML (iOS), NNAPI (Android)
mkdir crisperweaver && cd crisperweaver
git clone https://github.com/CrispStrobe/CrispASR.git
git clone https://github.com/CrispStrobe/CrisperWeaver.gitpubspec.yaml refers to the Dart FFI package via path: ../CrispASR/flutter/crispasr.
Each desktop platform has an end-to-end script that configures + builds CrispASR's libwhisper / whisper.dll, runs flutter build, and bundles every needed dynamic library next to the runner. They expect the sibling CrispASR checkout described above.
# macOS
./scripts/build_macos.sh release
# Linux
./scripts/build_linux.sh release
# Windows (cmd.exe — picks up pwsh if installed, falls back to powershell.exe)
build_windows.bat release
# or directly:
pwsh -File scripts\build_windows.ps1 releasePass debug instead of release for a debug build; add --rebuild-cmake (or -RebuildCmake on Windows) to force a fresh CrispASR cmake configure. Each script's output ends with a path to the runnable bundle / .app / .exe.
The bundlers can also be invoked standalone (scripts/bundle_macos_dylibs.sh, scripts/bundle_linux_libs.sh, scripts/bundle_windows_dlls.ps1) if you've already built CrispASR and flutter build <platform> separately and just want to drop the libs in place.
At runtime, CrispASREngine resolves the library by probing platform-specific names — crispasr.dll / libcrispasr.dylib / libcrispasr.so first, then the whisper-named alias — under the bundle, the user's CrispASR checkout, system lib dirs, and any user-supplied override path.
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=ON -DWHISPER_METAL=ON
cmake --build build --parallel --target crispasr
cd ../CrisperWeaver
flutter pub get
flutter run -d macos # or: linux, windows, android, iosflutter run picks up the freshly-built CrispASR shared library directly from ../CrispASR/build/src/ for fast inner-loop work.
- First run: open Settings → Manage models and download the model you want. Default pick is Whisper base (~140 MB, covers 99 languages).
- Transcribe a file: back to the main screen, drop a file or click the picker. Language auto-detects by default.
- Record from the mic: use the recorder card. Stop → transcribe.
- Stream from mic (Whisper): toggle Stream in the recorder. Partial text arrives as you speak.
- Synthesize speech (TTS): tap the Synthesize icon in the app-bar (next to Models). Pick a downloaded TTS model + voice + codec, type text, hit Synthesize — output plays in-app and can be saved as WAV via the share sheet.
- Export the result: share-sheet button → pick TXT / SRT / VTT / JSON.
- Browse past runs: History in the drawer — everything is persisted as JSON under
<app-docs>/history/.
Useful Settings:
- Preferred engine — CrispASR is the default.
- Default model / language — pre-select across new transcriptions.
- Custom models directory — point CrisperWeaver at an existing GGUF
library on an external disk (the Models directory picker), e.g.
/Volumes/backups/ai/crispasr, to reuse already-downloaded models instead of re-fetching into the app sandbox. Cross-platform; the model picker reads flat GGUFs from that folder and "Use default" restores the sandbox<app-docs>/models/whisper_cpplocation. (On Android 11+ an external/shared-storage path needs the one-time "All files access" grant.) - Skip checksum verification — custom / mirrored GGUFs.
- Log level + file sink —
<app-docs>/logs/session.logfor bug reports.
The default test pass is fast and offline:
flutter analyze # 0 issues — see analysis_options.yaml for the strict rule set
flutter test # 378 tests, ~20 s — widget unit + service / persistence / batch / dispatchtest/backend_dispatch_test.dart always runs the cheap dispatch
checks (every backend appears in CrispasrSession.availableBackends(),
opens with bogus paths fail cleanly). The expensive end-to-end synth /
transcribe roundtrips are tagged slow and skip themselves silently
unless their CRISPASR_TEST_<BACKEND>_MODEL env var points at a
downloaded GGUF. Opt in with:
M=/path/to/crispasr-models
CRISPASR_TEST_KOKORO_MODEL=$M/kokoro-82m-q8_0.gguf \
CRISPASR_TEST_KOKORO_VOICE=$M/kokoro-voice-af_heart.gguf \
CRISPASR_TEST_QWEN3_TTS_MODEL=$M/qwen3-tts-12hz-0.6b-customvoice-q8_0.gguf \
CRISPASR_TEST_QWEN3_TTS_CODEC=$M/qwen3-tts-tokenizer-12hz.gguf \
CRISPASR_TEST_VIBEVOICE_MODEL=$M/vibevoice-realtime-0.5b-tts-f32-tokenizer.gguf \
CRISPASR_TEST_VIBEVOICE_VOICE=$M/vibevoice-voice-en-Emma_woman.gguf \
CRISPASR_TEST_ORPHEUS_MODEL=$M/orpheus-3b-base-q8_0.gguf \
CRISPASR_TEST_ORPHEUS_CODEC=$M/snac-24khz.gguf \
CRISPASR_TEST_MIMO_ASR_MODEL=$M/mimo/mimo-asr-q4_k.gguf \
CRISPASR_TEST_MIMO_ASR_TOKENIZER=$M/mimo/mimo-tokenizer-q4_k.gguf \
CRISPASR_TEST_COSYVOICE3_MODEL=$M/cosyvoice3-llm-q4_k.gguf \
CRISPASR_TEST_WHISPER_MODEL=$M/ggml-tiny.bin \
flutter test --tags slow test/backend_dispatch_test.dartThe cosyvoice3 roundtrip needs its flow / hift / voices companions
sitting beside the LLM GGUF (auto-discovered by filename). There is also
an f5-tts roundtrip (CRISPASR_TEST_F5TTS_MODEL=$M/f5-tts-v1-base-f16.gguf,
zero-shot clone from test/jfk.wav) — left out of the line above on
purpose: its DiT synth is extremely slow (~50 min for one short sentence
on M1 Metal), so opt into it only deliberately and on its own.
A single sweep takes ~25 min on M1 Metal end-to-end (full sweep — 6 backends including a 3 B Llama and a 4 GB f32 vibevoice). Each backend skips if its env var isn't set, so partial sweeps are cheap. Coverage:
flutter test --coverage # writes coverage/lcov.info
genhtml coverage/lcov.info -o coverage/html # if lcov is installedSee dart_test.yaml for the tag config and PLAN.md §5.18 for the
test-suite speed roadmap (the architectural win is a persistent
ggml-metal pipeline cache via MTLBinaryArchive, which would cut
the cold-start kernel-JIT cost from minutes to seconds).
ci.yml— on push / PR:flutter analyze+flutter teston Ubuntu and macOS, plus debug.appand Linux bundle uploaded as workflow artifacts. Checks out siblingCrispStrobe/CrispASRautomatically. Slow tests (--tags slow) are skipped by default; CI can opt in by setting theCRISPASR_TEST_*env vars and adding--tags slowto the test command. Analyzer is configured to error onuse_build_context_synchronously,avoid_print,unused_*,inference_failure_*, anddeprecated_member_use— a regression fails the build instead of silently piling up.release.yml— on avX.Y.Ztag (or manual dispatch): builds and uploadscrisper_weaver-macos.zip— Metal-enabled.app, ad-hoc signed, all 24+ backend dylibs (kokoro / orpheus / mimo-asr included as of CrispASRba7d6ed) bundled.crisper_weaver-linux-x64.tar.gz— GTK-3 desktop bundle with all backend.so's.crisper_weaver-windows-x64.zip— portable zip withrunner.exe+whisper.dll+ sibling backend DLLs.crisper_weaver-windows-x64.msix— installable MSIX that registers "Open With → CrisperWeaver" for audio + subtitle types in Explorer. Unsigned for now (see Windows install below).crisper_weaver-android-arm64.apk— real ASR. CrispASR cross-built viabuild-android.sh --abi arm64-v8a;libwhisper.soand sibling backend.so's dropped intojniLibs/arm64-v8a/.crisper_weaver-ios-unsigned.ipa— sideload-compatible (see below).
Both workflows honour CRISPASR_REPO / CRISPASR_REF env vars at the top of each file, in case you maintain a fork of CrispASR.
We don't pay for the Apple Developer Program yet, so the iOS IPA in releases is unsigned. To install it on your own device, use a sideload service:
- SideStore — iOS app installer that uses your own Apple ID for signing (free: 7-day re-sign; paid ADP: 1-year). Self-hosted via StosVPN or pair-with-computer.
- AltStore — the original self-sign flow; requires a desktop-side companion (AltServer).
- Feather — open-source alternative.
All three accept the .ipa directly: download crisper_weaver-ios-unsigned.ipa from the release page, open in SideStore (Files → share to SideStore), tap Install. The free-tier 7-day limit means you'll need to re-sign weekly unless you also have a paid Apple Developer account.
Two artefacts ship per release; pick one:
crisper_weaver-windows-x64.zip— portable. Unzip anywhere, runcrisper_weaver.exe. No installer, no file associations, doesn't survive across machines cleanly. Good for quick trials and tech users.crisper_weaver-windows-x64.msix— installable. Lands inApps & features, uninstalls cleanly, and registers "Open With → CrisperWeaver" in Explorer for.wav/.mp3/.m4a/.flac/.ogg/.aac/.opus/.wma/.srt/.vtt. Recommended for everyday use.
The MSIX is not signed by a Microsoft Store-trusted CA yet (Microsoft Store registration is on the roadmap), so the first install needs a one-time trust step:
- Download
crisper_weaver-windows-x64.msixfrom the release page. - Right-click → Properties → tick Unblock (under the General tab, bottom right).
- Double-click the
.msix→ Install. Windows Security may warn ("Untrusted publisher") — accept once. - After install, right-click any audio file in Explorer → Open With → CrisperWeaver.
Once we publish to Microsoft Store, that warning goes away — winget install CrisperWeaver and the Store listing will both ship the same MSIX, signed.
The short version:
- CoreML acceleration for Whisper on iOS/macOS (enable
WHISPER_USE_COREMLinside CrispASR, ship the paired.mlmodelcnext to the.bin). - Swap the MFCC/k-means diarization stopgap for the shared-lib
crispasr_diarize_segments_abiintroduced in CrispASR v0.4.5 (pyannote GGUF + energy/xcorr/vad-turns in one call). - Wire the shared
crispasr_detect_language_pcm(v0.4.6) for auto-language on backends that lack native LID. - Wire
crispasr_align_words_abi(v0.4.7) to give word-level timestamps to LLM-based backends (qwen3 / voxtral / granite) that don't emit them natively. - Expose more CrispASR power-user knobs in the UI: beam search / best-of-N, temperature, source/target languages for translation backends, audio Q&A mode, streaming UI (see PLAN.md §5.8). VAD and
initialPromptalready shipped in v0.1.7. - Windows CI job (build script + DLL bundler shipped —
scripts/build_windows.ps1; CI workflow still TODO). - Android CI: cross-build
libwhisper.so+ sibling backends, drop intojniLibs/, ship a real-ASR APK. - iOS
pod install+ device-build CI verification. - Finish i18n migration for widgets + older Settings strings.
- Backend-specific UX (source/target language UI for Canary, Voxtral audio Q&A mode, Granite
--askflag, etc.).
Full per-item breakdown with file paths, risks, and verification steps: PLAN.md.
Technical learnings collected during development (FFI quirks, dylib bundling, macOS chrome, CI patterns): LEARNINGS.md.
CrisperWeaver is GNU AGPL-3.0-or-later. See LICENSE for full text and the in-app About screen for the auto-aggregated third-party license list (showLicensePage).
Copyright © Christian Ströbele · Stuttgart, Germany.