v0.9.2 — MTP everywhere (MTPLX + GGUF) · chaosengine-cli + E2E · in-app accelerator UX
LatestHeadline: full-surface MTP speculative decoding (MTPLX on Apple Silicon + GGUF MTP via llama.cpp PR #22673), chaosengine-cli automation wrapper + phased E2E test suite, in-app accelerator install UX (FU-056 9-phase), torch in-place upgrade with rollback, plus a deep polish pass across Image / Video Studio.
54 commits rolled up since v0.9.0. Three feature pillars stand out: MTP speculative decoding lands end-to-end on both Apple Silicon (via MTPLX) and CUDA / GGUF (via llama.cpp PR #22673), delivering measurable wins on Qwen3.6-27B-MTP and Qwen3.5-class models. The new chaosengine-cli automation wrapper + phased E2E test suite turn the app into a scriptable / headless surface — every Studio + Chat operation is now reachable from the shell. And FU-056 brings every accelerator install in-band (Nunchaku, SageAttention, DFlash CUDA, TriAttention, kvpress, vLLM-via-WSL2) so users never drop to PowerShell. Plus 30+ smaller polish items.
Highlights
MTP speculative decoding — both lanes
MTPLX (Apple Silicon, Sonata's native MTP)
- Native MTP head loader for trained-with-MTP models (Qwen3.6 / Qwen3.5 / DeepSeek V3/R1 / Gemma-4) on Apple Silicon. Runs in-process via the MTPLX engine wrapper.
- Live bench: M4 Max × Qwen3.6-27B-4bit baseline 15.3 → MTPLX-MTP 23.3 tok/s (+52%). 60–66% draft acceptance.
- Burst profile for thermal headroom — temporarily widens draft block + fan-control burst window.
- llama-server upgrade-hint endpoint surfaces a one-click prompt when the bundled
llama-serverlags upstream MTP features. - MTPLX quickstart install flow (no longer requires manual venv);
depth=3default for the right speed / quality tradeoff out of the box. - Tracker: PR #54, FU-028 (closes the MLX side).
GGUF MTP via llama.cpp PR #22673 (FU-047)
- llama.cpp PR #22673 merged 2026-05-16 —
--spec-type draft-mtp --spec-draft-n-max Nfor any model with baked-in MTP heads. - Live-validated: 20.9 tok/s vs 13.8 Q8_0 baseline (+51%) on
ggml-org/Qwen3.6-27B-MTP-GGUF, 60–72% acceptance. - Wired in
LlamaCppEngine._build_commandgated on_llama_server_supports("--spec-type"). Newis_mtp_gguf_repohelper,ggufMtpAvailablecapability flag, 4 aliases (ggml-org/...canonical +am17an/...author preview), catalog entries forggml-org/Qwen3.6-27B-MTP-GGUF(29 GB Q8_0) +ggml-org/Qwen3.6-35B-A3B-MTP-GGUF(37 GB Q8_0 MoE). - Tensor-level MTP probe recognises
nextn_predictmarker so even unlabelled MTP GGUFs route correctly. - ThermalForge hint surfaces when an MTP draft sustains thermal pressure long enough to merit a profile change.
- Pre-build gate now asserts the bundled
llama-serverexposes--spec-type draft-mtp.
FU-048 prefer-GGUF-MTP routing preference
- When both MTPLX and GGUF MTP are available for a model, the
RuntimeController._select_engineheuristic prefers the faster lane unless the user pinned otherwise. Settings → Speculative decoding backend (auto/prefer-gguf-mtp/prefer-mtplx).
chaosengine-cli automation wrapper + phased E2E test suite
The whole app is now scriptable end-to-end.
scripts/chaosengine-cli— single-binary CLI wrapper covering chat, image generation, video generation, model library, runtime control, diagnostics, HTML Challenge, benchmarks, downloads, installs. Every operation reachable from the desktop UI is reachable from the shell.- Phased E2E test suite (
scripts/e2e_test_suite.py) — 8 phases × 36 checks exercise the full vertical stack against a live backend on port 8876:- Phase 0: environment probe (health, routes, GPU status, MTPLX status, inventory)
- Phase 1: Chat (MLX native cache + TurboQuant + DFlash + MTPLX spec-dec + GGUF llama.cpp + long-context cache + fused-attention flag)
- Phase 2: Chat Compare two-model
- Phase 3: HTML Challenge create + delete
- Phase 4: Image Studio catalog + library + runtime + live generate
- Phase 5: Video Studio catalog + library + mlx-runtime + live generate
- Phase 6: Setup probes (mtplx-status, longlive-status, wan-status, wan-inventory, gpu-bundle, turbo-update, vllm-wsl-status, fu-056-capability-flags)
- Phase 7: Diagnostics + cleanup (snapshot, log-tail, orphan workers, runtime state)
- Writes CSV + Markdown reports to
~/.chaosengine/test-results/. --smokemode for sub-60s CI; full sweep ~3 minutes.- Pre-build gate integrates the smoke suite — release-blocking on phase failures.
- Memory-gate refusals correctly classified as SKIP (with reason) not FAIL — pre-build ships green on memory-constrained hosts.
FU-056 — In-app accelerator install UX (9 phases)
Bring every CUDA-side accelerator install in-band so users never drop to PowerShell. Install affordances live next to the thing they accelerate.
- Phase 1 — capability probe (backend_service/inference/accelerators.py): lazy importability + version helpers for nunchaku, sageattention, dflash-mlx, dflash-cuda, triattention, kvpress + a Windows-only
wsl2_available()shell probe. 11 new fields onBackendCapabilities. 25 unit tests pin the present/absent/broken-install matrix. - Phase 2 —
<AcceleratorCard>component + catalog: reusable React component withpill/card/rowvariants. Platform-gated visibility, install state badge, click-through to install. - Phase 3 — Image Studio accelerator surfaces: contextual pills on Discover + Models rows showing applicable accelerators per repo (Nunchaku for FLUX, SageAttention CUDA-only, etc.).
- Phase 4 — Video Studio accelerator surfaces: same pattern for Wan / LTX / HunyuanVideo + LongLive bundle install row.
- Phase 5 — Chat composer DFlash install nudge: when the loaded model has a registered drafter but
dflash-mlxisn't installed, the composer surfaces a one-click install pill. - Phase 6 — Diagnostics Boost Pack panel: one-stop view of every accelerator with current install state + per-package install button.
- Phase 7 — per-variant
recommendedAcceleratorscatalog metadata + i18n strings across all 10 locales. - Phase 8 — Windows vLLM-via-WSL2 bridge: WSL2 detector + isolated venv install + remote subprocess engine. 4 live-e2e issues caught + fixed during integration.
- Phase 9 — cache-strategy-matrix runner integration + pre-build gate.
- Hide platform-incompatible catalog variants entirely (extends FU-034 "hide unrecoverable options" policy): Windows / Linux users no longer see MLX / mlx-video / mflux / MTPLX entries; Apple Silicon users no longer see vLLM / nunchaku / CUDA-only entries.
- Hide MTPLX block on non-Apple-Silicon hosts in Chat Compare + HTML Challenge + Launch Modal.
- Backend
--port/--hostCLI args properly honored; WSL test scripts pinned.
End-state UX: fresh user installs ChaosEngineAI → downloads FLUX → sees "Nunchaku +3× available [Install]" pill on the catalog card → one click → 90s later first generation runs at SVDQuant speed, no terminal required.
Torch in-place upgrade with rollback
- Detection: backend detects when bundled torch is older than the latest available stable.
- UI pill: Settings → Setup surfaces "Torch upgrade available" with one-click install.
- Background job pattern: long-running install runs in the same background-job pattern as LongLive / Wan-convert. Progress + log tail + cancel.
- Rollback: if the new wheel fails to import (CUDA mismatch, ABI break), the runner restores the pre-upgrade snapshot of the affected packages.
- Rebuilds extras (
bitsandbytes,torchao,nunchaku,sageattention) against the new torch so the dependent stack stays coherent.
FU-049 → FU-055 — matrix expansion + star favourites + storage explorer
- FU-049 Python 3.14 support gate — tracker row documenting wheel coverage status, plan when gate opens.
- FU-050 cache-strategy matrix runner: reasoning-channel capture (so Qwen3.5/3.6/R1 with
<think>blocks no longer produce emptyfull_text) +DEFAULT_MAX_TOKENSbumped 96 → 512. Fixed three stale paths in the runner (route moved under/api/chat/,runtime.loadedModel.cacheStrategylookup, FU-030 legacy-alias assertion). - FU-051
effectiveCacheStrategyfield in/api/models/loadresponse — preserves the user's literal request for telemetry while exposing the registry-coerced canonical id. - FU-052 matrix expansion: vLLM cells (CUDA), MTPLX MLX cell, GGUF MTP cell. 15 cells total (was 9).
BackendCapabilitiesextended withmtplx_available,gguf_mtp_available,vllm_available. - FU-053 library status false-positive: Wan 2.2 distill variants were marked "installed" when only the base repo was on disk. New
_distill_transformer_validation_errorhelper. - FU-054 same-repo variants show actual on-disk size + "shares repo with N other variants" badge (Wan 2.2 TI2V 5B GGUF Q4_K_M + Q6_K + Q8_0 now show real per-file sizes instead of the per-row "31.9 GB" repetition).
- FU-055 storage explorer panel in Diagnostics — new
GET /api/diagnostics/storage-top?limit=20walks every model directory + returns top consumers with reveal-in-finder + delete actions. Cycle protection so symlinked dirs don't double-count.
v0.9.2 UX polish pass (this release)
- LTX-2 mlx-video dimension snap —
Height must be divisible by 32, got 432crash on 16:9 / 9:16 / 21:9 presets fixed. Backend snaps width + height to nearest multiple of 32 (bias up on tie), surfaces snap inruntimeNote, reports actual rendered dims. - Warm-cache phase fix — second-generation of the same model variant no longer flashes "Loading…".
generate()pre-computesvariant_key, calls_is_variant_loaded(), begins onPHASE_ENCODINGwith "Reusing {modelName}" when the pipeline is cached. Symmetric fix in image + video. - mlx-video progress wiring —
VideoRuntimeManagermlx branch now wrapsmlx.generate()withVIDEO_PROGRESS.begin/finish+on_progressadapter. Subprocess stdout fractions feedset_step(int(fraction * total_steps), total).fraction=Nonepreserves the counter (was jittering back to 0.5 on every non-step line). NewProgressTracker.set_message()updates message without resetting step or phase. - mlx-video device memory —
/api/video/mlx-runtimenow populatesdeviceMemoryGb(was alwaysnull, frontend defaulted to 16 GB on a 64 GB M4 Max). - CUDA-only FP8 layerwise toggle hidden on Apple Silicon — Image Studio + Video Studio.
- Studio dropdown platform filter — Nunchaku INT4 (CUDA) variants disappear from Image Studio dropdown on macOS; mlx-video / mflux entries disappear on Win/Linux. Symmetric in Video Studio.
FU-057 → FU-061 — pin bumps + tracker hygiene
- FU-057 (deferred): dflash-mlx v0.1.6 + v0.1.7 released upstream. v0.1.7 README live-validates Qwen3.6 27B 4-bit M5 Max bench at 2.78–3.06× over mlx-lm baseline with adaptive M block size. Migration is not a drop-in —
stream_dflash_generatesignature reshaped,configure_full_attention_splitremoved,resolve_target_opsmoved. Full migration plan documented in tracker; cheap re-anchor path to v0.1.5.1 reachable tag noted. Currentfada1ebpin still installable (commit reachable in upstream object pool). - FU-058: vLLM floor
>=0.8.0→>=0.21.0. Upstream brings Gemma4 MTP (#41745), MTP for MiMo-V2.5 (#41905), EAGLE for Mistral (#41024), TurboQuant hybrid model (#39931), spec-dec with thinking budget (#34668), Qwen3.5/Mamba hybrid Model Runner V2. Breaking build changes upstream (C++20 required, transformers v4 deprecated) consumed transparently — no code changes needed on our side. - FU-059: nunchaku pin
>=1.2.1(unsatisfiable — version reset upstream) →>=0.16.0. Setup-tab "Install Nunchaku" now resolves to the actual 0.16.1 wheel on PyPI. - FU-060: memory-pressure gate mocked in
test_video_routes.py+test_backend_service.pysetUp/tearDown. Tests deterministic regardless of host load (was flaky on busy dev boxes where host pressure legitimately >92%).scripts/e2e_test_suite.pyphases 4+5 also treat memory-gate refusals as SKIP (with reason), not FAIL — pre-build ships green on memory-constrained hosts. - FU-061: "Watching upstream" badge + disabled download CTA for tracked-only seeds (ERNIE-Image, Nucleus-Image, Z-Image, HiDream, GLM-Image, FLUX.2 family) that lack Studio pipeline routing. Backend
_is_launchable_image_repo(repo)helper; frontendtrackedOnly?: booleanonImageModelVariant+ badge + disabled IconActionButton with explanatory tooltip.
Cache strategies — TaylorSeer + PAB surfaced
TaylorSeer and PyramidAttentionBroadcast (PAB) diffusion cache strategies now selectable from the Image / Video Studio cache pickers. Closes the FU-026 follow-up — all four diffusers 0.38 native cache configs (TaylorSeer / MagCache / PAB / FasterCache) are exposed.
Bug fixes worth calling out
- TurboQuant ArraysCache slots — preserved for hybrid-attn models (was breaking certain Wan / video DiTs under TurboQuant).
- Library entry pruning — vanished library entries (deleted on disk) disappear from
/api/workspace+ load lookups without waiting for the next full rescan. Trust explicit path over broken catalog match. - llama.cpp mmproj resolver — scoped to the model's own directory (was picking up unrelated mmproj files from sibling repos).
- Tracked-seed model classifier — added image-family keywords (
ernie-image,nucleus-image,z-image,hidream,glm-image) so tracked seeds don't leak into Chat → My Models tab.
Cross-platform — Windows + WSL CUDA fixes (post-merge)
A Windows-side pytest sweep + matching WSL2 + RTX 4090 dry run after the staging merge caught 19 platform-portability test infra bugs (Windows 16, WSL 3) plus a chat-tab UX gap. All landed on main as part of the v0.9.2 ship.
- Chat empty-state banner (FU-056 follow-up): banner showed "No model loaded yet. Pick one from Models" even while the header strip showed "LOADING MODEL…" for a model already in flight.
ChatThread.tsxhides the banner whenserverLoadingis non-null (theModelLoadingProgressbubble already conveys state).ChatEmptyStateBanner.tsxreworded the "models present but none loaded" branch to "A model needs to be loaded before you can chat." + "Load Model" — actionable copy. - Windows test-suite fixes (1526 → 1528 / 16 fails → 0):
test_cache_strategies.py: 2 turboquant tests importedmlx_lm/turboquant_mlxat function scope without skip guards — added@unittest.skipUnless(_MLX_LM_AVAILABLE).test_mtplx_engine_integration.py: 5 tests build a#!/usr/bin/env bashwrapper — class-level@unittest.skipIf(sys.platform == "win32")since Windows can't honour the bash shebang.test_sdcpp_image.py+test_sdcpp_video.py: 3 tests asserted str equality against"/tmp/sd"but the source doesstr(Path(...))which yields"\tmp\sd"on Windows. Centralized via_FAKE_SD_BIN = str(Path("/tmp/sd"))so both sides of the assertion stay platform-agnostic.- Misc cross-platform shims in
test_gpu.py,test_preview_vae.py,test_mlx_video_wan_convert.py.
- WSL test-suite fixes (1510 → 1544 / 3 fails → 0): same MLX/turboquant skip guards apply on non-Apple-Silicon Linux too.
- vLLM engine signature regression gate — new
tests/test_vllm_engine.py(5 cases) pins the engine.generate() signature against the controller's full call shape, sampler forwarding into SamplingParams,repeat_penaltyrenaming, and thetemperature=0 → 0.01bump (vLLM forbids exact zero). Tests skip cleanly when the vLLM wheel isn't installed (macOS / Windows CPU-only paths). - WSL + CUDA test plan — new
docs/WSL_CUDA_TESTING.md. 7-phase practical guide, live-validated against WSL2 Ubuntu 24.04 + RTX 4090 (CUDA 12.6.85 toolkit, GPU passthrough):- Phase C (pytest with CUDA): 1544 / 0 / 3
- Phase D (E2E full): 7 / 0 / 1
- Phase E (matrix
--full): vLLM native Qwen3-0.6B PASS, SHAd18c2b8cb410 - Phase G (real workload): Qwen3-0.6B via vLLM on RTX 4090, 24 GB VRAM allocated, 44% GPU utilization confirmed via
nvidia-smi - Captures 13 install / runtime gotchas surfaced during the dry run (don't pre-pin torch before vllm;
tr -d '\r'notsed 's/\r$//'for CRLF;CHAOSENGINE_REQUIRE_AUTH=0for headless E2E; ninja must be on PATH at backend launch for flashinfer JIT; etc.).
Docs
- MkDocs site under
docs/(Read-The-Docs–style theme) — sources for MTPLX setup,chaosengine-clireference, E2E suite, cache strategy guide, contributor docs. - Auto-deploy to chaosengineai.com/docs/ via GitHub Actions.
- README accuracy pass — full diffusion cache list (incl. TaylorSeer / PAB), DFlash families, LTX series. README surfaces MTPLX +
chaosengine-cli+ E2E suite for new users.
Tooling
- App version sync gate in
pre-build-check.{sh,mjs}— pinspyproject.toml,package.json,src-tauri/Cargo.toml,src-tauri/tauri.conf.jsonto the same version. Caught a real drift bug during release prep. - Cross-strategy matrix expansion — 9 → 15 cells with platform-correct skip reasons (no false fails on non-applicable platforms).
- Pre-build now includes E2E smoke as a gate.
Migration notes
- Existing users: no action required.
- Apple Silicon users: Studio dropdowns now hide Nunchaku INT4 (CUDA) variants. If you previously had one selected, the runtime falls back to the family default (e.g. FLUX.1 Dev base) cleanly.
- Win/Linux users: Studio dropdowns now hide mlx-video / mflux MLX entries. Same fallback shape.
- Spec-dec users on Apple Silicon: if you previously ran an MTP-capable model without spec-dec, MTPLX will now auto-engage if the model has trained MTP heads. Toggle in Settings → Speculative decoding backend if you want to pin a specific lane.
- GGUF MTP users: a rebuild of the bundled
llama-serveris required to pick up PR #22673 (--spec-type draft-mtp). Pre-build gate flags this. - Tracked-only seed users: ERNIE-Image / Nucleus-Image / Z-Image / HiDream / GLM-Image / FLUX.2 Discover rows now show a "Watching upstream" badge + disabled download. These models don't yet have Studio pipeline routing — they were always download-only.
- Devs: torch in-place upgrade is opt-in via the Settings → Setup pill. Rollback runs automatically on import failure.
chaosengine-cliships atscripts/chaosengine-cli; the phased E2E suite runs against any backend on port 8876.
Stats
- 57 commits since v0.9.0
- pytest: macOS 1541 / Windows 1528 / WSL 1544 — all green
- vitest: 441 passed / 35 files
- TypeScript: 0 errors
- pre-build: 13 / 13 gates green
- E2E full suite: 8 / 8 phases · 36 / 36 checks · 0 fail · 194.5s (macOS); 7 / 0 / 1 (WSL2 + RTX 4090 dry run)
- Cross-strategy matrix
--fullon WSL2 + RTX 4090: vLLM native Qwen3-0.6B PASS (SHAd18c2b8cb410) - i18n: 100 % locale parity across all 10 shipping locales (2024 keys × 10)