Skip to content

v0.9.2 — MTP everywhere (MTPLX + GGUF) · chaosengine-cli + E2E · in-app accelerator UX

Latest

Choose a tag to compare

@cryptopoly cryptopoly released this 18 May 17:24
· 14 commits to staging since this release

Headline: full-surface MTP speculative decoding (MTPLX on Apple Silicon + GGUF MTP via llama.cpp PR #22673), chaosengine-cli automation wrapper + phased E2E test suite, in-app accelerator install UX (FU-056 9-phase), torch in-place upgrade with rollback, plus a deep polish pass across Image / Video Studio.

54 commits rolled up since v0.9.0. Three feature pillars stand out: MTP speculative decoding lands end-to-end on both Apple Silicon (via MTPLX) and CUDA / GGUF (via llama.cpp PR #22673), delivering measurable wins on Qwen3.6-27B-MTP and Qwen3.5-class models. The new chaosengine-cli automation wrapper + phased E2E test suite turn the app into a scriptable / headless surface — every Studio + Chat operation is now reachable from the shell. And FU-056 brings every accelerator install in-band (Nunchaku, SageAttention, DFlash CUDA, TriAttention, kvpress, vLLM-via-WSL2) so users never drop to PowerShell. Plus 30+ smaller polish items.


Highlights

MTP speculative decoding — both lanes

MTPLX (Apple Silicon, Sonata's native MTP)

  • Native MTP head loader for trained-with-MTP models (Qwen3.6 / Qwen3.5 / DeepSeek V3/R1 / Gemma-4) on Apple Silicon. Runs in-process via the MTPLX engine wrapper.
  • Live bench: M4 Max × Qwen3.6-27B-4bit baseline 15.3 → MTPLX-MTP 23.3 tok/s (+52%). 60–66% draft acceptance.
  • Burst profile for thermal headroom — temporarily widens draft block + fan-control burst window.
  • llama-server upgrade-hint endpoint surfaces a one-click prompt when the bundled llama-server lags upstream MTP features.
  • MTPLX quickstart install flow (no longer requires manual venv); depth=3 default for the right speed / quality tradeoff out of the box.
  • Tracker: PR #54, FU-028 (closes the MLX side).

GGUF MTP via llama.cpp PR #22673 (FU-047)

  • llama.cpp PR #22673 merged 2026-05-16 — --spec-type draft-mtp --spec-draft-n-max N for any model with baked-in MTP heads.
  • Live-validated: 20.9 tok/s vs 13.8 Q8_0 baseline (+51%) on ggml-org/Qwen3.6-27B-MTP-GGUF, 60–72% acceptance.
  • Wired in LlamaCppEngine._build_command gated on _llama_server_supports("--spec-type"). New is_mtp_gguf_repo helper, ggufMtpAvailable capability flag, 4 aliases (ggml-org/... canonical + am17an/... author preview), catalog entries for ggml-org/Qwen3.6-27B-MTP-GGUF (29 GB Q8_0) + ggml-org/Qwen3.6-35B-A3B-MTP-GGUF (37 GB Q8_0 MoE).
  • Tensor-level MTP probe recognises nextn_predict marker so even unlabelled MTP GGUFs route correctly.
  • ThermalForge hint surfaces when an MTP draft sustains thermal pressure long enough to merit a profile change.
  • Pre-build gate now asserts the bundled llama-server exposes --spec-type draft-mtp.

FU-048 prefer-GGUF-MTP routing preference

  • When both MTPLX and GGUF MTP are available for a model, the RuntimeController._select_engine heuristic prefers the faster lane unless the user pinned otherwise. Settings → Speculative decoding backend (auto / prefer-gguf-mtp / prefer-mtplx).

chaosengine-cli automation wrapper + phased E2E test suite

The whole app is now scriptable end-to-end.

  • scripts/chaosengine-cli — single-binary CLI wrapper covering chat, image generation, video generation, model library, runtime control, diagnostics, HTML Challenge, benchmarks, downloads, installs. Every operation reachable from the desktop UI is reachable from the shell.
  • Phased E2E test suite (scripts/e2e_test_suite.py) — 8 phases × 36 checks exercise the full vertical stack against a live backend on port 8876:
    • Phase 0: environment probe (health, routes, GPU status, MTPLX status, inventory)
    • Phase 1: Chat (MLX native cache + TurboQuant + DFlash + MTPLX spec-dec + GGUF llama.cpp + long-context cache + fused-attention flag)
    • Phase 2: Chat Compare two-model
    • Phase 3: HTML Challenge create + delete
    • Phase 4: Image Studio catalog + library + runtime + live generate
    • Phase 5: Video Studio catalog + library + mlx-runtime + live generate
    • Phase 6: Setup probes (mtplx-status, longlive-status, wan-status, wan-inventory, gpu-bundle, turbo-update, vllm-wsl-status, fu-056-capability-flags)
    • Phase 7: Diagnostics + cleanup (snapshot, log-tail, orphan workers, runtime state)
  • Writes CSV + Markdown reports to ~/.chaosengine/test-results/.
  • --smoke mode for sub-60s CI; full sweep ~3 minutes.
  • Pre-build gate integrates the smoke suite — release-blocking on phase failures.
  • Memory-gate refusals correctly classified as SKIP (with reason) not FAIL — pre-build ships green on memory-constrained hosts.

FU-056 — In-app accelerator install UX (9 phases)

Bring every CUDA-side accelerator install in-band so users never drop to PowerShell. Install affordances live next to the thing they accelerate.

  • Phase 1 — capability probe (backend_service/inference/accelerators.py): lazy importability + version helpers for nunchaku, sageattention, dflash-mlx, dflash-cuda, triattention, kvpress + a Windows-only wsl2_available() shell probe. 11 new fields on BackendCapabilities. 25 unit tests pin the present/absent/broken-install matrix.
  • Phase 2 — <AcceleratorCard> component + catalog: reusable React component with pill / card / row variants. Platform-gated visibility, install state badge, click-through to install.
  • Phase 3 — Image Studio accelerator surfaces: contextual pills on Discover + Models rows showing applicable accelerators per repo (Nunchaku for FLUX, SageAttention CUDA-only, etc.).
  • Phase 4 — Video Studio accelerator surfaces: same pattern for Wan / LTX / HunyuanVideo + LongLive bundle install row.
  • Phase 5 — Chat composer DFlash install nudge: when the loaded model has a registered drafter but dflash-mlx isn't installed, the composer surfaces a one-click install pill.
  • Phase 6 — Diagnostics Boost Pack panel: one-stop view of every accelerator with current install state + per-package install button.
  • Phase 7 — per-variant recommendedAccelerators catalog metadata + i18n strings across all 10 locales.
  • Phase 8 — Windows vLLM-via-WSL2 bridge: WSL2 detector + isolated venv install + remote subprocess engine. 4 live-e2e issues caught + fixed during integration.
  • Phase 9 — cache-strategy-matrix runner integration + pre-build gate.
  • Hide platform-incompatible catalog variants entirely (extends FU-034 "hide unrecoverable options" policy): Windows / Linux users no longer see MLX / mlx-video / mflux / MTPLX entries; Apple Silicon users no longer see vLLM / nunchaku / CUDA-only entries.
  • Hide MTPLX block on non-Apple-Silicon hosts in Chat Compare + HTML Challenge + Launch Modal.
  • Backend --port / --host CLI args properly honored; WSL test scripts pinned.

End-state UX: fresh user installs ChaosEngineAI → downloads FLUX → sees "Nunchaku +3× available [Install]" pill on the catalog card → one click → 90s later first generation runs at SVDQuant speed, no terminal required.

Torch in-place upgrade with rollback

  • Detection: backend detects when bundled torch is older than the latest available stable.
  • UI pill: Settings → Setup surfaces "Torch upgrade available" with one-click install.
  • Background job pattern: long-running install runs in the same background-job pattern as LongLive / Wan-convert. Progress + log tail + cancel.
  • Rollback: if the new wheel fails to import (CUDA mismatch, ABI break), the runner restores the pre-upgrade snapshot of the affected packages.
  • Rebuilds extras (bitsandbytes, torchao, nunchaku, sageattention) against the new torch so the dependent stack stays coherent.

FU-049 → FU-055 — matrix expansion + star favourites + storage explorer

  • FU-049 Python 3.14 support gate — tracker row documenting wheel coverage status, plan when gate opens.
  • FU-050 cache-strategy matrix runner: reasoning-channel capture (so Qwen3.5/3.6/R1 with <think> blocks no longer produce empty full_text) + DEFAULT_MAX_TOKENS bumped 96 → 512. Fixed three stale paths in the runner (route moved under /api/chat/, runtime.loadedModel.cacheStrategy lookup, FU-030 legacy-alias assertion).
  • FU-051 effectiveCacheStrategy field in /api/models/load response — preserves the user's literal request for telemetry while exposing the registry-coerced canonical id.
  • FU-052 matrix expansion: vLLM cells (CUDA), MTPLX MLX cell, GGUF MTP cell. 15 cells total (was 9). BackendCapabilities extended with mtplx_available, gguf_mtp_available, vllm_available.
  • FU-053 library status false-positive: Wan 2.2 distill variants were marked "installed" when only the base repo was on disk. New _distill_transformer_validation_error helper.
  • FU-054 same-repo variants show actual on-disk size + "shares repo with N other variants" badge (Wan 2.2 TI2V 5B GGUF Q4_K_M + Q6_K + Q8_0 now show real per-file sizes instead of the per-row "31.9 GB" repetition).
  • FU-055 storage explorer panel in Diagnostics — new GET /api/diagnostics/storage-top?limit=20 walks every model directory + returns top consumers with reveal-in-finder + delete actions. Cycle protection so symlinked dirs don't double-count.

v0.9.2 UX polish pass (this release)

  • LTX-2 mlx-video dimension snapHeight must be divisible by 32, got 432 crash on 16:9 / 9:16 / 21:9 presets fixed. Backend snaps width + height to nearest multiple of 32 (bias up on tie), surfaces snap in runtimeNote, reports actual rendered dims.
  • Warm-cache phase fix — second-generation of the same model variant no longer flashes "Loading…". generate() pre-computes variant_key, calls _is_variant_loaded(), begins on PHASE_ENCODING with "Reusing {modelName}" when the pipeline is cached. Symmetric fix in image + video.
  • mlx-video progress wiringVideoRuntimeManager mlx branch now wraps mlx.generate() with VIDEO_PROGRESS.begin/finish + on_progress adapter. Subprocess stdout fractions feed set_step(int(fraction * total_steps), total). fraction=None preserves the counter (was jittering back to 0.5 on every non-step line). New ProgressTracker.set_message() updates message without resetting step or phase.
  • mlx-video device memory/api/video/mlx-runtime now populates deviceMemoryGb (was always null, frontend defaulted to 16 GB on a 64 GB M4 Max).
  • CUDA-only FP8 layerwise toggle hidden on Apple Silicon — Image Studio + Video Studio.
  • Studio dropdown platform filter — Nunchaku INT4 (CUDA) variants disappear from Image Studio dropdown on macOS; mlx-video / mflux entries disappear on Win/Linux. Symmetric in Video Studio.

FU-057 → FU-061 — pin bumps + tracker hygiene

  • FU-057 (deferred): dflash-mlx v0.1.6 + v0.1.7 released upstream. v0.1.7 README live-validates Qwen3.6 27B 4-bit M5 Max bench at 2.78–3.06× over mlx-lm baseline with adaptive M block size. Migration is not a drop-instream_dflash_generate signature reshaped, configure_full_attention_split removed, resolve_target_ops moved. Full migration plan documented in tracker; cheap re-anchor path to v0.1.5.1 reachable tag noted. Current fada1eb pin still installable (commit reachable in upstream object pool).
  • FU-058: vLLM floor >=0.8.0>=0.21.0. Upstream brings Gemma4 MTP (#41745), MTP for MiMo-V2.5 (#41905), EAGLE for Mistral (#41024), TurboQuant hybrid model (#39931), spec-dec with thinking budget (#34668), Qwen3.5/Mamba hybrid Model Runner V2. Breaking build changes upstream (C++20 required, transformers v4 deprecated) consumed transparently — no code changes needed on our side.
  • FU-059: nunchaku pin >=1.2.1 (unsatisfiable — version reset upstream) → >=0.16.0. Setup-tab "Install Nunchaku" now resolves to the actual 0.16.1 wheel on PyPI.
  • FU-060: memory-pressure gate mocked in test_video_routes.py + test_backend_service.py setUp/tearDown. Tests deterministic regardless of host load (was flaky on busy dev boxes where host pressure legitimately >92%). scripts/e2e_test_suite.py phases 4+5 also treat memory-gate refusals as SKIP (with reason), not FAIL — pre-build ships green on memory-constrained hosts.
  • FU-061: "Watching upstream" badge + disabled download CTA for tracked-only seeds (ERNIE-Image, Nucleus-Image, Z-Image, HiDream, GLM-Image, FLUX.2 family) that lack Studio pipeline routing. Backend _is_launchable_image_repo(repo) helper; frontend trackedOnly?: boolean on ImageModelVariant + badge + disabled IconActionButton with explanatory tooltip.

Cache strategies — TaylorSeer + PAB surfaced

TaylorSeer and PyramidAttentionBroadcast (PAB) diffusion cache strategies now selectable from the Image / Video Studio cache pickers. Closes the FU-026 follow-up — all four diffusers 0.38 native cache configs (TaylorSeer / MagCache / PAB / FasterCache) are exposed.

Bug fixes worth calling out

  • TurboQuant ArraysCache slots — preserved for hybrid-attn models (was breaking certain Wan / video DiTs under TurboQuant).
  • Library entry pruning — vanished library entries (deleted on disk) disappear from /api/workspace + load lookups without waiting for the next full rescan. Trust explicit path over broken catalog match.
  • llama.cpp mmproj resolver — scoped to the model's own directory (was picking up unrelated mmproj files from sibling repos).
  • Tracked-seed model classifier — added image-family keywords (ernie-image, nucleus-image, z-image, hidream, glm-image) so tracked seeds don't leak into Chat → My Models tab.

Cross-platform — Windows + WSL CUDA fixes (post-merge)

A Windows-side pytest sweep + matching WSL2 + RTX 4090 dry run after the staging merge caught 19 platform-portability test infra bugs (Windows 16, WSL 3) plus a chat-tab UX gap. All landed on main as part of the v0.9.2 ship.

  • Chat empty-state banner (FU-056 follow-up): banner showed "No model loaded yet. Pick one from Models" even while the header strip showed "LOADING MODEL…" for a model already in flight. ChatThread.tsx hides the banner when serverLoading is non-null (the ModelLoadingProgress bubble already conveys state). ChatEmptyStateBanner.tsx reworded the "models present but none loaded" branch to "A model needs to be loaded before you can chat." + "Load Model" — actionable copy.
  • Windows test-suite fixes (1526 → 1528 / 16 fails → 0):
    • test_cache_strategies.py: 2 turboquant tests imported mlx_lm / turboquant_mlx at function scope without skip guards — added @unittest.skipUnless(_MLX_LM_AVAILABLE).
    • test_mtplx_engine_integration.py: 5 tests build a #!/usr/bin/env bash wrapper — class-level @unittest.skipIf(sys.platform == "win32") since Windows can't honour the bash shebang.
    • test_sdcpp_image.py + test_sdcpp_video.py: 3 tests asserted str equality against "/tmp/sd" but the source does str(Path(...)) which yields "\tmp\sd" on Windows. Centralized via _FAKE_SD_BIN = str(Path("/tmp/sd")) so both sides of the assertion stay platform-agnostic.
    • Misc cross-platform shims in test_gpu.py, test_preview_vae.py, test_mlx_video_wan_convert.py.
  • WSL test-suite fixes (1510 → 1544 / 3 fails → 0): same MLX/turboquant skip guards apply on non-Apple-Silicon Linux too.
  • vLLM engine signature regression gate — new tests/test_vllm_engine.py (5 cases) pins the engine.generate() signature against the controller's full call shape, sampler forwarding into SamplingParams, repeat_penalty renaming, and the temperature=0 → 0.01 bump (vLLM forbids exact zero). Tests skip cleanly when the vLLM wheel isn't installed (macOS / Windows CPU-only paths).
  • WSL + CUDA test plan — new docs/WSL_CUDA_TESTING.md. 7-phase practical guide, live-validated against WSL2 Ubuntu 24.04 + RTX 4090 (CUDA 12.6.85 toolkit, GPU passthrough):
    • Phase C (pytest with CUDA): 1544 / 0 / 3
    • Phase D (E2E full): 7 / 0 / 1
    • Phase E (matrix --full): vLLM native Qwen3-0.6B PASS, SHA d18c2b8cb410
    • Phase G (real workload): Qwen3-0.6B via vLLM on RTX 4090, 24 GB VRAM allocated, 44% GPU utilization confirmed via nvidia-smi
    • Captures 13 install / runtime gotchas surfaced during the dry run (don't pre-pin torch before vllm; tr -d '\r' not sed 's/\r$//' for CRLF; CHAOSENGINE_REQUIRE_AUTH=0 for headless E2E; ninja must be on PATH at backend launch for flashinfer JIT; etc.).

Docs

  • MkDocs site under docs/ (Read-The-Docs–style theme) — sources for MTPLX setup, chaosengine-cli reference, E2E suite, cache strategy guide, contributor docs.
  • Auto-deploy to chaosengineai.com/docs/ via GitHub Actions.
  • README accuracy pass — full diffusion cache list (incl. TaylorSeer / PAB), DFlash families, LTX series. README surfaces MTPLX + chaosengine-cli + E2E suite for new users.

Tooling

  • App version sync gate in pre-build-check.{sh,mjs} — pins pyproject.toml, package.json, src-tauri/Cargo.toml, src-tauri/tauri.conf.json to the same version. Caught a real drift bug during release prep.
  • Cross-strategy matrix expansion — 9 → 15 cells with platform-correct skip reasons (no false fails on non-applicable platforms).
  • Pre-build now includes E2E smoke as a gate.

Migration notes

  • Existing users: no action required.
  • Apple Silicon users: Studio dropdowns now hide Nunchaku INT4 (CUDA) variants. If you previously had one selected, the runtime falls back to the family default (e.g. FLUX.1 Dev base) cleanly.
  • Win/Linux users: Studio dropdowns now hide mlx-video / mflux MLX entries. Same fallback shape.
  • Spec-dec users on Apple Silicon: if you previously ran an MTP-capable model without spec-dec, MTPLX will now auto-engage if the model has trained MTP heads. Toggle in Settings → Speculative decoding backend if you want to pin a specific lane.
  • GGUF MTP users: a rebuild of the bundled llama-server is required to pick up PR #22673 (--spec-type draft-mtp). Pre-build gate flags this.
  • Tracked-only seed users: ERNIE-Image / Nucleus-Image / Z-Image / HiDream / GLM-Image / FLUX.2 Discover rows now show a "Watching upstream" badge + disabled download. These models don't yet have Studio pipeline routing — they were always download-only.
  • Devs: torch in-place upgrade is opt-in via the Settings → Setup pill. Rollback runs automatically on import failure. chaosengine-cli ships at scripts/chaosengine-cli; the phased E2E suite runs against any backend on port 8876.

Stats

  • 57 commits since v0.9.0
  • pytest: macOS 1541 / Windows 1528 / WSL 1544 — all green
  • vitest: 441 passed / 35 files
  • TypeScript: 0 errors
  • pre-build: 13 / 13 gates green
  • E2E full suite: 8 / 8 phases · 36 / 36 checks · 0 fail · 194.5s (macOS); 7 / 0 / 1 (WSL2 + RTX 4090 dry run)
  • Cross-strategy matrix --full on WSL2 + RTX 4090: vLLM native Qwen3-0.6B PASS (SHA d18c2b8cb410)
  • i18n: 100 % locale parity across all 10 shipping locales (2024 keys × 10)