feat(M32d): KV cache for qwen3_moe inference path — 19× speedup by noahgift · Pull Request #1832 · paiml/aprender

noahgift · 2026-05-20T05:34:05Z

Summary

Implements M32d KV cache for the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.1 → v1.2.0). Empirical: 19× speedup on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session. This PR supersedes #1829.

Empirical results

Metric	Pre-M32d	Post-M32d	Speedup
Sustained throughput (32 tok)	~0.5 tok/s	9.62 tok/s	19×
Wall on 4 tokens	1002ms	553ms	1.8×
Greedy output equivalence	—	byte-identical	✓

All measurements on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Implementation

New: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` step-for-step EXCEPT the FFN block, which calls `moe_ffn_forward_layer` (router → top-k → per-expert SwiGLU) instead of dense gate/up/down dispatch.
Rewrite: `run_qwen3_moe_generate` now uses cache-aware decode: prefill per-prompt-token + decode per-output-token, both via the new single-token function.
Visibility fix: `single_cache_final_output` in `ffn_block.rs` → `pub(crate)` so MoE path reuses the dense final-norm + LM head unchanged.

Risk surfaces from scope doc (all cleared)

✅ Numerical equivalence — byte-identical greedy outputs on 4-token reference
✅ Dense path regression — `forward_single_with_cache` untouched
✅ RoPE position offset — handled via `position` parameter (same pattern)
✅ GQA expansion — handled via `kv_dim()` + first-token edge case explicit
✅ Expert routing under cache — confirmed unaffected
◯ Streaming SSE — structurally enabled; not wired (separate follow-up)

New tests

`crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — generates 4 tokens via cache-on AND legacy full-prefill, asserts greedy outputs byte-identical
`crates/aprender-serve/tests/m32d_perf.rs` — asserts ≥ 5 tok/s sustained on 32-token gen (pinned floor)

Both `#[ignore]` + env-gated on `QWEN3_MOE_GGUF_PATH`.

V1_001 + V1_003 regression

Existing #1819 cargo test still passes: 9.39s wall, content "Human: What", no matmul guard fire.

Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at sustained 9.62 tok/s. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension.

Test plan

`cargo check -p aprender-serve --lib --features cuda` — clean
`cargo check --test moe_kv_cache_equivalence --test m32d_perf` — clean
`QWEN3_MOE_GGUF_PATH=... cargo test moe_kv_cache_equivalence --release -- --ignored` — PASS, 4 tokens byte-identical
`QWEN3_MOE_GGUF_PATH=... cargo test m32d_perf --release -- --ignored` — PASS, 9.62 tok/s sustained
`QWEN3_MOE_GGUF_PATH=... cargo test qwen3_moe_serve_dispatch_v1 --release -- --ignored` (regression check, fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF #1819 test) — PASS
CI (sovereign-ci full workflow)
Companion-side CCPA Phase 6 V1_004 discharge bench (operator-coordinated, ~10 hr wall)

🤖 Generated with Claude Code

…spatch ready (#254) Upstream M32d KV cache for qwen3_moe inference path shipped at paiml/aprender#1832 (open; in CI). Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session implementation. Empirical (on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M): - Pre-M32d: ~0.5 tok/s (bench timed out on every per-turn budget) - Post-M32d: 9.62 tok/s sustained on 32-token gen (19× speedup) - Numerical equivalence vs full-prefill: byte-identical greedy outputs - V1_001 + V1_003 (#1819 cargo test) regression: stable V1_004 prerequisite (M32d KV cache) NOW MET. Bench discharge is operator-actionable on a tractable ~10hr wall. ## Files ### NEW: `evidence/phase-6/m32d-shipped-2026-05-20.md` - Upstream empirical results table - New cargo tests pinning the invariants (equivalence + perf floor) - V1_004 dispatch checklist (7 operator steps) - Cross-references to all upstream PRs ### MODIFIED: `evidence/phase-6/1.5b-calibration-run.md` - aprender#1789 line: V1_004 status flipped from BLOCKED to "prerequisite MET 2026-05-20 via M32d" - Updated PR list with all 7 follow-up PRs (#1806, #1807, #1812, #1814, #1819, #1826, #1832) - Added cross-reference to m32d-shipped-2026-05-20.md ## What this is NOT - NOT a CCPA-side code change (bench script + analyzer + harness unchanged) - NOT the V1_004 bench dispatch itself (operator-coordinated, ~10hr wall) - NOT a new CCPA contract gate (V1_004 is unchanged; only its prerequisite flipped) Mechanical doc update. M-counter NOT bumped per the discipline doctrine. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…2d (#1837) Registers a new provable contract for temperature + top_k + top_p sampling on the qwen3_moe inference path. Sibling to qwen3-moe-streaming-sse-v1 (#1835); independent follow-up to M32d (#1832). ## Motivation `run_qwen3_moe_generate` currently does unconditional greedy argmax. QuantizedGenerateConfig.temperature/top_k/top_p are silently ignored. HTTP chat requests with non-zero temperature get the SAME output every time — unacceptable for production chat. ## Why now The V1_004 bench (paiml/claude-code-parity-apr Phase 6 against Qwen3-Coder-30B-A3B) is hitting 900s per-turn timeout because the 30B-MoE is verbose. Temperature scaling (e.g. 0.3) could concentrate probability mass on high-confidence tokens and reduce rambling. Greedy-only forces ONE point on the spectrum. ## Falsification gates - V1_001: greedy-fallback (temperature=0 OR top_k=1) → deterministic - V1_002: temperature>0 + fixed seed → deterministic - V1_003: temperature>0 + different seeds → different outputs - V1_004: top_k=1 with high temperature → still equivalent to greedy ## Implementation phases (engineer playbook) - Phase 1 (~2hr): lift dense-path sampling block into reusable helper - Phase 2 (~1hr): wire into run_qwen3_moe_generate decode loop - Phase 3 (~2-3hr): cargo test battery + optional companion sub-bench Total ~5-6 hours; operator-actionable any time. Independent of qwen3-moe-streaming-sse-v1. NOT in scope: - Repetition penalty (separate qwen3-moe-repetition-penalty-v1 contract) - Mirostat / logit bias / streaming (separate concerns) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

#1835) Registers a new provable contract for per-token SSE streaming on the qwen3_moe chat-completions path. This is the natural follow-up to #1832 (M32d KV cache) — pre-M32d, streaming was meaningless because full-prefill-per-token mode took ~30 minutes per 256-token completion. Post-M32d at 9.62 tok/s sustained, per-token SSE emits become valuable for chat UX. ## Falsification gates - V1_001: chat-completions with stream=true emits per-token SSE events (not buffered into pregenerated SSE) - V1_002: stream=false still returns a single JSON response (regression) - V1_003: streaming throughput ≥ 2 tok/s median inter-event time ## Implementation phases (engineer playbook) - Phase 1 (~2hr): callback variant of run_qwen3_moe_generate - Phase 2 (~4hr): wire into try_qwen3_moe_backend in cuda_chat_backend.rs - Phase 3 (~2hr): cargo integration test Total ~6-8 hours, operator-actionable once #1832 merges. NOT in scope: - MoE inference correctness (covered by qwen3-moe-serve-dispatch-v1) - KV cache mechanics (M32d / #1832) - Streaming for dense models (already exists via OwnedQuantizedModelCachedSync) - Tool-call streaming (separate contract) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…3-moe-streaming-sse-v1) (#1854) ## Summary Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835). Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this contract codifies that `stream=true` on a qwen3_moe model emits SSE events per-token instead of buffering the full completion. ## Changes - `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming` — callback variant of `run_qwen3_moe_generate`. Mirrors the non-streaming function step-for-step, but invokes `on_token(u32) -> bool` after each decoded token (BEFORE the stop check, so the client sees every sampled token). Callback returning `false` short-circuits the loop for client disconnect handling. - `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on `request.stream`. If true, spin up an mpsc channel, run the streaming variant on a `spawn_blocking` worker, and route the channel through the dense path's `true_streaming_sse_response` helper. Non-streaming path unchanged. - `api/openai_handlers.rs`: promote `true_streaming_sse_response` from `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE framing as the dense path. No behavior change. - `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003: * V1_001: streaming callback fires per-token, captured tokens equal the non-streaming greedy baseline. * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well below M32d's ~5 tok/s). * Bonus: callback returning `false` short-circuits the loop. V1_002 (`stream=false` regression) is covered by `qwen3_moe_serve_dispatch_v1.rs`. ## Why #1832 made KV cache available → per-token gen amortizes to ~100ms. Before this PR, MoE `stream=true` requests on qwen3_moe still went through `run_qwen3_moe_generate` (synchronous) and the client got the full response in a single late SSE event — UX regression vs dense path. Now the client sees the first token within `prefill_wall + 100ms` and subsequent tokens stream at ~M32d throughput. ## Test plan - [x] `cargo check -p aprender-serve --lib` — clean - [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass - [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…3 falsifiers PASS) (#1855) * feat: qwen3-moe streaming SSE — per-token emit when stream=true (qwen3-moe-streaming-sse-v1) ## Summary Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835). Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this contract codifies that `stream=true` on a qwen3_moe model emits SSE events per-token instead of buffering the full completion. ## Changes - `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming` — callback variant of `run_qwen3_moe_generate`. Mirrors the non-streaming function step-for-step, but invokes `on_token(u32) -> bool` after each decoded token (BEFORE the stop check, so the client sees every sampled token). Callback returning `false` short-circuits the loop for client disconnect handling. - `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on `request.stream`. If true, spin up an mpsc channel, run the streaming variant on a `spawn_blocking` worker, and route the channel through the dense path's `true_streaming_sse_response` helper. Non-streaming path unchanged. - `api/openai_handlers.rs`: promote `true_streaming_sse_response` from `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE framing as the dense path. No behavior change. - `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003: * V1_001: streaming callback fires per-token, captured tokens equal the non-streaming greedy baseline. * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well below M32d's ~5 tok/s). * Bonus: callback returning `false` short-circuits the loop. V1_002 (`stream=false` regression) is covered by `qwen3_moe_serve_dispatch_v1.rs`. ## Why #1832 made KV cache available → per-token gen amortizes to ~100ms. Before this PR, MoE `stream=true` requests on qwen3_moe still went through `run_qwen3_moe_generate` (synchronous) and the client got the full response in a single late SSE event — UX regression vs dense path. Now the client sees the first token within `prefill_wall + 100ms` and subsequent tokens stream at ~M32d throughput. ## Test plan - [x] `cargo check -p aprender-serve --lib` — clean - [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass - [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell ## Summary Operator-dispatched verification of `contracts/qwen3-moe-streaming-sse-v1.yaml` v1.0.0 (shipped via #1854). All three falsifiers PASS on Qwen3-Coder-30B-A3B (qwen3moe arch) running on gx10 Blackwell GB10. ## Results | Falsifier | Test | Verdict | |-------------|-------------------------------------------|---------| | V1_001 | v1_001_callback_fires_per_token | PASS | | V1_002 | (regression guard via serve-dispatch test) | GUARD | | V1_003 | v1_003_inter_token_latency_floor | PASS | V1_003 throughput on real 30B-MoE: **median 338 ms inter-token gap** over 32 callbacks (floor 500 ms), distribution p_min=250 ms / p_max=518 ms. ≈ 3 tok/s streamed — comfortably above the 2 tok/s contract floor and consistent with M32d's KV-cache-amortized per-token cost. Plus the negative-path `callback_stop_short_circuits` test confirmed that returning `false` from the per-token callback short-circuits the decode loop (client-disconnect handling). ## Artifacts - `findings.json` — machine-readable discharge record - `gx10-sse-smoke.log` — full cargo test stdout/stderr (549 lines) Both captured from `/home/noah/runs/sse-smoke-20260521-080640/` on gx10. ## Reproducer ```bash QWEN3_MOE_GGUF_PATH=/path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ cargo test --test qwen3_moe_streaming_sse_v1 \ -p aprender-serve --features cuda --release \ -- --ignored --nocapture ``` Binary commit: 6bff4ce (post-#1854 merge). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence: add gx10 cargo test log (qwen3-moe-streaming-sse-v1 discharge) Force-added (matches `.log` gitignore pattern but this one is a load-bearing discharge artifact, not a temp file). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift mentioned this pull request May 20, 2026

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache #1835

Merged

2 tasks

noahgift force-pushed the feat/m32d-moe-kv-cache branch from 1433ac7 to 4a04aae Compare May 20, 2026 05:53

noahgift enabled auto-merge (squash) May 20, 2026 06:04

noahgift mentioned this pull request May 20, 2026

refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4) #1831

Closed

2 tasks

Merge branch 'main' into feat/m32d-moe-kv-cache

0709293

noahgift mentioned this pull request May 20, 2026

spec: qwen3-moe-sampling-v1 — temperature/top_k/top_p follow-up to M32d #1837

Merged

2 tasks

Merge branch 'main' into feat/m32d-moe-kv-cache

52cd096

noahgift merged commit 6762827 into main May 20, 2026
10 checks passed

noahgift deleted the feat/m32d-moe-kv-cache branch May 20, 2026 08:24

noahgift mentioned this pull request May 20, 2026

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Closed

noahgift mentioned this pull request May 21, 2026

feat: qwen3-moe streaming SSE impl (qwen3-moe-streaming-sse-v1 V1_001+V1_003) #1854

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832
noahgift merged 3 commits into
mainfrom
feat/m32d-moe-kv-cache

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Empirical results

Implementation

Risk surfaces from scope doc (all cleared)

New tests

V1_001 + V1_003 regression

Companion-side downstream

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant