Skip to content

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832

Merged
noahgift merged 3 commits into
mainfrom
feat/m32d-moe-kv-cache
May 20, 2026
Merged

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup#1832
noahgift merged 3 commits into
mainfrom
feat/m32d-moe-kv-cache

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Implements M32d KV cache for the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.1 → v1.2.0). Empirical: 19× speedup on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Operator flipped from Option (b) engineer-driven (#1829) to Option (a) in-session. This PR supersedes #1829.

Empirical results

Metric Pre-M32d Post-M32d Speedup
Sustained throughput (32 tok) ~0.5 tok/s 9.62 tok/s 19×
Wall on 4 tokens 1002ms 553ms 1.8×
Greedy output equivalence byte-identical

All measurements on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Implementation

  • New: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` step-for-step EXCEPT the FFN block, which calls `moe_ffn_forward_layer` (router → top-k → per-expert SwiGLU) instead of dense gate/up/down dispatch.
  • Rewrite: `run_qwen3_moe_generate` now uses cache-aware decode: prefill per-prompt-token + decode per-output-token, both via the new single-token function.
  • Visibility fix: `single_cache_final_output` in `ffn_block.rs` → `pub(crate)` so MoE path reuses the dense final-norm + LM head unchanged.

Risk surfaces from scope doc (all cleared)

  1. ✅ Numerical equivalence — byte-identical greedy outputs on 4-token reference
  2. ✅ Dense path regression — `forward_single_with_cache` untouched
  3. ✅ RoPE position offset — handled via `position` parameter (same pattern)
  4. ✅ GQA expansion — handled via `kv_dim()` + first-token edge case explicit
  5. ✅ Expert routing under cache — confirmed unaffected
  6. ◯ Streaming SSE — structurally enabled; not wired (separate follow-up)

New tests

  • `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — generates 4 tokens via cache-on AND legacy full-prefill, asserts greedy outputs byte-identical
  • `crates/aprender-serve/tests/m32d_perf.rs` — asserts ≥ 5 tok/s sustained on 32-token gen (pinned floor)

Both `#[ignore]` + env-gated on `QWEN3_MOE_GGUF_PATH`.

V1_001 + V1_003 regression

Existing #1819 cargo test still passes: 9.39s wall, content "Human: What", no matmul guard fire.

Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at sustained 9.62 tok/s. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension.

Test plan

  • `cargo check -p aprender-serve --lib --features cuda` — clean
  • `cargo check --test moe_kv_cache_equivalence --test m32d_perf` — clean
  • `QWEN3_MOE_GGUF_PATH=... cargo test moe_kv_cache_equivalence --release -- --ignored` — PASS, 4 tokens byte-identical
  • `QWEN3_MOE_GGUF_PATH=... cargo test m32d_perf --release -- --ignored` — PASS, 9.62 tok/s sustained
  • `QWEN3_MOE_GGUF_PATH=... cargo test qwen3_moe_serve_dispatch_v1 --release -- --ignored` (regression check, fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF #1819 test) — PASS
  • CI (sovereign-ci full workflow)
  • Companion-side CCPA Phase 6 V1_004 discharge bench (operator-coordinated, ~10 hr wall)

🤖 Generated with Claude Code

noahgift added a commit to paiml/claude-code-parity-apr that referenced this pull request May 20, 2026
…spatch ready (#254)

Upstream M32d KV cache for qwen3_moe inference path shipped at
paiml/aprender#1832 (open; in CI). Operator flipped from Option (b)
engineer-driven (#1829) to Option (a) in-session implementation.

Empirical (on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M):
- Pre-M32d: ~0.5 tok/s (bench timed out on every per-turn budget)
- Post-M32d: 9.62 tok/s sustained on 32-token gen (19× speedup)
- Numerical equivalence vs full-prefill: byte-identical greedy outputs
- V1_001 + V1_003 (#1819 cargo test) regression: stable

V1_004 prerequisite (M32d KV cache) NOW MET. Bench discharge is
operator-actionable on a tractable ~10hr wall.

## Files

### NEW: `evidence/phase-6/m32d-shipped-2026-05-20.md`

- Upstream empirical results table
- New cargo tests pinning the invariants (equivalence + perf floor)
- V1_004 dispatch checklist (7 operator steps)
- Cross-references to all upstream PRs

### MODIFIED: `evidence/phase-6/1.5b-calibration-run.md`

- aprender#1789 line: V1_004 status flipped from BLOCKED to "prerequisite MET 2026-05-20 via M32d"
- Updated PR list with all 7 follow-up PRs (#1806, #1807, #1812, #1814, #1819, #1826, #1832)
- Added cross-reference to m32d-shipped-2026-05-20.md

## What this is NOT

- NOT a CCPA-side code change (bench script + analyzer + harness unchanged)
- NOT the V1_004 bench dispatch itself (operator-coordinated, ~10hr wall)
- NOT a new CCPA contract gate (V1_004 is unchanged; only its prerequisite flipped)

Mechanical doc update. M-counter NOT bumped per the discipline doctrine.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
…2d (#1837)

Registers a new provable contract for temperature + top_k + top_p
sampling on the qwen3_moe inference path. Sibling to
qwen3-moe-streaming-sse-v1 (#1835); independent follow-up to
M32d (#1832).

## Motivation

`run_qwen3_moe_generate` currently does unconditional greedy argmax.
QuantizedGenerateConfig.temperature/top_k/top_p are silently ignored.
HTTP chat requests with non-zero temperature get the SAME output every
time — unacceptable for production chat.

## Why now

The V1_004 bench (paiml/claude-code-parity-apr Phase 6 against
Qwen3-Coder-30B-A3B) is hitting 900s per-turn timeout because the
30B-MoE is verbose. Temperature scaling (e.g. 0.3) could concentrate
probability mass on high-confidence tokens and reduce rambling.
Greedy-only forces ONE point on the spectrum.

## Falsification gates

- V1_001: greedy-fallback (temperature=0 OR top_k=1) → deterministic
- V1_002: temperature>0 + fixed seed → deterministic
- V1_003: temperature>0 + different seeds → different outputs
- V1_004: top_k=1 with high temperature → still equivalent to greedy

## Implementation phases (engineer playbook)

- Phase 1 (~2hr): lift dense-path sampling block into reusable helper
- Phase 2 (~1hr): wire into run_qwen3_moe_generate decode loop
- Phase 3 (~2-3hr): cargo test battery + optional companion sub-bench

Total ~5-6 hours; operator-actionable any time. Independent of
qwen3-moe-streaming-sse-v1.

NOT in scope:
- Repetition penalty (separate qwen3-moe-repetition-penalty-v1 contract)
- Mirostat / logit bias / streaming (separate concerns)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 6762827 into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the feat/m32d-moe-kv-cache branch May 20, 2026 08:24
noahgift added a commit that referenced this pull request May 20, 2026
#1835)

Registers a new provable contract for per-token SSE streaming on the
qwen3_moe chat-completions path. This is the natural follow-up to
#1832 (M32d KV cache) — pre-M32d, streaming was
meaningless because full-prefill-per-token mode took ~30 minutes per
256-token completion. Post-M32d at 9.62 tok/s sustained, per-token
SSE emits become valuable for chat UX.

## Falsification gates

- V1_001: chat-completions with stream=true emits per-token SSE events
  (not buffered into pregenerated SSE)
- V1_002: stream=false still returns a single JSON response (regression)
- V1_003: streaming throughput ≥ 2 tok/s median inter-event time

## Implementation phases (engineer playbook)

- Phase 1 (~2hr): callback variant of run_qwen3_moe_generate
- Phase 2 (~4hr): wire into try_qwen3_moe_backend in cuda_chat_backend.rs
- Phase 3 (~2hr): cargo integration test

Total ~6-8 hours, operator-actionable once #1832 merges.

NOT in scope:
- MoE inference correctness (covered by qwen3-moe-serve-dispatch-v1)
- KV cache mechanics (M32d / #1832)
- Streaming for dense models (already exists via OwnedQuantizedModelCachedSync)
- Tool-call streaming (separate contract)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 21, 2026
…3-moe-streaming-sse-v1) (#1854)

## Summary

Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835).
Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this
contract codifies that `stream=true` on a qwen3_moe model emits SSE
events per-token instead of buffering the full completion.

## Changes

- `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming`
  — callback variant of `run_qwen3_moe_generate`. Mirrors the
  non-streaming function step-for-step, but invokes `on_token(u32) -> bool`
  after each decoded token (BEFORE the stop check, so the client sees
  every sampled token). Callback returning `false` short-circuits the
  loop for client disconnect handling.

- `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on
  `request.stream`. If true, spin up an mpsc channel, run the streaming
  variant on a `spawn_blocking` worker, and route the channel through
  the dense path's `true_streaming_sse_response` helper. Non-streaming
  path unchanged.

- `api/openai_handlers.rs`: promote `true_streaming_sse_response` from
  `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE
  framing as the dense path. No behavior change.

- `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests
  (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003:
    * V1_001: streaming callback fires per-token, captured tokens
      equal the non-streaming greedy baseline.
    * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well
      below M32d's ~5 tok/s).
    * Bonus: callback returning `false` short-circuits the loop.

V1_002 (`stream=false` regression) is covered by
`qwen3_moe_serve_dispatch_v1.rs`.

## Why

#1832 made KV cache available → per-token gen amortizes to ~100ms.
Before this PR, MoE `stream=true` requests on qwen3_moe still went
through `run_qwen3_moe_generate` (synchronous) and the client got the
full response in a single late SSE event — UX regression vs dense path.
Now the client sees the first token within `prefill_wall + 100ms` and
subsequent tokens stream at ~M32d throughput.

## Test plan
- [x] `cargo check -p aprender-serve --lib` — clean
- [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
- [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 21, 2026
…3 falsifiers PASS) (#1855)

* feat: qwen3-moe streaming SSE — per-token emit when stream=true (qwen3-moe-streaming-sse-v1)

## Summary

Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835).
Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this
contract codifies that `stream=true` on a qwen3_moe model emits SSE
events per-token instead of buffering the full completion.

## Changes

- `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming`
  — callback variant of `run_qwen3_moe_generate`. Mirrors the
  non-streaming function step-for-step, but invokes `on_token(u32) -> bool`
  after each decoded token (BEFORE the stop check, so the client sees
  every sampled token). Callback returning `false` short-circuits the
  loop for client disconnect handling.

- `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on
  `request.stream`. If true, spin up an mpsc channel, run the streaming
  variant on a `spawn_blocking` worker, and route the channel through
  the dense path's `true_streaming_sse_response` helper. Non-streaming
  path unchanged.

- `api/openai_handlers.rs`: promote `true_streaming_sse_response` from
  `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE
  framing as the dense path. No behavior change.

- `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests
  (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003:
    * V1_001: streaming callback fires per-token, captured tokens
      equal the non-streaming greedy baseline.
    * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well
      below M32d's ~5 tok/s).
    * Bonus: callback returning `false` short-circuits the loop.

V1_002 (`stream=false` regression) is covered by
`qwen3_moe_serve_dispatch_v1.rs`.

## Why

#1832 made KV cache available → per-token gen amortizes to ~100ms.
Before this PR, MoE `stream=true` requests on qwen3_moe still went
through `run_qwen3_moe_generate` (synchronous) and the client got the
full response in a single late SSE event — UX regression vs dense path.
Now the client sees the first token within `prefill_wall + 100ms` and
subsequent tokens stream at ~M32d throughput.

## Test plan
- [x] `cargo check -p aprender-serve --lib` — clean
- [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
- [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell

## Summary

Operator-dispatched verification of `contracts/qwen3-moe-streaming-sse-v1.yaml`
v1.0.0 (shipped via #1854). All three falsifiers PASS on Qwen3-Coder-30B-A3B
(qwen3moe arch) running on gx10 Blackwell GB10.

## Results

| Falsifier   | Test                                      | Verdict |
|-------------|-------------------------------------------|---------|
| V1_001      | v1_001_callback_fires_per_token            | PASS    |
| V1_002      | (regression guard via serve-dispatch test) | GUARD   |
| V1_003      | v1_003_inter_token_latency_floor           | PASS    |

V1_003 throughput on real 30B-MoE: **median 338 ms inter-token gap** over
32 callbacks (floor 500 ms), distribution p_min=250 ms / p_max=518 ms.
≈ 3 tok/s streamed — comfortably above the 2 tok/s contract floor and
consistent with M32d's KV-cache-amortized per-token cost.

Plus the negative-path `callback_stop_short_circuits` test confirmed
that returning `false` from the per-token callback short-circuits the
decode loop (client-disconnect handling).

## Artifacts

- `findings.json` — machine-readable discharge record
- `gx10-sse-smoke.log` — full cargo test stdout/stderr (549 lines)

Both captured from `/home/noah/runs/sse-smoke-20260521-080640/` on gx10.

## Reproducer

```bash
QWEN3_MOE_GGUF_PATH=/path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  cargo test --test qwen3_moe_streaming_sse_v1 \
    -p aprender-serve --features cuda --release \
    -- --ignored --nocapture
```

Binary commit: 6bff4ce (post-#1854 merge).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence: add gx10 cargo test log (qwen3-moe-streaming-sse-v1 discharge)

Force-added (matches `.log` gitignore pattern but this one is a
load-bearing discharge artifact, not a temp file).

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant