fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF#1819
Merged
Conversation
…" key (PMAT-698j)
THE root-cause bug behind the entire Phase 3 cuda dispatch cascade
(PMAT-698e..i, 6 prior PRs). Discovered by PMAT-698i's [FWD-CACHE]
diagnostic logging.
The `warm!` macro in pre_warm_for_model:
macro_rules! warm {
($key:expr, $kernel:expr) => {{
let ptx = $kernel.emit_ptx_for_target(&target);
self.get_or_compile("silu_forward", &ptx)?; // <-- HARDCODED
count += 1;
}};
}
Every single `warm!()` call stored its compiled module under the
hashmap key "silu_forward", colliding on the first call:
1. warm!("batched_rmsnorm_fwd_896", BatchedVectorizedRmsNormKernel...)
→ cache["silu_forward"] = BatchedVectorizedRmsNorm PTX
2. warm!("gemm_forward_...", ...)
→ cache["silu_forward"] already Occupied → returns existing entry,
new PTX silently discarded
3-23. same — all subsequent kernels never actually pre-warm.
At runtime, every kernel looks up its real cache key:
let key = format!("batched_rmsnorm_fwd_{hidden_size}_eps{eps_bits:08x}");
match cache.get_cached(&key) { Some(m) => m, None => JIT }
— and cache-MISSES because the cache contains exactly one entry
under "silu_forward". JIT fires for every "pre-warmed" kernel during
the first forward pass — exactly when Blackwell sm_121's CUDA driver
crashes on cuModuleLoadData during active GPU work.
PMAT-698i's [FWD-CACHE] logging surfaced this: every kernel that was
"supposed to be pre-warmed" emitted [FWD-CACHE] Compiling at runtime,
proving the cache had nothing in it under those keys.
Fix: pass $key through to get_or_compile. One-character change
("silu_forward" → &key).
This explains the entire PMAT-698e..i cascade:
- PMAT-698e (workspace cap) — legit independent bug
- PMAT-698f (APR magic) — legit independent bug
- PMAT-698g (non-LoRA backward pre-warm) — would have been fine IF
forward pre-warm worked; the backward kernels were correctly stored
under their real keys (backward macro doesn't have the typo).
Defense-in-depth, still valuable.
- PMAT-698h (rms_norm_gamma_reduce) — same defense-in-depth.
- PMAT-698j (THIS) — the root cause.
The previous PMAT-698g/h fixes are still correct (they covered backward
gaps that exist independently). This PR addresses the forward cache,
which was the dominant source of post-pre-warm JIT events.
Test plan:
- [x] cargo check --features cuda — clean build
- [x] 366 autograd lib tests pass
- [ ] Live gx10 dispatch (post-merge) shows ZERO [FWD-CACHE] Compiling
events post-pre-warm (all 23 forward kernels now actually cached)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001
+ V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 →
v1.1.1). Pins the chat-completions MoE dispatch invariant into CI as
an opt-in integration test.
## What the test does
`crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`:
- Loads a real Qwen3-MoE GGUF (via QWEN3_MOE_GGUF_PATH env var)
- Builds AppState with `with_quantized_model_and_vocab` + attaches
retained mmap via `with_mapped_gguf_model` (Option B path)
- Creates the router via `realizar::api::create_router`
- POSTs `/v1/chat/completions` with max_tokens=4, temperature=0
- Asserts:
- HTTP 200 (V1_001: dispatch returns non-error)
- Non-empty `choices[0].message.content` (V1_001: actual generation)
- Body does NOT contain "InvalidShape" or "matmul weight has EMPTY
data buffer" (V1_003: #1790 defensive guard did not fire — proves
MoE path was taken, not dense)
Gated `#[ignore]` by default. Activated by:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
cargo test --test qwen3_moe_serve_dispatch_v1 \
-p aprender-serve --features cuda --release -- --ignored --nocapture
```
If `QWEN3_MOE_GGUF_PATH` is unset, test prints a SKIP message and
passes — does not block CI on hosts without a real qwen3_moe GGUF.
## Empirical evidence (this PR)
Test passed against `/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf`
in 7.84s wall. Response body:
```json
{"id":"chatcmpl-q4k-1779208970299","object":"chat.completion","model":"qwen3-moe-v1-001",
"choices":[{"message":{"role":"assistant","content":"Human: What"}}],
"usage":{"prompt_tokens":13,"completion_tokens":4,"total_tokens":17}}
```
Non-empty `content` + no `InvalidShape` → V1_001 + V1_003 cargo-test
discharged.
## Contract bump
`qwen3-moe-serve-dispatch-v1.yaml` v1.1.0 → v1.1.1:
- V1_001 evidence updated with new cargo-test command + empirical run record
- V1_003 evidence updated to same
- status_history appends v1.1.1 entry noting formal discharge
V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains
BLOCKED on M32d KV cache work — independent contract gate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
noahgift
added a commit
that referenced
this pull request
May 20, 2026
Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
… C-prep) (#1840) * feat(M32d): KV cache for qwen3_moe inference path — 19× speedup Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(distill): add DATASET_DIR env var → --dataset flag (Phase 4 Stage C-prep) Threads the Phase 4 Stage B-2 `apr distill --dataset <DIR>` flag through the gx10 dispatch script. When `DATASET_DIR` is set, the script passes the directory to apr distill, which drives training from real-corpus .bin shards via ShardBatchSource. When unset (default), the pipeline falls back to SyntheticBatchSource (Phase 3 smoke semantics). Preamble now surfaces which mode the dispatch is in: dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset) dataset: (synthetic — Phase 3 smoke semantics) Validates the directory exists on gx10 before launching apr distill; fails fast with a clear message otherwise. Usage: STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \ bash scripts/dispatch-distill-phase-3-gx10.sh Depends on PR #1839 (Stage B-2) landing first so `apr distill --dataset` exists on the rebuilt gx10 binary. With #1839 unmerged the script's invocation falls back to a clap error. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) 🟡 in CI Stage C-prep (THIS) dispatch script + 10 pre-staged shards Stage C run on gx10 with --dataset Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 Stage F publish v2 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001 + V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 → v1.1.1). Empirical evidence previously only via direct curl smoke test; this PR pins the invariant into CI as an opt-in integration test.
What the test does
`crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`:
Gated `#[ignore]` by default. CI-safe: skips with eprintln when env var missing.
Empirical evidence (this PR)
Passed in 7.84s wall:
{\"id\":\"chatcmpl-q4k-1779208970299\",\"object\":\"chat.completion\",\"model\":\"qwen3-moe-v1-001\", \"choices\":[{\"message\":{\"role\":\"assistant\",\"content\":\"Human: What\"}}], \"usage\":{\"prompt_tokens\":13,\"completion_tokens\":4,\"total_tokens\":17}}Contract bump
v1.1.0 → v1.1.1 — V1_001 + V1_003 evidence fields updated with the new cargo command + recorded pass time; `status_history` appends v1.1.1 entry.
V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains independently BLOCKED on M32d KV cache work — see paiml/claude-code-parity-apr `evidence/phase-6/30b-moe-empirical-2026-05-19.md`.
Test plan
🤖 Generated with Claude Code