fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF by noahgift · Pull Request #1819 · paiml/aprender

noahgift · 2026-05-19T16:45:15Z

Summary

Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001 + V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 → v1.1.1). Empirical evidence previously only via direct curl smoke test; this PR pins the invariant into CI as an opt-in integration test.

What the test does

`crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`:

Loads a real Qwen3-MoE GGUF (via `QWEN3_MOE_GGUF_PATH` env var)
Builds AppState via `with_quantized_model_and_vocab` + `with_mapped_gguf_model` (Option B path)
Creates the router via `realizar::api::create_router`
POSTs `/v1/chat/completions` with max_tokens=4, temperature=0
Asserts:
- HTTP 200 (V1_001)
- Non-empty `choices[0].message.content` (V1_001)
- No "InvalidShape" or "matmul weight has EMPTY data buffer" in body (V1_003: proves MoE path was taken, not dense — fix(serve): #1789 matmul defensive guard against empty / undersized weights #1790 guard did not fire)

Gated `#[ignore]` by default. CI-safe: skips with eprintln when env var missing.

Empirical evidence (this PR)

QWEN3_MOE_GGUF_PATH=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \\
  cargo test --test qwen3_moe_serve_dispatch_v1 \\
  -p aprender-serve --features cuda --release -- --ignored --nocapture

Passed in 7.84s wall:

{\"id\":\"chatcmpl-q4k-1779208970299\",\"object\":\"chat.completion\",\"model\":\"qwen3-moe-v1-001\",
 \"choices\":[{\"message\":{\"role\":\"assistant\",\"content\":\"Human: What\"}}],
 \"usage\":{\"prompt_tokens\":13,\"completion_tokens\":4,\"total_tokens\":17}}

Contract bump

v1.1.0 → v1.1.1 — V1_001 + V1_003 evidence fields updated with the new cargo command + recorded pass time; `status_history` appends v1.1.1 entry.

V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains independently BLOCKED on M32d KV cache work — see paiml/claude-code-parity-apr `evidence/phase-6/30b-moe-empirical-2026-05-19.md`.

Test plan

`cargo check --test qwen3_moe_serve_dispatch_v1` — clean
`cargo test --test qwen3_moe_serve_dispatch_v1` without env var — skipped via #[ignore]
`QWEN3_MOE_GGUF_PATH=... cargo test ... --release -- --ignored` — PASS in 7.84s
CI (sovereign-ci full workflow; test stays #[ignore] on CI)

🤖 Generated with Claude Code

…" key (PMAT-698j) THE root-cause bug behind the entire Phase 3 cuda dispatch cascade (PMAT-698e..i, 6 prior PRs). Discovered by PMAT-698i's [FWD-CACHE] diagnostic logging. The `warm!` macro in pre_warm_for_model: macro_rules! warm { ($key:expr, $kernel:expr) => {{ let ptx = $kernel.emit_ptx_for_target(&target); self.get_or_compile("silu_forward", &ptx)?; // <-- HARDCODED count += 1; }}; } Every single `warm!()` call stored its compiled module under the hashmap key "silu_forward", colliding on the first call: 1. warm!("batched_rmsnorm_fwd_896", BatchedVectorizedRmsNormKernel...) → cache["silu_forward"] = BatchedVectorizedRmsNorm PTX 2. warm!("gemm_forward_...", ...) → cache["silu_forward"] already Occupied → returns existing entry, new PTX silently discarded 3-23. same — all subsequent kernels never actually pre-warm. At runtime, every kernel looks up its real cache key: let key = format!("batched_rmsnorm_fwd_{hidden_size}_eps{eps_bits:08x}"); match cache.get_cached(&key) { Some(m) => m, None => JIT } — and cache-MISSES because the cache contains exactly one entry under "silu_forward". JIT fires for every "pre-warmed" kernel during the first forward pass — exactly when Blackwell sm_121's CUDA driver crashes on cuModuleLoadData during active GPU work. PMAT-698i's [FWD-CACHE] logging surfaced this: every kernel that was "supposed to be pre-warmed" emitted [FWD-CACHE] Compiling at runtime, proving the cache had nothing in it under those keys. Fix: pass $key through to get_or_compile. One-character change ("silu_forward" → &key). This explains the entire PMAT-698e..i cascade: - PMAT-698e (workspace cap) — legit independent bug - PMAT-698f (APR magic) — legit independent bug - PMAT-698g (non-LoRA backward pre-warm) — would have been fine IF forward pre-warm worked; the backward kernels were correctly stored under their real keys (backward macro doesn't have the typo). Defense-in-depth, still valuable. - PMAT-698h (rms_norm_gamma_reduce) — same defense-in-depth. - PMAT-698j (THIS) — the root cause. The previous PMAT-698g/h fixes are still correct (they covered backward gaps that exist independently). This PR addresses the forward cache, which was the dominant source of post-pre-warm JIT events. Test plan: - [x] cargo check --features cuda — clean build - [x] 366 autograd lib tests pass - [ ] Live gx10 dispatch (post-merge) shows ZERO [FWD-CACHE] Compiling events post-pre-warm (all 23 forward kernels now actually cached) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001 + V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 → v1.1.1). Pins the chat-completions MoE dispatch invariant into CI as an opt-in integration test. ## What the test does `crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`: - Loads a real Qwen3-MoE GGUF (via QWEN3_MOE_GGUF_PATH env var) - Builds AppState with `with_quantized_model_and_vocab` + attaches retained mmap via `with_mapped_gguf_model` (Option B path) - Creates the router via `realizar::api::create_router` - POSTs `/v1/chat/completions` with max_tokens=4, temperature=0 - Asserts: - HTTP 200 (V1_001: dispatch returns non-error) - Non-empty `choices[0].message.content` (V1_001: actual generation) - Body does NOT contain "InvalidShape" or "matmul weight has EMPTY data buffer" (V1_003: #1790 defensive guard did not fire — proves MoE path was taken, not dense) Gated `#[ignore]` by default. Activated by: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test qwen3_moe_serve_dispatch_v1 \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` If `QWEN3_MOE_GGUF_PATH` is unset, test prints a SKIP message and passes — does not block CI on hosts without a real qwen3_moe GGUF. ## Empirical evidence (this PR) Test passed against `/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf` in 7.84s wall. Response body: ```json {"id":"chatcmpl-q4k-1779208970299","object":"chat.completion","model":"qwen3-moe-v1-001", "choices":[{"message":{"role":"assistant","content":"Human: What"}}], "usage":{"prompt_tokens":13,"completion_tokens":4,"total_tokens":17}} ``` Non-empty `content` + no `InvalidShape` → V1_001 + V1_003 cargo-test discharged. ## Contract bump `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0 → v1.1.1: - V1_001 evidence updated with new cargo-test command + empirical run record - V1_003 evidence updated to same - status_history appends v1.1.1 entry noting formal discharge V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains BLOCKED on M32d KV cache work — independent contract gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… C-prep) (#1840) * feat(M32d): KV cache for qwen3_moe inference path — 19× speedup Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(distill): add DATASET_DIR env var → --dataset flag (Phase 4 Stage C-prep) Threads the Phase 4 Stage B-2 `apr distill --dataset <DIR>` flag through the gx10 dispatch script. When `DATASET_DIR` is set, the script passes the directory to apr distill, which drives training from real-corpus .bin shards via ShardBatchSource. When unset (default), the pipeline falls back to SyntheticBatchSource (Phase 3 smoke semantics). Preamble now surfaces which mode the dispatch is in: dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset) dataset: (synthetic — Phase 3 smoke semantics) Validates the directory exists on gx10 before launching apr distill; fails fast with a clear message otherwise. Usage: STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \ bash scripts/dispatch-distill-phase-3-gx10.sh Depends on PR #1839 (Stage B-2) landing first so `apr distill --dataset` exists on the rebuilt gx10 binary. With #1839 unmerged the script's invocation falls back to a clap error. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) 🟡 in CI Stage C-prep (THIS) dispatch script + 10 pre-staged shards Stage C run on gx10 with --dataset Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 Stage F publish v2 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 19, 2026 18:36

noahgift mentioned this pull request May 19, 2026

test(m-gpu-moe-3): FALSIFY-Q4K-REAL-WEIGHT-006 — 🚨 ROOT CAUSE FOUND: CUDA Q4_K matvec 237,775× divergence vs CPU (#1583 PR-3k DISCHARGE) #1821

Merged

2 tasks

noahgift merged commit 207dfde into main May 19, 2026
11 checks passed

noahgift deleted the fix/1789-v1-001-integration-test branch May 19, 2026 17:15

This was referenced May 19, 2026

spec(M32d): KV cache for qwen3_moe inference path — scope + operator decision doc #1826

Merged

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Closed

feat(M32d): KV cache for qwen3_moe inference path — 19× speedup #1832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF#1819

fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF#1819
noahgift merged 2 commits into
mainfrom
fix/1789-v1-001-integration-test

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

What the test does

Empirical evidence (this PR)

Contract bump

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant