docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN by noahgift · Pull Request #1747 · paiml/aprender

noahgift · 2026-05-17T11:29:32Z

Summary

Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade of #1583. Captures the cascade outcome in the canonical contract location and marks L47 as KNOWN_DIVERGENCE_NOT_BENIGN so future engineers don't have to re-derive the cascade narrative from GitHub comments.

YAML-only — production hot paths byte-unchanged.

Cascade outcome

PR	Result
PR-2 #1737	✅ fp64 q6k_gemv acc — 47/48 layers ≥0.99
PR-3 verify	✅ ran on RTX 4090 — L47 cliff surfaces
PR-3b #1739	✅ v1.7.0 → v1.7.1
PR-3c #1740	✅ scope doc + L47 sub-cascade
PR-3d	❌ H(i) qtype-mismatch FALSIFIED
PR-3e #1741	✅ L47 first divergent router (cos 0.9926)
PR-3e2 #1743	✅ H(ii) CONFIRMED — 2-of-8 expert swap at L47
PR-3f1	❌ fp64 gate softmax FALSIFIED
PR-3f2	❌ f64 weighted-sum FALSIFIED
PR-3g #1745	✅ L47 NOT BENIGN — 3/4 prompts agree, 1/4 flips
This PR (PR-3-final)	✅ v1.7.1 → v1.7.2 closing amendment

Root cause (by elimination)

Per-expert SwiGLU f32 intermediates:

1. gate_proj @ hidden    ← fp64 acc thanks to PR-2 ✅
2. silu(gate)            ← f32 ✗
3. silu(gate) × up_proj  ← f32 multiply on 8192-element vector ✗
4. down_proj @ above     ← fp64 acc thanks to PR-2 ✅

Fix scope = PR-3h (multi-week, unfuses/refuses GPU SwiGLU kernel). Parked behind PR-4 throughput cascade.

What flips

metadata.version 1.7.1 → 1.7.2
bottom-of-file version: \"1.7.1\" → \"1.7.2\"
bottom-of-file status comment refreshed
New amendment_history v1.7.2 entry with full cascade narrative, regression-gate test names, and KNOWN_DIVERGENCE_NOT_BENIGN classification rationale

Status

metadata.status unchanged at ACTIVE_ALGORITHM_LEVEL — the algorithm is bound on main; ACTIVE_RUNTIME still gates on throughput (PR-4) + L47 fix (PR-3h)
AC_GPU_MOE_001 text refresh: 47/48 layers ALGORITHM_LEVEL_DISCHARGED; L47 marked KNOWN_DIVERGENCE_NOT_BENIGN

Why KNOWN_DIVERGENCE_NOT_BENIGN, not KNOWN_BUG

L47 is a numerical-precision artifact, not a correctness bug. CPU and GPU follow the same algorithm against the same weights; only the order of f32 accumulation inside the per-expert SwiGLU differs. Both pick legitimate top-8 expert sets at L47 — neither is wrong — but the small score-perturbation crosses a top-k boundary. Same class as the gemv reduction-order variance that PR-2 fixed, one call-stack level higher.

Regression gate for PR-3h

When PR-3h lands, expect:

falsify_qw3_moe_l47_router_indices (feat(m-gpu-moe-3): PR-3e2 MoeRouterIndices stage + L47 expert-set falsifier — H(ii) CONFIRMED (#1583) #1743): L47 sorted top-8 CPU == GPU
falsify_qw3_moe_gpu_argmax_agreement (test(m-gpu-moe-3): PR-3g multi-prompt argmax agreement — L47 NOT BENIGN (#1583) #1745): 4/4 prompts argmax agree

Both #[ignore] + #[cfg(feature = \"cuda\")] gated.

Test plan

pv validate → 0 errors, valid
No production hot-path changes (YAML-only)

🤖 Generated with Claude Code

…-3 PR-2 verified, L47 surfaced Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main (#1737, 88ce47f — q6k_gemv fp64 accumulators). PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in GitHub comment on #1583 (issuecomment-4470195446). The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46, v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a real win. L47 was previously undetected because no per-layer falsifier existed in-tree; PR-1 of this cascade (#1713) closed that gap and surfaced the L47 anomaly. WHAT FLIPS: metadata.version 1.7.0 → 1.7.1 bottom-of-file version: "1.7.0" → "1.7.1" bottom-of-file status comment refreshed: "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING" → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING" AC_GPU_MOE_001 stage status text refresh (text-only — not yet refactored into a new amendment_history entry since this PR is scoped to the v1.7.1 amendment block only). WHAT STAYS PENDING: - L47 single-layer cascade — root cause unknown. Three candidate hypotheses captured in the v1.7.1 amendment block (qtype mismatch, MoE expert distribution, stride/shape boundary). Forthcoming PR-3c surfaces §85 (or next-available section) covering the L47 cascade. Forthcoming PR-3d+: per-tensor histogram on L47 before authoring fix. - M-GPU-MOE-2 (wgpu fallback) — unchanged - M-GPU-MOE-3 PR-4 throughput — unchanged YAML-ONLY: Production hot paths byte-unchanged. Additive-purity invariant pinned in v1.1.0 still holds. Contract validates via: cargo run -p aprender-contracts-cli --bin pv -- \ validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s), Contract is valid. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…final-contract-v1.7.2

…-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade. After v1.7.1 surfaced L47 as a single-layer cliff (cos=0.961236 post fp64 q6k_gemv acc, PR-2 #1737), the cascade ran a 5-step falsifier sequence (PRs #1737, #1739-1745 + 4 #1583 comments) to pin the root cause and verify user-visible impact. OUTCOME PR-3 ✅ 47/48 layers cos ≥ 0.99, L47 alone at 0.961236 PR-3d ❌ H(i) qtype-mismatch FALSIFIED PR-3e ✅ #1741 — L47 first divergent router (cos 0.9926) PR-3e2 ✅ #1743 — H(ii) CONFIRMED, 2-of-8 expert swap at L47 PR-3f1 ❌ fp64 gate softmax FALSIFIED — drift upstream PR-3f2 ❌ f64 weighted-sum FALSIFIED — drift upstream PR-3g ✅ #1745 — multi-prompt argmax: 3/4 agree, 1/4 disagrees → L47 NOT BENIGN (~25% prompt-dependent impact) ROOT CAUSE (by elimination) Per-expert SwiGLU f32 intermediates: 1. gate_proj @ hidden ← fp64 acc thanks to PR-2 ✅ 2. silu(gate) ← f32 ✗ 3. silu(gate) × up_proj ← f32 multiply on 8192-element vector ✗ 4. down_proj @ above ← fp64 acc thanks to PR-2 ✅ Fix scope = PR-3h: promote silu × up multiply + intermediate state to f64 in both expert_swiglu_quantized (CPU, simple) and expert_swiglu_cuda (GPU, requires unfusing/refusing the SwiGLU kernel). Multi-week kernel work. STATUS FLIPS metadata.version: 1.7.1 → 1.7.2 metadata.status: ACTIVE_ALGORITHM_LEVEL (unchanged) AC_GPU_MOE_001: 47/48 layers ALGORITHM_LEVEL_DISCHARGED + L47 KNOWN_DIVERGENCE_NOT_BENIGN WHAT STAYS PENDING - PR-3h fp64 per-expert SwiGLU (multi-week) - M-GPU-MOE-2 wgpu fallback (#1582) - M-GPU-MOE-3 PR-4 throughput (independent of L47 fix; unblocked by this amendment) WHY NOT KNOWN_BUG L47 is a numerical-precision artifact, not a correctness bug. CPU and GPU follow the same algorithm against the same weights; only the order of f32 accumulation inside the per-expert SwiGLU differs. Both pick legitimate top-8 sets at L47 — neither is wrong — but the small score-perturbation crosses a top-k boundary. Same class as gemv reduction-order variance, one call-stack level higher. REGRESSION GATE FOR PR-3h - falsify_qw3_moe_l47_router_indices (#1743): expect CPU L47 sorted top-8 == GPU L47 sorted top-8 - falsify_qw3_moe_gpu_argmax_agreement (#1745): expect 4/4 prompts argmax agreement YAML-ONLY Production hot paths byte-unchanged. Additive-purity invariant pinned in v1.1.0 still holds. Contract validates via: cargo run -p aprender-contracts-cli --bin pv -- \ validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0 error(s), 0 warning(s), Contract is valid. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rward_qwen3_moe (#1749) (#1751) Closes #1749. Pre-fix, `apr bench` against any MoE GGUF (Qwen3-Coder-30B-A3B-Instruct etc.) routed through the dense `forward_single_with_cache` path which calls `matmul_fused.rs:211` on tensor names that don't exist on MoE models (the 3D `*_exps` tensors are stored at different names than the 2D dense `ffn_{gate,up,down}.weight` the dense path looks up). Result: hundreds of parallel thread panics — `index out of bounds: len=0 but index ≈ 91M`. This PR adds MoE detection via `gguf.expert_count().is_some()` and routes to the MoE-aware forward path: CPU: realizar::gguf::OwnedQuantizedModel::forward_qwen3_moe CUDA: realizar::gguf::OwnedQuantizedModelCuda::forward_qwen3_moe_cuda Both helpers do not currently expose a KV cache, so the bench runs them **autoregressively with re-prefill** — each iteration runs full forward over `prompt + previously-generated tokens` and appends the argmax to the prompt for the next iter. O(N²) in N tokens but bounded by `--max-tokens` (default 32). This is intentionally a stop-gap to unblock M-GPU-MOE-3 PR-4 throughput measurement. True KV-cache MoE decoding is the actual PR-4 work; this PR makes `apr bench` produce a real (if pessimistic) tok/s number for MoE GGUFs instead of panicking. ## Empirical (lambda-vector RTX 4090, 2026-05-17) apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ --max-tokens 8 --warmup 1 --iterations 4 --json → total_time_ms: 87085 ; total_tokens: 4 → 0.046 tok/s effective (auto-regressive re-prefill cost dominates) → 22.8s ttft, 22.2s p50, 32.5s p99 The 0.046 tok/s is the upper bound on what `apr bench` can currently measure for MoE without KV cache. PR-4's job is to add the cache and push this to ≥ 150 tok/s. ## What's in this PR crates/apr-cli/src/commands/bench_moe.rs (new): - `is_moe_gguf(&GGUFModel)` predicate - `run_gguf_moe_benchmark` — loads MappedGGUFModel + N Qwen3MoeQuantizedLayer descriptors + (optionally) wraps in OwnedQuantizedModelCuda, then dispatches to the CUDA or CPU bench helper. - `run_cuda_moe_benchmark` — autoregressive forward_qwen3_moe_cuda + greedy argmax decode. - `run_cpu_moe_benchmark` — autoregressive forward_qwen3_moe. crates/apr-cli/src/commands/bench.rs: + `include!("bench_moe.rs")` after the existing `include!("bench_safetensors.rs")` (same pattern as the other bench sub-files). crates/apr-cli/src/commands/benchmark.rs: + In `run_gguf_benchmark`, after parsing the GGUF and tokenising the prompt, check `is_moe_gguf(&gguf)`. If true, log the detection (`expert_count` + top-k) and tail-call `run_gguf_moe_benchmark`. Otherwise fall through to the existing dense path. ## What's NOT in this PR - True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target) - Streaming/per-token JSON output for MoE (existing JSON output works; just reflects the autoregressive re-prefill cost) - MoE bench for SafeTensors / APR formats (only GGUF MoE supported today; the other formats don't have MoE production paths in the realizar inference engine) ## Cross-refs - #1583 M-GPU-MOE-3 — PR-4 throughput unblocks on this - #1747 contract qwen3-moe-forward-gpu-v1 v1.7.2 (just merged) — L47 known-divergence + cascade pause point - The MoE bench helpers reuse `forward_qwen3_moe[_cuda]` directly, which means PR-2 #1737's fp64 q6k_gemv acc is in effect; this bench measures *post-fp64-acc* throughput, not the pre-fix path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 3 commits May 17, 2026 11:47

Merge remote-tracking branch 'origin/main' into docs/m-gpu-moe-3-pr3-…

3a5e88c

…final-contract-v1.7.2

noahgift enabled auto-merge (squash) May 17, 2026 11:29

noahgift merged commit cbe22d0 into main May 17, 2026
11 checks passed

noahgift deleted the docs/m-gpu-moe-3-pr3-final-contract-v1.7.2 branch May 17, 2026 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN#1747

docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN#1747
noahgift merged 3 commits into
mainfrom
docs/m-gpu-moe-3-pr3-final-contract-v1.7.2

noahgift commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

Cascade outcome

Root cause (by elimination)

What flips

Status

Why KNOWN_DIVERGENCE_NOT_BENIGN, not KNOWN_BUG

Regression gate for PR-3h

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant