Skip to content

docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN#1747

Merged
noahgift merged 3 commits into
mainfrom
docs/m-gpu-moe-3-pr3-final-contract-v1.7.2
May 17, 2026
Merged

docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN#1747
noahgift merged 3 commits into
mainfrom
docs/m-gpu-moe-3-pr3-final-contract-v1.7.2

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade of #1583. Captures the cascade outcome in the canonical contract location and marks L47 as KNOWN_DIVERGENCE_NOT_BENIGN so future engineers don't have to re-derive the cascade narrative from GitHub comments.

YAML-only — production hot paths byte-unchanged.

Cascade outcome

PR Result
PR-2 #1737 ✅ fp64 q6k_gemv acc — 47/48 layers ≥0.99
PR-3 verify ✅ ran on RTX 4090 — L47 cliff surfaces
PR-3b #1739 ✅ v1.7.0 → v1.7.1
PR-3c #1740 ✅ scope doc + L47 sub-cascade
PR-3d ❌ H(i) qtype-mismatch FALSIFIED
PR-3e #1741 ✅ L47 first divergent router (cos 0.9926)
PR-3e2 #1743 ✅ H(ii) CONFIRMED — 2-of-8 expert swap at L47
PR-3f1 ❌ fp64 gate softmax FALSIFIED
PR-3f2 ❌ f64 weighted-sum FALSIFIED
PR-3g #1745 ✅ L47 NOT BENIGN — 3/4 prompts agree, 1/4 flips
This PR (PR-3-final) ✅ v1.7.1 → v1.7.2 closing amendment

Root cause (by elimination)

Per-expert SwiGLU f32 intermediates:

1. gate_proj @ hidden    ← fp64 acc thanks to PR-2 ✅
2. silu(gate)            ← f32 ✗
3. silu(gate) × up_proj  ← f32 multiply on 8192-element vector ✗
4. down_proj @ above     ← fp64 acc thanks to PR-2 ✅

Fix scope = PR-3h (multi-week, unfuses/refuses GPU SwiGLU kernel). Parked behind PR-4 throughput cascade.

What flips

  • metadata.version 1.7.1 → 1.7.2
  • bottom-of-file version: \"1.7.1\"\"1.7.2\"
  • bottom-of-file status comment refreshed
  • New amendment_history v1.7.2 entry with full cascade narrative, regression-gate test names, and KNOWN_DIVERGENCE_NOT_BENIGN classification rationale

Status

  • metadata.status unchanged at ACTIVE_ALGORITHM_LEVEL — the algorithm is bound on main; ACTIVE_RUNTIME still gates on throughput (PR-4) + L47 fix (PR-3h)
  • AC_GPU_MOE_001 text refresh: 47/48 layers ALGORITHM_LEVEL_DISCHARGED; L47 marked KNOWN_DIVERGENCE_NOT_BENIGN

Why KNOWN_DIVERGENCE_NOT_BENIGN, not KNOWN_BUG

L47 is a numerical-precision artifact, not a correctness bug. CPU and GPU follow the same algorithm against the same weights; only the order of f32 accumulation inside the per-expert SwiGLU differs. Both pick legitimate top-8 expert sets at L47 — neither is wrong — but the small score-perturbation crosses a top-k boundary. Same class as the gemv reduction-order variance that PR-2 fixed, one call-stack level higher.

Regression gate for PR-3h

When PR-3h lands, expect:

Both #[ignore] + #[cfg(feature = \"cuda\")] gated.

Test plan

  • pv validate → 0 errors, valid
  • No production hot-path changes (YAML-only)

🤖 Generated with Claude Code

noahgift and others added 3 commits May 17, 2026 11:47
…-3 PR-2 verified, L47 surfaced

Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main
(#1737, 88ce47f — q6k_gemv fp64 accumulators).

PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on
lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One
layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below
the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in
GitHub comment on #1583 (issuecomment-4470195446).

The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46,
v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a
real win. L47 was previously undetected because no per-layer
falsifier existed in-tree; PR-1 of this cascade (#1713) closed that
gap and surfaced the L47 anomaly.

WHAT FLIPS:

  metadata.version 1.7.0 → 1.7.1
  bottom-of-file version: "1.7.0" → "1.7.1"
  bottom-of-file status comment refreshed:
    "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING"
    → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING"

  AC_GPU_MOE_001 stage status text refresh (text-only — not yet
  refactored into a new amendment_history entry since this PR is
  scoped to the v1.7.1 amendment block only).

WHAT STAYS PENDING:

  - L47 single-layer cascade — root cause unknown. Three candidate
    hypotheses captured in the v1.7.1 amendment block (qtype mismatch,
    MoE expert distribution, stride/shape boundary). Forthcoming PR-3c
    surfaces §85 (or next-available section) covering the L47 cascade.
    Forthcoming PR-3d+: per-tensor histogram on L47 before authoring
    fix.
  - M-GPU-MOE-2 (wgpu fallback) — unchanged
  - M-GPU-MOE-3 PR-4 throughput — unchanged

YAML-ONLY:

  Production hot paths byte-unchanged. Additive-purity invariant
  pinned in v1.1.0 still holds. Contract validates via:
    cargo run -p aprender-contracts-cli --bin pv -- \
      validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s), Contract is valid.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN

Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade.

After v1.7.1 surfaced L47 as a single-layer cliff (cos=0.961236 post
fp64 q6k_gemv acc, PR-2 #1737), the cascade ran a 5-step falsifier
sequence (PRs #1737, #1739-1745 + 4 #1583 comments) to pin the root
cause and verify user-visible impact.

OUTCOME

  PR-3   ✅ 47/48 layers cos ≥ 0.99, L47 alone at 0.961236
  PR-3d  ❌ H(i) qtype-mismatch FALSIFIED
  PR-3e  ✅ #1741 — L47 first divergent router (cos 0.9926)
  PR-3e2 ✅ #1743 — H(ii) CONFIRMED, 2-of-8 expert swap at L47
  PR-3f1 ❌ fp64 gate softmax FALSIFIED — drift upstream
  PR-3f2 ❌ f64 weighted-sum FALSIFIED — drift upstream
  PR-3g  ✅ #1745 — multi-prompt argmax: 3/4 agree, 1/4 disagrees
                    → L47 NOT BENIGN (~25% prompt-dependent impact)

ROOT CAUSE (by elimination)

  Per-expert SwiGLU f32 intermediates:
    1. gate_proj @ hidden   ← fp64 acc thanks to PR-2 ✅
    2. silu(gate)           ← f32 ✗
    3. silu(gate) × up_proj ← f32 multiply on 8192-element vector ✗
    4. down_proj @ above    ← fp64 acc thanks to PR-2 ✅

  Fix scope = PR-3h: promote silu × up multiply + intermediate state
  to f64 in both expert_swiglu_quantized (CPU, simple) and
  expert_swiglu_cuda (GPU, requires unfusing/refusing the SwiGLU
  kernel). Multi-week kernel work.

STATUS FLIPS

  metadata.version:  1.7.1 → 1.7.2
  metadata.status:   ACTIVE_ALGORITHM_LEVEL (unchanged)
  AC_GPU_MOE_001:    47/48 layers ALGORITHM_LEVEL_DISCHARGED + L47
                     KNOWN_DIVERGENCE_NOT_BENIGN

WHAT STAYS PENDING

  - PR-3h fp64 per-expert SwiGLU (multi-week)
  - M-GPU-MOE-2 wgpu fallback (#1582)
  - M-GPU-MOE-3 PR-4 throughput (independent of L47 fix; unblocked
    by this amendment)

WHY NOT KNOWN_BUG

  L47 is a numerical-precision artifact, not a correctness bug. CPU
  and GPU follow the same algorithm against the same weights; only
  the order of f32 accumulation inside the per-expert SwiGLU differs.
  Both pick legitimate top-8 sets at L47 — neither is wrong — but
  the small score-perturbation crosses a top-k boundary. Same class
  as gemv reduction-order variance, one call-stack level higher.

REGRESSION GATE FOR PR-3h

  - falsify_qw3_moe_l47_router_indices (#1743): expect CPU L47 sorted
    top-8 == GPU L47 sorted top-8
  - falsify_qw3_moe_gpu_argmax_agreement (#1745): expect 4/4 prompts
    argmax agreement

YAML-ONLY

  Production hot paths byte-unchanged. Additive-purity invariant pinned
  in v1.1.0 still holds. Contract validates via:
    cargo run -p aprender-contracts-cli --bin pv -- \
      validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s), Contract is valid.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 17, 2026 11:29
@noahgift noahgift merged commit cbe22d0 into main May 17, 2026
11 checks passed
@noahgift noahgift deleted the docs/m-gpu-moe-3-pr3-final-contract-v1.7.2 branch May 17, 2026 11:55
noahgift added a commit that referenced this pull request May 17, 2026
…rward_qwen3_moe (#1749) (#1751)

Closes #1749. Pre-fix, `apr bench` against any MoE GGUF
(Qwen3-Coder-30B-A3B-Instruct etc.) routed through the dense
`forward_single_with_cache` path which calls `matmul_fused.rs:211`
on tensor names that don't exist on MoE models (the 3D `*_exps`
tensors are stored at different names than the 2D dense
`ffn_{gate,up,down}.weight` the dense path looks up). Result:
hundreds of parallel thread panics — `index out of bounds: len=0 but
index ≈ 91M`.

This PR adds MoE detection via `gguf.expert_count().is_some()` and
routes to the MoE-aware forward path:

  CPU:  realizar::gguf::OwnedQuantizedModel::forward_qwen3_moe
  CUDA: realizar::gguf::OwnedQuantizedModelCuda::forward_qwen3_moe_cuda

Both helpers do not currently expose a KV cache, so the bench runs
them **autoregressively with re-prefill** — each iteration runs full
forward over `prompt + previously-generated tokens` and appends the
argmax to the prompt for the next iter. O(N²) in N tokens but bounded
by `--max-tokens` (default 32).

This is intentionally a stop-gap to unblock M-GPU-MOE-3 PR-4
throughput measurement. True KV-cache MoE decoding is the actual
PR-4 work; this PR makes `apr bench` produce a real (if pessimistic)
tok/s number for MoE GGUFs instead of panicking.

## Empirical (lambda-vector RTX 4090, 2026-05-17)

  apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
    --max-tokens 8 --warmup 1 --iterations 4 --json

  → total_time_ms: 87085 ; total_tokens: 4
  → 0.046 tok/s effective (auto-regressive re-prefill cost dominates)
  → 22.8s ttft, 22.2s p50, 32.5s p99

The 0.046 tok/s is the upper bound on what `apr bench` can currently
measure for MoE without KV cache. PR-4's job is to add the cache and
push this to ≥ 150 tok/s.

## What's in this PR

  crates/apr-cli/src/commands/bench_moe.rs (new):
    - `is_moe_gguf(&GGUFModel)` predicate
    - `run_gguf_moe_benchmark` — loads MappedGGUFModel + N
      Qwen3MoeQuantizedLayer descriptors + (optionally) wraps in
      OwnedQuantizedModelCuda, then dispatches to the CUDA or CPU
      bench helper.
    - `run_cuda_moe_benchmark` — autoregressive
      forward_qwen3_moe_cuda + greedy argmax decode.
    - `run_cpu_moe_benchmark` — autoregressive forward_qwen3_moe.

  crates/apr-cli/src/commands/bench.rs:
    + `include!("bench_moe.rs")` after the existing
      `include!("bench_safetensors.rs")` (same pattern as the other
      bench sub-files).

  crates/apr-cli/src/commands/benchmark.rs:
    + In `run_gguf_benchmark`, after parsing the GGUF and tokenising
      the prompt, check `is_moe_gguf(&gguf)`. If true, log the
      detection (`expert_count` + top-k) and tail-call
      `run_gguf_moe_benchmark`. Otherwise fall through to the
      existing dense path.

## What's NOT in this PR

  - True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target)
  - Streaming/per-token JSON output for MoE (existing JSON output
    works; just reflects the autoregressive re-prefill cost)
  - MoE bench for SafeTensors / APR formats (only GGUF MoE supported
    today; the other formats don't have MoE production paths in the
    realizar inference engine)

## Cross-refs

- #1583 M-GPU-MOE-3 — PR-4 throughput unblocks on this
- #1747 contract qwen3-moe-forward-gpu-v1 v1.7.2 (just merged) —
  L47 known-divergence + cascade pause point
- The MoE bench helpers reuse `forward_qwen3_moe[_cuda]` directly,
  which means PR-2 #1737's fp64 q6k_gemv acc is in effect; this
  bench measures *post-fp64-acc* throughput, not the pre-fix path.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant