Skip to content

feat(test): M-GPU-MOE-3 PR-1 — per-layer CPU vs GPU MoE FFN out cosine falsifier (refs #1583)#1713

Merged
noahgift merged 1 commit into
mainfrom
docs/m-gpu-moe-3-scope-note
May 16, 2026
Merged

feat(test): M-GPU-MOE-3 PR-1 — per-layer CPU vs GPU MoE FFN out cosine falsifier (refs #1583)#1713
noahgift merged 1 commit into
mainfrom
docs/m-gpu-moe-3-scope-note

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

PR-1 of the M-GPU-MOE-3 cascade per issue #1583. Adds the per-layer
cosine falsifier that makes "L7 cos < 0.99" reproducible against any
commit on `main`. Does NOT fix the divergence — that's PR-2 in
`../trueno` (fp64 accumulator in `Q6KGemvKernel`). This PR is the
regression gate.

What this PR adds

`crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs`

New `#[ignore]`, `#[cfg(feature = "cuda")]` test
`falsify_qw3_moe_per_layer_001_cosine_per_layer`:

  • Two `SaveTensorPlan`s capturing `MoeFfnOut` for all 48 layers,
    one per backend (CPU LAZY-FUSED-MATVEC vs GPU `q6k_gemv`).
  • Runs:
    • `OwnedQuantizedModel::forward_qwen3_moe_traced_with_plan` (CPU)
    • `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced_with_plan` (GPU,
      already on main as M-MOE-SUB-2 step (b))
  • 1-token prompt (`[785]`) to bound runtime to ~5 min on RTX 4090.
  • Reads back the `APRT`-magic stage files, computes per-layer cosine
    in f64 to avoid test-side rounding noise polluting the signal.
  • Prints the full 48-element cos vector (with markers on layers below
    0.99) regardless of pass/fail.
  • Asserts every layer ≥ 0.99.

`docs/specifications/aprender-gpu/m-gpu-moe-3-scope.md`

Cascade scope doc documenting:

  • The reduction-order divergence root cause (CPU SIMD 4-acc rayon
    row-parallel vs GPU 32-thread interleaved warp-shuffle).
  • Ranked fix space: fp64 acc → Kahan → contiguous chunking → 4-acc-on-GPU.
  • 5-PR cascade map.

How to run

```
cargo test --release --features cuda
-p aprender-serve --test qwen3_moe_per_layer_gpu_parity
-- --ignored --nocapture
```

Requires cached `Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf` at one of
the canonical paths. Skipped on non-CUDA hosts.

Expected outcome

The test FAILS on current main per the issue body's claim of 7-8
layers with cos ∈ [0.94, 0.987]. The diagnostic output identifies the
actual divergent layers — the issue body's enumeration ("L7, L9, L12,
L20, L23, L29, L46") becomes a verified-on-main baseline or gets
corrected.

Once PR-2 (fp64 acc) lands upstream in `../trueno`, re-running this
test should show all 48 layers ≥ 0.99 and the `qwen3-moe-forward-gpu-v1`
contract can flip v1.7.0 → v1.8.0 ACTIVE_RUNTIME (PR-5 of the cascade).

What this PR is NOT

  • Does NOT fix the divergence (PR-2 in `../trueno`).
  • Does NOT touch any production forward path —
    `forward_qwen3_moe` (CPU) and `forward_qwen3_moe_cuda` (GPU) are
    byte-for-byte unchanged.
  • Uses existing `SaveTensorPlan` plumbing — no new fields on
    `LayerActivation` or `ForwardTrace`.

Test verified

  • `cargo check -p aprender-serve --features cuda` clean.
  • `cargo test --no-run --features cuda --test qwen3_moe_per_layer_gpu_parity`
    builds successfully.
  • The `--ignored` test cannot run in CI (requires 17 GB GGUF +
    RTX 4090); same convention as
    `qwen3_moe_gpu_parity.rs::falsify_qw3_moe_gpu_parity_001_cosine_vs_cpu`.

🤖 Generated with Claude Code

…e falsifier (refs #1583)

Adds the first PR of the M-GPU-MOE-3 cascade per issue #1583
("throughput ≥150 tok/s on RTX 4090 + fp-accumulator-order alignment").

## What this PR adds

1. `crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs`:
   new `#[ignore]`, `#[cfg(feature = "cuda")]` test
   `falsify_qw3_moe_per_layer_001_cosine_per_layer`.

   Mechanism:
   - Two `SaveTensorPlan`s capturing `MoeFfnOut` for all 48 layers,
     one per backend.
   - `OwnedQuantizedModel::forward_qwen3_moe_traced_with_plan` (CPU
     LAZY-FUSED-MATVEC, `fused_q6k_parallel_matvec` + AVX2 4-way acc).
   - `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced_with_plan`
     (GPU `q6k_gemv` warp-shuffle reduction, M-MOE-SUB-2 step (b)
     already on main).
   - 1-token prompt (`[785]`) to bound runtime to ~5 min on RTX 4090.
   - Reads back the `APRT`-magic stage files, computes per-layer cos
     in f64 to avoid test-side rounding noise.
   - Prints the full 48-element cos vector (including which layers
     drop below 0.99) regardless of pass/fail.
   - Asserts every layer ≥ 0.99.

2. `docs/specifications/aprender-gpu/m-gpu-moe-3-scope.md`:
   cascade scope doc + map. Documents:
   - The reduction-order divergence root cause (CPU SIMD 4-acc rayon
     row-parallel vs. GPU 32-thread interleaved-position warp-shuffle).
   - Ranked fix space: fp64 acc (PR-2) → Kahan → contiguous chunking
     → 4-acc-on-GPU.
   - PR sequence: PR-1 falsifier (this PR), PR-2 fp64 acc in trueno,
     PR-3 chunking (if needed), PR-4 throughput, PR-5 contract
     v1.7.0 → v1.8.0 ACTIVE_RUNTIME.
   - Why upstream `forward_qwen3_moe_cuda_traced` already exists
     (M-MOE-SUB-2 step (b), `contracts/trace-moe-gpu-sub-stages-v1.yaml`).

## How to run

```
cargo test --release --features cuda \
  -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \
  -- --ignored --nocapture
```

Requires cached `Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf` at one of
the canonical paths in `CANONICAL_QWEN3_CODER_GGUF_PATHS` (lambda-labs
host typically has it at `/home/noah/models/`).

## Expected outcome on main

The test FAILS on current main per the issue body's claim of 7-8
divergent layers in [0.94, 0.987]. Once PR-2 (fp64 acc in
`Q6KGemvKernel`) lands upstream in `../trueno`, re-running this test
should show all 48 layers ≥ 0.99 and the M-GPU-MOE-3 cascade can flip
the `qwen3-moe-forward-gpu-v1` contract to ACTIVE_RUNTIME.

## What this PR is NOT

- It does NOT fix the divergence (that's PR-2 in `../trueno`).
- It does NOT touch any production forward path. `forward_qwen3_moe`
  (CPU) and `forward_qwen3_moe_cuda` (GPU) are unchanged byte-for-byte.
- It uses the existing `SaveTensorPlan` plumbing (no new struct fields
  on `LayerActivation` or `ForwardTrace`).

## Test verified

- `cargo check -p aprender-serve --features cuda` clean.
- `cargo test --no-run --features cuda --test qwen3_moe_per_layer_gpu_parity`
  builds successfully.
- The `--ignored` test cannot be run in CI (requires 17 GB GGUF +
  RTX 4090); follows the same convention as
  `qwen3_moe_gpu_parity.rs::falsify_qw3_moe_gpu_parity_001_cosine_vs_cpu`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 16, 2026 13:48
@noahgift noahgift merged commit e5514a5 into main May 16, 2026
11 checks passed
@noahgift noahgift deleted the docs/m-gpu-moe-3-scope-note branch May 16, 2026 14:07
noahgift added a commit that referenced this pull request May 17, 2026
) (#1737)

Closes 7-layer MoE cosine regression (L7/L9/L12/L20/L23/L29/L46 at
0.94–0.987 vs CPU fused_q6k_parallel_matvec) by promoting the per-lane
accumulator path in q6k_gemv to f64, mirroring the GH-561 pattern
already shipped in nf4_rmsnorm_gemv and the NF4 GEMM kernel.

Kernel changes (`crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs`):
- `acc` and `thread_partial` are now `mov_f64_imm_zero` (was `mov_f32_imm`)
- Inner 8-FMA-per-thread loop uses `fma_f64_acc_inplace`: x_val + dequant
  (both f32) are promoted to f64 inside the helper; thread_partial stays
  f64 across iterations
- Per-superblock `add_f64_inplace(acc, thread_partial)` (new helper)
- Final `cvt.rn.f32.f64` just before the warp-reduce — the 5
  shfl-down-add reductions stay f32 (shfl.sync.down.b32 is the only
  primitive we expose) and the final `st.global.f32` is unchanged
- Cost: ~16K extra cvt.rn.f64.f32 + ~64 f64 adds per output element;
  amortised over the ~64 superblocks of FMAs the precision win
  dominates

Builder addition (`crates/aprender-gpu/src/ptx/builder/inplace_ops.rs`):
- New `add_f64_inplace(dst, src)` — `dst += src`, both f64, round-to-nearest.
  Pairs with the existing `mov_f64_imm_zero`, `fma_f64_acc_inplace`, and
  `cvt_f32_f64_rn` GH-561 helpers.

Falsification test:
- `falsify_m_gpu_moe_3_pr2_kernel_emits_fp64_accumulators` asserts the
  emitted PTX contains `fma.f64`, `add.f64`, `mov.f64`/`mov.b64`, the
  `cvt.rn.f32.f64` downcast, the unchanged `shfl.sync.down.b32`
  warp-reduce, and the final `st.global.f32`. If this drifts back to
  all-`.f32` the 7-layer regression returns.

Test status (host without CUDA driver):
- 82/82 q6k unit tests pass (81 existing + this new emit assertion)
- 0 q6k regressions
- Broader cargo-test failures are pre-existing CUDA-driver tests
  (cublas/cuda_graph/driver/memory_fuzz) requiring a live GPU — not
  touched by this PR
- Cosine measurement on the 7 problem MoE layers is the next
  cascade-step (PR-3) and runs on lambda-labs / gx10

References:
- GH-561 (fp32 accumulator order divergence — same root cause class)
- #1583 (M-GPU-MOE-3 umbrella)
- PR-1 of this cascade was #1713 (per-layer CPU vs GPU MoE FFN out
  cosine falsifier) — this PR ships the kernel fix that falsifier was
  authored to validate

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…-3 PR-2 verified, L47 surfaced (#1739)

Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main
(#1737, 88ce47f — q6k_gemv fp64 accumulators).

PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on
lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One
layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below
the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in
GitHub comment on #1583 (issuecomment-4470195446).

The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46,
v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a
real win. L47 was previously undetected because no per-layer
falsifier existed in-tree; PR-1 of this cascade (#1713) closed that
gap and surfaced the L47 anomaly.

WHAT FLIPS:

  metadata.version 1.7.0 → 1.7.1
  bottom-of-file version: "1.7.0" → "1.7.1"
  bottom-of-file status comment refreshed:
    "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING"
    → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING"

  AC_GPU_MOE_001 stage status text refresh (text-only — not yet
  refactored into a new amendment_history entry since this PR is
  scoped to the v1.7.1 amendment block only).

WHAT STAYS PENDING:

  - L47 single-layer cascade — root cause unknown. Three candidate
    hypotheses captured in the v1.7.1 amendment block (qtype mismatch,
    MoE expert distribution, stride/shape boundary). Forthcoming PR-3c
    surfaces §85 (or next-available section) covering the L47 cascade.
    Forthcoming PR-3d+: per-tensor histogram on L47 before authoring
    fix.
  - M-GPU-MOE-2 (wgpu fallback) — unchanged
  - M-GPU-MOE-3 PR-4 throughput — unchanged

YAML-ONLY:

  Production hot paths byte-unchanged. Additive-purity invariant
  pinned in v1.1.0 still holds. Contract validates via:
    cargo run -p aprender-contracts-cli --bin pv -- \
      validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s), Contract is valid.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…e surfaced (#1583) (#1740)

PR-3c of the M-GPU-MOE-3 cascade. Updates m-gpu-moe-3-scope.md with
the actual landed state and the new L47 single-layer sub-cascade.

WHAT CHANGED

  Cascade table now reflects shipped PRs:
    PR-1 ✅ shipped (#1713) — per-layer cos falsifier
    PR-2 ✅ shipped (#1737) — fp64 accumulators in Q6KGemvKernel.
                              Note: in-tree at
                              crates/aprender-gpu/src/kernels/quantize/q6k/gemv.rs
                              (the original "../trueno" reference was
                              stale after the monorepo consolidation
                              subsumed trueno-gpu).
    PR-3  ✅ ran (manual hardware verification on lambda-vector RTX 4090,
                  2026-05-17) — 47/48 layers cos ≥ 0.99, L47 alone at
                  cos=0.961236. Evidence in #1583 comment-4470195446.
    PR-3b ✅ shipped (#1739) — contract v1.7.0 → v1.7.1.
    PR-3c ✅ this update.
    PR-3d ✅ ran — H(i) qtype-mismatch FALSIFIED. apr tensors shows
                   L0, L46, L47 have identical shapes + qtypes.
                   Evidence in #1583 comment-4470216021.

  New sub-cascade for L47:
    PR-3e — pending: routing-divergence falsifier for H(ii).
            Hypothesis: per-layer cosine is ACCUMULATED drift, not
            per-kernel divergence. By L47 the CPU-vs-GPU hidden state
            has drifted by ~0.002. If that drift straddles a top-k
            expert boundary at L47, CPU and GPU pick different expert
            sets and the FFN output diverges by O(1) — matching the
            0.961 cliff. The falsifier extends SaveTensorStage::MoeRouter
            (or adds a sibling stage) to persist top-k EXPERT INDICES
            alongside the weights, then asserts CPU index set == GPU
            index set at L47.
    PR-3f+ — pending: L47 fix based on PR-3e outcome.
             - If H(ii) confirmed: deterministic tie-breaking in
               expert ordering OR fp64 MoE gate softmax OR f64 expert
               selection with f32 post-conversion.
             - If H(ii) dead: per-expert weight cancellation pathology
               investigation (capture FfnGate + FfnUp + FfnSwigl at L47).

  Parallel work:
    PR-4 (throughput ≥150 tok/s + VRAM ≤95%) — independent of L47
                                                 sub-cascade.
    PR-5 (contract v1.7.1 → v1.8.0 ACTIVE_RUNTIME) — gates on PR-3f+
                                                     AND PR-4.

REPRODUCTION

  cargo test --release --features cuda \
    -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \
    -- --ignored --nocapture

  27.92s on RTX 4090. Test source:
  crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs

DOC-ONLY PR

  No code changes. Production hot paths byte-unchanged.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…GN (#1583)

PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually
user-visible" falsifier, runs 4 canonical prompts through both CPU and
GPU full forwards, and asserts argmax agreement.

## Result (lambda-vector RTX 4090, 2026-05-17)

  PROMPT             | CPU argmax (val)     | GPU argmax (val)
  canonical_3tok     |    944 ( 13.7270)  |    944 ( 14.4133)  ✓
  single_tok_785     |    220 ( 15.5523)  |     25 ( 18.5098)  ✗ MISMATCH
  multi_tok_short    |    315 ( 26.2279)  |    315 ( 25.5230)  ✓
  multi_tok_code     |    198 ( 17.7453)  |    198 ( 17.8433)  ✓

**3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the
expert-set divergence DOES flip the top-1 predicted token for some
prompts (~25% in this small sample). Option E (Accept) is off the
table; must pursue Option C (fp64 in per-expert SwiGLU).

## What this PR adds

  crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs:
    + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt
      probe that builds CPU + GPU models once, runs 4 canonical prompts
      through both full forwards, and prints argmax agreement table +
      verdict. PROBE not hard-assert; prints "BENIGN" if all agree or
      "NOT BENIGN" + disagreeing prompts otherwise.

## Cascade context

- PR-1 #1713 ✅ per-layer cos falsifier
- PR-2 #1737 ✅ q6k_gemv fp64 accumulators
- PR-3   ✅ hardware verify — 47/48 PASS, L47 surfaces
- PR-3b #1739 ✅ contract v1.7.0 → v1.7.1
- PR-3c #1740 ✅ scope-doc + L47 sub-cascade
- PR-3d   ✅ H(i) qtype-mismatch FALSIFIED
- PR-3e #1741 ✅ router-weight probe
- PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap)
- PR-3f1 ❌ falsified (fp64 softmax) — dropped
- PR-3f2 ❌ falsified (f64 weighted-sum) — dropped
- PR-3g  ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix
- PR-3h  pending — Option C fp64 in per-expert SwiGLU intermediates

## Why the cascade kept eliminating candidates

The 3-falsifier sequence ruled out the "easy" fix locations:

1. PR-3f1 (gate softmax precision) — drift upstream of softmax
2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum
3. **Remaining**: drift inside each per-expert SwiGLU's intermediate
   chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc
   which PR-2 already promoted to fp64)

PR-3h must promote the silu(gate) × up element-wise multiply and the
hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU
and CUDA expert_swiglu helpers.

## Reproduction

  cargo test --release --features cuda \
    -p aprender-serve --test qwen3_moe_gpu_parity \
    falsify_qw3_moe_gpu_argmax_agreement \
    -- --ignored --nocapture

~25s on RTX 4090.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…router weights diverge (#1583) (#1741)

PR-3e of the M-GPU-MOE-3 cascade. Adds `falsify_qw3_moe_l47_router_probe`
to disambiguate H(ii) routing-divergence from post-routing divergence
at L47.

## Result (lambda-vector RTX 4090, 2026-05-17)

  L## | MoeRouter | MoeFfnOut
  L00 | 1.000000  | 0.999999
  L01 | 1.000000  | 1.000000
  ...
  L46 | 1.000000  | 0.998498
  L47 | 0.992558  | 0.961236  <-- FfnOut BELOW 0.99

**Dispositive evidence:** L0..L46 router cos = 1.000000 (byte-identical
between CPU and GPU). L47 router cos = 0.992558 — the FIRST and ONLY
layer where router weights diverge. This is a sharp transition,
exactly the signature of an accumulated-drift threshold being crossed
at the L47 softmax/top-k boundary.

## Verdict

H(ii) routing-divergence is INCONCLUSIVE from weights alone (router
cos 0.992558 is between 0.99 and 0.995). The saved MoeRouter tensor
is `[k=8]` post-softmax+renormalize WEIGHTS in descending order — if
CPU and GPU pick DIFFERENT 8 experts, both vectors are still sorted
descending and cos can be near 1.0 even with disjoint sets. Indices
are required to definitively confirm or falsify SET divergence.

PR-3e2 will add `SaveTensorStage::MoeRouterIndices` to persist the
top-k INDICES alongside the weights. Once indices are captured, the
test can assert CPU set == GPU set at L47 to lock in H(ii).

## What's in this PR

  crates/aprender-serve/tests/qwen3_moe_per_layer_gpu_parity.rs:
    + new helper `make_router_and_ffn_out_plan` capturing both
      `moe_router` and `moe_ffn_out` stages for all 48 layers
    + new test `falsify_qw3_moe_l47_router_probe` printing the
      per-layer router+ffn_out cos vector side-by-side and a
      verdict line classifying H(ii) status (FALSIFIED / ALIVE /
      INCONCLUSIVE)
    + same `#[ignore]` + `#![cfg(feature = "cuda")]` gates as the
      existing falsifier — runs only on RTX 4090 with the cached
      30B GGUF

This is a PROBE, not a hard-fail falsifier. The test prints the
verdict; it does not assert. The verdict drives the next PR's
investigation target.

## Cascade context

- PR-1 #1713 — per-layer cos falsifier
- PR-2 #1737 — q6k_gemv fp64 accumulators
- PR-3 hardware-verify (manual) — 47/48 PASS, L47 surfaces
- PR-3b #1739 — contract v1.7.0 → v1.7.1 (in flight)
- PR-3c #1740 — scope-doc update + L47 sub-cascade (in flight)
- PR-3d — H(i) qtype-mismatch FALSIFIED (#1583 comment-4470216021)
- PR-3e — **this PR** — H(ii) router-weight probe
- PR-3e2 — pending: capture top-k INDICES via new
              `SaveTensorStage::MoeRouterIndices` variant
- PR-3f+  — pending: L47 fix based on PR-3e/PR-3e2 outcome

## Reproduction

  cargo test --release --features cuda \
    -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \
    falsify_qw3_moe_l47_router_probe \
    -- --ignored --nocapture

7s on RTX 4090 (after build).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN (#1747)

* docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.0 → v1.7.1 — M-GPU-MOE-3 PR-2 verified, L47 surfaced

Hardware-verification amendment after M-GPU-MOE-3 PR-2 landed on main
(#1737, 88ce47f — q6k_gemv fp64 accumulators).

PR-3 ran the per-layer FALSIFY-QW3-MOE-PER-LAYER-001 falsifier on
lambda-vector (RTX 4090) against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M
on 2026-05-17. Result: 47/48 decoder layers cos ≥ 0.99 (PASS). One
layer (L47, the final decoder layer) sits at cos=0.961236 — 3σ below
the L40-L46 cluster (~0.998). Full 48-layer cos vector logged in
GitHub comment on #1583 (issuecomment-4470195446).

The 7 originally-cited problem layers (L7/L9/L12/L20/L23/L29/L46,
v1.7.0 amendment lines 41-45) ALL lifted above 0.99 — PR-2 was a
real win. L47 was previously undetected because no per-layer
falsifier existed in-tree; PR-1 of this cascade (#1713) closed that
gap and surfaced the L47 anomaly.

WHAT FLIPS:

  metadata.version 1.7.0 → 1.7.1
  bottom-of-file version: "1.7.0" → "1.7.1"
  bottom-of-file status comment refreshed:
    "1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING"
    → "47/48 layers cos≥0.99 post-PR #1737; L47 single-layer cascade PENDING"

  AC_GPU_MOE_001 stage status text refresh (text-only — not yet
  refactored into a new amendment_history entry since this PR is
  scoped to the v1.7.1 amendment block only).

WHAT STAYS PENDING:

  - L47 single-layer cascade — root cause unknown. Three candidate
    hypotheses captured in the v1.7.1 amendment block (qtype mismatch,
    MoE expert distribution, stride/shape boundary). Forthcoming PR-3c
    surfaces §85 (or next-available section) covering the L47 cascade.
    Forthcoming PR-3d+: per-tensor histogram on L47 before authoring
    fix.
  - M-GPU-MOE-2 (wgpu fallback) — unchanged
  - M-GPU-MOE-3 PR-4 throughput — unchanged

YAML-ONLY:

  Production hot paths byte-unchanged. Additive-purity invariant
  pinned in v1.1.0 still holds. Contract validates via:
    cargo run -p aprender-contracts-cli --bin pv -- \
      validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s), Contract is valid.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN

Terminal amendment for the M-GPU-MOE-3 PR-3 sub-cascade.

After v1.7.1 surfaced L47 as a single-layer cliff (cos=0.961236 post
fp64 q6k_gemv acc, PR-2 #1737), the cascade ran a 5-step falsifier
sequence (PRs #1737, #1739-1745 + 4 #1583 comments) to pin the root
cause and verify user-visible impact.

OUTCOME

  PR-3   ✅ 47/48 layers cos ≥ 0.99, L47 alone at 0.961236
  PR-3d  ❌ H(i) qtype-mismatch FALSIFIED
  PR-3e  ✅ #1741 — L47 first divergent router (cos 0.9926)
  PR-3e2 ✅ #1743 — H(ii) CONFIRMED, 2-of-8 expert swap at L47
  PR-3f1 ❌ fp64 gate softmax FALSIFIED — drift upstream
  PR-3f2 ❌ f64 weighted-sum FALSIFIED — drift upstream
  PR-3g  ✅ #1745 — multi-prompt argmax: 3/4 agree, 1/4 disagrees
                    → L47 NOT BENIGN (~25% prompt-dependent impact)

ROOT CAUSE (by elimination)

  Per-expert SwiGLU f32 intermediates:
    1. gate_proj @ hidden   ← fp64 acc thanks to PR-2 ✅
    2. silu(gate)           ← f32 ✗
    3. silu(gate) × up_proj ← f32 multiply on 8192-element vector ✗
    4. down_proj @ above    ← fp64 acc thanks to PR-2 ✅

  Fix scope = PR-3h: promote silu × up multiply + intermediate state
  to f64 in both expert_swiglu_quantized (CPU, simple) and
  expert_swiglu_cuda (GPU, requires unfusing/refusing the SwiGLU
  kernel). Multi-week kernel work.

STATUS FLIPS

  metadata.version:  1.7.1 → 1.7.2
  metadata.status:   ACTIVE_ALGORITHM_LEVEL (unchanged)
  AC_GPU_MOE_001:    47/48 layers ALGORITHM_LEVEL_DISCHARGED + L47
                     KNOWN_DIVERGENCE_NOT_BENIGN

WHAT STAYS PENDING

  - PR-3h fp64 per-expert SwiGLU (multi-week)
  - M-GPU-MOE-2 wgpu fallback (#1582)
  - M-GPU-MOE-3 PR-4 throughput (independent of L47 fix; unblocked
    by this amendment)

WHY NOT KNOWN_BUG

  L47 is a numerical-precision artifact, not a correctness bug. CPU
  and GPU follow the same algorithm against the same weights; only
  the order of f32 accumulation inside the per-expert SwiGLU differs.
  Both pick legitimate top-8 sets at L47 — neither is wrong — but
  the small score-perturbation crosses a top-k boundary. Same class
  as gemv reduction-order variance, one call-stack level higher.

REGRESSION GATE FOR PR-3h

  - falsify_qw3_moe_l47_router_indices (#1743): expect CPU L47 sorted
    top-8 == GPU L47 sorted top-8
  - falsify_qw3_moe_gpu_argmax_agreement (#1745): expect 4/4 prompts
    argmax agreement

YAML-ONLY

  Production hot paths byte-unchanged. Additive-purity invariant pinned
  in v1.1.0 still holds. Contract validates via:
    cargo run -p aprender-contracts-cli --bin pv -- \
      validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s), Contract is valid.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…sifier — H(ii) CONFIRMED (#1583)

PR-3e2 of the M-GPU-MOE-3 cascade. Adds `SaveTensorStage::MoeRouterIndices`
to definitively confirm or falsify H(ii) expert-set divergence at L47.

  L47 sorted top-8:
    cpu = [  2,  20,  36,  57,  60,  73, 111, 120 ]
    gpu = [  2,  12,  36,  57,  60, 103, 111, 120 ]
                  ^^^                ^^^
                  cpu-only={20, 73}; gpu-only={12, 103}

CPU and GPU agree on 6 of 8 experts at L47 but disagree on 2 (mild
H(ii) confirmation). All other 47 layers produce IDENTICAL expert
SETS between CPU and GPU.

Root cause: by L47 the accumulated post-routing drift from per-expert
q6k_gemv fp64 accumulation through 47 layers of MoeFfnOut has
perturbed the gate input enough that two boundary expert scores
swap. The resulting FFN output diverges by O(1) because the disjoint
experts produce unrelated outputs.

- **Deterministic tie-breaking**: sort top-k by (-prob, +index)
- **fp64 gate softmax**: W_gate @ x → softmax → renormalize at fp64
- **Reorder-stable top-k**: stable partial sort + ε-tolerance on the
  (k+1)-th vs k-th score boundary

  inference_trace/save_tensor_stage.rs:
    + `MoeRouterIndices` enum variant + "moe_router_indices" name
    + `is_index_payload(&self)` helper
    + `ALL` array 22 → 23; per_layer count 20 → 21; tests renamed

  gguf/qwen3_moe_load.rs + gguf/cuda/moe_ffn_forward_layer_cuda.rs:
    + traced `_with_router` helpers now return
      `(output, weights, indices)` instead of `(output, weights)`

  gguf/inference/forward/forward_qwen3_moe_traced.rs (CPU)
  gguf/cuda/forward_qwen3_moe_cuda_traced.rs (CUDA):
    + capture `last_router_top_k_indices` from helper
    + emit `MoeRouterIndices` stage (indices cast to f32, lossless
      for num_experts ≤ 2^24)

  tests/qwen3_moe_per_layer_gpu_parity.rs:
    + helpers `make_router_indices_plan` + `read_indices_stage_file`
    + new test `falsify_qw3_moe_l47_router_indices` — definitive
      H(ii) falsifier; captures top-k INDICES at every layer for
      both CPU and GPU, sorts each, asserts set equality, prints
      L47-specific verdict

- PR-1 #1713 ✅ per-layer cos falsifier
- PR-2 #1737 ✅ q6k_gemv fp64 accumulators
- PR-3   ✅ hardware verify (47/48 PASS, L47 surfaces)
- PR-3b #1739 ✅ contract v1.7.0 → v1.7.1
- PR-3c #1740 ✅ scope-doc + L47 sub-cascade
- PR-3d  ✅ H(i) qtype-mismatch FALSIFIED
- PR-3e #1741 ✅ router-weight probe
- PR-3e2 ✅ **THIS PR** — H(ii) CONFIRMED
- PR-3f+  pending — apply one of the 3 candidate fixes

  cargo test --release --features cuda \
    -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \
    falsify_qw3_moe_l47_router_indices \
    -- --ignored --nocapture

29.5s on RTX 4090.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 17, 2026
…GN (#1583) (#1745)

PR-3g of the M-GPU-MOE-3 cascade. Adds the canonical "is L47 actually
user-visible" falsifier, runs 4 canonical prompts through both CPU and
GPU full forwards, and asserts argmax agreement.

## Result (lambda-vector RTX 4090, 2026-05-17)

  PROMPT             | CPU argmax (val)     | GPU argmax (val)
  canonical_3tok     |    944 ( 13.7270)  |    944 ( 14.4133)  ✓
  single_tok_785     |    220 ( 15.5523)  |     25 ( 18.5098)  ✗ MISMATCH
  multi_tok_short    |    315 ( 26.2279)  |    315 ( 25.5230)  ✓
  multi_tok_code     |    198 ( 17.7453)  |    198 ( 17.8433)  ✓

**3/4 prompts agree, 1 disagrees.** L47 cliff is NOT benign — the
expert-set divergence DOES flip the top-1 predicted token for some
prompts (~25% in this small sample). Option E (Accept) is off the
table; must pursue Option C (fp64 in per-expert SwiGLU).

## What this PR adds

  crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs:
    + new test `falsify_qw3_moe_gpu_argmax_agreement` — multi-prompt
      probe that builds CPU + GPU models once, runs 4 canonical prompts
      through both full forwards, and prints argmax agreement table +
      verdict. PROBE not hard-assert; prints "BENIGN" if all agree or
      "NOT BENIGN" + disagreeing prompts otherwise.

## Cascade context

- PR-1 #1713 ✅ per-layer cos falsifier
- PR-2 #1737 ✅ q6k_gemv fp64 accumulators
- PR-3   ✅ hardware verify — 47/48 PASS, L47 surfaces
- PR-3b #1739 ✅ contract v1.7.0 → v1.7.1
- PR-3c #1740 ✅ scope-doc + L47 sub-cascade
- PR-3d   ✅ H(i) qtype-mismatch FALSIFIED
- PR-3e #1741 ✅ router-weight probe
- PR-3e2 #1743 ✅ H(ii) CONFIRMED (2-of-8 expert swap)
- PR-3f1 ❌ falsified (fp64 softmax) — dropped
- PR-3f2 ❌ falsified (f64 weighted-sum) — dropped
- PR-3g  ✅ **THIS PR** — L47 NOT BENIGN, must pursue fix
- PR-3h  pending — Option C fp64 in per-expert SwiGLU intermediates

## Why the cascade kept eliminating candidates

The 3-falsifier sequence ruled out the "easy" fix locations:

1. PR-3f1 (gate softmax precision) — drift upstream of softmax
2. PR-3f2 (weighted-sum precision) — drift upstream of weighted-sum
3. **Remaining**: drift inside each per-expert SwiGLU's intermediate
   chain (silu × up at f32, down-proj at f32 except its q6k_gemv acc
   which PR-2 already promoted to fp64)

PR-3h must promote the silu(gate) × up element-wise multiply and the
hidden-dim×4 intermediate state to f64. ~30-50 LOC across both CPU
and CUDA expert_swiglu helpers.

## Reproduction

  cargo test --release --features cuda \
    -p aprender-serve --test qwen3_moe_gpu_parity \
    falsify_qw3_moe_gpu_argmax_agreement \
    -- --ignored --nocapture

~25s on RTX 4090.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 18, 2026
…sifier — H(ii) CONFIRMED (#1583) (#1743)

* feat(m-gpu-moe-3): PR-3e2 MoeRouterIndices stage + L47 expert-set falsifier — H(ii) CONFIRMED (#1583)

PR-3e2 of the M-GPU-MOE-3 cascade. Adds `SaveTensorStage::MoeRouterIndices`
to definitively confirm or falsify H(ii) expert-set divergence at L47.

  L47 sorted top-8:
    cpu = [  2,  20,  36,  57,  60,  73, 111, 120 ]
    gpu = [  2,  12,  36,  57,  60, 103, 111, 120 ]
                  ^^^                ^^^
                  cpu-only={20, 73}; gpu-only={12, 103}

CPU and GPU agree on 6 of 8 experts at L47 but disagree on 2 (mild
H(ii) confirmation). All other 47 layers produce IDENTICAL expert
SETS between CPU and GPU.

Root cause: by L47 the accumulated post-routing drift from per-expert
q6k_gemv fp64 accumulation through 47 layers of MoeFfnOut has
perturbed the gate input enough that two boundary expert scores
swap. The resulting FFN output diverges by O(1) because the disjoint
experts produce unrelated outputs.

- **Deterministic tie-breaking**: sort top-k by (-prob, +index)
- **fp64 gate softmax**: W_gate @ x → softmax → renormalize at fp64
- **Reorder-stable top-k**: stable partial sort + ε-tolerance on the
  (k+1)-th vs k-th score boundary

  inference_trace/save_tensor_stage.rs:
    + `MoeRouterIndices` enum variant + "moe_router_indices" name
    + `is_index_payload(&self)` helper
    + `ALL` array 22 → 23; per_layer count 20 → 21; tests renamed

  gguf/qwen3_moe_load.rs + gguf/cuda/moe_ffn_forward_layer_cuda.rs:
    + traced `_with_router` helpers now return
      `(output, weights, indices)` instead of `(output, weights)`

  gguf/inference/forward/forward_qwen3_moe_traced.rs (CPU)
  gguf/cuda/forward_qwen3_moe_cuda_traced.rs (CUDA):
    + capture `last_router_top_k_indices` from helper
    + emit `MoeRouterIndices` stage (indices cast to f32, lossless
      for num_experts ≤ 2^24)

  tests/qwen3_moe_per_layer_gpu_parity.rs:
    + helpers `make_router_indices_plan` + `read_indices_stage_file`
    + new test `falsify_qw3_moe_l47_router_indices` — definitive
      H(ii) falsifier; captures top-k INDICES at every layer for
      both CPU and GPU, sorts each, asserts set equality, prints
      L47-specific verdict

- PR-1 #1713 ✅ per-layer cos falsifier
- PR-2 #1737 ✅ q6k_gemv fp64 accumulators
- PR-3   ✅ hardware verify (47/48 PASS, L47 surfaces)
- PR-3b #1739 ✅ contract v1.7.0 → v1.7.1
- PR-3c #1740 ✅ scope-doc + L47 sub-cascade
- PR-3d  ✅ H(i) qtype-mismatch FALSIFIED
- PR-3e #1741 ✅ router-weight probe
- PR-3e2 ✅ **THIS PR** — H(ii) CONFIRMED
- PR-3f+  pending — apply one of the 3 candidate fixes

  cargo test --release --features cuda \
    -p aprender-serve --test qwen3_moe_per_layer_gpu_parity \
    falsify_qw3_moe_l47_router_indices \
    -- --ignored --nocapture

29.5s on RTX 4090.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(m-gpu-moe-3): update save_tensor_plan tests for 23 stages (PR-3e2 #1583)

PR-3e2 added `SaveTensorStage::MoeRouterIndices` (22 → 23 stages) but
missed updating the parallel tests in `save_tensor_plan.rs` that
asserted on the constant `22`. Workspace-test CI surfaced this:

  test inference_trace::save_tensor_plan::tests::
        all_keyword_expands_to_twenty_two_stages ... FAILED
  test inference_trace::save_tensor_plan::tests::
        all_keyword_case_insensitive ... FAILED

Two fixes:
1. Rename `all_keyword_expands_to_twenty_two_stages` →
   `all_keyword_expands_to_all_stages` and assert against
   `SaveTensorStage::ALL.len()` (currently 23) instead of the hardcoded
   `22`. Future stage additions won't require touching this test.
2. Same change in `all_keyword_case_insensitive` — assert against
   `SaveTensorStage::ALL.len()` instead of `22`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant