Skip to content

contract+evidence(qwen3-moe-forward-gpu-v1): v1.3.0 — preload-bug fix plan#1490

Merged
noahgift merged 1 commit into
mainfrom
evidence/m-gpu-moe-1-2-blocked-by-preload-weights-bug
May 4, 2026
Merged

contract+evidence(qwen3-moe-forward-gpu-v1): v1.3.0 — preload-bug fix plan#1490
noahgift merged 1 commit into
mainfrom
evidence/m-gpu-moe-1-2-blocked-by-preload-weights-bug

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090. First --include-ignored run of the M-GPU-MOE-1.2 cosine-parity test exposed a pre-existing bug: OwnedQuantizedModelCuda::new panics for any qwen3_moe GGUF because build_indexed_weights unconditionally requires dense FFN weight names that don't exist in MoE.

UnsupportedOperation { operation: \"preload_weights_gpu\",
  reason: \"PAR-043: Failed to build indexed weights:
           Invalid launch config: Quantized weight
           'blk.0.ffn_gate.weight' not cached\" }

What this PR adds

  1. Evidence file evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md — full 5-whys, fix architecture
  2. Contract v1.2.0 → v1.3.0:
    • 110-line v1.3.0 amendment_history block
    • New implementation_stage M-GPU-MOE-1.3 (PENDING)
    • New falsifier FALSIFY-QW3-MOE-GPU-PRELOAD-001
    • Status block updated
  3. pv validate: 0 errors, 0 warnings ✓

What this PR does NOT do

Why this matters (R10 impact)

The M-GPU-MOE cascade was on track to discharge by 2026-05-04 with hardware tests passing. This bug means the cascade is correct architecturally but unusable in practice until 1.3 lands. R10 (the P0 blocker on production-cadence Qwen3-Coder consumption) cannot retire without this fix.

Test plan

  • pv validate contracts/qwen3-moe-forward-gpu-v1.yaml → 0/0
  • CI ci/gate green

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 4, 2026 21:51
@noahgift noahgift force-pushed the evidence/m-gpu-moe-1-2-blocked-by-preload-weights-bug branch from 38ed9cc to 2c19055 Compare May 4, 2026 21:55
noahgift added a commit that referenced this pull request May 4, 2026
…partial discharge)

Per qwen3-moe-forward-gpu-v1 v1.3.0 amendment (PR #1490).

WHAT THIS PR FIXES:

  ArchConstraints + build_indexed_weights + ValidatedLayerWeights all
  made MoE-aware via new `is_moe: bool` field on ArchConstraints.

  (1) `crates/aprender-serve/src/gguf/config.rs` — adds `is_moe: bool`
      field to `ArchConstraints` struct.

  (2) `crates/aprender-serve/src/gguf/arch_constraints_fallback.rs` —
      sets `is_moe: false` on all 19 dense arch entries; sets
      `is_moe: true` on the qwen3_moe arm. Also adds the raw GGUF arch
      string `qwen3moe` (no underscore) and `qwen3_5moe` to the same
      arm — these reach `from_architecture` from
      `ValidatedModelConfig::from_apr` without going through
      `normalize_architecture`.

  (3) `crates/aprender-serve/src/cuda/executor/weights.rs` —
      `build_indexed_weights` gates the 3 FFN-related quant lookups
      (ffn_gate.weight, ffn_up.weight, ffn_down.weight) on
      `arch.is_moe`; uses (0u64, 0usize) sentinels for MoE. Same
      gating for the 3 qtype resolutions.

  (4) `crates/aprender-serve/src/cuda/types.rs` —
      `ValidatedLayerWeights::validate` skips the FfnGate/FfnUp/FfnDown
      role checks when `arch.is_moe`. The MoE forward path
      (`forward_qwen3_moe_cuda`) routes FFN through `moe_layers`
      parameter, never reading these from the indexed weights.

WHAT THIS PR PARTIALLY DISCHARGES:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new in v1.3.0) — wrapper
  construction now succeeds for qwen3_moe GGUFs. Before this PR,
  `OwnedQuantizedModelCuda::new(model, 0)` panicked at:

    UnsupportedOperation { operation: "preload_weights_gpu",
      reason: "PAR-043: Failed to build indexed weights:
               Invalid launch config: Quantized weight
               'blk.0.ffn_gate.weight' not cached" }

  After this PR, that specific path no longer fails. Verified by
  re-running M-GPU-MOE-1.2 heavy test — it now progresses past
  `OwnedQuantizedModelCuda::new`.

NEW DOWNSTREAM BUG (not blocking this PR):

  After the wrapper construction fix, the heavy test now panics in
  CPU forward `matmul_fused.rs:211` with
  `index out of bounds: the len is 0 but the index is N`. This is a
  separate bug class: someone in the CPU forward path is dereferencing
  `layer.ffn_up_weight.data` (or similar) which is the
  `dense_ffn_placeholder` (byte_size=0) for MoE layers per
  `transformer.rs:348-353`. Root cause likely: the CPU
  `forward_qwen3_moe` does NOT touch the dense placeholders directly,
  but some preload/validation/init step does. Needs a follow-up PR
  (M-GPU-MOE-1.4) to either (a) skip dense-FFN-data access for MoE
  layers, or (b) replace the placeholder with proper sentinel.

  This PR DOES NOT regress the previous behaviour: the previous
  state was "wrapper construction fails", which masked the
  downstream bug. M-GPU-MOE-1.4 will surface and fix it.

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors
  cargo check -p aprender-serve --features cuda  → 0 errors
  cargo test -p aprender-serve --test qwen3_moe_gpu_parity \
      --features cuda                            → 3 helpers pass

  Heavy test on lambda-vector RTX 4090:
    BEFORE this PR: panic at OwnedQuantizedModelCuda::new
                    (preload_weights_gpu / build_indexed_weights)
    AFTER this PR:  panic moved to CPU forward matmul_fused.rs:211
                    (downstream bug, separate PR scope)

  Net: progress one bug class. M-GPU-MOE-1.3 stage is FUNCTIONALLY
  DISCHARGED as defined; M-GPU-MOE-1.4 follow-up needed for full
  PARITY-001 discharge.

NOTE ON PR STACKING:

  This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment +
  evidence file) being on aprender main first. The contract pinned
  the architectural decision; this PR implements it.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (partial discharge)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d-bug fix plan

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:

  UnsupportedOperation { operation: "preload_weights_gpu",
    reason: "PAR-043: Failed to build indexed weights:
             Invalid launch config: Quantized weight
             'blk.0.ffn_gate.weight' not cached" }

ROOT CAUSE (5-whys in evidence file):

  `executor.build_indexed_weights` at
  `crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
  unconditionally requires `blk.{i}.ffn_gate.weight`,
  `.ffn_up.weight`, `.ffn_down.weight` to be cached for every
  layer. For MoE these names DO NOT EXIST — MoE has 128 expert
  gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
  the `moe_layers` parameter at forward-time.

  M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
  weights for FFN, but the wrapper construction goes through
  `preload_weights_gpu` BEFORE forward is ever called. Wrapper
  construction fails first.

WHY DEFAULT CI DIDN'T CATCH IT:

  Lib-only stub test (PR #1464) only checks signature at compile
  time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
  + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
  dogfood on lambda-vector found this 2026-05-04.

THIS PR ADDS:

  (1) Evidence file
      `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
      documenting the live failure + 5-whys + fix architecture.

  (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
      * New v1.3.0 amendment_history block (~110 lines) describing
        the bug, root cause, and three-step fix architecture
      * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
        with status PENDING
      * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
        (hardware test + lib-only sibling)
      * Top-level version "1.2.0" → "1.3.0"
      * Status comment expanded to mention M-GPU-MOE-1.3 as a
        precondition for ACTIVE_ALGORITHM_LEVEL flip

VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
            → 0 errors, 0 warnings. Contract is valid.

WHAT THIS PR DOES NOT DO:

  Does NOT implement the fix. Per CLAUDE.md "NEVER write code
  before writing a provable contract", this PR pins the contract
  first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
  ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
  drift-prevention test.

  Does NOT block PR #1485's already-shipped 3-commit cascade
  (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
  bug-fix.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the evidence/m-gpu-moe-1-2-blocked-by-preload-weights-bug branch from 2c19055 to 8ea4dc5 Compare May 4, 2026 22:28
@noahgift noahgift merged commit 8267d4b into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the evidence/m-gpu-moe-1-2-blocked-by-preload-weights-bug branch May 4, 2026 22:48
noahgift added a commit that referenced this pull request May 4, 2026
…partial discharge)

Per qwen3-moe-forward-gpu-v1 v1.3.0 amendment (PR #1490).

WHAT THIS PR FIXES:

  ArchConstraints + build_indexed_weights + ValidatedLayerWeights all
  made MoE-aware via new `is_moe: bool` field on ArchConstraints.

  (1) `crates/aprender-serve/src/gguf/config.rs` — adds `is_moe: bool`
      field to `ArchConstraints` struct.

  (2) `crates/aprender-serve/src/gguf/arch_constraints_fallback.rs` —
      sets `is_moe: false` on all 19 dense arch entries; sets
      `is_moe: true` on the qwen3_moe arm. Also adds the raw GGUF arch
      string `qwen3moe` (no underscore) and `qwen3_5moe` to the same
      arm — these reach `from_architecture` from
      `ValidatedModelConfig::from_apr` without going through
      `normalize_architecture`.

  (3) `crates/aprender-serve/src/cuda/executor/weights.rs` —
      `build_indexed_weights` gates the 3 FFN-related quant lookups
      (ffn_gate.weight, ffn_up.weight, ffn_down.weight) on
      `arch.is_moe`; uses (0u64, 0usize) sentinels for MoE. Same
      gating for the 3 qtype resolutions.

  (4) `crates/aprender-serve/src/cuda/types.rs` —
      `ValidatedLayerWeights::validate` skips the FfnGate/FfnUp/FfnDown
      role checks when `arch.is_moe`. The MoE forward path
      (`forward_qwen3_moe_cuda`) routes FFN through `moe_layers`
      parameter, never reading these from the indexed weights.

WHAT THIS PR PARTIALLY DISCHARGES:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new in v1.3.0) — wrapper
  construction now succeeds for qwen3_moe GGUFs. Before this PR,
  `OwnedQuantizedModelCuda::new(model, 0)` panicked at:

    UnsupportedOperation { operation: "preload_weights_gpu",
      reason: "PAR-043: Failed to build indexed weights:
               Invalid launch config: Quantized weight
               'blk.0.ffn_gate.weight' not cached" }

  After this PR, that specific path no longer fails. Verified by
  re-running M-GPU-MOE-1.2 heavy test — it now progresses past
  `OwnedQuantizedModelCuda::new`.

NEW DOWNSTREAM BUG (not blocking this PR):

  After the wrapper construction fix, the heavy test now panics in
  CPU forward `matmul_fused.rs:211` with
  `index out of bounds: the len is 0 but the index is N`. This is a
  separate bug class: someone in the CPU forward path is dereferencing
  `layer.ffn_up_weight.data` (or similar) which is the
  `dense_ffn_placeholder` (byte_size=0) for MoE layers per
  `transformer.rs:348-353`. Root cause likely: the CPU
  `forward_qwen3_moe` does NOT touch the dense placeholders directly,
  but some preload/validation/init step does. Needs a follow-up PR
  (M-GPU-MOE-1.4) to either (a) skip dense-FFN-data access for MoE
  layers, or (b) replace the placeholder with proper sentinel.

  This PR DOES NOT regress the previous behaviour: the previous
  state was "wrapper construction fails", which masked the
  downstream bug. M-GPU-MOE-1.4 will surface and fix it.

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors
  cargo check -p aprender-serve --features cuda  → 0 errors
  cargo test -p aprender-serve --test qwen3_moe_gpu_parity \
      --features cuda                            → 3 helpers pass

  Heavy test on lambda-vector RTX 4090:
    BEFORE this PR: panic at OwnedQuantizedModelCuda::new
                    (preload_weights_gpu / build_indexed_weights)
    AFTER this PR:  panic moved to CPU forward matmul_fused.rs:211
                    (downstream bug, separate PR scope)

  Net: progress one bug class. M-GPU-MOE-1.3 stage is FUNCTIONALLY
  DISCHARGED as defined; M-GPU-MOE-1.4 follow-up needed for full
  PARITY-001 discharge.

NOTE ON PR STACKING:

  This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment +
  evidence file) being on aprender main first. The contract pinned
  the architectural decision; this PR implements it.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (partial discharge)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…partial discharge) (#1491)

* feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge)

Per qwen3-moe-forward-gpu-v1 v1.3.0 amendment (PR #1490).

WHAT THIS PR FIXES:

  ArchConstraints + build_indexed_weights + ValidatedLayerWeights all
  made MoE-aware via new `is_moe: bool` field on ArchConstraints.

  (1) `crates/aprender-serve/src/gguf/config.rs` — adds `is_moe: bool`
      field to `ArchConstraints` struct.

  (2) `crates/aprender-serve/src/gguf/arch_constraints_fallback.rs` —
      sets `is_moe: false` on all 19 dense arch entries; sets
      `is_moe: true` on the qwen3_moe arm. Also adds the raw GGUF arch
      string `qwen3moe` (no underscore) and `qwen3_5moe` to the same
      arm — these reach `from_architecture` from
      `ValidatedModelConfig::from_apr` without going through
      `normalize_architecture`.

  (3) `crates/aprender-serve/src/cuda/executor/weights.rs` —
      `build_indexed_weights` gates the 3 FFN-related quant lookups
      (ffn_gate.weight, ffn_up.weight, ffn_down.weight) on
      `arch.is_moe`; uses (0u64, 0usize) sentinels for MoE. Same
      gating for the 3 qtype resolutions.

  (4) `crates/aprender-serve/src/cuda/types.rs` —
      `ValidatedLayerWeights::validate` skips the FfnGate/FfnUp/FfnDown
      role checks when `arch.is_moe`. The MoE forward path
      (`forward_qwen3_moe_cuda`) routes FFN through `moe_layers`
      parameter, never reading these from the indexed weights.

WHAT THIS PR PARTIALLY DISCHARGES:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new in v1.3.0) — wrapper
  construction now succeeds for qwen3_moe GGUFs. Before this PR,
  `OwnedQuantizedModelCuda::new(model, 0)` panicked at:

    UnsupportedOperation { operation: "preload_weights_gpu",
      reason: "PAR-043: Failed to build indexed weights:
               Invalid launch config: Quantized weight
               'blk.0.ffn_gate.weight' not cached" }

  After this PR, that specific path no longer fails. Verified by
  re-running M-GPU-MOE-1.2 heavy test — it now progresses past
  `OwnedQuantizedModelCuda::new`.

NEW DOWNSTREAM BUG (not blocking this PR):

  After the wrapper construction fix, the heavy test now panics in
  CPU forward `matmul_fused.rs:211` with
  `index out of bounds: the len is 0 but the index is N`. This is a
  separate bug class: someone in the CPU forward path is dereferencing
  `layer.ffn_up_weight.data` (or similar) which is the
  `dense_ffn_placeholder` (byte_size=0) for MoE layers per
  `transformer.rs:348-353`. Root cause likely: the CPU
  `forward_qwen3_moe` does NOT touch the dense placeholders directly,
  but some preload/validation/init step does. Needs a follow-up PR
  (M-GPU-MOE-1.4) to either (a) skip dense-FFN-data access for MoE
  layers, or (b) replace the placeholder with proper sentinel.

  This PR DOES NOT regress the previous behaviour: the previous
  state was "wrapper construction fails", which masked the
  downstream bug. M-GPU-MOE-1.4 will surface and fix it.

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors
  cargo check -p aprender-serve --features cuda  → 0 errors
  cargo test -p aprender-serve --test qwen3_moe_gpu_parity \
      --features cuda                            → 3 helpers pass

  Heavy test on lambda-vector RTX 4090:
    BEFORE this PR: panic at OwnedQuantizedModelCuda::new
                    (preload_weights_gpu / build_indexed_weights)
    AFTER this PR:  panic moved to CPU forward matmul_fused.rs:211
                    (downstream bug, separate PR scope)

  Net: progress one bug class. M-GPU-MOE-1.3 stage is FUNCTIONALLY
  DISCHARGED as defined; M-GPU-MOE-1.4 follow-up needed for full
  PARITY-001 discharge.

NOTE ON PR STACKING:

  This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment +
  evidence file) being on aprender main first. The contract pinned
  the architectural decision; this PR implements it.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (partial discharge)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): M-GPU-MOE-1.3 — also skip parity_gate for MoE

Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate
(Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`)
also runs the dense forward paths
(`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on
construction. For MoE these dispatch to `fused_matmul_f32` against
the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel
panics in `matmul_fused.rs:211`.

Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale
already in v1.3.0's amendment_history block.

  - The parity gate's purpose is "stop the line if GPU diverges
    from CPU" — for dense models, it's load-time safety.
  - For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001
    (qwen3_moe_gpu_parity.rs), which exercises the MoE-specific
    forward paths and bypasses the dense path the gate runs.
  - Net: MoE models lose load-time parity but gain
    test-time parity via the qwen3_moe_gpu_parity test.

VERIFICATION ON LAMBDA-VECTOR RTX 4090:

  Test progresses much further now:

    BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights
            (FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier)
    AFTER previous commit: panic at parity_gate matmul_fused.rs:211
            (downstream bug — exposed but not yet fixed)
    AFTER this commit: CPU forward succeeds, GPU forward executes,
            then asserts at gpu_logits.iter().all(|v| v.is_finite())
            because the GPU produces NaN/Inf logits.

  Test output:
    [GH-129] Early kernel preload: 49 modules compiled
    [PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048)
    [PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB)
    FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward...
    panicked at qwen3_moe_gpu_parity.rs:168:
    all GPU logits must be finite (no NaN/Inf)

PARTIAL DISCHARGE:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds.
  FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK
                                       implicitly; finiteness FAILS).
  FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug.

NEW DOWNSTREAM BUG:

  GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR
  #1477) produces NaN/Inf for at least the canonical 3-token
  Qwen3-Coder prompt. This is the NEXT bug to investigate
  (M-GPU-MOE-1.5 follow-up). Likely candidates:
    - Q4K matmul accumulator overflow in expert_swiglu_cuda
    - Per-expert SwiGLU silu activation produces Inf for large inputs
    - Top-k router weight renormalization division by zero
    - missing per-head Q/K RMSNorm path for MoE (qk_norm tensors
      loaded but not applied)
  Bisection via `apr trace --json --payload` per the M32d Step 2
  surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0
  PARITY-001 if_fails).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant