Skip to content

feat(aprender-serve): forward_qwen3_moe_cuda — M-GPU-MOE-1.0-redo on correct type#1464

Merged
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-forward-cuda-m-stage-1-0-redo
May 4, 2026
Merged

feat(aprender-serve): forward_qwen3_moe_cuda — M-GPU-MOE-1.0-redo on correct type#1464
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-forward-cuda-m-stage-1-0-redo

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Why this PR

PR #1460 placed `forward_qwen3_moe_gpu` on `OwnedQuantizedModel` (the CPU-only type). Code archaeology after #1460 landed showed `OwnedQuantizedModelCuda` already exists with `forward_cuda` doing CPU-attention + CUDA-FFN — the established pattern this contract should extend. The v1.1.0 amendment (#1462) recorded option D as the architectural decision. This PR ships the implementation choice.

Test plan

  • `cargo check -p aprender-serve --features cuda` — compiles
  • `cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda` — passes
  • `pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` — 0/0
  • M-GPU-MOE-1.1 PR: per-expert CUDA dispatch via `self.executor` (separate)

🤖 Generated with Claude Code

…correct type

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D amendment (PR #1462
squash 4495407), the GPU MoE forward path lives on
OwnedQuantizedModelCuda, NOT OwnedQuantizedModel. The first-cut
M-GPU-MOE-1.0 stub from PR #1460 (4d9e5ae) was placed on the
wrong type; this PR ships the redo on the correct type.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda.rs  NEW

    impl OwnedQuantizedModelCuda {
        pub fn forward_qwen3_moe_cuda(
            &self,
            token_ids: &[u32],
            moe_layers: &[Qwen3MoeQuantizedLayer],
            num_experts: usize,
            num_experts_per_tok: usize,
            moe_intermediate: usize,
            _data: &[u8],
        ) -> Result<Vec<f32>>
    }

  Behavior at M-GPU-MOE-1.0-redo:
    1. Validate preconditions (token_ids non-empty, moe_layers length
       matches self.model.layers.len(), num_experts/num_experts_per_tok/
       moe_intermediate > 0, num_experts_per_tok ≤ num_experts).
    2. Return RealizarError::UnsupportedOperation pointing at
       qwen3-moe-forward-gpu-v1 v1.1.0 + listing pending stages
       M-GPU-MOE-1.1+.

  + 1 unit test (signature drift gate)
  + uses.rs gets `include!("forward_qwen3_moe_cuda.rs");`

Why on OwnedQuantizedModelCuda (not OwnedQuantizedModel)
=========================================================

Per the v1.1.0 amendment's option D decision: this method must
extend the existing OwnedQuantizedModelCuda CPU-attention + CUDA-FFN
pattern (forward_cuda in cuda.rs at line 18), not invent a new
substrate. OwnedQuantizedModelCuda already wraps OwnedQuantizedModel
+ holds CudaExecutor + GPU buffers (embed_buf, prefix_cache).

Naming follows existing precedent: `forward_cuda` is the existing
method on this type, so `forward_qwen3_moe_cuda` slots in cleanly.

Wrong-type stub (#1460) status
==============================

The OwnedQuantizedModel::forward_qwen3_moe_gpu function from #1460
remains on main. It returns the same UnsupportedOperation but on
the wrong type. A separate cleanup PR can either delete it or
update its doc-comment to point at this new variant. Not blocking.

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.1  Per-expert CUDA dispatch via            PENDING
                 self.executor (gemm_q4k for gate/up,
                 gemm_q6k for down)
  M-GPU-MOE-1.2  Cosine-vs-CPU parity gate ≥0.99        PENDING
  M-GPU-MOE-2    wgpu fallback                            PENDING
  M-GPU-MOE-3    Throughput ≥150 + VRAM ≤ 95%            PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

Refs PR #1460 squash 4d9e5ae (wrong-type stub on OwnedQuantizedModel)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 (P0 elevation)
Refs claude-code-parity-apr POC M50 (M-GPU-MOE-0 SHIPPED)
Refs M32b precedent (CPU sibling staging: stub → forward impl)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 09:52
@noahgift noahgift merged commit 9c721ec into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the feat/qwen3-moe-forward-cuda-m-stage-1-0-redo branch May 4, 2026 12:23
noahgift added a commit that referenced this pull request May 4, 2026
… GPU helper

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU
compute for the contract. Mirrors the M32c.2.2.* CPU staging where
per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the
full moe_ffn_forward_layer integration.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs  NEW

    pub(crate) fn expert_swiglu_cuda(
        executor: &mut crate::cuda::CudaExecutor,
        gate_bytes: &[u8],   // Q4_K, [intermediate, hidden_dim]
        up_bytes:   &[u8],   // Q4_K, [intermediate, hidden_dim]
        down_bytes: &[u8],   // Q6_K, [hidden_dim, intermediate]
        hidden: &[f32],
        hidden_dim: usize,
        intermediate: usize,
    ) -> Result<Vec<f32>>

  Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop):
    1. gate_out      = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim)
    2. up_out        = executor.q4k_matvec(up_bytes,   hidden, ..., m=intermediate, k=hidden_dim)
    3. ffn_inner[i]  = silu(gate_out[i]) * up_out[i]   (CPU element-wise)
    4. expert_out    = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate)

  + 2 unit tests (signature drift gate + InvalidShape rejection)

Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline
===============================================================

The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3.
The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages
correctness before performance:

  M-GPU-MOE-1.1.0 (this)  Per-expert via existing primitives        SHIPPED ✓
                          - silu via CPU elementwise (small)
                          - element-wise gate*up via CPU
                          - matmuls via existing q4k/q6k GPU kernels
  M-GPU-MOE-1.1.1         Full forward integration in
                          OwnedQuantizedModelCuda::forward_qwen3_moe_cuda
                          (router + per-token loop + per-expert
                          dispatch + weighted aggregation)         PENDING
  M-GPU-MOE-1.2           Cosine-vs-CPU parity gate ≥0.99          PENDING
                          (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2             wgpu fallback                            PENDING
  M-GPU-MOE-3             Fused kernels + sparse batching          PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda
  test ... ok. 2 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)

Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda)
Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…1469)

* feat(aprender-serve): expert_swiglu_cuda — M-GPU-MOE-1.1.0 per-expert GPU helper

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — first concrete GPU
compute for the contract. Mirrors the M32c.2.2.* CPU staging where
per-expert byte slicer + per-expert SwiGLU helper landed BEFORE the
full moe_ffn_forward_layer integration.

What this PR ships
==================

  crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs  NEW

    pub(crate) fn expert_swiglu_cuda(
        executor: &mut crate::cuda::CudaExecutor,
        gate_bytes: &[u8],   // Q4_K, [intermediate, hidden_dim]
        up_bytes:   &[u8],   // Q4_K, [intermediate, hidden_dim]
        down_bytes: &[u8],   // Q6_K, [hidden_dim, intermediate]
        hidden: &[f32],
        hidden_dim: usize,
        intermediate: usize,
    ) -> Result<Vec<f32>>

  Body (mirrors CPU sibling moe_ffn_forward_layer per-expert loop):
    1. gate_out      = executor.q4k_matvec(gate_bytes, hidden, ..., m=intermediate, k=hidden_dim)
    2. up_out        = executor.q4k_matvec(up_bytes,   hidden, ..., m=intermediate, k=hidden_dim)
    3. ffn_inner[i]  = silu(gate_out[i]) * up_out[i]   (CPU element-wise)
    4. expert_out    = executor.q6k_gemv(down_bytes, ffn_inner, ..., n=hidden_dim, k=intermediate)

  + 2 unit tests (signature drift gate + InvalidShape rejection)

Why "naive per-expert dispatch" is the M-GPU-MOE-1.1.0 baseline
===============================================================

The fused dequant+matmul + sparse expert batching path is M-GPU-MOE-3.
The contract (qwen3-moe-forward-gpu-v1 implementation_stages) stages
correctness before performance:

  M-GPU-MOE-1.1.0 (this)  Per-expert via existing primitives        SHIPPED ✓
                          - silu via CPU elementwise (small)
                          - element-wise gate*up via CPU
                          - matmuls via existing q4k/q6k GPU kernels
  M-GPU-MOE-1.1.1         Full forward integration in
                          OwnedQuantizedModelCuda::forward_qwen3_moe_cuda
                          (router + per-token loop + per-expert
                          dispatch + weighted aggregation)         PENDING
  M-GPU-MOE-1.2           Cosine-vs-CPU parity gate ≥0.99          PENDING
                          (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2             wgpu fallback                            PENDING
  M-GPU-MOE-3             Fused kernels + sparse batching          PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda
  test ... ok. 2 passed; 0 failed
  $ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  0 error(s), 0 warning(s)

Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs PR #1464 (M-GPU-MOE-1.0-redo stub on OwnedQuantizedModelCuda)
Refs M32c.2.2.0 + M32c.2.2.1 (CPU per-expert sub-milestone precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation + risk row)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 single-layer GPU MoE FFN

Mirrors CPU sibling moe_ffn_forward_layer (qwen3_moe_load.rs:363)
step-for-step: F32 router on CPU, softmax + top-k + renormalize on
CPU, per-expert SwiGLU dispatched through expert_swiglu_cuda
(M-GPU-MOE-1.1.0), weighted aggregation on CPU.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D: GPU MoE forward path
on OwnedQuantizedModelCuda, reusing existing CudaExecutor primitives
(q4k_matvec for gate/up, q6k_gemv for down) per expert.

Composes the M-GPU-MOE-1.1.0 helper into the layer-level structure
that the next stage M-GPU-MOE-1.1.2 (forward_qwen3_moe_cuda full
integration) will call once per token per layer.

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles

Refs PR #1465 (expert_swiglu_cuda M-GPU-MOE-1.1.0)
Refs M32c.2.2.2.0 (CPU sibling moe_ffn_forward_layer precedent)
Refs claude-code-parity-apr POC M49 / R10 (P0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…-MOE-1.1.2

Replaces the M-GPU-MOE-1.0-redo stub body with the full forward
integration. forward_qwen3_moe_cuda now mirrors the CPU sibling
OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs)
line-for-line, with one difference: the per-layer FFN section
routes through moe_ffn_forward_layer_cuda which dispatches per-
expert matmuls to self.executor (CudaExecutor) via the
expert_swiglu_cuda helper.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing
OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda
in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to
GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel
for ~5× throughput.

Signature changes
=================

  - &self → &mut self  (executor needs mutable for kernel cache)
  - _data → data       (passed to moe_ffn_forward_layer_cuda for
                          expert_byte_slice)

Forward body structure (mirrors CPU sibling step-for-step):
  1. Embed (CPU)                                              — self.model.embed
  2. Per-layer:
     2a. Attention norm (CPU)                                 — ops::rms_norm
     2b. QKV projection (CPU)                                 — self.model.qkv_matmul
     2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b)         — ops::apply_per_head_rms_norm
     2d. Causal attention + output proj (CPU)                 — self.model.causal_attention
     2e. Residual                                              — element-wise CPU
     2f. Pre-FFN norm (CPU)                                   — ops::rms_norm
     2g. **MoE FFN on GPU**                                   — moe_ffn_forward_layer_cuda
                                                                  → expert_swiglu_cuda
                                                                  → self.executor.q4k_matvec
                                                                                .q6k_gemv
     2h. Residual                                              — element-wise CPU
  3. Final norm (CPU)
  4. LM head — last token (CPU)

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda    SHIPPED ✓ (#1464)
  M-GPU-MOE-1.1.0     expert_swiglu_cuda helper          SHIPPED ✓ (via #1469 squash)
  M-GPU-MOE-1.1.1     moe_ffn_forward_layer_cuda          SHIPPED ✓ (#1469)
  M-GPU-MOE-1.1.2     forward_qwen3_moe_cuda full integ   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.2       Cosine-vs-CPU parity gate ≥0.99     PENDING
                      (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2         wgpu fallback                        PENDING
  M-GPU-MOE-3         Throughput ≥150 + VRAM ≤ 95%         PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed

Refs PR #1469 squash 77b9f0d (helpers landed)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…-MOE-1.1.2

Replaces the M-GPU-MOE-1.0-redo stub body with the full forward
integration. forward_qwen3_moe_cuda now mirrors the CPU sibling
OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs)
line-for-line, with one difference: the per-layer FFN section
routes through moe_ffn_forward_layer_cuda which dispatches per-
expert matmuls to self.executor (CudaExecutor) via the
expert_swiglu_cuda helper.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing
OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda
in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to
GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel
for ~5× throughput.

Signature changes
=================

  - &self → &mut self  (executor needs mutable for kernel cache)
  - _data → data       (passed to moe_ffn_forward_layer_cuda for
                          expert_byte_slice)

Forward body structure (mirrors CPU sibling step-for-step):
  1. Embed (CPU)                                              — self.model.embed
  2. Per-layer:
     2a. Attention norm (CPU)                                 — ops::rms_norm
     2b. QKV projection (CPU)                                 — self.model.qkv_matmul
     2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b)         — ops::apply_per_head_rms_norm
     2d. Causal attention + output proj (CPU)                 — self.model.causal_attention
     2e. Residual                                              — element-wise CPU
     2f. Pre-FFN norm (CPU)                                   — ops::rms_norm
     2g. **MoE FFN on GPU**                                   — moe_ffn_forward_layer_cuda
                                                                  → expert_swiglu_cuda
                                                                  → self.executor.q4k_matvec
                                                                                .q6k_gemv
     2h. Residual                                              — element-wise CPU
  3. Final norm (CPU)
  4. LM head — last token (CPU)

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda    SHIPPED ✓ (#1464)
  M-GPU-MOE-1.1.0     expert_swiglu_cuda helper          SHIPPED ✓ (via #1469 squash)
  M-GPU-MOE-1.1.1     moe_ffn_forward_layer_cuda          SHIPPED ✓ (#1469)
  M-GPU-MOE-1.1.2     forward_qwen3_moe_cuda full integ   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.2       Cosine-vs-CPU parity gate ≥0.99     PENDING
                      (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2         wgpu fallback                        PENDING
  M-GPU-MOE-3         Throughput ≥150 + VRAM ≤ 95%         PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed

Refs PR #1469 squash 77b9f0d (helpers landed)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…-MOE-1.1.2 (#1477)

Replaces the M-GPU-MOE-1.0-redo stub body with the full forward
integration. forward_qwen3_moe_cuda now mirrors the CPU sibling
OwnedQuantizedModel::forward_qwen3_moe (forward_qwen3_moe.rs)
line-for-line, with one difference: the per-layer FFN section
routes through moe_ffn_forward_layer_cuda which dispatches per-
expert matmuls to self.executor (CudaExecutor) via the
expert_swiglu_cuda helper.

Per qwen3-moe-forward-gpu-v1 v1.1.0 option D — extends the existing
OwnedQuantizedModelCuda CPU-attention + CUDA-FFN pattern (forward_cuda
in cuda.rs:18). Attention path stays on CPU; only FFN matmuls go to
GPU. M-GPU-MOE-3 fuses dispatch into a single sparse-expert kernel
for ~5× throughput.

Signature changes
=================

  - &self → &mut self  (executor needs mutable for kernel cache)
  - _data → data       (passed to moe_ffn_forward_layer_cuda for
                          expert_byte_slice)

Forward body structure (mirrors CPU sibling step-for-step):
  1. Embed (CPU)                                              — self.model.embed
  2. Per-layer:
     2a. Attention norm (CPU)                                 — ops::rms_norm
     2b. QKV projection (CPU)                                 — self.model.qkv_matmul
     2c. Per-head Q/K RMSNorm + RoPE (M32d Step 5/5b)         — ops::apply_per_head_rms_norm
     2d. Causal attention + output proj (CPU)                 — self.model.causal_attention
     2e. Residual                                              — element-wise CPU
     2f. Pre-FFN norm (CPU)                                   — ops::rms_norm
     2g. **MoE FFN on GPU**                                   — moe_ffn_forward_layer_cuda
                                                                  → expert_swiglu_cuda
                                                                  → self.executor.q4k_matvec
                                                                                .q6k_gemv
     2h. Residual                                              — element-wise CPU
  3. Final norm (CPU)
  4. LM head — last token (CPU)

Implementation stages updated
=============================

  M-GPU-MOE-0    Contract scaffold v1.0.0                SHIPPED ✓
  M-GPU-MOE-0.5  v1.1.0 option D amendment              SHIPPED ✓
  M-GPU-MOE-1.0-redo  Stub on OwnedQuantizedModelCuda    SHIPPED ✓ (#1464)
  M-GPU-MOE-1.1.0     expert_swiglu_cuda helper          SHIPPED ✓ (via #1469 squash)
  M-GPU-MOE-1.1.1     moe_ffn_forward_layer_cuda          SHIPPED ✓ (#1469)
  M-GPU-MOE-1.1.2     forward_qwen3_moe_cuda full integ   SHIPPED ✓ (THIS PR)
  M-GPU-MOE-1.2       Cosine-vs-CPU parity gate ≥0.99     PENDING
                      (FALSIFY-QW3-MOE-GPU-PARITY-001)
  M-GPU-MOE-2         wgpu fallback                        PENDING
  M-GPU-MOE-3         Throughput ≥150 + VRAM ≤ 95%         PENDING

Verification
============

  $ cargo check -p aprender-serve --features cuda
  ✓ Compiles
  $ cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda
  test ... ok. 1 passed

Refs PR #1469 squash 77b9f0d (helpers landed)
Refs PR #1462 squash 4495407 (v1.1.0 option D amendment)
Refs claude-code-parity-apr POC M49 / R10 (P0 elevation)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…QuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I
(see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for
the wgpu backend.

WHAT THIS PR ADDS:

  * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module
    with OwnedQuantizedModelWgpu struct + new() + stub method
    forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure.

  * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim
    `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors
    cuda_model.rs.

  * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules
    behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature
    flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208).

WHY MODULE NAMED `wgpu_backend`:

The Rust ecosystem already has a `wgpu` crate. A module named `wgpu`
inside the same crate would shadow it inside the file's body. The
public re-export still presents `OwnedQuantizedModelWgpu` (no ugly
suffix) thanks to wgpu_model.rs.

WHY THIS IS A STUB:

Same staging discipline as M-GPU-MOE-1.0-redo — contract first,
scaffold second, implementation third. The body of
forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda
sibling's boundary) then returns RealizarError::UnsupportedOperation
whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2
staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA
hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU
LAZY-FUSED-MATVEC, ~30 tok/s).

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors (default)
  cargo check -p aprender-serve --features cuda  → 0 errors (cuda)
  cargo check -p aprender-serve --features gpu   → 0 errors (wgpu)
  cargo test -p aprender-serve --lib --features gpu \
      owned_quantized_model_wgpu_tests           → 1 passed

Lib unit test asserts the function signature exists and matches the
cuda sibling step-for-step (compile-time checks via fn pointer
coercion — no runtime model construction needed at the stub stage).

DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I
amendment). Branch is stacked on the v1.2.0 contract branch; once
#1485 lands on main, this PR rebases onto main directly.

NEXT STAGES per v1.2.0:

  M-GPU-MOE-2.1  per-expert wgpu dispatch helpers
                 (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2  full forward integration mirror of cuda sibling
  M-GPU-MOE-2.3  cosine-vs-CPU parity test on wgpu hardware

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…1487)

Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I
(see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for
the wgpu backend.

WHAT THIS PR ADDS:

  * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module
    with OwnedQuantizedModelWgpu struct + new() + stub method
    forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure.

  * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim
    `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors
    cuda_model.rs.

  * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules
    behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature
    flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208).

WHY MODULE NAMED `wgpu_backend`:

The Rust ecosystem already has a `wgpu` crate. A module named `wgpu`
inside the same crate would shadow it inside the file's body. The
public re-export still presents `OwnedQuantizedModelWgpu` (no ugly
suffix) thanks to wgpu_model.rs.

WHY THIS IS A STUB:

Same staging discipline as M-GPU-MOE-1.0-redo — contract first,
scaffold second, implementation third. The body of
forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda
sibling's boundary) then returns RealizarError::UnsupportedOperation
whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2
staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA
hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU
LAZY-FUSED-MATVEC, ~30 tok/s).

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors (default)
  cargo check -p aprender-serve --features cuda  → 0 errors (cuda)
  cargo check -p aprender-serve --features gpu   → 0 errors (wgpu)
  cargo test -p aprender-serve --lib --features gpu \
      owned_quantized_model_wgpu_tests           → 1 passed

Lib unit test asserts the function signature exists and matches the
cuda sibling step-for-step (compile-time checks via fn pointer
coercion — no runtime model construction needed at the stub stage).

DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I
amendment). Branch is stacked on the v1.2.0 contract branch; once
#1485 lands on main, this PR rebases onto main directly.

NEXT STAGES per v1.2.0:

  M-GPU-MOE-2.1  per-expert wgpu dispatch helpers
                 (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2  full forward integration mirror of cuda sibling
  M-GPU-MOE-2.3  cosine-vs-CPU parity test on wgpu hardware

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…QuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…1487)

Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I
(see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for
the wgpu backend.

WHAT THIS PR ADDS:

  * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module
    with OwnedQuantizedModelWgpu struct + new() + stub method
    forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure.

  * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim
    `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors
    cuda_model.rs.

  * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules
    behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature
    flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208).

WHY MODULE NAMED `wgpu_backend`:

The Rust ecosystem already has a `wgpu` crate. A module named `wgpu`
inside the same crate would shadow it inside the file's body. The
public re-export still presents `OwnedQuantizedModelWgpu` (no ugly
suffix) thanks to wgpu_model.rs.

WHY THIS IS A STUB:

Same staging discipline as M-GPU-MOE-1.0-redo — contract first,
scaffold second, implementation third. The body of
forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda
sibling's boundary) then returns RealizarError::UnsupportedOperation
whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2
staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA
hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU
LAZY-FUSED-MATVEC, ~30 tok/s).

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors (default)
  cargo check -p aprender-serve --features cuda  → 0 errors (cuda)
  cargo check -p aprender-serve --features gpu   → 0 errors (wgpu)
  cargo test -p aprender-serve --lib --features gpu \
      owned_quantized_model_wgpu_tests           → 1 passed

Lib unit test asserts the function signature exists and matches the
cuda sibling step-for-step (compile-time checks via fn pointer
coercion — no runtime model construction needed at the stub stage).

DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I
amendment). Branch is stacked on the v1.2.0 contract branch; once
#1485 lands on main, this PR rebases onto main directly.

NEXT STAGES per v1.2.0:

  M-GPU-MOE-2.1  per-expert wgpu dispatch helpers
                 (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2  full forward integration mirror of cuda sibling
  M-GPU-MOE-2.3  cosine-vs-CPU parity test on wgpu hardware

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…arity test (#1485)

* contract(qwen3-moe-forward-gpu-v1): v1.1.0 → v1.2.0 — option I (OwnedQuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): OwnedQuantizedModelWgpu stub — M-GPU-MOE-2.0 (#1487)

Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I
(see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for
the wgpu backend.

WHAT THIS PR ADDS:

  * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module
    with OwnedQuantizedModelWgpu struct + new() + stub method
    forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure.

  * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim
    `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors
    cuda_model.rs.

  * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules
    behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature
    flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208).

WHY MODULE NAMED `wgpu_backend`:

The Rust ecosystem already has a `wgpu` crate. A module named `wgpu`
inside the same crate would shadow it inside the file's body. The
public re-export still presents `OwnedQuantizedModelWgpu` (no ugly
suffix) thanks to wgpu_model.rs.

WHY THIS IS A STUB:

Same staging discipline as M-GPU-MOE-1.0-redo — contract first,
scaffold second, implementation third. The body of
forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda
sibling's boundary) then returns RealizarError::UnsupportedOperation
whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2
staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA
hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU
LAZY-FUSED-MATVEC, ~30 tok/s).

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors (default)
  cargo check -p aprender-serve --features cuda  → 0 errors (cuda)
  cargo check -p aprender-serve --features gpu   → 0 errors (wgpu)
  cargo test -p aprender-serve --lib --features gpu \
      owned_quantized_model_wgpu_tests           → 1 passed

Lib unit test asserts the function signature exists and matches the
cuda sibling step-for-step (compile-time checks via fn pointer
coercion — no runtime model construction needed at the stub stage).

DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I
amendment). Branch is stacked on the v1.2.0 contract branch; once
#1485 lands on main, this PR rebases onto main directly.

NEXT STAGES per v1.2.0:

  M-GPU-MOE-2.1  per-expert wgpu dispatch helpers
                 (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2  full forward integration mirror of cuda sibling
  M-GPU-MOE-2.3  cosine-vs-CPU parity test on wgpu hardware

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* test(aprender-serve): qwen3_moe_wgpu_parity — M-GPU-MOE-2.3 cosine ≥0.99 falsifier (wgpu) (#1488)

wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484).
Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference
and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu`
integration on the same prompt.

Same falsifier ID as the cuda sibling
(FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend
implementing the same contract gate, not a different gate. Same
threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same
3-token canonical prompt as the cuda test.

CI WIRING:

  - #[cfg(feature = "gpu")] gates the file (matches the gate on
    OwnedQuantizedModelWgpu in gguf/mod.rs)
  - #[ignore] on the heavy test (CI default skips; explicit
    `--include-ignored` runs it on a wgpu-capable adapter — Apple
    Silicon Metal, AMD Vulkan, Intel ARC Vulkan)
  - 2 helper unit tests (cosine_similarity sanity coverage) DO run
    by default

WHEN THE TEST PASSES:

  - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test
    currently panics at the wgpu forward call (correct behaviour
    for a falsifier against an incomplete impl).
  - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu
    QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2
    (full forward integration analog of forward_qwen3_moe_cuda)
    must both land before this test passes on hardware.
  - On hardware with wgpu support, run with --include-ignored to
    exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for
    the wgpu backend (cuda backend discharged by sibling test).

DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub).
Branch is stacked on the v1.2.0 contract branch; once #1485 lands
on main, this PR's base flips to main automatically.

Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 ::
M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…d-bug fix plan

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:

  UnsupportedOperation { operation: "preload_weights_gpu",
    reason: "PAR-043: Failed to build indexed weights:
             Invalid launch config: Quantized weight
             'blk.0.ffn_gate.weight' not cached" }

ROOT CAUSE (5-whys in evidence file):

  `executor.build_indexed_weights` at
  `crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
  unconditionally requires `blk.{i}.ffn_gate.weight`,
  `.ffn_up.weight`, `.ffn_down.weight` to be cached for every
  layer. For MoE these names DO NOT EXIST — MoE has 128 expert
  gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
  the `moe_layers` parameter at forward-time.

  M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
  weights for FFN, but the wrapper construction goes through
  `preload_weights_gpu` BEFORE forward is ever called. Wrapper
  construction fails first.

WHY DEFAULT CI DIDN'T CATCH IT:

  Lib-only stub test (PR #1464) only checks signature at compile
  time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
  + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
  dogfood on lambda-vector found this 2026-05-04.

THIS PR ADDS:

  (1) Evidence file
      `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
      documenting the live failure + 5-whys + fix architecture.

  (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
      * New v1.3.0 amendment_history block (~110 lines) describing
        the bug, root cause, and three-step fix architecture
      * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
        with status PENDING
      * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
        (hardware test + lib-only sibling)
      * Top-level version "1.2.0" → "1.3.0"
      * Status comment expanded to mention M-GPU-MOE-1.3 as a
        precondition for ACTIVE_ALGORITHM_LEVEL flip

VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
            → 0 errors, 0 warnings. Contract is valid.

WHAT THIS PR DOES NOT DO:

  Does NOT implement the fix. Per CLAUDE.md "NEVER write code
  before writing a provable contract", this PR pins the contract
  first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
  ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
  drift-prevention test.

  Does NOT block PR #1485's already-shipped 3-commit cascade
  (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
  bug-fix.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…d-bug fix plan

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:

  UnsupportedOperation { operation: "preload_weights_gpu",
    reason: "PAR-043: Failed to build indexed weights:
             Invalid launch config: Quantized weight
             'blk.0.ffn_gate.weight' not cached" }

ROOT CAUSE (5-whys in evidence file):

  `executor.build_indexed_weights` at
  `crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
  unconditionally requires `blk.{i}.ffn_gate.weight`,
  `.ffn_up.weight`, `.ffn_down.weight` to be cached for every
  layer. For MoE these names DO NOT EXIST — MoE has 128 expert
  gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
  the `moe_layers` parameter at forward-time.

  M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
  weights for FFN, but the wrapper construction goes through
  `preload_weights_gpu` BEFORE forward is ever called. Wrapper
  construction fails first.

WHY DEFAULT CI DIDN'T CATCH IT:

  Lib-only stub test (PR #1464) only checks signature at compile
  time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
  + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
  dogfood on lambda-vector found this 2026-05-04.

THIS PR ADDS:

  (1) Evidence file
      `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
      documenting the live failure + 5-whys + fix architecture.

  (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
      * New v1.3.0 amendment_history block (~110 lines) describing
        the bug, root cause, and three-step fix architecture
      * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
        with status PENDING
      * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
        (hardware test + lib-only sibling)
      * Top-level version "1.2.0" → "1.3.0"
      * Status comment expanded to mention M-GPU-MOE-1.3 as a
        precondition for ACTIVE_ALGORITHM_LEVEL flip

VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
            → 0 errors, 0 warnings. Contract is valid.

WHAT THIS PR DOES NOT DO:

  Does NOT implement the fix. Per CLAUDE.md "NEVER write code
  before writing a provable contract", this PR pins the contract
  first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
  ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
  drift-prevention test.

  Does NOT block PR #1485's already-shipped 3-commit cascade
  (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
  bug-fix.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…d-bug fix plan (#1490)

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:

  UnsupportedOperation { operation: "preload_weights_gpu",
    reason: "PAR-043: Failed to build indexed weights:
             Invalid launch config: Quantized weight
             'blk.0.ffn_gate.weight' not cached" }

ROOT CAUSE (5-whys in evidence file):

  `executor.build_indexed_weights` at
  `crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
  unconditionally requires `blk.{i}.ffn_gate.weight`,
  `.ffn_up.weight`, `.ffn_down.weight` to be cached for every
  layer. For MoE these names DO NOT EXIST — MoE has 128 expert
  gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
  the `moe_layers` parameter at forward-time.

  M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
  weights for FFN, but the wrapper construction goes through
  `preload_weights_gpu` BEFORE forward is ever called. Wrapper
  construction fails first.

WHY DEFAULT CI DIDN'T CATCH IT:

  Lib-only stub test (PR #1464) only checks signature at compile
  time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
  + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
  dogfood on lambda-vector found this 2026-05-04.

THIS PR ADDS:

  (1) Evidence file
      `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
      documenting the live failure + 5-whys + fix architecture.

  (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
      * New v1.3.0 amendment_history block (~110 lines) describing
        the bug, root cause, and three-step fix architecture
      * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
        with status PENDING
      * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
        (hardware test + lib-only sibling)
      * Top-level version "1.2.0" → "1.3.0"
      * Status comment expanded to mention M-GPU-MOE-1.3 as a
        precondition for ACTIVE_ALGORITHM_LEVEL flip

VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
            → 0 errors, 0 warnings. Contract is valid.

WHAT THIS PR DOES NOT DO:

  Does NOT implement the fix. Per CLAUDE.md "NEVER write code
  before writing a provable contract", this PR pins the contract
  first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
  ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
  drift-prevention test.

  Does NOT block PR #1485's already-shipped 3-commit cascade
  (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
  bug-fix.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant