feat(aprender-serve): forward_qwen3_moe_traced_with_plan — M-MOE-SUB-2 step (a)#1516
Merged
Merged
Conversation
…2 step (a)
Wires SaveTensorStage::MoeRouter + MoeFfnOut emission into the CPU
traced MoE forward path per
`contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 step (a).
## Design
Adds new method `forward_qwen3_moe_traced_with_plan` accepting
`Option<&SaveTensorPlan>`. The existing `forward_qwen3_moe_traced`
becomes a thin one-line delegate passing `None` — public API
unchanged, zero-cost when no plan is set (single Option discriminant
check).
When the plan selects MoeRouter or MoeFfnOut for a given layer, the
last sequence position's MoE forward is dispatched through
`moe_ffn_forward_layer_with_router` (M68 helper) to obtain top-k
router weights without re-running the MoE forward. All other
sequence positions (and the last position when neither stage is
selected) continue using the production `moe_ffn_forward_layer` so
trace cost stays minimal.
## What this discharges
Per FALSIFY-MOE-SUB-002 contract:
- Helper byte-identity preserved: `moe_ffn_forward_layer_with_router`
produces the same `output` Vec as production (asserted by step c
M68 unit tests).
- Production `forward_qwen3_moe` / `forward_qwen3_moe_cuda` hot paths
unchanged byte-for-byte.
- `forward_qwen3_moe_traced` public API unchanged (delegate
pattern).
- Plan-aware code path emits MoeRouter as `[num_experts_per_tok]`
+ MoeFfnOut as `[hidden_dim]` to disk via existing
`maybe_save_stage` machinery (same machinery used by
`forward_traced_with_plan` for SHIP-007 SaveTensor).
Full discharge of FALSIFY-MOE-SUB-002 needs M-MOE-SUB-2 step (b)
(GPU sibling) + M-MOE-SUB-3 (live bisection on lambda-vector RTX
4090) + M-GPU-MOE-1.4 (fix at bisected stage).
## Verification
$ cargo build -p aprender-serve --release
clean
$ cargo test -p aprender-serve --release --lib gguf::qwen3_moe_load
8 passed
$ cargo clippy -p aprender-serve --lib --release -- -D warnings
clean
$ rustfmt --check forward_qwen3_moe_traced.rs
clean
## What this does NOT ship
- M-MOE-SUB-2 step (b): NEW `forward_qwen3_moe_cuda_traced.rs`
GPU sibling — separate PR.
- Wiring in `apr trace` CLI dispatch site to actually pass a plan
through to `forward_qwen3_moe_traced_with_plan` — separate PR
(current `apr trace` for MoE still calls `forward_qwen3_moe_traced`
with no plan).
- End-to-end SaveTensor verification on lambda-vector RTX 4090 —
exercised via M-MOE-SUB-3.
Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 step (a),
M68 helper PR #1507 (squash 0f22c78),
M-GPU-MOE-1.4 NaN/Inf bisection plan
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
) M-MOE-SUB-2 step (a) CLI completion: connects clap surface (--save-tensor / --save-tensor-layers / --save-tensor-dir, PR-A #1405) through to forward_qwen3_moe_traced_with_plan (M74, PR #1516 squash 3138d13) for Qwen3-MoE-arch GGUF models. ## What ships 1. New pub fn `run_save_tensor_gguf_moe(path, stages, dir, layers)` in `crates/apr-cli/src/commands/trace_save_tensor.rs`. Mirrors the structure of the existing `run_save_tensor_apr` for APR models, but loads via `MappedGGUFModel` / `OwnedQuantizedModel`, validates `qwen3_moe` arch (rejects dense GGUF with a clear error), reads MoE config from GGUF metadata (`expert_count`, `expert_used_count`, `expert_feed_forward_length`), loads per-layer `Qwen3MoeQuantizedLayer` descriptors, then dispatches to `forward_qwen3_moe_traced_with_plan` with the plan derived from the CLI args. 2. Dispatch wireup in `dispatch.rs::dispatch_diagnostic_commands` under the `Commands::Trace` arm. The previous code dispatched `--save-tensor` for `.apr` only and printed a stub for other extensions; now `.gguf` dispatches to the new `run_save_tensor_gguf_moe` function. Other extensions (.safetensors) still print the stub pending SHIP-007 PR-E. ## What this does NOT ship - Dense GGUF SaveTensor wireup (still falls through to stub). - M-MOE-SUB-2 step (b) GPU sibling `forward_qwen3_moe_cuda_traced.rs` — separate PR. - Live verification on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF — exercised in M-MOE-SUB-3. ## Hot path safety Production `forward_qwen3_moe` / `forward_qwen3_moe_cuda` hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0). Production `forward_qwen3_moe_traced` (no plan) also unchanged — the new wireup uses the M74 sibling `forward_qwen3_moe_traced_with_plan`. ## Verification $ cargo build -p apr-cli --release --features inference clean $ cargo clippy -p apr-cli --lib --release --features inference \ -- -D warnings clean $ rustfmt --check trace_save_tensor.rs dispatch.rs clean $ cargo test -p apr-cli --release --lib commands::trace_save_tensor 5 passed (existing tests preserved) ## Falsifier impact - FALSIFY-MOE-SUB-002 (byte-identity preservation): still partial — needs M-MOE-SUB-2 step (b) GPU sibling for full discharge. - M-MOE-SUB-3 live bisection: now unblocked operationally — invoking `apr trace --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <dir> <gguf>` on lambda-vector RTX 4090 will produce per-layer MoeRouter + MoeFfnOut tensor files for the cached Qwen3-Coder GGUF, ready for diff vs the GPU sibling output once step (b) ships. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 step (a) CLI completion, M68 helper PR #1507 (squash 0f22c78), M74 forward_qwen3_moe_traced_with_plan PR #1516 (squash 3138d13) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…p (b) (#1523) GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced sibling (M32d Step 2 + PR #1516 _with_plan extension) but routes per- layer MoE FFN through the GPU dispatch so `apr trace --gpu --json --payload --save-tensor` can run the same SaveTensorPlan against both CPU and GPU forward paths, capture per-stage activations at MoeRouter and MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its first divergence point. What ships ========== - `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data)` → Result<ForwardTrace>. No-plan delegate. - `forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>)` → Result<ForwardTrace>. The plan-aware body. - New file `crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs` (~430 LOC including doc-comments). - `include!()` registered in `cuda/uses.rs`. - Lib-only signature drift gate test `forward_qwen3_moe_cuda_traced_signature_drift_gate`. End-to-end byte-identity vs production sibling exercised by the heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF (M-MOE-SUB-3). Hot path safety =============== Production `forward_qwen3_moe_cuda` is unchanged byte-for-byte. This is a parallel slow path used only by `apr trace --gpu`. The per-token loop dispatches the GPU MoE FFN identically to production for non- capture positions; the LAST sequence position uses `moe_ffn_forward_layer_cuda_with_router` (PR #1522) when the plan selects MoeRouter or MoeFfnOut so the router weights can be emitted without recomputation. Closes (a)+(b)+(c.gpu) of step (b) ================================== The triplet for M-MOE-SUB-2 is now complete: - step (a) CPU body: PR #1516 (forward_qwen3_moe_traced_with_plan) - step (a) CLI wireup: PR #1521 (apr trace --save-tensor for GGUF MoE) - step (b) GPU body: THIS PR - step (c) CPU helper: PR #1507 (moe_ffn_forward_layer_with_router) - step (c.gpu) GPU helper: PR #1522 (moe_ffn_forward_layer_cuda_with_router) M-MOE-SUB-3 next: heavy parity test on lambda-vector RTX 4090, diff CPU vs GPU at MoeRouter and MoeFfnOut, identify first divergence. Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR DOES NOT discharge it (heavy parity test required at M-MOE- SUB-3); it provides the GPU traced sibling that will run the test. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (b) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…-SUB-3 (#1524) Per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.2.0 step M-MOE-SUB-3: heavy diagnostic test that runs CPU-traced + GPU-traced forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523) with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every layer, then computes per-layer per-stage cosine similarity to identify the first layer where the GPU diverges from the CPU. What ships ========== - New heavy test `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` in `crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs`. - 6 light unit tests for the verdict classifier (`Match`, `Diverge`, `NanGpu`, `NanCpu`, `NanBoth`, `Missing`) — all 6 pass. - Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder GGUF (matches the existing `qwen3_moe_gpu_parity` test convention). - `#[ignore]` + `#[cfg(feature = "cuda")]` gating per the repo's heavy-test convention. Invocation ========== cargo test -p aprender-serve --features cuda \ --test qwen3_moe_gpu_per_stage_diff \ -- --include-ignored --nocapture What the harness does NOT do (yet) ================================== - Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001 cosine threshold lives in the existing qwen3_moe_gpu_parity test; this is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history block. - Does NOT clean up `/tmp/moe-sub-{cpu,gpu}-<pid>/` dirs (operator inspects them for raw bytes if cosine is ambiguous). Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR ships the harness; an operator-dispatched run on lambda- vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) will produce the layer-by-layer divergence table and pinpoint the first M-GPU-MOE-1.4 bug-origin candidate. Threshold-based discharge (`cosine ≥ 0.99`) becomes meaningful AFTER the bug is fixed. M-MOE-SUB-3 status after this PR ================================ - Test harness: SHIPPED (this PR) - Run on lambda-vector + interpret table: operator-dispatched - Promote to FALSIFY-MOE-SUB-002 DISCHARGED: gated on M-GPU-MOE-1.4 fix Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (a) PR #1516, step (b) PR #1523 Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 NaN/Inf bisection Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Merged
3 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…scade complete on main (#1525) Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3 (harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING (optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at MoeRouter / MoeFfnOut granularity). Cited PRs (chronological) ========================= - #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c) - #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a) - #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI) - #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu) - #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b) - #1524 — heavy diff harness (M-MOE-SUB-3) What's left =========== - Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) → produces layer-by-layer divergence table. - M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run. - FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires `SaveTensorStage::MoeRouter` + `MoeFfnOut` emission into the CPU traced MoE forward path per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 step (a).
Design
Adds new method `forward_qwen3_moe_traced_with_plan` accepting `Option<&SaveTensorPlan>`. The existing `forward_qwen3_moe_traced` becomes a thin one-line delegate passing `None` — public API unchanged, zero-cost when no plan is set.
When plan selects MoeRouter or MoeFfnOut for a layer, the last-position MoE forward dispatches through `moe_ffn_forward_layer_with_router` (M68 helper) to obtain top-k router weights without re-running the MoE forward. Other positions use the production `moe_ffn_forward_layer`.
Discharges (partial)
FALSIFY-MOE-SUB-002:
Full discharge needs step (b) GPU sibling + M-MOE-SUB-3 live bisection + M-GPU-MOE-1.4 fix.
Verification
What this does NOT ship
Test plan
Refs: `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 step (a),
M68 helper PR #1507 (squash 0f22c78)
🤖 Generated with Claude Code