contract(trace-moe-gpu-sub-stages-v1): v1.0.0 → v1.1.0 — clarify M-MOE-SUB-2 traced target#1503
Merged
noahgift merged 1 commit intoMay 5, 2026
Conversation
…E-SUB-2 traced target
Code archaeology found pre-existing `forward_qwen3_moe_traced` (M32d
Step 2 work) — the natural extension point for SaveTensorPlan
capture. v1.0.0's M-MOE-SUB-2 description was ambiguous: "wire into
both forward_qwen3_moe AND forward_qwen3_moe_cuda" could be read as
modifying the production hot paths, which would force every dense
caller to plumb None and add a per-token branch.
v1.1.0 clarifies:
(a) Extend `forward_qwen3_moe_traced` (CPU traced sibling,
pre-existing) to accept Option<&SaveTensorPlan>.
(b) Author NEW `forward_qwen3_moe_cuda_traced` (GPU traced
sibling, ~150-300 LOC mirroring CPU).
(c) Add sibling `moe_ffn_forward_layer_with_router` to
qwen3_moe_load.rs (production sibling stays byte-identical).
(d) Production hot paths (forward_qwen3_moe + forward_qwen3_moe_cuda)
MUST NOT be modified — preserves "additive purity" invariant.
UPDATED FIELDS:
* metadata.version: 1.0.0 → 1.1.0
* metadata.description: 23-line v1.1.0 amendment block at top
* implementation_stages.M-MOE-SUB-2: expanded into 3-part (a)/(b)/(c)
description; added explicit blockers ("must NOT modify production")
* implementation_stages.M-MOE-SUB-3: clarified targets the **traced**
forward bodies (NOT the production hot paths)
VALIDATION: pv validate exits 0 errors, 0 warnings.
Per CLAUDE.md "NEVER write code before writing a provable contract"
— this amendment pins the architectural decision (don't touch
production hot paths) BEFORE M-MOE-SUB-2 implementation lands.
Refs: M-GPU-MOE-1.4 step (a) instrumentation, R10,
qwen3-moe-forward-gpu-v1 v1.4.0,
forward_qwen3_moe_traced (M32d Step 2 pre-existing).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cae4641 to
087a52b
Compare
noahgift
added a commit
that referenced
this pull request
May 5, 2026
… step (c) (#1507) Author additive-pure sibling of `moe_ffn_forward_layer` per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 step (c). ## What ships `moe_ffn_forward_layer_with_router` returns `(output, router_top_k_weights)`: - `output: Vec<f32>` — `[hidden_dim]` aggregated MoE FFN output (the MoeFfnOut SaveTensorStage capture target). - `router_top_k_weights: Vec<f32>` — `[num_experts_per_tok]` post-softmax + renormalize top-k expert weights (the MoeRouter SaveTensorStage capture target). The helper enables traced forward bodies (M-MOE-SUB-2 steps a/b upcoming) to capture both `MoeRouter` and `MoeFfnOut` stages without a second router computation. ## Hot path safety Production `moe_ffn_forward_layer` is unchanged byte-for-byte. The helper duplicates the router/softmax/top-k logic to satisfy the v1.1.0 amendment's additive-purity invariant: "MUST NOT modify production forward_qwen3_moe / forward_qwen3_moe_cuda hot paths". Drift between sibling functions is mitigated by: 1. Two new unit tests asserting the helper's input-validation error messages match the production sibling's error class for the same shape/qtype boundary violations (`hidden.len() != hidden_dim` and `router qtype != F32`). 2. End-to-end byte-identity for realistic GGUF inputs is exercised by the heavy parity tests at `qwen3_moe_gpu_parity.rs` (out of scope for unit tests since they require the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct GGUF on lambda-vector RTX 4090). ## What this discharges - FALSIFY-MOE-SUB-002 (byte-identity preservation): partial — the helper exists and validates inputs symmetrically with production. Full discharge needs M-MOE-SUB-2 steps (a)+(b) to wire it into traced CPU + GPU forward paths. ## Verification $ cargo test -p aprender-serve --release --lib gguf::qwen3_moe_load::tests 8 passed (including 2 new helper tests) $ cargo clippy -p aprender-serve --lib --release -- -D warnings clean $ rustfmt --check crates/aprender-serve/src/gguf/qwen3_moe_load.rs clean ## What this does NOT ship - M-MOE-SUB-2 step (a): extending `forward_qwen3_moe_traced` to call this helper (CPU-side traced wireup) — separate PR. - M-MOE-SUB-2 step (b): NEW `forward_qwen3_moe_cuda_traced.rs` GPU sibling — separate PR. - Real-GGUF byte-identity test — exercised end-to-end via heavy parity tests, not unit tests. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 (M64 companion-spec record, aprender PR #1503) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Code archaeology found pre-existing
forward_qwen3_moe_traced(M32d Step 2 work). v1.0.0's M-MOE-SUB-2 description was ambiguous about whether to modify production hot paths. v1.1.0 clarifies: extend the traced sibling functions, NOT production.Changes
forward_qwen3_moe_traced+ (b) NEWforward_qwen3_moe_cuda_traced+ (c) siblingmoe_ffn_forward_layer_with_router; explicit blocker "MUST NOT modify production hot paths"Validation
Test plan
🤖 Generated with Claude Code