contract(qwen3-moe-forward-v1): v1.1.0 → v1.2.0 — M32c.2.2.2.1 integration strategy#1123
Merged
Merged
Conversation
…ation strategy
Records the design decision that gates the final wire-up for `apr run`
to produce tokens against Qwen3-Coder-30B-A3B-Instruct. Three integration
approaches were considered:
(A) Add `mmap` + `moe_layers` fields to OwnedQuantizedModel
→ 99 construction sites must be updated (verified by grep).
Massive blast radius for one feature.
(B) Parallel `run_qwen3_moe_generate` function
→ ~300 LOC + helper extraction; zero touch to OwnedQuantizedModel.
(C) Wrapper struct `Qwen3MoeOwnedModel`
→ ~150 LOC + same helper extraction as (B), PLUS new struct.
Strictly more work than (B) for the same prerequisite.
DECISION: (B) — parallel function with attention/RoPE/KV refactored
into reusable helpers. Rationale:
* (A)'s 99-site blast radius fails the CI-must-stay-green invariant.
* (C)'s struct adds work without removing the helper-extraction
prerequisite, so it's strictly worse than (B).
* (B) makes the MoE forward path self-contained and ground-truth-
validatable against llama.cpp without entangling the dense path.
IMPLEMENTATION SLICES (M32c.2.2.2.1.0 → M32c.2.2.2.1.4):
.1.0 — Extract apply_rope, causal_attention, qkv_matmul from
OwnedQuantizedModel into pure functions in a new
gguf::inference::common module. Existing forward() refactored
to call them. Behavior unchanged.
.1.1 — New gguf/qwen3_moe_forward.rs: forward_one_token(mapped,
moe_layers, transformer_state, token_id, position, kv_cache).
.1.2 — run_qwen3_moe_generate(mapped, tokens, config): full
inference loop using .1.1's per-token forward.
.1.3 — Dispatch: run_inference for arch == "qwen3_moe" calls
run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs
short-circuit.
.1.4 — Falsifier FALSIFY-QW3-MOE-FORWARD-003: `apr run
<qwen3-coder>.gguf --prompt "Hi" -n 8` exits 0 + stdout
matches /\\S/. Live test on lambda-vector RTX 4090.
Each sub-slice is a small, atomically-testable PR. .1.0 is contract-
preserving. .1.1-.1.2 are additive. .1.3 is the dispatch flip. .1.4
is the live falsifier that closes FALSIFY-QW3-MOE-FORWARD-003.
After .1.4: M32d numerical parity vs llama.cpp Q4_K + HF FP16 is the
last gate before flipping this contract DRAFT → ACTIVE_RUNTIME, which
in turn unblocks companion-repo FALSIFY-CCPA-013 measured
tool-dispatch parity score.
NO RUST CHANGES in this PR — contract-only per CLAUDE.md CB-1400
contract-first design.
Validation:
$ pv validate contracts/qwen3-moe-forward-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.
$ cargo test -p aprender-contracts --lib lint_passes_on_real_contracts
test result: ok. 1 passed; 0 failed
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 29, 2026
…ntizedModel (#1124) * feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel Per qwen3-moe-forward-v1 v1.2.0 (M32c.2.2.2.1 integration strategy, PR #1123), this is the per-token forward pass for Qwen3-MoE-arch GGUF models. Mirrors `OwnedQuantizedModel::forward` step-for-step except the FFN site, which calls M32c.2.2.2.0's `moe_ffn_forward_layer` instead of the dense SwiGLU/GELU dispatch. WHAT THIS PR SHIPS ================== * `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs` (NEW): impl block on `OwnedQuantizedModel` adding `forward_qwen3_moe(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data) -> Result<Vec<f32>>`. Reuses existing `&self` methods for embedding, attention norm, qkv_matmul, RoPE (apply_rope), causal_attention, attn_output proj, output_norm, and lm_head — ZERO duplication of attention code. The ONLY new logic is the per-position FFN call to `moe_ffn_forward_layer`. * `crates/aprender-serve/src/gguf/inference/forward/mod.rs`: register the new module. * `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (NEW): F-QW3-MOE-C22211-001 — full 1-token forward against the cached 17.3 GB Qwen3-Coder GGUF. Asserts: - logits.len() == vocab_size (151936) - all logits finite (no NaN/Inf) - argmax in valid range Skipped when no GGUF cached (slow live test; runs minutes per call due to mmap fault-in + 48 layers × 128-expert routing × top-8 × per-expert Q4_K/Q6_K matmul). WHY THIS DESIGN (PER v1.2.0) ============================ Three integration approaches were considered (see contract): (A) Add fields to OwnedQuantizedModel: 99-site blast radius. (B) Parallel run_qwen3_moe_generate function: ~300 LOC + helpers. (C) Wrapper struct: ~150 LOC + helpers. The chosen path is a HYBRID of (B) and (C): a method on OwnedQuantizedModel that reuses existing `&self` primitives (no duplication, no new struct) and takes mmap data + moe_layers as PARAMETERS (no field add, no 99-site touch). Strictly less work than the originally planned (B) while preserving the "CI-must-stay-green" invariant. WHAT M32c.2.2.2.1.2 (NEXT PR) WILL SHIP ======================================== * `run_qwen3_moe_generate(mapped, tokens, gen_config, config)`: full inference loop calling forward_qwen3_moe per token, sampling, detokenizing — sibling to `run_gguf_generate`. WHAT M32c.2.2.2.1.3 WILL SHIP ============================== * Dispatch flip in `run_inference`: arch == "qwen3_moe" routes to run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs short-circuit. WHAT M32c.2.2.2.1.4 WILL SHIP ============================== * Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8` exits 0 + stdout matches /\\S/. Discharges FALSIFY-QW3-MOE-FORWARD-003. NO BEHAVIOR CHANGE in this PR. The new method is additive; existing forward path is untouched. Compilation clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(realizar): doc list indentation in forward_qwen3_moe (clippy) Pre-CI lint caught a doc-list-item-without-indentation warning at line 18 of forward_qwen3_moe.rs. Continuation line for a list item must be indented to align with the item's text, not the bullet. Caught by: ci/lint job in PR #1124. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Records the design decision for the final integration step that gates
apr runproducing tokens. After comparing 3 approaches (touch 99 OwnedQuantizedModel sites / parallel function / wrapper struct), parallelrun_qwen3_moe_generatechosen as strictly less work than the alternatives while preserving CI-must-stay-green.Implementation slices (M32c.2.2.2.1.0 → .1.4)
apply_rope/causal_attention/qkv_matmulinto reusable helpers (contract-preserving)qwen3_moe_forward.rsmodule — per-token forwardrun_qwen3_moe_generate— full inference looprun_inferenceroutes qwen3_moe to new functionapr runproduces tokens (FALSIFY-QW3-MOE-FORWARD-003)What this PR does NOT ship
No Rust code change. Contract-first per CLAUDE.md. Implementation lands in the 5 sub-slices.
Test plan
pv validate contracts/qwen3-moe-forward-v1.yamlcleanlint_passes_on_real_contractsPASS🤖 Generated with Claude Code