contract(qwen3-moe-forward-v1): v1.1.0 → v1.2.0 — M32c.2.2.2.1 integration strategy by noahgift · Pull Request #1123 · paiml/aprender

noahgift · 2026-04-29T05:09:25Z

Summary

Records the design decision for the final integration step that gates apr run producing tokens. After comparing 3 approaches (touch 99 OwnedQuantizedModel sites / parallel function / wrapper struct), parallel run_qwen3_moe_generate chosen as strictly less work than the alternatives while preserving CI-must-stay-green.

Implementation slices (M32c.2.2.2.1.0 → .1.4)

.1.0: Extract apply_rope / causal_attention / qkv_matmul into reusable helpers (contract-preserving)
.1.1: New qwen3_moe_forward.rs module — per-token forward
.1.2: run_qwen3_moe_generate — full inference loop
.1.3: Dispatch flip — run_inference routes qwen3_moe to new function
.1.4: Live falsifier — apr run produces tokens (FALSIFY-QW3-MOE-FORWARD-003)

What this PR does NOT ship

No Rust code change. Contract-first per CLAUDE.md. Implementation lands in the 5 sub-slices.

Test plan

pv validate contracts/qwen3-moe-forward-v1.yaml clean
lint_passes_on_real_contracts PASS
CI green

🤖 Generated with Claude Code

…ation strategy Records the design decision that gates the final wire-up for `apr run` to produce tokens against Qwen3-Coder-30B-A3B-Instruct. Three integration approaches were considered: (A) Add `mmap` + `moe_layers` fields to OwnedQuantizedModel → 99 construction sites must be updated (verified by grep). Massive blast radius for one feature. (B) Parallel `run_qwen3_moe_generate` function → ~300 LOC + helper extraction; zero touch to OwnedQuantizedModel. (C) Wrapper struct `Qwen3MoeOwnedModel` → ~150 LOC + same helper extraction as (B), PLUS new struct. Strictly more work than (B) for the same prerequisite. DECISION: (B) — parallel function with attention/RoPE/KV refactored into reusable helpers. Rationale: * (A)'s 99-site blast radius fails the CI-must-stay-green invariant. * (C)'s struct adds work without removing the helper-extraction prerequisite, so it's strictly worse than (B). * (B) makes the MoE forward path self-contained and ground-truth- validatable against llama.cpp without entangling the dense path. IMPLEMENTATION SLICES (M32c.2.2.2.1.0 → M32c.2.2.2.1.4): .1.0 — Extract apply_rope, causal_attention, qkv_matmul from OwnedQuantizedModel into pure functions in a new gguf::inference::common module. Existing forward() refactored to call them. Behavior unchanged. .1.1 — New gguf/qwen3_moe_forward.rs: forward_one_token(mapped, moe_layers, transformer_state, token_id, position, kv_cache). .1.2 — run_qwen3_moe_generate(mapped, tokens, config): full inference loop using .1.1's per-token forward. .1.3 — Dispatch: run_inference for arch == "qwen3_moe" calls run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs short-circuit. .1.4 — Falsifier FALSIFY-QW3-MOE-FORWARD-003: `apr run <qwen3-coder>.gguf --prompt "Hi" -n 8` exits 0 + stdout matches /\\S/. Live test on lambda-vector RTX 4090. Each sub-slice is a small, atomically-testable PR. .1.0 is contract- preserving. .1.1-.1.2 are additive. .1.3 is the dispatch flip. .1.4 is the live falsifier that closes FALSIFY-QW3-MOE-FORWARD-003. After .1.4: M32d numerical parity vs llama.cpp Q4_K + HF FP16 is the last gate before flipping this contract DRAFT → ACTIVE_RUNTIME, which in turn unblocks companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity score. NO RUST CHANGES in this PR — contract-only per CLAUDE.md CB-1400 contract-first design. Validation: $ pv validate contracts/qwen3-moe-forward-v1.yaml 0 error(s), 0 warning(s) Contract is valid. $ cargo test -p aprender-contracts --lib lint_passes_on_real_contracts test result: ok. 1 passed; 0 failed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ntizedModel (#1124) * feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel Per qwen3-moe-forward-v1 v1.2.0 (M32c.2.2.2.1 integration strategy, PR #1123), this is the per-token forward pass for Qwen3-MoE-arch GGUF models. Mirrors `OwnedQuantizedModel::forward` step-for-step except the FFN site, which calls M32c.2.2.2.0's `moe_ffn_forward_layer` instead of the dense SwiGLU/GELU dispatch. WHAT THIS PR SHIPS ================== * `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs` (NEW): impl block on `OwnedQuantizedModel` adding `forward_qwen3_moe(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data) -> Result<Vec<f32>>`. Reuses existing `&self` methods for embedding, attention norm, qkv_matmul, RoPE (apply_rope), causal_attention, attn_output proj, output_norm, and lm_head — ZERO duplication of attention code. The ONLY new logic is the per-position FFN call to `moe_ffn_forward_layer`. * `crates/aprender-serve/src/gguf/inference/forward/mod.rs`: register the new module. * `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (NEW): F-QW3-MOE-C22211-001 — full 1-token forward against the cached 17.3 GB Qwen3-Coder GGUF. Asserts: - logits.len() == vocab_size (151936) - all logits finite (no NaN/Inf) - argmax in valid range Skipped when no GGUF cached (slow live test; runs minutes per call due to mmap fault-in + 48 layers × 128-expert routing × top-8 × per-expert Q4_K/Q6_K matmul). WHY THIS DESIGN (PER v1.2.0) ============================ Three integration approaches were considered (see contract): (A) Add fields to OwnedQuantizedModel: 99-site blast radius. (B) Parallel run_qwen3_moe_generate function: ~300 LOC + helpers. (C) Wrapper struct: ~150 LOC + helpers. The chosen path is a HYBRID of (B) and (C): a method on OwnedQuantizedModel that reuses existing `&self` primitives (no duplication, no new struct) and takes mmap data + moe_layers as PARAMETERS (no field add, no 99-site touch). Strictly less work than the originally planned (B) while preserving the "CI-must-stay-green" invariant. WHAT M32c.2.2.2.1.2 (NEXT PR) WILL SHIP ======================================== * `run_qwen3_moe_generate(mapped, tokens, gen_config, config)`: full inference loop calling forward_qwen3_moe per token, sampling, detokenizing — sibling to `run_gguf_generate`. WHAT M32c.2.2.2.1.3 WILL SHIP ============================== * Dispatch flip in `run_inference`: arch == "qwen3_moe" routes to run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs short-circuit. WHAT M32c.2.2.2.1.4 WILL SHIP ============================== * Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8` exits 0 + stdout matches /\\S/. Discharges FALSIFY-QW3-MOE-FORWARD-003. NO BEHAVIOR CHANGE in this PR. The new method is additive; existing forward path is untouched. Compilation clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(realizar): doc list indentation in forward_qwen3_moe (clippy) Pre-CI lint caught a doc-list-item-without-indentation warning at line 18 of forward_qwen3_moe.rs. Continuation line for a list item must be indented to align with the item's text, not the bullet. Caught by: ci/lint job in PR #1124. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit bd08718 into main Apr 29, 2026
11 checks passed

noahgift deleted the feat/m32c-2-2-2-1-forward-qwen3-moe-method branch April 29, 2026 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract(qwen3-moe-forward-v1): v1.1.0 → v1.2.0 — M32c.2.2.2.1 integration strategy#1123

contract(qwen3-moe-forward-v1): v1.1.0 → v1.2.0 — M32c.2.2.2.1 integration strategy#1123
noahgift merged 1 commit into
mainfrom
feat/m32c-2-2-2-1-forward-qwen3-moe-method

noahgift commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 29, 2026

Summary

Implementation slices (M32c.2.2.2.1.0 → .1.4)

What this PR does NOT ship

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant