feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel#1124
Merged
Merged
Conversation
…ntizedModel Per qwen3-moe-forward-v1 v1.2.0 (M32c.2.2.2.1 integration strategy, PR #1123), this is the per-token forward pass for Qwen3-MoE-arch GGUF models. Mirrors `OwnedQuantizedModel::forward` step-for-step except the FFN site, which calls M32c.2.2.2.0's `moe_ffn_forward_layer` instead of the dense SwiGLU/GELU dispatch. WHAT THIS PR SHIPS ================== * `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs` (NEW): impl block on `OwnedQuantizedModel` adding `forward_qwen3_moe(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data) -> Result<Vec<f32>>`. Reuses existing `&self` methods for embedding, attention norm, qkv_matmul, RoPE (apply_rope), causal_attention, attn_output proj, output_norm, and lm_head — ZERO duplication of attention code. The ONLY new logic is the per-position FFN call to `moe_ffn_forward_layer`. * `crates/aprender-serve/src/gguf/inference/forward/mod.rs`: register the new module. * `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (NEW): F-QW3-MOE-C22211-001 — full 1-token forward against the cached 17.3 GB Qwen3-Coder GGUF. Asserts: - logits.len() == vocab_size (151936) - all logits finite (no NaN/Inf) - argmax in valid range Skipped when no GGUF cached (slow live test; runs minutes per call due to mmap fault-in + 48 layers × 128-expert routing × top-8 × per-expert Q4_K/Q6_K matmul). WHY THIS DESIGN (PER v1.2.0) ============================ Three integration approaches were considered (see contract): (A) Add fields to OwnedQuantizedModel: 99-site blast radius. (B) Parallel run_qwen3_moe_generate function: ~300 LOC + helpers. (C) Wrapper struct: ~150 LOC + helpers. The chosen path is a HYBRID of (B) and (C): a method on OwnedQuantizedModel that reuses existing `&self` primitives (no duplication, no new struct) and takes mmap data + moe_layers as PARAMETERS (no field add, no 99-site touch). Strictly less work than the originally planned (B) while preserving the "CI-must-stay-green" invariant. WHAT M32c.2.2.2.1.2 (NEXT PR) WILL SHIP ======================================== * `run_qwen3_moe_generate(mapped, tokens, gen_config, config)`: full inference loop calling forward_qwen3_moe per token, sampling, detokenizing — sibling to `run_gguf_generate`. WHAT M32c.2.2.2.1.3 WILL SHIP ============================== * Dispatch flip in `run_inference`: arch == "qwen3_moe" routes to run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs short-circuit. WHAT M32c.2.2.2.1.4 WILL SHIP ============================== * Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8` exits 0 + stdout matches /\\S/. Discharges FALSIFY-QW3-MOE-FORWARD-003. NO BEHAVIOR CHANGE in this PR. The new method is additive; existing forward path is untouched. Compilation clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pre-CI lint caught a doc-list-item-without-indentation warning at line 18 of forward_qwen3_moe.rs. Continuation line for a list item must be indented to align with the item's text, not the bullet. Caught by: ci/lint job in PR #1124.
This was referenced Apr 29, 2026
noahgift
added a commit
that referenced
this pull request
Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127) ## What ships Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` — F-QW3-MOE-C22214-001, an integration test that invokes the user-facing `apr` binary as a subprocess and asserts: 1. exit 0 2. stdout contains ≥1 non-whitespace character against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf with a fresh date-tagged prompt. This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126, squash a902eea) in CI / regression-prevention. Without it, a future regression that re-routed qwen3_moe back to the dense `run_gguf_generate` path (which produces garbage on MoE weights) would slip through CI silently — there'd be no signal at the `apr run` user-facing surface. ## Live evidence (lambda-vector RTX 4090, 2026-04-29) ``` running 1 test test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ... F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-C22214-001: elapsed = 130.945370974s stdout (first 200B): === APR Run === Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Output: . Completed in 130.83s (cached) stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' F-QW3-MOE-C22214-001: PASS ok test result: ok. 1 passed; 0 failed; 0 ignored ``` Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This test asserts ONLY emit/exit-0 — the discharge gate for FALSIFY-QW3-MOE-FORWARD-003. ## Skip path CI runners (and any host without the cached GGUF) print: F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...] and return success. Same skip pattern as `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (M32c.2.2.2.1.1 in-process forward primitive). ## Contract chain status M32a qwen3-moe-forward-v1 contract scaffold SHIPPED (#1099) M32b arch-aware FFN load refuses qwen3_moe SHIPPED (#1100) M32c.1+ MoE descriptor load + per-expert byte slicer SHIPPED M32c.2.2.2.1.1 forward_qwen3_moe method SHIPPED (#1124) M32c.2.2.2.1.2 run_qwen3_moe_generate function SHIPPED (#1125) M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch SHIPPED (#1126) M32c.2.2.2.1.4 live `apr run` falsifier THIS PR M32d numerical parity vs llama.cpp PENDING After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per-token forward pass for Qwen3-MoE — mirrors
OwnedQuantizedModel::forwardstep-for-step except the FFN site, which calls M32c.2.2.2.0'smoe_ffn_forward_layerinstead of dense SwiGLU.Design (per v1.2.0 contract)
Originally v1.2.0 specified parallel
run_qwen3_moe_generate(~300 LOC + helpers). Found a strictly-less-work HYBRID: a method onOwnedQuantizedModelthat:&selfprimitives (qkv_matmul, apply_rope, causal_attention, fused_matmul) — ZERO attention/RoPE/KV duplicationmmap data + moe_layers + MoE configas parameters — ZERO field-add (no 99-site blast radius)This achieves the v1.2.0 goal (separable MoE forward without OwnedQuantizedModel field-add) without the helper extraction step, since
&selfmethods are already serviceable helpers.Test plan
cargo build -p aprender-serve --test qwen3_moe_forward_one_token --features cudacleanStage map
run_qwen3_moe_generate(full inference loop)🤖 Generated with Claude Code