feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel by noahgift · Pull Request #1124 · paiml/aprender

noahgift · 2026-04-29T06:00:31Z

Summary

Per-token forward pass for Qwen3-MoE — mirrors OwnedQuantizedModel::forward step-for-step except the FFN site, which calls M32c.2.2.2.0's moe_ffn_forward_layer instead of dense SwiGLU.

Design (per v1.2.0 contract)

Originally v1.2.0 specified parallel run_qwen3_moe_generate (~300 LOC + helpers). Found a strictly-less-work HYBRID: a method on OwnedQuantizedModel that:

Reuses existing &self primitives (qkv_matmul, apply_rope, causal_attention, fused_matmul) — ZERO attention/RoPE/KV duplication
Takes mmap data + moe_layers + MoE config as parameters — ZERO field-add (no 99-site blast radius)

This achieves the v1.2.0 goal (separable MoE forward without OwnedQuantizedModel field-add) without the helper extraction step, since &self methods are already serviceable helpers.

Test plan

cargo build -p aprender-serve --test qwen3_moe_forward_one_token --features cuda clean
M32c.1/c.2/c.2.1/c.2.2.0/c.2.2.1/c.2.2.2.0 regression tests still pass
CI green
Live test (slow, runs minutes; runs only when cached GGUF present)

Stage map

M32c.2.2.2.0 full-layer dispatch: ✅ merged
v1.2.0 integration strategy: ✅ merged
M32c.2.2.2.1.1 forward_qwen3_moe method (this PR)
M32c.2.2.2.1.2 run_qwen3_moe_generate (full inference loop)
M32c.2.2.2.1.3 dispatch flip (replace gguf_gpu_generate short-circuit)
M32c.2.2.2.1.4 live falsifier (FALSIFY-QW3-MOE-FORWARD-003)
M32d numerical parity → flips contract DRAFT → ACTIVE_RUNTIME

🤖 Generated with Claude Code

…ntizedModel Per qwen3-moe-forward-v1 v1.2.0 (M32c.2.2.2.1 integration strategy, PR #1123), this is the per-token forward pass for Qwen3-MoE-arch GGUF models. Mirrors `OwnedQuantizedModel::forward` step-for-step except the FFN site, which calls M32c.2.2.2.0's `moe_ffn_forward_layer` instead of the dense SwiGLU/GELU dispatch. WHAT THIS PR SHIPS ================== * `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs` (NEW): impl block on `OwnedQuantizedModel` adding `forward_qwen3_moe(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data) -> Result<Vec<f32>>`. Reuses existing `&self` methods for embedding, attention norm, qkv_matmul, RoPE (apply_rope), causal_attention, attn_output proj, output_norm, and lm_head — ZERO duplication of attention code. The ONLY new logic is the per-position FFN call to `moe_ffn_forward_layer`. * `crates/aprender-serve/src/gguf/inference/forward/mod.rs`: register the new module. * `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (NEW): F-QW3-MOE-C22211-001 — full 1-token forward against the cached 17.3 GB Qwen3-Coder GGUF. Asserts: - logits.len() == vocab_size (151936) - all logits finite (no NaN/Inf) - argmax in valid range Skipped when no GGUF cached (slow live test; runs minutes per call due to mmap fault-in + 48 layers × 128-expert routing × top-8 × per-expert Q4_K/Q6_K matmul). WHY THIS DESIGN (PER v1.2.0) ============================ Three integration approaches were considered (see contract): (A) Add fields to OwnedQuantizedModel: 99-site blast radius. (B) Parallel run_qwen3_moe_generate function: ~300 LOC + helpers. (C) Wrapper struct: ~150 LOC + helpers. The chosen path is a HYBRID of (B) and (C): a method on OwnedQuantizedModel that reuses existing `&self` primitives (no duplication, no new struct) and takes mmap data + moe_layers as PARAMETERS (no field add, no 99-site touch). Strictly less work than the originally planned (B) while preserving the "CI-must-stay-green" invariant. WHAT M32c.2.2.2.1.2 (NEXT PR) WILL SHIP ======================================== * `run_qwen3_moe_generate(mapped, tokens, gen_config, config)`: full inference loop calling forward_qwen3_moe per token, sampling, detokenizing — sibling to `run_gguf_generate`. WHAT M32c.2.2.2.1.3 WILL SHIP ============================== * Dispatch flip in `run_inference`: arch == "qwen3_moe" routes to run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs short-circuit. WHAT M32c.2.2.2.1.4 WILL SHIP ============================== * Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8` exits 0 + stdout matches /\\S/. Discharges FALSIFY-QW3-MOE-FORWARD-003. NO BEHAVIOR CHANGE in this PR. The new method is additive; existing forward path is untouched. Compilation clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pre-CI lint caught a doc-list-item-without-indentation warning at line 18 of forward_qwen3_moe.rs. Continuation line for a list item must be indented to align with the item's text, not the bullet. Caught by: ci/lint job in PR #1124.

…SIFY-QW3-MOE-FORWARD-003 (#1127) ## What ships Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` — F-QW3-MOE-C22214-001, an integration test that invokes the user-facing `apr` binary as a subprocess and asserts: 1. exit 0 2. stdout contains ≥1 non-whitespace character against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf with a fresh date-tagged prompt. This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126, squash a902eea) in CI / regression-prevention. Without it, a future regression that re-routed qwen3_moe back to the dense `run_gguf_generate` path (which produces garbage on MoE weights) would slip through CI silently — there'd be no signal at the `apr run` user-facing surface. ## Live evidence (lambda-vector RTX 4090, 2026-04-29) ``` running 1 test test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ... F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-C22214-001: elapsed = 130.945370974s stdout (first 200B): === APR Run === Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Output: . Completed in 130.83s (cached) stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' F-QW3-MOE-C22214-001: PASS ok test result: ok. 1 passed; 0 failed; 0 ignored ``` Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This test asserts ONLY emit/exit-0 — the discharge gate for FALSIFY-QW3-MOE-FORWARD-003. ## Skip path CI runners (and any host without the cached GGUF) print: F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...] and return success. Same skip pattern as `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (M32c.2.2.2.1.1 in-process forward primitive). ## Contract chain status M32a qwen3-moe-forward-v1 contract scaffold SHIPPED (#1099) M32b arch-aware FFN load refuses qwen3_moe SHIPPED (#1100) M32c.1+ MoE descriptor load + per-expert byte slicer SHIPPED M32c.2.2.2.1.1 forward_qwen3_moe method SHIPPED (#1124) M32c.2.2.2.1.2 run_qwen3_moe_generate function SHIPPED (#1125) M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch SHIPPED (#1126) M32c.2.2.2.1.4 live `apr run` falsifier THIS PR M32d numerical parity vs llama.cpp PENDING After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits April 29, 2026 07:59

noahgift merged commit 10c74c4 into main Apr 29, 2026
10 checks passed

noahgift deleted the feat/m32c-2-2-2-1-1-forward-qwen3-moe-method branch April 29, 2026 06:53

This was referenced Apr 29, 2026

feat(realizar): M32c.2.2.2.1.2 — run_qwen3_moe_generate full inference loop #1125

Merged

test(realizar): M32c.2.2.2.1.4 — live apr run falsifier pinning FALSIFY-QW3-MOE-FORWARD-003 #1127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel#1124

feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel#1124
noahgift merged 2 commits into
mainfrom
feat/m32c-2-2-2-1-1-forward-qwen3-moe-method

noahgift commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 29, 2026

Summary

Design (per v1.2.0 contract)

Test plan

Stage map

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant