Skip to content

feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel#1124

Merged
noahgift merged 2 commits into
mainfrom
feat/m32c-2-2-2-1-1-forward-qwen3-moe-method
Apr 29, 2026
Merged

feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel#1124
noahgift merged 2 commits into
mainfrom
feat/m32c-2-2-2-1-1-forward-qwen3-moe-method

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Per-token forward pass for Qwen3-MoE — mirrors OwnedQuantizedModel::forward step-for-step except the FFN site, which calls M32c.2.2.2.0's moe_ffn_forward_layer instead of dense SwiGLU.

Design (per v1.2.0 contract)

Originally v1.2.0 specified parallel run_qwen3_moe_generate (~300 LOC + helpers). Found a strictly-less-work HYBRID: a method on OwnedQuantizedModel that:

  • Reuses existing &self primitives (qkv_matmul, apply_rope, causal_attention, fused_matmul) — ZERO attention/RoPE/KV duplication
  • Takes mmap data + moe_layers + MoE config as parameters — ZERO field-add (no 99-site blast radius)

This achieves the v1.2.0 goal (separable MoE forward without OwnedQuantizedModel field-add) without the helper extraction step, since &self methods are already serviceable helpers.

Test plan

  • cargo build -p aprender-serve --test qwen3_moe_forward_one_token --features cuda clean
  • M32c.1/c.2/c.2.1/c.2.2.0/c.2.2.1/c.2.2.2.0 regression tests still pass
  • CI green
  • Live test (slow, runs minutes; runs only when cached GGUF present)

Stage map

  • M32c.2.2.2.0 full-layer dispatch: ✅ merged
  • v1.2.0 integration strategy: ✅ merged
  • M32c.2.2.2.1.1 forward_qwen3_moe method (this PR)
  • M32c.2.2.2.1.2 run_qwen3_moe_generate (full inference loop)
  • M32c.2.2.2.1.3 dispatch flip (replace gguf_gpu_generate short-circuit)
  • M32c.2.2.2.1.4 live falsifier (FALSIFY-QW3-MOE-FORWARD-003)
  • M32d numerical parity → flips contract DRAFT → ACTIVE_RUNTIME

🤖 Generated with Claude Code

noahgift and others added 2 commits April 29, 2026 07:59
…ntizedModel

Per qwen3-moe-forward-v1 v1.2.0 (M32c.2.2.2.1 integration strategy, PR #1123),
this is the per-token forward pass for Qwen3-MoE-arch GGUF models. Mirrors
`OwnedQuantizedModel::forward` step-for-step except the FFN site, which
calls M32c.2.2.2.0's `moe_ffn_forward_layer` instead of the dense
SwiGLU/GELU dispatch.

WHAT THIS PR SHIPS
==================
* `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`
  (NEW): impl block on `OwnedQuantizedModel` adding
  `forward_qwen3_moe(token_ids, moe_layers, num_experts,
  num_experts_per_tok, moe_intermediate, data) -> Result<Vec<f32>>`.
  Reuses existing `&self` methods for embedding, attention norm,
  qkv_matmul, RoPE (apply_rope), causal_attention, attn_output proj,
  output_norm, and lm_head — ZERO duplication of attention code. The
  ONLY new logic is the per-position FFN call to `moe_ffn_forward_layer`.

* `crates/aprender-serve/src/gguf/inference/forward/mod.rs`: register
  the new module.

* `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (NEW):
  F-QW3-MOE-C22211-001 — full 1-token forward against the cached
  17.3 GB Qwen3-Coder GGUF. Asserts:
    - logits.len() == vocab_size (151936)
    - all logits finite (no NaN/Inf)
    - argmax in valid range
  Skipped when no GGUF cached (slow live test; runs minutes per call
  due to mmap fault-in + 48 layers × 128-expert routing × top-8 ×
  per-expert Q4_K/Q6_K matmul).

WHY THIS DESIGN (PER v1.2.0)
============================
Three integration approaches were considered (see contract):
  (A) Add fields to OwnedQuantizedModel: 99-site blast radius.
  (B) Parallel run_qwen3_moe_generate function: ~300 LOC + helpers.
  (C) Wrapper struct: ~150 LOC + helpers.

The chosen path is a HYBRID of (B) and (C): a method on
OwnedQuantizedModel that reuses existing `&self` primitives (no
duplication, no new struct) and takes mmap data + moe_layers as
PARAMETERS (no field add, no 99-site touch). Strictly less work
than the originally planned (B) while preserving the
"CI-must-stay-green" invariant.

WHAT M32c.2.2.2.1.2 (NEXT PR) WILL SHIP
========================================
* `run_qwen3_moe_generate(mapped, tokens, gen_config, config)`:
  full inference loop calling forward_qwen3_moe per token, sampling,
  detokenizing — sibling to `run_gguf_generate`.

WHAT M32c.2.2.2.1.3 WILL SHIP
==============================
* Dispatch flip in `run_inference`: arch == "qwen3_moe" routes to
  run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs
  short-circuit.

WHAT M32c.2.2.2.1.4 WILL SHIP
==============================
* Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8`
  exits 0 + stdout matches /\\S/. Discharges
  FALSIFY-QW3-MOE-FORWARD-003.

NO BEHAVIOR CHANGE in this PR. The new method is additive; existing
forward path is untouched. Compilation clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pre-CI lint caught a doc-list-item-without-indentation warning at
line 18 of forward_qwen3_moe.rs. Continuation line for a list item
must be indented to align with the item's text, not the bullet.

Caught by: ci/lint job in PR #1124.
@noahgift noahgift merged commit 10c74c4 into main Apr 29, 2026
10 checks passed
@noahgift noahgift deleted the feat/m32c-2-2-2-1-1-forward-qwen3-moe-method branch April 29, 2026 06:53
noahgift added a commit that referenced this pull request Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127)

## What ships

Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` —
F-QW3-MOE-C22214-001, an integration test that invokes the user-facing
`apr` binary as a subprocess and asserts:

  1. exit 0
  2. stdout contains ≥1 non-whitespace character

against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
with a fresh date-tagged prompt.

This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126,
squash a902eea) in CI / regression-prevention. Without it, a
future regression that re-routed qwen3_moe back to the dense
`run_gguf_generate` path (which produces garbage on MoE weights)
would slip through CI silently — there'd be no signal at the
`apr run` user-facing surface.

## Live evidence (lambda-vector RTX 4090, 2026-04-29)

```
running 1 test
test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ...
F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
F-QW3-MOE-C22214-001: elapsed = 130.945370974s
  stdout (first 200B): === APR Run ===

Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf

Output:
.

Completed in 130.83s (cached)

  stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'
[BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'

F-QW3-MOE-C22214-001: PASS
ok

test result: ok. 1 passed; 0 failed; 0 ignored
```

Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This
test asserts ONLY emit/exit-0 — the discharge gate for
FALSIFY-QW3-MOE-FORWARD-003.

## Skip path

CI runners (and any host without the cached GGUF) print:

  F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...]

and return success. Same skip pattern as
`crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs`
(M32c.2.2.2.1.1 in-process forward primitive).

## Contract chain status

  M32a    qwen3-moe-forward-v1 contract scaffold        SHIPPED (#1099)
  M32b    arch-aware FFN load refuses qwen3_moe          SHIPPED (#1100)
  M32c.1+ MoE descriptor load + per-expert byte slicer   SHIPPED
  M32c.2.2.2.1.1 forward_qwen3_moe method                SHIPPED (#1124)
  M32c.2.2.2.1.2 run_qwen3_moe_generate function         SHIPPED (#1125)
  M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch   SHIPPED (#1126)
  M32c.2.2.2.1.4 live `apr run` falsifier               THIS PR
  M32d           numerical parity vs llama.cpp           PENDING

After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which
unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch
parity gate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant