feat(realizar): M32c.2.2.2.1.2 — run_qwen3_moe_generate full inference loop#1125
Merged
Merged
Conversation
…e loop
Composes M32c.2.2.2.1.1's `OwnedQuantizedModel::forward_qwen3_moe` into
an autoregressive token-by-token generation loop. Sibling of
`run_gguf_generate` for `qwen3_moe` arch.
WHAT THIS PR SHIPS
==================
* `crates/aprender-serve/src/gguf/keys.rs`: 3 new GGUF metadata key
constants — `expert_count`, `expert_used_count`,
`expert_feed_forward_length`.
* `crates/aprender-serve/src/gguf/metadata.rs`: 3 new accessors on
`GGUFModel` — `expert_count()`, `expert_used_count()`,
`expert_feed_forward_length()`. Each reads `{arch}.<key>` and
returns `Option<usize>`. None for dense models.
* `crates/aprender-serve/src/infer/qwen3_moe_generate.rs` (NEW):
- `pub fn run_qwen3_moe_generate(mapped, model, input_tokens,
gen_config) -> Result<Vec<u32>>`
- Reads MoE config from metadata
- Builds per-layer Qwen3MoeQuantizedLayer descriptors via
load_qwen3_moe_layer (M32c.1)
- Generation loop: full-prefill per token via forward_qwen3_moe
(M32c.2.2.2.1.1) → greedy argmax → append → repeat
- Stops on configured stop_tokens
* `crates/aprender-serve/src/infer/mod.rs`: registers the new module.
DESIGN NOTES
============
* No KV cache — full prefill per token. Catastrophically slow on
Qwen3-Coder-30B-A3B (~minutes per token) but CORRECT. KV cache
with mmap-borrow expert tensors is M32d follow-up.
* Greedy argmax sampling only. Top-p/top-k/temperature are M32
follow-up; the goal of this slice is "tokens emit", not
"tokens have nice distribution".
* No tracing/profiling integration. Goal is the FALSIFY-QW3-MOE-
FORWARD-003 minimum: `apr run -n 8` exits 0 + stdout matches
/\\S/. Latency is accepted.
WHAT M32c.2.2.2.1.3 (NEXT PR) WILL SHIP
========================================
* Dispatch flip: in run_inference's GGUF format branch, detect
arch == "qwen3_moe" and call run_qwen3_moe_generate. Removes
M32c.2.1's gguf_gpu_generate.rs short-circuit.
WHAT M32c.2.2.2.1.4 WILL SHIP
==============================
* Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8`
exits 0 + stdout matches /\\S/. Discharges
FALSIFY-QW3-MOE-FORWARD-003.
NO BEHAVIOR CHANGE in this PR for the existing dense path. Pure
addition of a new function + 3 new metadata accessors.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merged
5 tasks
noahgift
added a commit
that referenced
this pull request
Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127) ## What ships Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` — F-QW3-MOE-C22214-001, an integration test that invokes the user-facing `apr` binary as a subprocess and asserts: 1. exit 0 2. stdout contains ≥1 non-whitespace character against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf with a fresh date-tagged prompt. This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126, squash a902eea) in CI / regression-prevention. Without it, a future regression that re-routed qwen3_moe back to the dense `run_gguf_generate` path (which produces garbage on MoE weights) would slip through CI silently — there'd be no signal at the `apr run` user-facing surface. ## Live evidence (lambda-vector RTX 4090, 2026-04-29) ``` running 1 test test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ... F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-C22214-001: elapsed = 130.945370974s stdout (first 200B): === APR Run === Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Output: . Completed in 130.83s (cached) stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' F-QW3-MOE-C22214-001: PASS ok test result: ok. 1 passed; 0 failed; 0 ignored ``` Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This test asserts ONLY emit/exit-0 — the discharge gate for FALSIFY-QW3-MOE-FORWARD-003. ## Skip path CI runners (and any host without the cached GGUF) print: F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...] and return success. Same skip pattern as `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (M32c.2.2.2.1.1 in-process forward primitive). ## Contract chain status M32a qwen3-moe-forward-v1 contract scaffold SHIPPED (#1099) M32b arch-aware FFN load refuses qwen3_moe SHIPPED (#1100) M32c.1+ MoE descriptor load + per-expert byte slicer SHIPPED M32c.2.2.2.1.1 forward_qwen3_moe method SHIPPED (#1124) M32c.2.2.2.1.2 run_qwen3_moe_generate function SHIPPED (#1125) M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch SHIPPED (#1126) M32c.2.2.2.1.4 live `apr run` falsifier THIS PR M32d numerical parity vs llama.cpp PENDING After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Composes M32c.2.2.2.1.1's
forward_qwen3_moeinto an autoregressive generation loop. Sibling ofrun_gguf_generatefor qwen3_moe arch.What's NEW
expert_count(),expert_used_count(),expert_feed_forward_length()run_qwen3_moe_generate(mapped, model, input_tokens, gen_config) -> Result<Vec<u32>>Stage map
run_inferenceroutes qwen3_moeapr runproduces tokens (FALSIFY-QW3-MOE-FORWARD-003)Test plan
cargo build --lib -p aprender-serve --features cudacleancargo clippy -p aprender-serve --release --features cudaclean on new files🤖 Generated with Claude Code