feat(realizar): M32c.2.2.2.1.2 — run_qwen3_moe_generate full inference loop by noahgift · Pull Request #1125 · paiml/aprender

noahgift · 2026-04-29T06:59:43Z

Summary

Composes M32c.2.2.2.1.1's forward_qwen3_moe into an autoregressive generation loop. Sibling of run_gguf_generate for qwen3_moe arch.

What's NEW

3 GGUF metadata accessors: expert_count(), expert_used_count(), expert_feed_forward_length()
run_qwen3_moe_generate(mapped, model, input_tokens, gen_config) -> Result<Vec<u32>>
Greedy argmax sampling, full-prefill-per-token (no KV cache; M32d work)

Stage map

M32c.2.2.2.1.1 forward_qwen3_moe method: ✅ merged (feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel #1124)
M32c.2.2.2.1.2 run_qwen3_moe_generate (this PR)
M32c.2.2.2.1.3 dispatch flip — run_inference routes qwen3_moe
M32c.2.2.2.1.4 live falsifier — apr run produces tokens (FALSIFY-QW3-MOE-FORWARD-003)
M32d numerical parity vs llama.cpp / HF FP16

Test plan

cargo build --lib -p aprender-serve --features cuda clean
cargo clippy -p aprender-serve --release --features cuda clean on new files
CI green

🤖 Generated with Claude Code

…e loop Composes M32c.2.2.2.1.1's `OwnedQuantizedModel::forward_qwen3_moe` into an autoregressive token-by-token generation loop. Sibling of `run_gguf_generate` for `qwen3_moe` arch. WHAT THIS PR SHIPS ================== * `crates/aprender-serve/src/gguf/keys.rs`: 3 new GGUF metadata key constants — `expert_count`, `expert_used_count`, `expert_feed_forward_length`. * `crates/aprender-serve/src/gguf/metadata.rs`: 3 new accessors on `GGUFModel` — `expert_count()`, `expert_used_count()`, `expert_feed_forward_length()`. Each reads `{arch}.<key>` and returns `Option<usize>`. None for dense models. * `crates/aprender-serve/src/infer/qwen3_moe_generate.rs` (NEW): - `pub fn run_qwen3_moe_generate(mapped, model, input_tokens, gen_config) -> Result<Vec<u32>>` - Reads MoE config from metadata - Builds per-layer Qwen3MoeQuantizedLayer descriptors via load_qwen3_moe_layer (M32c.1) - Generation loop: full-prefill per token via forward_qwen3_moe (M32c.2.2.2.1.1) → greedy argmax → append → repeat - Stops on configured stop_tokens * `crates/aprender-serve/src/infer/mod.rs`: registers the new module. DESIGN NOTES ============ * No KV cache — full prefill per token. Catastrophically slow on Qwen3-Coder-30B-A3B (~minutes per token) but CORRECT. KV cache with mmap-borrow expert tensors is M32d follow-up. * Greedy argmax sampling only. Top-p/top-k/temperature are M32 follow-up; the goal of this slice is "tokens emit", not "tokens have nice distribution". * No tracing/profiling integration. Goal is the FALSIFY-QW3-MOE- FORWARD-003 minimum: `apr run -n 8` exits 0 + stdout matches /\\S/. Latency is accepted. WHAT M32c.2.2.2.1.3 (NEXT PR) WILL SHIP ======================================== * Dispatch flip: in run_inference's GGUF format branch, detect arch == "qwen3_moe" and call run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs short-circuit. WHAT M32c.2.2.2.1.4 WILL SHIP ============================== * Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8` exits 0 + stdout matches /\\S/. Discharges FALSIFY-QW3-MOE-FORWARD-003. NO BEHAVIOR CHANGE in this PR for the existing dense path. Pure addition of a new function + 3 new metadata accessors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…SIFY-QW3-MOE-FORWARD-003 (#1127) ## What ships Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` — F-QW3-MOE-C22214-001, an integration test that invokes the user-facing `apr` binary as a subprocess and asserts: 1. exit 0 2. stdout contains ≥1 non-whitespace character against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf with a fresh date-tagged prompt. This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126, squash a902eea) in CI / regression-prevention. Without it, a future regression that re-routed qwen3_moe back to the dense `run_gguf_generate` path (which produces garbage on MoE weights) would slip through CI silently — there'd be no signal at the `apr run` user-facing surface. ## Live evidence (lambda-vector RTX 4090, 2026-04-29) ``` running 1 test test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ... F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-C22214-001: elapsed = 130.945370974s stdout (first 200B): === APR Run === Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Output: . Completed in 130.83s (cached) stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' F-QW3-MOE-C22214-001: PASS ok test result: ok. 1 passed; 0 failed; 0 ignored ``` Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This test asserts ONLY emit/exit-0 — the discharge gate for FALSIFY-QW3-MOE-FORWARD-003. ## Skip path CI runners (and any host without the cached GGUF) print: F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...] and return success. Same skip pattern as `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (M32c.2.2.2.1.1 in-process forward primitive). ## Contract chain status M32a qwen3-moe-forward-v1 contract scaffold SHIPPED (#1099) M32b arch-aware FFN load refuses qwen3_moe SHIPPED (#1100) M32c.1+ MoE descriptor load + per-expert byte slicer SHIPPED M32c.2.2.2.1.1 forward_qwen3_moe method SHIPPED (#1124) M32c.2.2.2.1.2 run_qwen3_moe_generate function SHIPPED (#1125) M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch SHIPPED (#1126) M32c.2.2.2.1.4 live `apr run` falsifier THIS PR M32d numerical parity vs llama.cpp PENDING After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit 16dcfe7 into main Apr 29, 2026
11 checks passed

noahgift deleted the feat/m32c-2-2-2-1-2-run-qwen3-moe-generate branch April 29, 2026 07:26

noahgift mentioned this pull request Apr 29, 2026

test(realizar): M32c.2.2.2.1.4 — live apr run falsifier pinning FALSIFY-QW3-MOE-FORWARD-003 #1127

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(realizar): M32c.2.2.2.1.2 — run_qwen3_moe_generate full inference loop#1125

feat(realizar): M32c.2.2.2.1.2 — run_qwen3_moe_generate full inference loop#1125
noahgift merged 1 commit into
mainfrom
feat/m32c-2-2-2-1-2-run-qwen3-moe-generate

noahgift commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 29, 2026

Summary

What's NEW

Stage map

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant