Skip to content

feat(realizar): M32c.2.2.2.1.2 — run_qwen3_moe_generate full inference loop#1125

Merged
noahgift merged 1 commit into
mainfrom
feat/m32c-2-2-2-1-2-run-qwen3-moe-generate
Apr 29, 2026
Merged

feat(realizar): M32c.2.2.2.1.2 — run_qwen3_moe_generate full inference loop#1125
noahgift merged 1 commit into
mainfrom
feat/m32c-2-2-2-1-2-run-qwen3-moe-generate

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Composes M32c.2.2.2.1.1's forward_qwen3_moe into an autoregressive generation loop. Sibling of run_gguf_generate for qwen3_moe arch.

What's NEW

  • 3 GGUF metadata accessors: expert_count(), expert_used_count(), expert_feed_forward_length()
  • run_qwen3_moe_generate(mapped, model, input_tokens, gen_config) -> Result<Vec<u32>>
  • Greedy argmax sampling, full-prefill-per-token (no KV cache; M32d work)

Stage map

Test plan

  • cargo build --lib -p aprender-serve --features cuda clean
  • cargo clippy -p aprender-serve --release --features cuda clean on new files
  • CI green

🤖 Generated with Claude Code

…e loop

Composes M32c.2.2.2.1.1's `OwnedQuantizedModel::forward_qwen3_moe` into
an autoregressive token-by-token generation loop. Sibling of
`run_gguf_generate` for `qwen3_moe` arch.

WHAT THIS PR SHIPS
==================
* `crates/aprender-serve/src/gguf/keys.rs`: 3 new GGUF metadata key
  constants — `expert_count`, `expert_used_count`,
  `expert_feed_forward_length`.

* `crates/aprender-serve/src/gguf/metadata.rs`: 3 new accessors on
  `GGUFModel` — `expert_count()`, `expert_used_count()`,
  `expert_feed_forward_length()`. Each reads `{arch}.<key>` and
  returns `Option<usize>`. None for dense models.

* `crates/aprender-serve/src/infer/qwen3_moe_generate.rs` (NEW):
  - `pub fn run_qwen3_moe_generate(mapped, model, input_tokens,
    gen_config) -> Result<Vec<u32>>`
  - Reads MoE config from metadata
  - Builds per-layer Qwen3MoeQuantizedLayer descriptors via
    load_qwen3_moe_layer (M32c.1)
  - Generation loop: full-prefill per token via forward_qwen3_moe
    (M32c.2.2.2.1.1) → greedy argmax → append → repeat
  - Stops on configured stop_tokens

* `crates/aprender-serve/src/infer/mod.rs`: registers the new module.

DESIGN NOTES
============
* No KV cache — full prefill per token. Catastrophically slow on
  Qwen3-Coder-30B-A3B (~minutes per token) but CORRECT. KV cache
  with mmap-borrow expert tensors is M32d follow-up.
* Greedy argmax sampling only. Top-p/top-k/temperature are M32
  follow-up; the goal of this slice is "tokens emit", not
  "tokens have nice distribution".
* No tracing/profiling integration. Goal is the FALSIFY-QW3-MOE-
  FORWARD-003 minimum: `apr run -n 8` exits 0 + stdout matches
  /\\S/. Latency is accepted.

WHAT M32c.2.2.2.1.3 (NEXT PR) WILL SHIP
========================================
* Dispatch flip: in run_inference's GGUF format branch, detect
  arch == "qwen3_moe" and call run_qwen3_moe_generate. Removes
  M32c.2.1's gguf_gpu_generate.rs short-circuit.

WHAT M32c.2.2.2.1.4 WILL SHIP
==============================
* Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8`
  exits 0 + stdout matches /\\S/. Discharges
  FALSIFY-QW3-MOE-FORWARD-003.

NO BEHAVIOR CHANGE in this PR for the existing dense path. Pure
addition of a new function + 3 new metadata accessors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 16dcfe7 into main Apr 29, 2026
11 checks passed
@noahgift noahgift deleted the feat/m32c-2-2-2-1-2-run-qwen3-moe-generate branch April 29, 2026 07:26
noahgift added a commit that referenced this pull request Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127)

## What ships

Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` —
F-QW3-MOE-C22214-001, an integration test that invokes the user-facing
`apr` binary as a subprocess and asserts:

  1. exit 0
  2. stdout contains ≥1 non-whitespace character

against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
with a fresh date-tagged prompt.

This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126,
squash a902eea) in CI / regression-prevention. Without it, a
future regression that re-routed qwen3_moe back to the dense
`run_gguf_generate` path (which produces garbage on MoE weights)
would slip through CI silently — there'd be no signal at the
`apr run` user-facing surface.

## Live evidence (lambda-vector RTX 4090, 2026-04-29)

```
running 1 test
test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ...
F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
F-QW3-MOE-C22214-001: elapsed = 130.945370974s
  stdout (first 200B): === APR Run ===

Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf

Output:
.

Completed in 130.83s (cached)

  stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'
[BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'

F-QW3-MOE-C22214-001: PASS
ok

test result: ok. 1 passed; 0 failed; 0 ignored
```

Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This
test asserts ONLY emit/exit-0 — the discharge gate for
FALSIFY-QW3-MOE-FORWARD-003.

## Skip path

CI runners (and any host without the cached GGUF) print:

  F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...]

and return success. Same skip pattern as
`crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs`
(M32c.2.2.2.1.1 in-process forward primitive).

## Contract chain status

  M32a    qwen3-moe-forward-v1 contract scaffold        SHIPPED (#1099)
  M32b    arch-aware FFN load refuses qwen3_moe          SHIPPED (#1100)
  M32c.1+ MoE descriptor load + per-expert byte slicer   SHIPPED
  M32c.2.2.2.1.1 forward_qwen3_moe method                SHIPPED (#1124)
  M32c.2.2.2.1.2 run_qwen3_moe_generate function         SHIPPED (#1125)
  M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch   SHIPPED (#1126)
  M32c.2.2.2.1.4 live `apr run` falsifier               THIS PR
  M32d           numerical parity vs llama.cpp           PENDING

After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which
unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch
parity gate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant