Skip to content

contract(qwen3-moe-forward-v1): v1.1.0 → v1.2.0 — M32c.2.2.2.1 integration strategy#1123

Merged
noahgift merged 1 commit into
mainfrom
feat/m32c-2-2-2-1-forward-qwen3-moe-method
Apr 29, 2026
Merged

contract(qwen3-moe-forward-v1): v1.1.0 → v1.2.0 — M32c.2.2.2.1 integration strategy#1123
noahgift merged 1 commit into
mainfrom
feat/m32c-2-2-2-1-forward-qwen3-moe-method

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Records the design decision for the final integration step that gates apr run producing tokens. After comparing 3 approaches (touch 99 OwnedQuantizedModel sites / parallel function / wrapper struct), parallel run_qwen3_moe_generate chosen as strictly less work than the alternatives while preserving CI-must-stay-green.

Implementation slices (M32c.2.2.2.1.0 → .1.4)

  • .1.0: Extract apply_rope / causal_attention / qkv_matmul into reusable helpers (contract-preserving)
  • .1.1: New qwen3_moe_forward.rs module — per-token forward
  • .1.2: run_qwen3_moe_generate — full inference loop
  • .1.3: Dispatch flip — run_inference routes qwen3_moe to new function
  • .1.4: Live falsifier — apr run produces tokens (FALSIFY-QW3-MOE-FORWARD-003)

What this PR does NOT ship

No Rust code change. Contract-first per CLAUDE.md. Implementation lands in the 5 sub-slices.

Test plan

  • pv validate contracts/qwen3-moe-forward-v1.yaml clean
  • lint_passes_on_real_contracts PASS
  • CI green

🤖 Generated with Claude Code

…ation strategy

Records the design decision that gates the final wire-up for `apr run`
to produce tokens against Qwen3-Coder-30B-A3B-Instruct. Three integration
approaches were considered:

  (A) Add `mmap` + `moe_layers` fields to OwnedQuantizedModel
      → 99 construction sites must be updated (verified by grep).
        Massive blast radius for one feature.

  (B) Parallel `run_qwen3_moe_generate` function
      → ~300 LOC + helper extraction; zero touch to OwnedQuantizedModel.

  (C) Wrapper struct `Qwen3MoeOwnedModel`
      → ~150 LOC + same helper extraction as (B), PLUS new struct.
        Strictly more work than (B) for the same prerequisite.

DECISION: (B) — parallel function with attention/RoPE/KV refactored
into reusable helpers. Rationale:
  * (A)'s 99-site blast radius fails the CI-must-stay-green invariant.
  * (C)'s struct adds work without removing the helper-extraction
    prerequisite, so it's strictly worse than (B).
  * (B) makes the MoE forward path self-contained and ground-truth-
    validatable against llama.cpp without entangling the dense path.

IMPLEMENTATION SLICES (M32c.2.2.2.1.0 → M32c.2.2.2.1.4):

  .1.0 — Extract apply_rope, causal_attention, qkv_matmul from
         OwnedQuantizedModel into pure functions in a new
         gguf::inference::common module. Existing forward() refactored
         to call them. Behavior unchanged.
  .1.1 — New gguf/qwen3_moe_forward.rs: forward_one_token(mapped,
         moe_layers, transformer_state, token_id, position, kv_cache).
  .1.2 — run_qwen3_moe_generate(mapped, tokens, config): full
         inference loop using .1.1's per-token forward.
  .1.3 — Dispatch: run_inference for arch == "qwen3_moe" calls
         run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs
         short-circuit.
  .1.4 — Falsifier FALSIFY-QW3-MOE-FORWARD-003: `apr run
         <qwen3-coder>.gguf --prompt "Hi" -n 8` exits 0 + stdout
         matches /\\S/. Live test on lambda-vector RTX 4090.

Each sub-slice is a small, atomically-testable PR. .1.0 is contract-
preserving. .1.1-.1.2 are additive. .1.3 is the dispatch flip. .1.4
is the live falsifier that closes FALSIFY-QW3-MOE-FORWARD-003.

After .1.4: M32d numerical parity vs llama.cpp Q4_K + HF FP16 is the
last gate before flipping this contract DRAFT → ACTIVE_RUNTIME, which
in turn unblocks companion-repo FALSIFY-CCPA-013 measured
tool-dispatch parity score.

NO RUST CHANGES in this PR — contract-only per CLAUDE.md CB-1400
contract-first design.

Validation:
  $ pv validate contracts/qwen3-moe-forward-v1.yaml
  0 error(s), 0 warning(s)
  Contract is valid.

  $ cargo test -p aprender-contracts --lib lint_passes_on_real_contracts
  test result: ok. 1 passed; 0 failed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit bd08718 into main Apr 29, 2026
11 checks passed
@noahgift noahgift deleted the feat/m32c-2-2-2-1-forward-qwen3-moe-method branch April 29, 2026 05:36
noahgift added a commit that referenced this pull request Apr 29, 2026
…ntizedModel (#1124)

* feat(realizar): M32c.2.2.2.1.1 — forward_qwen3_moe method on OwnedQuantizedModel

Per qwen3-moe-forward-v1 v1.2.0 (M32c.2.2.2.1 integration strategy, PR #1123),
this is the per-token forward pass for Qwen3-MoE-arch GGUF models. Mirrors
`OwnedQuantizedModel::forward` step-for-step except the FFN site, which
calls M32c.2.2.2.0's `moe_ffn_forward_layer` instead of the dense
SwiGLU/GELU dispatch.

WHAT THIS PR SHIPS
==================
* `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`
  (NEW): impl block on `OwnedQuantizedModel` adding
  `forward_qwen3_moe(token_ids, moe_layers, num_experts,
  num_experts_per_tok, moe_intermediate, data) -> Result<Vec<f32>>`.
  Reuses existing `&self` methods for embedding, attention norm,
  qkv_matmul, RoPE (apply_rope), causal_attention, attn_output proj,
  output_norm, and lm_head — ZERO duplication of attention code. The
  ONLY new logic is the per-position FFN call to `moe_ffn_forward_layer`.

* `crates/aprender-serve/src/gguf/inference/forward/mod.rs`: register
  the new module.

* `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (NEW):
  F-QW3-MOE-C22211-001 — full 1-token forward against the cached
  17.3 GB Qwen3-Coder GGUF. Asserts:
    - logits.len() == vocab_size (151936)
    - all logits finite (no NaN/Inf)
    - argmax in valid range
  Skipped when no GGUF cached (slow live test; runs minutes per call
  due to mmap fault-in + 48 layers × 128-expert routing × top-8 ×
  per-expert Q4_K/Q6_K matmul).

WHY THIS DESIGN (PER v1.2.0)
============================
Three integration approaches were considered (see contract):
  (A) Add fields to OwnedQuantizedModel: 99-site blast radius.
  (B) Parallel run_qwen3_moe_generate function: ~300 LOC + helpers.
  (C) Wrapper struct: ~150 LOC + helpers.

The chosen path is a HYBRID of (B) and (C): a method on
OwnedQuantizedModel that reuses existing `&self` primitives (no
duplication, no new struct) and takes mmap data + moe_layers as
PARAMETERS (no field add, no 99-site touch). Strictly less work
than the originally planned (B) while preserving the
"CI-must-stay-green" invariant.

WHAT M32c.2.2.2.1.2 (NEXT PR) WILL SHIP
========================================
* `run_qwen3_moe_generate(mapped, tokens, gen_config, config)`:
  full inference loop calling forward_qwen3_moe per token, sampling,
  detokenizing — sibling to `run_gguf_generate`.

WHAT M32c.2.2.2.1.3 WILL SHIP
==============================
* Dispatch flip in `run_inference`: arch == "qwen3_moe" routes to
  run_qwen3_moe_generate. Removes M32c.2.1's gguf_gpu_generate.rs
  short-circuit.

WHAT M32c.2.2.2.1.4 WILL SHIP
==============================
* Live falsifier — `apr run <qwen3-coder>.gguf -p "Hi" -n 8`
  exits 0 + stdout matches /\\S/. Discharges
  FALSIFY-QW3-MOE-FORWARD-003.

NO BEHAVIOR CHANGE in this PR. The new method is additive; existing
forward path is untouched. Compilation clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(realizar): doc list indentation in forward_qwen3_moe (clippy)

Pre-CI lint caught a doc-list-item-without-indentation warning at
line 18 of forward_qwen3_moe.rs. Continuation line for a list item
must be indented to align with the item's text, not the bullet.

Caught by: ci/lint job in PR #1124.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant