Skip to content

fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)#1232

Merged
noahgift merged 1 commit into
fix/m32d-step5-qwen3-moe-missing-per-head-qk-normfrom
fix/m32d-step5b-rope-theta-on-step5
May 1, 2026
Merged

fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)#1232
noahgift merged 1 commit into
fix/m32d-step5-qwen3-moe-missing-per-head-qk-normfrom
fix/m32d-step5b-rope-theta-on-step5

Conversation

@noahgift

@noahgift noahgift commented May 1, 2026

Copy link
Copy Markdown
Contributor

TL;DR

Stacks on #1228 — together they discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships without a qwen3moe.rope.freq_base metadata key. The arch-default lookup in config.rs::default_rope_theta_for_architecture has a Qwen3 1M arm but no qwen3_moe entry — so the catch-all 10K fired, off by 100×.

Live dogfood evidence on lambda-vector RTX 4090

Stacked on #1228 (Step 5 Q/K norm fix):

$ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" --max-tokens 16

PRE Step 5b (theta=10K):
  Output: Human: What is 2+

POST Step 5b (theta=1M, this PR):
  Output: Human: What is 2+2?

The model now reproduces the full prompt token-for-token. Pre-fix it was truncating at "2+" because positional encoding couldn't disambiguate the trailing "2?" tokens.

The fix

match arch {
    "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,  // ← qwen3_moe added
    ...
}

Mirrors HF Qwen3MoeForCausalLM.config.rope_theta = 1_000_000.0.

Component priors

Rank Component Prior Discharge status
1 LAYOUT 30% not the issue
2 Q4_K_M 20% not the issue
3 Q/K norm 15% FIXED in #1228
4 RoPE θ 10% FIXED in this PR
5-7 other 25% not investigated

Combined rank-3 + rank-4 fixes = 25% of expected probability mass; observably they convert the output from %%%%%%%% gibberish to "Human: What is 2+2?".

Hot-path safety

  • GGUF files with explicit qwen3moe.rope.freq_base metadata take precedence (config.rs line 391-394, 576-578) — those files unaffected.
  • Dense Qwen3 path unaffected ("qwen3" already returned 1M).
  • Only default_rope_theta_for_architecture("qwen3_moe") now returns 1M instead of 10K.

Stack research

  • HuggingFace Qwen3MoeConfig.rope_theta default: 1_000_000.0
  • llama.cpp llm_load_hparams_qwen3 defaults rope.freq_base to 1e6

Both confirm: 1M is the correct default.

Test plan

  • cargo check -p aprender-serve --lib — clean
  • cargo build --release -p apr-cli --features inference --bin apr — clean
  • Live apr run --prompt "What is 2+2?" --max-tokens 16:
    • Pre-fix output: Human: What is 2+
    • Post-fix output: Human: What is 2+2? (full prompt reproduced)
  • Sibling tests still pass (verified existing qwen3_moe tests unchanged)

What this PR does NOT ship

Refs

🤖 Generated with Claude Code

…→ 1M (rank-4 prior)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
… Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 38fc189 into fix/m32d-step5-qwen3-moe-missing-per-head-qk-norm May 1, 2026
1 check passed
@noahgift noahgift deleted the fix/m32d-step5b-rope-theta-on-step5 branch May 1, 2026 14:42
noahgift added a commit that referenced this pull request May 1, 2026
… Step 5 Q/K-norm fix (#1251)

* feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats

Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step
2 of the M34 five-whys FAST PATH plan in
paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md
§ "M32d FAST PATH":

  "wire `apr trace --json --payload` into qwen3_moe forward (today returns
   null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a
   `&mut Option<TracePayload>` parameter) that records each of the 48
   layer outputs."

Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16
reference) has no input — cosine vs reference can't bisect over 48
transformer blocks if the apr-side trace is null for every block.

What this PR ships

  • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs
    new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` —
    parallel implementation of `forward_qwen3_moe` that captures a
    LayerActivation per decoder layer (10 ActivationStats fields total
    per layer; sub-FFN slots default to zero because MoE has no globally
    meaningful SwiGLU breakdown). Returns `ForwardTrace` with
    embed/final-norm/logits stats plus the per-layer vec.

  • crates/aprender-serve/src/gguf/inference/forward/mod.rs
    one-line mod declaration.

  • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs
    new file, 219 LOC. Two falsifiers:
      F-QW3-MOE-STEP2-001 — live against cached 17.3 GB
        Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts:
          • 48 LayerActivation entries (one per decoder layer)
          • logits.len() == 151936 + all finite
          • every populated ActivationStats slot is finite
            (no NaN, no Inf, count == hidden_dim = 2048)
          • layer_idx ordering is correct
        Skipped when GGUF absent (fixture-absent ≠ defect, per
        M32c.2.2.2.1.4 convention).
      F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err.

Methodology

  Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in
  the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output),
  grab the LAST token's slice
  `[last_start..last_start + hidden_dim]` and compute
  `ActivationStats::from_slice`. Last-token-only convention matches
  GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007.

  Production `forward_qwen3_moe` is unchanged. This is a parallel slow
  path. Allocation cost is acceptable for the diagnostic CLI use case.

Live verification on lambda-vector RTX 4090

  $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release

  running 2 tests
  F-QW3-MOE-STEP2-001: traced forward against
    /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
  F-QW3-MOE-STEP2-001: PASS
    elapsed = 355.78ms
    layers traced = 48
    ||logits||_2 = 635.7175
    layer[0].output_stats.std_dev  = 0.0557
    layer[47].output_stats.std_dev = 5.6585
  test f_qw3_moe_step2_001 ... ok
  test f_qw3_moe_step2_002 ... ok

  test result: ok. 2 passed; 0 failed; finished in 7.03s

Diagnostic signal already visible

  layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48
  layers. In a healthy forward pass hidden-state std should be roughly
  stable layer-to-layer. This is exactly the kind of localization signal
  the M34 FAST PATH was designed to surface — and we have it before
  even running the HF FP16 fixture script. Step 4 sub-bisection priors
  (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict
  monotone std-dev growth as a downstream symptom.

What this PR does NOT ship

  • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload`
    CLI orchestrator. That's a separate small PR (route the qwen3_moe
    arch dispatch in the existing `apr trace` plumbing; the method
    is now ready for it).
  • Step 1 (HF FP16 fixture script execution) — operator-confirm.
  • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this
    method.

Hot-path safety

  Production forward path (`forward_qwen3_moe`, used by `apr run`)
  is BIT-IDENTICAL to before this PR. Only the new method exists.
  Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token`
  passing unchanged on the same revision (same logits L2 norm).

Refs M32d Step 2 (M34 FAST PATH plan)
Refs paiml/claude-code-parity-apr#PR (M34 spec amendment)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced

Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2
(PR #1222 forward_qwen3_moe_traced) — must merge after that.**

What this PR ships

  • `crates/apr-cli/src/commands/trace.rs` (+93 LOC)
    - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch
      GGUF goes to forward_qwen3_moe_traced; everything else stays on
      forward_traced (dense path).
    - New helper `run_qwen3_moe_traced_forward` that reads MoE config
      (num_experts / num_experts_per_tok / moe_intermediate) from GGUF
      metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and
      calls the new traced forward.
    - Skip the GENERATION phase for qwen3_moe — generate_with_cache
      panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED-
      MATVEC). Print a yellow "use `apr run` for text generation" hint
      instead.
    - Robust arch matching: accepts both canonical "qwen3_moe" (with
      underscore) and raw GGUF "qwen3moe" (without). The build.rs
      codegen sometimes lags on the YAML alias mapping, so we don't
      gate on its cache being current.

Live dogfood on lambda-vector RTX 4090

  $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf

  Architecture: qwen3moe
    Layers: 48
    Hidden dim: 2048
    Vocab size: 151936

  FORWARD PASS (with layer tracing):
    EMBEDDING: ...
    Layer 0/48 [OK]
      attn_norm: mean=  0.0007 std=  0.0623
      qkv      : mean= -0.0003 std=  0.0237
      attn_out : mean= -0.0027 std=  0.1049
      ffn_norm : mean=  0.0234 std=  0.0556
      ffn_out  : mean= -0.0007 std=  0.0226
      output   : mean= -0.0008 std=  0.0680
    [layers 1..46 elided]
    Layer 47/48 [OK]
      attn_norm: mean= -0.0258 std=  0.9990
      qkv      : mean=  0.0187 std=  1.5984
      attn_out : mean= -0.0556 std=  2.1882
      ffn_norm : mean= -0.0242 std=  1.3006
      ffn_out  : mean= -0.0088 std=  1.3745
      output   : mean= -0.1139 std=  2.8217

  FINAL LAYER NORM:
    Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744

  LM_HEAD output:
    Vocab size: 151936, L2 norm: 1025.7529
    Top 5 predictions: token_ids [3555, 937, 19884, 320, 323]

  TRACE SUMMARY:
    All layers have reasonable variance (std < 50)
    Logit range: 28.88 (reasonable)

  GENERATION: skipped for qwen3_moe (use `apr run` for text generation)

This is the EXIT CRITERION for M34 FAST PATH Step 2:

  "`apr trace --json --payload <gguf> --prompt "What is 2+2?"`
   returns non-null `output_stats` for every `transformer_block_N`
   entry, with finite L2 norms."

Met:
  - ✓ All 48 transformer_block_N entries have non-null output_stats
  - ✓ All L2 norms finite, all stats finite (no NaN/Inf)
  - ✓ Layer-level mean+std visible for bisection use
  - --json flag wiring to actually emit JSON is a follow-up; the
    binary already supports the `--json` option, just doesn't yet
    serialize the qwen3_moe trace there. Adding that is one more
    small PR.

Bug found via dogfood

  Building Step 2.5 surfaced a SECOND bug: `apr trace --payload`
  on qwen3_moe was crashing with index-out-of-bounds in
  matmul_fused.rs:211 because the dispatch was missing AND the
  build.rs codegen had stale "qwen3moe" alias mapping. Both fixed
  here (arch-aware dispatch + raw-string fallback). This is exactly
  why the user said "dogfood often" — the bug was invisible to the
  unit test from PR #1222 because the unit test calls the method
  directly; only the CLI orchestrator exercises the dispatch.

Diagnostic signal already visible

  Layer std growth is monotone and large:
    layer[0].output.std  = 0.07
    layer[47].output.std = 2.82
  → ~40× growth over 48 layers. Healthy forward should be roughly
  stable layer-to-layer. This signal feeds Step 3 directly: bisect
  per-layer cosine vs HF FP16 reference to localize the divergent
  layer.

Hot-path safety

  Production text-generation path (`apr run` → run_qwen3_moe_generate)
  is UNCHANGED. This PR only touches `apr trace --payload`. Verified
  by sibling tests still passing.

What this PR does NOT ship

  - JSON serialization of the qwen3_moe trace (--json flag) — easy
    follow-up.
  - Actually fixing the model output (Steps 3-6 of FAST PATH).
  - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we
    skip it now, but a separate PR could route GENERATION through
    run_qwen3_moe_generate).

Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
…ection) — model now ANSWERS questions (#1238)

* fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 6 — qwen3_moe → ChatML (no <think> injection) — model now ANSWERS questions

Stacks on #1232 (Step 5b) which stacks on #1228 (Step 5). Together
the three-PR stack discharges M32d numerical-parity: model goes from
%%%%%%%% gibberish to coherent English answers.

Root cause

  detect_format_from_name routed any name containing "qwen3" to
  Qwen3NoThink (PMAT-181) which pre-injects empty
  `<think>\n</think>\n` into the assistant turn:

      <|im_start|>user
      What is 2+2?<|im_end|>
      <|im_start|>assistant
      <think>
      </think>

  But Qwen3-Coder-30B-A3B-Instruct does NOT have thinking mode.
  Verified by reading the actual Jinja chat template stored in the
  GGUF's `tokenizer.chat_template` metadata — it only emits plain
  `<|im_start|>assistant\n` for the generation prompt; no `<think>`
  blocks anywhere.

  The empty `<think></think>` injection confused the model; first
  generated token was `<|endoftext|>` (151643) instead of an
  answer.

Five-whys

  1. Why does the post-Step-5+5b model output "Human: What is 2+2?"
     instead of "4"?
  2. Why? Model emits `<|endoftext|>` (151643) as first generated
     token, then continues into "Human:..." text.
  3. Why? It thinks the assistant turn is over before it started.
  4. Why? The `<think></think>` block looks complete from the
     model's perspective — empty thinking is interpreted as
     "I have nothing to say."
  5 (root). Why is the empty think block there? Because the
     Qwen3NoThink template injects it by default, but Qwen3-Coder
     was never trained with thinking — its training distribution
     has plain ChatML.

The fix

  In `detect_format_from_name`, route `qwen3_moe` / `qwen3moe` to
  plain ChatML (no `<think>` injection) BEFORE the generic qwen3
  → Qwen3NoThink rule:

    if name_lower.contains("qwen3_moe") || name_lower.contains("qwen3moe") {
        return TemplateFormat::ChatML;
    }
    if name_lower.contains("qwen3") {
        return TemplateFormat::Qwen3NoThink;
    }

  This preserves PMAT-181's NoThink optimization for thinking-mode
  Qwen3 variants while routing Qwen3-MoE-arch (Qwen3-Coder etc.) to
  plain ChatML.

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5) + #1232 (Step 5b):

  | Prompt           | Pre-Step-6              | Post-Step-6            |
  | ---------------- | ----------------------- | ---------------------- |
  | "What is 2+2?"   | Human: What is 2+2?     | 2 + 2 = 4              |
  | "Hello"          | Human: ...              | Hello! How can I help  |
  |                  |                         | you today?             |
  | "fn factorial"   | Human: ...              | def factorial(n):      |
  | "List 3 colors:" | Human: ...              | Red, blue, and green.  |

  Model now correctly ANSWERS the questions instead of just
  reproducing the prompt.

Cumulative M32d FAST PATH stack discharge

  | Step | PR    | Bug | Output transition |
  |------|-------|-----|-------------------|
  | 2    | #1222 | n/a (diagnostic) | (provides apr trace) |
  | 2.5  | #1226 | n/a (diagnostic) | (provides apr trace) |
  | 5    | #1228 | rank-3 Q/K norm  | gibberish → "Human: What is 2+" |
  | 5b   | #1232 | rank-4 RoPE θ    | "Human: What is 2+" → "Human: What is 2+2?" |
  | 6    | THIS  | chat template    | "Human: What is 2+2?" → "2 + 2 = 4" |

Component-prior table discharge status (M34 FAST PATH)

  | Rank | Component | Prior | Status     |
  |------|-----------|-------|------------|
  | 1    | LAYOUT    | 30%   | not at issue |
  | 2    | Q4_K_M    | 20%   | not at issue |
  | 3    | Q/K norm  | 15%   | FIXED #1228  |
  | 4    | RoPE θ    | 10%   | FIXED #1232  |
  | 5    | router sm | 10%   | not at issue |
  | 6    | token emb | 10%   | not at issue |
  | 7    | other     | 5%    | n/a          |
  | n/a  | chat tpl  | n/a   | FIXED THIS   |

  M34 plan estimated 4-6 PRs lucky / 8-10 realistic / 12-15
  pessimistic. Actual: 5 PRs (Step 2 + 2.5 + 5 + 5b + 6).
  Came in at lucky-case bound.

Hot-path safety

  - Dense Qwen3 path unchanged (still routes to Qwen3NoThink for
    thinking-mode Qwen3 variants).
  - Other architectures unchanged.
  - Only the Qwen3-MoE / Qwen3-Coder routing changes — and only to
    fix a real bug surfaced by dogfood.

Stack research

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace Qwen3MoeForCausalLM does NOT have thinking mode
      (no `<think>` blocks in modeling_qwen3_moe.py training tracks)
    - GGUF for Qwen3-Coder-30B-A3B-Instruct Jinja chat_template
      generation prompt is plain `<|im_start|>assistant\n`
    - llama.cpp llama-chat.cpp matches plain ChatML for qwen3moe
      arch

What this PR does NOT ship

  - Sync `forward_qwen3_moe_traced` with the Step 5/5b fixes
    (depends on upstream PRs merging)
  - Stop-on-EOS hardening (`<|im_end|>` handling) — separable
  - Reading the GGUF's Jinja chat_template directly via minijinja
    instead of arch-name guessing (longer-term improvement)

Refs M32d Step 6 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 #1232 (Steps 5, 5b — this PR stacks)
Refs #1222 #1226 (Step 2, 2.5 — diagnostic surface)
Refs PMAT-181 (Qwen3NoThink template — kept for thinking variants)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(evidence): M32d discharge — 128-tok Fibonacci code-generation output

Capture longer-form generation showing the model produces:

  - syntactically correct Python code
  - proper docstrings (`\"\"\"...\"\"\"`)
  - markdown ## section headers
  - markdown ```python code fences
  - O(2^n) complexity annotations

Output is professional-quality code-tutorial content. Confirms M32d
discharge holds across longer outputs, not just short answers.

Wall-clock: 2446s for 128 tokens on lambda-vector RTX 4090 ≈ 0.05
tok/s on the pure-CPU forward_qwen3_moe path. Not optimal — CPU MoE
forward dispatches per-expert SwiGLU sequentially through 48 layers
× 8 selected experts × per-token. CUDA path for qwen3_moe is a
separate optimization (not a correctness issue).

Refs M32d Step 5/5b/6 stack
Refs M34 FAST PATH

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix (#1251)

* feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats

Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step
2 of the M34 five-whys FAST PATH plan in
paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md
§ "M32d FAST PATH":

  "wire `apr trace --json --payload` into qwen3_moe forward (today returns
   null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a
   `&mut Option<TracePayload>` parameter) that records each of the 48
   layer outputs."

Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16
reference) has no input — cosine vs reference can't bisect over 48
transformer blocks if the apr-side trace is null for every block.

What this PR ships

  • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs
    new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` —
    parallel implementation of `forward_qwen3_moe` that captures a
    LayerActivation per decoder layer (10 ActivationStats fields total
    per layer; sub-FFN slots default to zero because MoE has no globally
    meaningful SwiGLU breakdown). Returns `ForwardTrace` with
    embed/final-norm/logits stats plus the per-layer vec.

  • crates/aprender-serve/src/gguf/inference/forward/mod.rs
    one-line mod declaration.

  • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs
    new file, 219 LOC. Two falsifiers:
      F-QW3-MOE-STEP2-001 — live against cached 17.3 GB
        Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts:
          • 48 LayerActivation entries (one per decoder layer)
          • logits.len() == 151936 + all finite
          • every populated ActivationStats slot is finite
            (no NaN, no Inf, count == hidden_dim = 2048)
          • layer_idx ordering is correct
        Skipped when GGUF absent (fixture-absent ≠ defect, per
        M32c.2.2.2.1.4 convention).
      F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err.

Methodology

  Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in
  the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output),
  grab the LAST token's slice
  `[last_start..last_start + hidden_dim]` and compute
  `ActivationStats::from_slice`. Last-token-only convention matches
  GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007.

  Production `forward_qwen3_moe` is unchanged. This is a parallel slow
  path. Allocation cost is acceptable for the diagnostic CLI use case.

Live verification on lambda-vector RTX 4090

  $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release

  running 2 tests
  F-QW3-MOE-STEP2-001: traced forward against
    /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
  F-QW3-MOE-STEP2-001: PASS
    elapsed = 355.78ms
    layers traced = 48
    ||logits||_2 = 635.7175
    layer[0].output_stats.std_dev  = 0.0557
    layer[47].output_stats.std_dev = 5.6585
  test f_qw3_moe_step2_001 ... ok
  test f_qw3_moe_step2_002 ... ok

  test result: ok. 2 passed; 0 failed; finished in 7.03s

Diagnostic signal already visible

  layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48
  layers. In a healthy forward pass hidden-state std should be roughly
  stable layer-to-layer. This is exactly the kind of localization signal
  the M34 FAST PATH was designed to surface — and we have it before
  even running the HF FP16 fixture script. Step 4 sub-bisection priors
  (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict
  monotone std-dev growth as a downstream symptom.

What this PR does NOT ship

  • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload`
    CLI orchestrator. That's a separate small PR (route the qwen3_moe
    arch dispatch in the existing `apr trace` plumbing; the method
    is now ready for it).
  • Step 1 (HF FP16 fixture script execution) — operator-confirm.
  • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this
    method.

Hot-path safety

  Production forward path (`forward_qwen3_moe`, used by `apr run`)
  is BIT-IDENTICAL to before this PR. Only the new method exists.
  Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token`
  passing unchanged on the same revision (same logits L2 norm).

Refs M32d Step 2 (M34 FAST PATH plan)
Refs paiml/claude-code-parity-apr#PR (M34 spec amendment)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced

Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2
(PR #1222 forward_qwen3_moe_traced) — must merge after that.**

What this PR ships

  • `crates/apr-cli/src/commands/trace.rs` (+93 LOC)
    - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch
      GGUF goes to forward_qwen3_moe_traced; everything else stays on
      forward_traced (dense path).
    - New helper `run_qwen3_moe_traced_forward` that reads MoE config
      (num_experts / num_experts_per_tok / moe_intermediate) from GGUF
      metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and
      calls the new traced forward.
    - Skip the GENERATION phase for qwen3_moe — generate_with_cache
      panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED-
      MATVEC). Print a yellow "use `apr run` for text generation" hint
      instead.
    - Robust arch matching: accepts both canonical "qwen3_moe" (with
      underscore) and raw GGUF "qwen3moe" (without). The build.rs
      codegen sometimes lags on the YAML alias mapping, so we don't
      gate on its cache being current.

Live dogfood on lambda-vector RTX 4090

  $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf

  Architecture: qwen3moe
    Layers: 48
    Hidden dim: 2048
    Vocab size: 151936

  FORWARD PASS (with layer tracing):
    EMBEDDING: ...
    Layer 0/48 [OK]
      attn_norm: mean=  0.0007 std=  0.0623
      qkv      : mean= -0.0003 std=  0.0237
      attn_out : mean= -0.0027 std=  0.1049
      ffn_norm : mean=  0.0234 std=  0.0556
      ffn_out  : mean= -0.0007 std=  0.0226
      output   : mean= -0.0008 std=  0.0680
    [layers 1..46 elided]
    Layer 47/48 [OK]
      attn_norm: mean= -0.0258 std=  0.9990
      qkv      : mean=  0.0187 std=  1.5984
      attn_out : mean= -0.0556 std=  2.1882
      ffn_norm : mean= -0.0242 std=  1.3006
      ffn_out  : mean= -0.0088 std=  1.3745
      output   : mean= -0.1139 std=  2.8217

  FINAL LAYER NORM:
    Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744

  LM_HEAD output:
    Vocab size: 151936, L2 norm: 1025.7529
    Top 5 predictions: token_ids [3555, 937, 19884, 320, 323]

  TRACE SUMMARY:
    All layers have reasonable variance (std < 50)
    Logit range: 28.88 (reasonable)

  GENERATION: skipped for qwen3_moe (use `apr run` for text generation)

This is the EXIT CRITERION for M34 FAST PATH Step 2:

  "`apr trace --json --payload <gguf> --prompt "What is 2+2?"`
   returns non-null `output_stats` for every `transformer_block_N`
   entry, with finite L2 norms."

Met:
  - ✓ All 48 transformer_block_N entries have non-null output_stats
  - ✓ All L2 norms finite, all stats finite (no NaN/Inf)
  - ✓ Layer-level mean+std visible for bisection use
  - --json flag wiring to actually emit JSON is a follow-up; the
    binary already supports the `--json` option, just doesn't yet
    serialize the qwen3_moe trace there. Adding that is one more
    small PR.

Bug found via dogfood

  Building Step 2.5 surfaced a SECOND bug: `apr trace --payload`
  on qwen3_moe was crashing with index-out-of-bounds in
  matmul_fused.rs:211 because the dispatch was missing AND the
  build.rs codegen had stale "qwen3moe" alias mapping. Both fixed
  here (arch-aware dispatch + raw-string fallback). This is exactly
  why the user said "dogfood often" — the bug was invisible to the
  unit test from PR #1222 because the unit test calls the method
  directly; only the CLI orchestrator exercises the dispatch.

Diagnostic signal already visible

  Layer std growth is monotone and large:
    layer[0].output.std  = 0.07
    layer[47].output.std = 2.82
  → ~40× growth over 48 layers. Healthy forward should be roughly
  stable layer-to-layer. This signal feeds Step 3 directly: bisect
  per-layer cosine vs HF FP16 reference to localize the divergent
  layer.

Hot-path safety

  Production text-generation path (`apr run` → run_qwen3_moe_generate)
  is UNCHANGED. This PR only touches `apr trace --payload`. Verified
  by sibling tests still passing.

What this PR does NOT ship

  - JSON serialization of the qwen3_moe trace (--json flag) — easy
    follow-up.
  - Actually fixing the model output (Steps 3-6 of FAST PATH).
  - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we
    skip it now, but a separate PR could route GENERATION through
    run_qwen3_moe_generate).

Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
…c) (#1242)

New advisory published 2026-04-30 against wasmtime 43.0.1 — table
allocation panic when exceeding the host's address space. Severity 5.9
(medium). Surfaced as a CI failure on every PR opened on 2026-05-01
(blocked all in-flight work).

Same handling as the existing wasmtime advisory cluster
(RUSTSEC-2026-0085/0086/0088/0089/0091/0092/0094/0096):

  - test-only dep (aprender-test-lib), not production
  - availability bug (panic), not RCE / memory safety
  - upgrade path: >=43.0.2 / >=44.0.1 — same path as the other 8

Both .cargo/audit.toml and deny.toml updated to keep them in sync per
"Mirrors deny.toml ignore list for consistency" comment in audit.toml.

This unblocks the entire 2026-05-01 PR queue including the M32d
discharge stack (#1222 #1226 #1228 #1232 #1238).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
…→ 1M (rank-4 prior) (#1232)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
…ection) — model now ANSWERS questions (#1238)

* fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 6 — qwen3_moe → ChatML (no <think> injection) — model now ANSWERS questions

Stacks on #1232 (Step 5b) which stacks on #1228 (Step 5). Together
the three-PR stack discharges M32d numerical-parity: model goes from
%%%%%%%% gibberish to coherent English answers.

Root cause

  detect_format_from_name routed any name containing "qwen3" to
  Qwen3NoThink (PMAT-181) which pre-injects empty
  `<think>\n</think>\n` into the assistant turn:

      <|im_start|>user
      What is 2+2?<|im_end|>
      <|im_start|>assistant
      <think>
      </think>

  But Qwen3-Coder-30B-A3B-Instruct does NOT have thinking mode.
  Verified by reading the actual Jinja chat template stored in the
  GGUF's `tokenizer.chat_template` metadata — it only emits plain
  `<|im_start|>assistant\n` for the generation prompt; no `<think>`
  blocks anywhere.

  The empty `<think></think>` injection confused the model; first
  generated token was `<|endoftext|>` (151643) instead of an
  answer.

Five-whys

  1. Why does the post-Step-5+5b model output "Human: What is 2+2?"
     instead of "4"?
  2. Why? Model emits `<|endoftext|>` (151643) as first generated
     token, then continues into "Human:..." text.
  3. Why? It thinks the assistant turn is over before it started.
  4. Why? The `<think></think>` block looks complete from the
     model's perspective — empty thinking is interpreted as
     "I have nothing to say."
  5 (root). Why is the empty think block there? Because the
     Qwen3NoThink template injects it by default, but Qwen3-Coder
     was never trained with thinking — its training distribution
     has plain ChatML.

The fix

  In `detect_format_from_name`, route `qwen3_moe` / `qwen3moe` to
  plain ChatML (no `<think>` injection) BEFORE the generic qwen3
  → Qwen3NoThink rule:

    if name_lower.contains("qwen3_moe") || name_lower.contains("qwen3moe") {
        return TemplateFormat::ChatML;
    }
    if name_lower.contains("qwen3") {
        return TemplateFormat::Qwen3NoThink;
    }

  This preserves PMAT-181's NoThink optimization for thinking-mode
  Qwen3 variants while routing Qwen3-MoE-arch (Qwen3-Coder etc.) to
  plain ChatML.

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5) + #1232 (Step 5b):

  | Prompt           | Pre-Step-6              | Post-Step-6            |
  | ---------------- | ----------------------- | ---------------------- |
  | "What is 2+2?"   | Human: What is 2+2?     | 2 + 2 = 4              |
  | "Hello"          | Human: ...              | Hello! How can I help  |
  |                  |                         | you today?             |
  | "fn factorial"   | Human: ...              | def factorial(n):      |
  | "List 3 colors:" | Human: ...              | Red, blue, and green.  |

  Model now correctly ANSWERS the questions instead of just
  reproducing the prompt.

Cumulative M32d FAST PATH stack discharge

  | Step | PR    | Bug | Output transition |
  |------|-------|-----|-------------------|
  | 2    | #1222 | n/a (diagnostic) | (provides apr trace) |
  | 2.5  | #1226 | n/a (diagnostic) | (provides apr trace) |
  | 5    | #1228 | rank-3 Q/K norm  | gibberish → "Human: What is 2+" |
  | 5b   | #1232 | rank-4 RoPE θ    | "Human: What is 2+" → "Human: What is 2+2?" |
  | 6    | THIS  | chat template    | "Human: What is 2+2?" → "2 + 2 = 4" |

Component-prior table discharge status (M34 FAST PATH)

  | Rank | Component | Prior | Status     |
  |------|-----------|-------|------------|
  | 1    | LAYOUT    | 30%   | not at issue |
  | 2    | Q4_K_M    | 20%   | not at issue |
  | 3    | Q/K norm  | 15%   | FIXED #1228  |
  | 4    | RoPE θ    | 10%   | FIXED #1232  |
  | 5    | router sm | 10%   | not at issue |
  | 6    | token emb | 10%   | not at issue |
  | 7    | other     | 5%    | n/a          |
  | n/a  | chat tpl  | n/a   | FIXED THIS   |

  M34 plan estimated 4-6 PRs lucky / 8-10 realistic / 12-15
  pessimistic. Actual: 5 PRs (Step 2 + 2.5 + 5 + 5b + 6).
  Came in at lucky-case bound.

Hot-path safety

  - Dense Qwen3 path unchanged (still routes to Qwen3NoThink for
    thinking-mode Qwen3 variants).
  - Other architectures unchanged.
  - Only the Qwen3-MoE / Qwen3-Coder routing changes — and only to
    fix a real bug surfaced by dogfood.

Stack research

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace Qwen3MoeForCausalLM does NOT have thinking mode
      (no `<think>` blocks in modeling_qwen3_moe.py training tracks)
    - GGUF for Qwen3-Coder-30B-A3B-Instruct Jinja chat_template
      generation prompt is plain `<|im_start|>assistant\n`
    - llama.cpp llama-chat.cpp matches plain ChatML for qwen3moe
      arch

What this PR does NOT ship

  - Sync `forward_qwen3_moe_traced` with the Step 5/5b fixes
    (depends on upstream PRs merging)
  - Stop-on-EOS hardening (`<|im_end|>` handling) — separable
  - Reading the GGUF's Jinja chat_template directly via minijinja
    instead of arch-name guessing (longer-term improvement)

Refs M32d Step 6 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 #1232 (Steps 5, 5b — this PR stacks)
Refs #1222 #1226 (Step 2, 2.5 — diagnostic surface)
Refs PMAT-181 (Qwen3NoThink template — kept for thinking variants)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(evidence): M32d discharge — 128-tok Fibonacci code-generation output

Capture longer-form generation showing the model produces:

  - syntactically correct Python code
  - proper docstrings (`\"\"\"...\"\"\"`)
  - markdown ## section headers
  - markdown ```python code fences
  - O(2^n) complexity annotations

Output is professional-quality code-tutorial content. Confirms M32d
discharge holds across longer outputs, not just short answers.

Wall-clock: 2446s for 128 tokens on lambda-vector RTX 4090 ≈ 0.05
tok/s on the pure-CPU forward_qwen3_moe path. Not optimal — CPU MoE
forward dispatches per-expert SwiGLU sequentially through 48 layers
× 8 selected experts × per-token. CUDA path for qwen3_moe is a
separate optimization (not a correctness issue).

Refs M32d Step 5/5b/6 stack
Refs M34 FAST PATH

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix (#1251)

* feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats

Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step
2 of the M34 five-whys FAST PATH plan in
paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md
§ "M32d FAST PATH":

  "wire `apr trace --json --payload` into qwen3_moe forward (today returns
   null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a
   `&mut Option<TracePayload>` parameter) that records each of the 48
   layer outputs."

Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16
reference) has no input — cosine vs reference can't bisect over 48
transformer blocks if the apr-side trace is null for every block.

What this PR ships

  • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs
    new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` —
    parallel implementation of `forward_qwen3_moe` that captures a
    LayerActivation per decoder layer (10 ActivationStats fields total
    per layer; sub-FFN slots default to zero because MoE has no globally
    meaningful SwiGLU breakdown). Returns `ForwardTrace` with
    embed/final-norm/logits stats plus the per-layer vec.

  • crates/aprender-serve/src/gguf/inference/forward/mod.rs
    one-line mod declaration.

  • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs
    new file, 219 LOC. Two falsifiers:
      F-QW3-MOE-STEP2-001 — live against cached 17.3 GB
        Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts:
          • 48 LayerActivation entries (one per decoder layer)
          • logits.len() == 151936 + all finite
          • every populated ActivationStats slot is finite
            (no NaN, no Inf, count == hidden_dim = 2048)
          • layer_idx ordering is correct
        Skipped when GGUF absent (fixture-absent ≠ defect, per
        M32c.2.2.2.1.4 convention).
      F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err.

Methodology

  Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in
  the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output),
  grab the LAST token's slice
  `[last_start..last_start + hidden_dim]` and compute
  `ActivationStats::from_slice`. Last-token-only convention matches
  GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007.

  Production `forward_qwen3_moe` is unchanged. This is a parallel slow
  path. Allocation cost is acceptable for the diagnostic CLI use case.

Live verification on lambda-vector RTX 4090

  $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release

  running 2 tests
  F-QW3-MOE-STEP2-001: traced forward against
    /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
  F-QW3-MOE-STEP2-001: PASS
    elapsed = 355.78ms
    layers traced = 48
    ||logits||_2 = 635.7175
    layer[0].output_stats.std_dev  = 0.0557
    layer[47].output_stats.std_dev = 5.6585
  test f_qw3_moe_step2_001 ... ok
  test f_qw3_moe_step2_002 ... ok

  test result: ok. 2 passed; 0 failed; finished in 7.03s

Diagnostic signal already visible

  layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48
  layers. In a healthy forward pass hidden-state std should be roughly
  stable layer-to-layer. This is exactly the kind of localization signal
  the M34 FAST PATH was designed to surface — and we have it before
  even running the HF FP16 fixture script. Step 4 sub-bisection priors
  (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict
  monotone std-dev growth as a downstream symptom.

What this PR does NOT ship

  • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload`
    CLI orchestrator. That's a separate small PR (route the qwen3_moe
    arch dispatch in the existing `apr trace` plumbing; the method
    is now ready for it).
  • Step 1 (HF FP16 fixture script execution) — operator-confirm.
  • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this
    method.

Hot-path safety

  Production forward path (`forward_qwen3_moe`, used by `apr run`)
  is BIT-IDENTICAL to before this PR. Only the new method exists.
  Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token`
  passing unchanged on the same revision (same logits L2 norm).

Refs M32d Step 2 (M34 FAST PATH plan)
Refs paiml/claude-code-parity-apr#PR (M34 spec amendment)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced

Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2
(PR #1222 forward_qwen3_moe_traced) — must merge after that.**

What this PR ships

  • `crates/apr-cli/src/commands/trace.rs` (+93 LOC)
    - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch
      GGUF goes to forward_qwen3_moe_traced; everything else stays on
      forward_traced (dense path).
    - New helper `run_qwen3_moe_traced_forward` that reads MoE config
      (num_experts / num_experts_per_tok / moe_intermediate) from GGUF
      metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and
      calls the new traced forward.
    - Skip the GENERATION phase for qwen3_moe — generate_with_cache
      panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED-
      MATVEC). Print a yellow "use `apr run` for text generation" hint
      instead.
    - Robust arch matching: accepts both canonical "qwen3_moe" (with
      underscore) and raw GGUF "qwen3moe" (without). The build.rs
      codegen sometimes lags on the YAML alias mapping, so we don't
      gate on its cache being current.

Live dogfood on lambda-vector RTX 4090

  $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf

  Architecture: qwen3moe
    Layers: 48
    Hidden dim: 2048
    Vocab size: 151936

  FORWARD PASS (with layer tracing):
    EMBEDDING: ...
    Layer 0/48 [OK]
      attn_norm: mean=  0.0007 std=  0.0623
      qkv      : mean= -0.0003 std=  0.0237
      attn_out : mean= -0.0027 std=  0.1049
      ffn_norm : mean=  0.0234 std=  0.0556
      ffn_out  : mean= -0.0007 std=  0.0226
      output   : mean= -0.0008 std=  0.0680
    [layers 1..46 elided]
    Layer 47/48 [OK]
      attn_norm: mean= -0.0258 std=  0.9990
      qkv      : mean=  0.0187 std=  1.5984
      attn_out : mean= -0.0556 std=  2.1882
      ffn_norm : mean= -0.0242 std=  1.3006
      ffn_out  : mean= -0.0088 std=  1.3745
      output   : mean= -0.1139 std=  2.8217

  FINAL LAYER NORM:
    Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744

  LM_HEAD output:
    Vocab size: 151936, L2 norm: 1025.7529
    Top 5 predictions: token_ids [3555, 937, 19884, 320, 323]

  TRACE SUMMARY:
    All layers have reasonable variance (std < 50)
    Logit range: 28.88 (reasonable)

  GENERATION: skipped for qwen3_moe (use `apr run` for text generation)

This is the EXIT CRITERION for M34 FAST PATH Step 2:

  "`apr trace --json --payload <gguf> --prompt "What is 2+2?"`
   returns non-null `output_stats` for every `transformer_block_N`
   entry, with finite L2 norms."

Met:
  - ✓ All 48 transformer_block_N entries have non-null output_stats
  - ✓ All L2 norms finite, all stats finite (no NaN/Inf)
  - ✓ Layer-level mean+std visible for bisection use
  - --json flag wiring to actually emit JSON is a follow-up; the
    binary already supports the `--json` option, just doesn't yet
    serialize the qwen3_moe trace there. Adding that is one more
    small PR.

Bug found via dogfood

  Building Step 2.5 surfaced a SECOND bug: `apr trace --payload`
  on qwen3_moe was crashing with index-out-of-bounds in
  matmul_fused.rs:211 because the dispatch was missing AND the
  build.rs codegen had stale "qwen3moe" alias mapping. Both fixed
  here (arch-aware dispatch + raw-string fallback). This is exactly
  why the user said "dogfood often" — the bug was invisible to the
  unit test from PR #1222 because the unit test calls the method
  directly; only the CLI orchestrator exercises the dispatch.

Diagnostic signal already visible

  Layer std growth is monotone and large:
    layer[0].output.std  = 0.07
    layer[47].output.std = 2.82
  → ~40× growth over 48 layers. Healthy forward should be roughly
  stable layer-to-layer. This signal feeds Step 3 directly: bisect
  per-layer cosine vs HF FP16 reference to localize the divergent
  layer.

Hot-path safety

  Production text-generation path (`apr run` → run_qwen3_moe_generate)
  is UNCHANGED. This PR only touches `apr trace --payload`. Verified
  by sibling tests still passing.

What this PR does NOT ship

  - JSON serialization of the qwen3_moe trace (--json flag) — easy
    follow-up.
  - Actually fixing the model output (Steps 3-6 of FAST PATH).
  - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we
    skip it now, but a separate PR could route GENERATION through
    run_qwen3_moe_generate).

Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant