fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior) by noahgift · Pull Request #1232 · paiml/aprender

noahgift · 2026-05-01T11:09:07Z

TL;DR

Stacks on #1228 — together they discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships without a qwen3moe.rope.freq_base metadata key. The arch-default lookup in config.rs::default_rope_theta_for_architecture has a Qwen3 1M arm but no qwen3_moe entry — so the catch-all 10K fired, off by 100×.

Live dogfood evidence on lambda-vector RTX 4090

Stacked on #1228 (Step 5 Q/K norm fix):

$ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" --max-tokens 16

PRE Step 5b (theta=10K):
  Output: Human: What is 2+

POST Step 5b (theta=1M, this PR):
  Output: Human: What is 2+2?

The model now reproduces the full prompt token-for-token. Pre-fix it was truncating at "2+" because positional encoding couldn't disambiguate the trailing "2?" tokens.

The fix

match arch {
    "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,  // ← qwen3_moe added
    ...
}

Mirrors HF Qwen3MoeForCausalLM.config.rope_theta = 1_000_000.0.

Component priors

Rank	Component	Prior	Discharge status
1	LAYOUT	30%	not the issue
2	Q4_K_M	20%	not the issue
3	Q/K norm	15%	FIXED in #1228
4	RoPE θ	10%	FIXED in this PR
5-7	other	25%	not investigated

Combined rank-3 + rank-4 fixes = 25% of expected probability mass; observably they convert the output from %%%%%%%% gibberish to "Human: What is 2+2?".

Hot-path safety

GGUF files with explicit qwen3moe.rope.freq_base metadata take precedence (config.rs line 391-394, 576-578) — those files unaffected.
Dense Qwen3 path unaffected ("qwen3" already returned 1M).
Only default_rope_theta_for_architecture("qwen3_moe") now returns 1M instead of 10K.

Stack research

HuggingFace Qwen3MoeConfig.rope_theta default: 1_000_000.0
llama.cpp llm_load_hparams_qwen3 defaults rope.freq_base to 1e6

Both confirm: 1M is the correct default.

Test plan

cargo check -p aprender-serve --lib — clean
cargo build --release -p apr-cli --features inference --bin apr — clean
Live apr run --prompt "What is 2+2?" --max-tokens 16:
- Pre-fix output: Human: What is 2+
- Post-fix output: Human: What is 2+2? (full prompt reproduced)
Sibling tests still pass (verified existing qwen3_moe tests unchanged)

What this PR does NOT ship

Sync forward_qwen3_moe_traced (depends on feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced for per-layer ActivationStats #1222 merge)
Multi-token output coherence past prompt repetition (Step 6 / chat template / EOS handling — separable)

Refs

Stacks on fix(aprender-serve): M32d Step 5 — apply per-head Q/K RMSNorm in forward_qwen3_moe (GH-279) — gibberish → coherent English #1228 (Step 5: per-head Q/K RMSNorm)
M32d Step 5b of M34 FAST PATH (paiml/claude-code-parity-apr § "M32d FAST PATH")
FALSIFY-QW3-MOE-FORWARD-003

🤖 Generated with Claude Code

…→ 1M (rank-4 prior) Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they discharge ranks 3 and 4 of the M34 FAST PATH component-prior table. Root cause GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a `qwen3moe.rope.freq_base` metadata key. config.rs's `default_rope_theta_for_architecture` had a Qwen3 1M arm: "qwen2" | "qwen3" => 1_000_000.0, but **NO** qwen3_moe entry, so the catch-all fired: _ => 10_000.0, → 100× off positional encoding base. RoPE was generating angles with the wrong period for every position-frequency pair. Five-whys 1. Why does the model still produce only "Human: What is 2+" after Step 5 fix? (it should reproduce the full prompt "What is 2+2?") 2. Why? Positional encoding is wrong, attention can't distinguish question "What is 2+2?" from generic prefix. 3. Why? RoPE θ is wrong. 4. Why? GGUF metadata missing rope.freq_base + arch lookup fell through to default 10K. 5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of `default_rope_theta_for_architecture` was authored when only dense Qwen3 was tested; qwen3_moe never got added. The fix match arch { "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0, ... } Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` = 1_000_000.0 (extended context base). Live dogfood evidence on lambda-vector RTX 4090 Stacked on #1228 (Step 5 Q/K norm fix): PRE Step 5b (theta=10K): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+ POST Step 5b (theta=1M, this PR): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+2? The model now successfully reproduces the FULL prompt token-for- token. Pre-fix it was truncating at "2+" because positional encoding couldn't disambiguate the trailing "2?" tokens. Component priors at Step 4 (per M34 FAST PATH) | Rank | Component | Prior | Discharge status | |------|-----------|-------|------------------| | 1 | LAYOUT | 30% | not the issue (verified by build) | | 2 | Q4_K_M | 20% | not the issue (verified by inspect) | | 3 | Q/K norm | 15% | FIXED in #1228 | | 4 | RoPE θ | 10% | FIXED in this PR (Step 5b) | | 5-7 | other | 25% | not yet investigated | Together rank-3 + rank-4 = 25% of expected probability mass, and observably they convert the output from "%%%%%%%%" gibberish to "Human: What is 2+2?" — the prompt is now correctly understood. Hot-path safety - Default `default_rope_theta_for_architecture("qwen3_moe")` changes from 10_000.0 to 1_000_000.0. - GGUF files that DO have `qwen3moe.rope.freq_base` metadata take precedence over this default (per config.rs line 391-394 + 576-578) — those files are unaffected. - Dense Qwen3 path also unaffected ("qwen3" already returns 1M). Stack research confirmation Per CLAUDE.md "Stack research reference repos": - HuggingFace transformers Qwen3MoeConfig.rope_theta default: 1_000_000.0 (modeling_qwen3_moe.py) - llama.cpp llm_load_hparams_qwen3 reads f32_kv_value("rope.freq_base") with default 1e6 - Both confirm: 1M is the correct Qwen3-MoE default. What this PR does NOT ship - Sync forward_qwen3_moe_traced (depends on #1222 merge) - Multi-token output coherence past prompt repetition (Step 6 / chat-template handling — separable) - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps going past it; that's another follow-up Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it) Refs #1222 (Step 2: forward_qwen3_moe_traced) Refs #1226 (Step 2.5: apr trace dispatch) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… Step 5 Q/K-norm fix Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228 (Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge before this fix can land. Why this exists PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm. After PR #1228 (Step 5) added the per-head Q/K RMSNorm to forward_qwen3_moe, the traced variant kept the bug. Result: `apr trace --payload` shows DIFFERENT numerics from `apr run` for the same prompt + GGUF — silent diagnostic-vs-production drift. What this PR fixes Mirror the same per-head Q/K RMSNorm into forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE RoPE — same as #1228. Now both functions produce the same numerics. Live verification on lambda-vector RTX 4090 ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release — 2/2 PASS in 7.56s ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer std growth post-sync (Q/K norm gates attention scores per layer). ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes. What this PR does NOT ship - rope_theta is read from `self.config.rope_theta` which is set at model load time from the default lookup. PR #1232 fixed that default for `qwen3_moe`. forward_qwen3_moe_traced reads the same config, so it inherits the fix automatically — no separate sync needed. - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.) were already mirrored correctly in the original Step 2 PR. Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix) Refs PR #1232 (Step 5b: rope_theta — auto-applied via config) Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… Step 5 Q/K-norm fix (#1251) * feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step 2 of the M34 five-whys FAST PATH plan in paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md § "M32d FAST PATH": "wire `apr trace --json --payload` into qwen3_moe forward (today returns null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a `&mut Option<TracePayload>` parameter) that records each of the 48 layer outputs." Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16 reference) has no input — cosine vs reference can't bisect over 48 transformer blocks if the apr-side trace is null for every block. What this PR ships • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` — parallel implementation of `forward_qwen3_moe` that captures a LayerActivation per decoder layer (10 ActivationStats fields total per layer; sub-FFN slots default to zero because MoE has no globally meaningful SwiGLU breakdown). Returns `ForwardTrace` with embed/final-norm/logits stats plus the per-layer vec. • crates/aprender-serve/src/gguf/inference/forward/mod.rs one-line mod declaration. • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs new file, 219 LOC. Two falsifiers: F-QW3-MOE-STEP2-001 — live against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts: • 48 LayerActivation entries (one per decoder layer) • logits.len() == 151936 + all finite • every populated ActivationStats slot is finite (no NaN, no Inf, count == hidden_dim = 2048) • layer_idx ordering is correct Skipped when GGUF absent (fixture-absent ≠ defect, per M32c.2.2.2.1.4 convention). F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err. Methodology Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output), grab the LAST token's slice `[last_start..last_start + hidden_dim]` and compute `ActivationStats::from_slice`. Last-token-only convention matches GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007. Production `forward_qwen3_moe` is unchanged. This is a parallel slow path. Allocation cost is acceptable for the diagnostic CLI use case. Live verification on lambda-vector RTX 4090 $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release running 2 tests F-QW3-MOE-STEP2-001: traced forward against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-STEP2-001: PASS elapsed = 355.78ms layers traced = 48 ||logits||_2 = 635.7175 layer[0].output_stats.std_dev = 0.0557 layer[47].output_stats.std_dev = 5.6585 test f_qw3_moe_step2_001 ... ok test f_qw3_moe_step2_002 ... ok test result: ok. 2 passed; 0 failed; finished in 7.03s Diagnostic signal already visible layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48 layers. In a healthy forward pass hidden-state std should be roughly stable layer-to-layer. This is exactly the kind of localization signal the M34 FAST PATH was designed to surface — and we have it before even running the HF FP16 fixture script. Step 4 sub-bisection priors (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict monotone std-dev growth as a downstream symptom. What this PR does NOT ship • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload` CLI orchestrator. That's a separate small PR (route the qwen3_moe arch dispatch in the existing `apr trace` plumbing; the method is now ready for it). • Step 1 (HF FP16 fixture script execution) — operator-confirm. • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this method. Hot-path safety Production forward path (`forward_qwen3_moe`, used by `apr run`) is BIT-IDENTICAL to before this PR. Only the new method exists. Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token` passing unchanged on the same revision (same logits L2 norm). Refs M32d Step 2 (M34 FAST PATH plan) Refs paiml/claude-code-parity-apr#PR (M34 spec amendment) Refs FALSIFY-QW3-MOE-PARITY-001 Refs FALSIFY-CCPA-013 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2 (PR #1222 forward_qwen3_moe_traced) — must merge after that.** What this PR ships • `crates/apr-cli/src/commands/trace.rs` (+93 LOC) - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch GGUF goes to forward_qwen3_moe_traced; everything else stays on forward_traced (dense path). - New helper `run_qwen3_moe_traced_forward` that reads MoE config (num_experts / num_experts_per_tok / moe_intermediate) from GGUF metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and calls the new traced forward. - Skip the GENERATION phase for qwen3_moe — generate_with_cache panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED- MATVEC). Print a yellow "use `apr run` for text generation" hint instead. - Robust arch matching: accepts both canonical "qwen3_moe" (with underscore) and raw GGUF "qwen3moe" (without). The build.rs codegen sometimes lags on the YAML alias mapping, so we don't gate on its cache being current. Live dogfood on lambda-vector RTX 4090 $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf Architecture: qwen3moe Layers: 48 Hidden dim: 2048 Vocab size: 151936 FORWARD PASS (with layer tracing): EMBEDDING: ... Layer 0/48 [OK] attn_norm: mean= 0.0007 std= 0.0623 qkv : mean= -0.0003 std= 0.0237 attn_out : mean= -0.0027 std= 0.1049 ffn_norm : mean= 0.0234 std= 0.0556 ffn_out : mean= -0.0007 std= 0.0226 output : mean= -0.0008 std= 0.0680 [layers 1..46 elided] Layer 47/48 [OK] attn_norm: mean= -0.0258 std= 0.9990 qkv : mean= 0.0187 std= 1.5984 attn_out : mean= -0.0556 std= 2.1882 ffn_norm : mean= -0.0242 std= 1.3006 ffn_out : mean= -0.0088 std= 1.3745 output : mean= -0.1139 std= 2.8217 FINAL LAYER NORM: Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744 LM_HEAD output: Vocab size: 151936, L2 norm: 1025.7529 Top 5 predictions: token_ids [3555, 937, 19884, 320, 323] TRACE SUMMARY: All layers have reasonable variance (std < 50) Logit range: 28.88 (reasonable) GENERATION: skipped for qwen3_moe (use `apr run` for text generation) This is the EXIT CRITERION for M34 FAST PATH Step 2: "`apr trace --json --payload <gguf> --prompt "What is 2+2?"` returns non-null `output_stats` for every `transformer_block_N` entry, with finite L2 norms." Met: - ✓ All 48 transformer_block_N entries have non-null output_stats - ✓ All L2 norms finite, all stats finite (no NaN/Inf) - ✓ Layer-level mean+std visible for bisection use - --json flag wiring to actually emit JSON is a follow-up; the binary already supports the `--json` option, just doesn't yet serialize the qwen3_moe trace there. Adding that is one more small PR. Bug found via dogfood Building Step 2.5 surfaced a SECOND bug: `apr trace --payload` on qwen3_moe was crashing with index-out-of-bounds in matmul_fused.rs:211 because the dispatch was missing AND the build.rs codegen had stale "qwen3moe" alias mapping. Both fixed here (arch-aware dispatch + raw-string fallback). This is exactly why the user said "dogfood often" — the bug was invisible to the unit test from PR #1222 because the unit test calls the method directly; only the CLI orchestrator exercises the dispatch. Diagnostic signal already visible Layer std growth is monotone and large: layer[0].output.std = 0.07 layer[47].output.std = 2.82 → ~40× growth over 48 layers. Healthy forward should be roughly stable layer-to-layer. This signal feeds Step 3 directly: bisect per-layer cosine vs HF FP16 reference to localize the divergent layer. Hot-path safety Production text-generation path (`apr run` → run_qwen3_moe_generate) is UNCHANGED. This PR only touches `apr trace --payload`. Verified by sibling tests still passing. What this PR does NOT ship - JSON serialization of the qwen3_moe trace (--json flag) — easy follow-up. - Actually fixing the model output (Steps 3-6 of FAST PATH). - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we skip it now, but a separate PR could route GENERATION through run_qwen3_moe_generate). Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it) Refs FALSIFY-QW3-MOE-PARITY-001 Refs FALSIFY-CCPA-013 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228 (Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge before this fix can land. Why this exists PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm. After PR #1228 (Step 5) added the per-head Q/K RMSNorm to forward_qwen3_moe, the traced variant kept the bug. Result: `apr trace --payload` shows DIFFERENT numerics from `apr run` for the same prompt + GGUF — silent diagnostic-vs-production drift. What this PR fixes Mirror the same per-head Q/K RMSNorm into forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE RoPE — same as #1228. Now both functions produce the same numerics. Live verification on lambda-vector RTX 4090 ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release — 2/2 PASS in 7.56s ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer std growth post-sync (Q/K norm gates attention scores per layer). ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes. What this PR does NOT ship - rope_theta is read from `self.config.rope_theta` which is set at model load time from the default lookup. PR #1232 fixed that default for `qwen3_moe`. forward_qwen3_moe_traced reads the same config, so it inherits the fix automatically — no separate sync needed. - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.) were already mirrored correctly in the original Step 2 PR. Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix) Refs PR #1232 (Step 5b: rope_theta — auto-applied via config) Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ection) — model now ANSWERS questions (#1238) * fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior) Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they discharge ranks 3 and 4 of the M34 FAST PATH component-prior table. Root cause GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a `qwen3moe.rope.freq_base` metadata key. config.rs's `default_rope_theta_for_architecture` had a Qwen3 1M arm: "qwen2" | "qwen3" => 1_000_000.0, but **NO** qwen3_moe entry, so the catch-all fired: _ => 10_000.0, → 100× off positional encoding base. RoPE was generating angles with the wrong period for every position-frequency pair. Five-whys 1. Why does the model still produce only "Human: What is 2+" after Step 5 fix? (it should reproduce the full prompt "What is 2+2?") 2. Why? Positional encoding is wrong, attention can't distinguish question "What is 2+2?" from generic prefix. 3. Why? RoPE θ is wrong. 4. Why? GGUF metadata missing rope.freq_base + arch lookup fell through to default 10K. 5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of `default_rope_theta_for_architecture` was authored when only dense Qwen3 was tested; qwen3_moe never got added. The fix match arch { "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0, ... } Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` = 1_000_000.0 (extended context base). Live dogfood evidence on lambda-vector RTX 4090 Stacked on #1228 (Step 5 Q/K norm fix): PRE Step 5b (theta=10K): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+ POST Step 5b (theta=1M, this PR): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+2? The model now successfully reproduces the FULL prompt token-for- token. Pre-fix it was truncating at "2+" because positional encoding couldn't disambiguate the trailing "2?" tokens. Component priors at Step 4 (per M34 FAST PATH) | Rank | Component | Prior | Discharge status | |------|-----------|-------|------------------| | 1 | LAYOUT | 30% | not the issue (verified by build) | | 2 | Q4_K_M | 20% | not the issue (verified by inspect) | | 3 | Q/K norm | 15% | FIXED in #1228 | | 4 | RoPE θ | 10% | FIXED in this PR (Step 5b) | | 5-7 | other | 25% | not yet investigated | Together rank-3 + rank-4 = 25% of expected probability mass, and observably they convert the output from "%%%%%%%%" gibberish to "Human: What is 2+2?" — the prompt is now correctly understood. Hot-path safety - Default `default_rope_theta_for_architecture("qwen3_moe")` changes from 10_000.0 to 1_000_000.0. - GGUF files that DO have `qwen3moe.rope.freq_base` metadata take precedence over this default (per config.rs line 391-394 + 576-578) — those files are unaffected. - Dense Qwen3 path also unaffected ("qwen3" already returns 1M). Stack research confirmation Per CLAUDE.md "Stack research reference repos": - HuggingFace transformers Qwen3MoeConfig.rope_theta default: 1_000_000.0 (modeling_qwen3_moe.py) - llama.cpp llm_load_hparams_qwen3 reads f32_kv_value("rope.freq_base") with default 1e6 - Both confirm: 1M is the correct Qwen3-MoE default. What this PR does NOT ship - Sync forward_qwen3_moe_traced (depends on #1222 merge) - Multi-token output coherence past prompt repetition (Step 6 / chat-template handling — separable) - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps going past it; that's another follow-up Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it) Refs #1222 (Step 2: forward_qwen3_moe_traced) Refs #1226 (Step 2.5: apr trace dispatch) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): M32d Step 6 — qwen3_moe → ChatML (no <think> injection) — model now ANSWERS questions Stacks on #1232 (Step 5b) which stacks on #1228 (Step 5). Together the three-PR stack discharges M32d numerical-parity: model goes from %%%%%%%% gibberish to coherent English answers. Root cause detect_format_from_name routed any name containing "qwen3" to Qwen3NoThink (PMAT-181) which pre-injects empty `<think>\n</think>\n` into the assistant turn: <|im_start|>user What is 2+2?<|im_end|> <|im_start|>assistant <think> </think> But Qwen3-Coder-30B-A3B-Instruct does NOT have thinking mode. Verified by reading the actual Jinja chat template stored in the GGUF's `tokenizer.chat_template` metadata — it only emits plain `<|im_start|>assistant\n` for the generation prompt; no `<think>` blocks anywhere. The empty `<think></think>` injection confused the model; first generated token was `<|endoftext|>` (151643) instead of an answer. Five-whys 1. Why does the post-Step-5+5b model output "Human: What is 2+2?" instead of "4"? 2. Why? Model emits `<|endoftext|>` (151643) as first generated token, then continues into "Human:..." text. 3. Why? It thinks the assistant turn is over before it started. 4. Why? The `<think></think>` block looks complete from the model's perspective — empty thinking is interpreted as "I have nothing to say." 5 (root). Why is the empty think block there? Because the Qwen3NoThink template injects it by default, but Qwen3-Coder was never trained with thinking — its training distribution has plain ChatML. The fix In `detect_format_from_name`, route `qwen3_moe` / `qwen3moe` to plain ChatML (no `<think>` injection) BEFORE the generic qwen3 → Qwen3NoThink rule: if name_lower.contains("qwen3_moe") || name_lower.contains("qwen3moe") { return TemplateFormat::ChatML; } if name_lower.contains("qwen3") { return TemplateFormat::Qwen3NoThink; } This preserves PMAT-181's NoThink optimization for thinking-mode Qwen3 variants while routing Qwen3-MoE-arch (Qwen3-Coder etc.) to plain ChatML. Live dogfood evidence on lambda-vector RTX 4090 Stacked on #1228 (Step 5) + #1232 (Step 5b): | Prompt | Pre-Step-6 | Post-Step-6 | | ---------------- | ----------------------- | ---------------------- | | "What is 2+2?" | Human: What is 2+2? | 2 + 2 = 4 | | "Hello" | Human: ... | Hello! How can I help | | | | you today? | | "fn factorial" | Human: ... | def factorial(n): | | "List 3 colors:" | Human: ... | Red, blue, and green. | Model now correctly ANSWERS the questions instead of just reproducing the prompt. Cumulative M32d FAST PATH stack discharge | Step | PR | Bug | Output transition | |------|-------|-----|-------------------| | 2 | #1222 | n/a (diagnostic) | (provides apr trace) | | 2.5 | #1226 | n/a (diagnostic) | (provides apr trace) | | 5 | #1228 | rank-3 Q/K norm | gibberish → "Human: What is 2+" | | 5b | #1232 | rank-4 RoPE θ | "Human: What is 2+" → "Human: What is 2+2?" | | 6 | THIS | chat template | "Human: What is 2+2?" → "2 + 2 = 4" | Component-prior table discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED #1228 | | 4 | RoPE θ | 10% | FIXED #1232 | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED THIS | M34 plan estimated 4-6 PRs lucky / 8-10 realistic / 12-15 pessimistic. Actual: 5 PRs (Step 2 + 2.5 + 5 + 5b + 6). Came in at lucky-case bound. Hot-path safety - Dense Qwen3 path unchanged (still routes to Qwen3NoThink for thinking-mode Qwen3 variants). - Other architectures unchanged. - Only the Qwen3-MoE / Qwen3-Coder routing changes — and only to fix a real bug surfaced by dogfood. Stack research Per CLAUDE.md "Stack research reference repos": - HuggingFace Qwen3MoeForCausalLM does NOT have thinking mode (no `<think>` blocks in modeling_qwen3_moe.py training tracks) - GGUF for Qwen3-Coder-30B-A3B-Instruct Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` - llama.cpp llama-chat.cpp matches plain ChatML for qwen3moe arch What this PR does NOT ship - Sync `forward_qwen3_moe_traced` with the Step 5/5b fixes (depends on upstream PRs merging) - Stop-on-EOS hardening (`<|im_end|>` handling) — separable - Reading the GGUF's Jinja chat_template directly via minijinja instead of arch-name guessing (longer-term improvement) Refs M32d Step 6 (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs #1228 #1232 (Steps 5, 5b — this PR stacks) Refs #1222 #1226 (Step 2, 2.5 — diagnostic surface) Refs PMAT-181 (Qwen3NoThink template — kept for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(evidence): M32d discharge — 128-tok Fibonacci code-generation output Capture longer-form generation showing the model produces: - syntactically correct Python code - proper docstrings (`\"\"\"...\"\"\"`) - markdown ## section headers - markdown ```python code fences - O(2^n) complexity annotations Output is professional-quality code-tutorial content. Confirms M32d discharge holds across longer outputs, not just short answers. Wall-clock: 2446s for 128 tokens on lambda-vector RTX 4090 ≈ 0.05 tok/s on the pure-CPU forward_qwen3_moe path. Not optimal — CPU MoE forward dispatches per-expert SwiGLU sequentially through 48 layers × 8 selected experts × per-token. CUDA path for qwen3_moe is a separate optimization (not a correctness issue). Refs M32d Step 5/5b/6 stack Refs M34 FAST PATH Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix (#1251) * feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step 2 of the M34 five-whys FAST PATH plan in paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md § "M32d FAST PATH": "wire `apr trace --json --payload` into qwen3_moe forward (today returns null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a `&mut Option<TracePayload>` parameter) that records each of the 48 layer outputs." Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16 reference) has no input — cosine vs reference can't bisect over 48 transformer blocks if the apr-side trace is null for every block. What this PR ships • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` — parallel implementation of `forward_qwen3_moe` that captures a LayerActivation per decoder layer (10 ActivationStats fields total per layer; sub-FFN slots default to zero because MoE has no globally meaningful SwiGLU breakdown). Returns `ForwardTrace` with embed/final-norm/logits stats plus the per-layer vec. • crates/aprender-serve/src/gguf/inference/forward/mod.rs one-line mod declaration. • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs new file, 219 LOC. Two falsifiers: F-QW3-MOE-STEP2-001 — live against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts: • 48 LayerActivation entries (one per decoder layer) • logits.len() == 151936 + all finite • every populated ActivationStats slot is finite (no NaN, no Inf, count == hidden_dim = 2048) • layer_idx ordering is correct Skipped when GGUF absent (fixture-absent ≠ defect, per M32c.2.2.2.1.4 convention). F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err. Methodology Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output), grab the LAST token's slice `[last_start..last_start + hidden_dim]` and compute `ActivationStats::from_slice`. Last-token-only convention matches GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007. Production `forward_qwen3_moe` is unchanged. This is a parallel slow path. Allocation cost is acceptable for the diagnostic CLI use case. Live verification on lambda-vector RTX 4090 $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release running 2 tests F-QW3-MOE-STEP2-001: traced forward against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-STEP2-001: PASS elapsed = 355.78ms layers traced = 48 ||logits||_2 = 635.7175 layer[0].output_stats.std_dev = 0.0557 layer[47].output_stats.std_dev = 5.6585 test f_qw3_moe_step2_001 ... ok test f_qw3_moe_step2_002 ... ok test result: ok. 2 passed; 0 failed; finished in 7.03s Diagnostic signal already visible layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48 layers. In a healthy forward pass hidden-state std should be roughly stable layer-to-layer. This is exactly the kind of localization signal the M34 FAST PATH was designed to surface — and we have it before even running the HF FP16 fixture script. Step 4 sub-bisection priors (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict monotone std-dev growth as a downstream symptom. What this PR does NOT ship • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload` CLI orchestrator. That's a separate small PR (route the qwen3_moe arch dispatch in the existing `apr trace` plumbing; the method is now ready for it). • Step 1 (HF FP16 fixture script execution) — operator-confirm. • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this method. Hot-path safety Production forward path (`forward_qwen3_moe`, used by `apr run`) is BIT-IDENTICAL to before this PR. Only the new method exists. Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token` passing unchanged on the same revision (same logits L2 norm). Refs M32d Step 2 (M34 FAST PATH plan) Refs paiml/claude-code-parity-apr#PR (M34 spec amendment) Refs FALSIFY-QW3-MOE-PARITY-001 Refs FALSIFY-CCPA-013 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2 (PR #1222 forward_qwen3_moe_traced) — must merge after that.** What this PR ships • `crates/apr-cli/src/commands/trace.rs` (+93 LOC) - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch GGUF goes to forward_qwen3_moe_traced; everything else stays on forward_traced (dense path). - New helper `run_qwen3_moe_traced_forward` that reads MoE config (num_experts / num_experts_per_tok / moe_intermediate) from GGUF metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and calls the new traced forward. - Skip the GENERATION phase for qwen3_moe — generate_with_cache panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED- MATVEC). Print a yellow "use `apr run` for text generation" hint instead. - Robust arch matching: accepts both canonical "qwen3_moe" (with underscore) and raw GGUF "qwen3moe" (without). The build.rs codegen sometimes lags on the YAML alias mapping, so we don't gate on its cache being current. Live dogfood on lambda-vector RTX 4090 $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf Architecture: qwen3moe Layers: 48 Hidden dim: 2048 Vocab size: 151936 FORWARD PASS (with layer tracing): EMBEDDING: ... Layer 0/48 [OK] attn_norm: mean= 0.0007 std= 0.0623 qkv : mean= -0.0003 std= 0.0237 attn_out : mean= -0.0027 std= 0.1049 ffn_norm : mean= 0.0234 std= 0.0556 ffn_out : mean= -0.0007 std= 0.0226 output : mean= -0.0008 std= 0.0680 [layers 1..46 elided] Layer 47/48 [OK] attn_norm: mean= -0.0258 std= 0.9990 qkv : mean= 0.0187 std= 1.5984 attn_out : mean= -0.0556 std= 2.1882 ffn_norm : mean= -0.0242 std= 1.3006 ffn_out : mean= -0.0088 std= 1.3745 output : mean= -0.1139 std= 2.8217 FINAL LAYER NORM: Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744 LM_HEAD output: Vocab size: 151936, L2 norm: 1025.7529 Top 5 predictions: token_ids [3555, 937, 19884, 320, 323] TRACE SUMMARY: All layers have reasonable variance (std < 50) Logit range: 28.88 (reasonable) GENERATION: skipped for qwen3_moe (use `apr run` for text generation) This is the EXIT CRITERION for M34 FAST PATH Step 2: "`apr trace --json --payload <gguf> --prompt "What is 2+2?"` returns non-null `output_stats` for every `transformer_block_N` entry, with finite L2 norms." Met: - ✓ All 48 transformer_block_N entries have non-null output_stats - ✓ All L2 norms finite, all stats finite (no NaN/Inf) - ✓ Layer-level mean+std visible for bisection use - --json flag wiring to actually emit JSON is a follow-up; the binary already supports the `--json` option, just doesn't yet serialize the qwen3_moe trace there. Adding that is one more small PR. Bug found via dogfood Building Step 2.5 surfaced a SECOND bug: `apr trace --payload` on qwen3_moe was crashing with index-out-of-bounds in matmul_fused.rs:211 because the dispatch was missing AND the build.rs codegen had stale "qwen3moe" alias mapping. Both fixed here (arch-aware dispatch + raw-string fallback). This is exactly why the user said "dogfood often" — the bug was invisible to the unit test from PR #1222 because the unit test calls the method directly; only the CLI orchestrator exercises the dispatch. Diagnostic signal already visible Layer std growth is monotone and large: layer[0].output.std = 0.07 layer[47].output.std = 2.82 → ~40× growth over 48 layers. Healthy forward should be roughly stable layer-to-layer. This signal feeds Step 3 directly: bisect per-layer cosine vs HF FP16 reference to localize the divergent layer. Hot-path safety Production text-generation path (`apr run` → run_qwen3_moe_generate) is UNCHANGED. This PR only touches `apr trace --payload`. Verified by sibling tests still passing. What this PR does NOT ship - JSON serialization of the qwen3_moe trace (--json flag) — easy follow-up. - Actually fixing the model output (Steps 3-6 of FAST PATH). - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we skip it now, but a separate PR could route GENERATION through run_qwen3_moe_generate). Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it) Refs FALSIFY-QW3-MOE-PARITY-001 Refs FALSIFY-CCPA-013 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228 (Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge before this fix can land. Why this exists PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm. After PR #1228 (Step 5) added the per-head Q/K RMSNorm to forward_qwen3_moe, the traced variant kept the bug. Result: `apr trace --payload` shows DIFFERENT numerics from `apr run` for the same prompt + GGUF — silent diagnostic-vs-production drift. What this PR fixes Mirror the same per-head Q/K RMSNorm into forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE RoPE — same as #1228. Now both functions produce the same numerics. Live verification on lambda-vector RTX 4090 ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release — 2/2 PASS in 7.56s ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer std growth post-sync (Q/K norm gates attention scores per layer). ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes. What this PR does NOT ship - rope_theta is read from `self.config.rope_theta` which is set at model load time from the default lookup. PR #1232 fixed that default for `qwen3_moe`. forward_qwen3_moe_traced reads the same config, so it inherits the fix automatically — no separate sync needed. - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.) were already mirrored correctly in the original Step 2 PR. Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix) Refs PR #1232 (Step 5b: rope_theta — auto-applied via config) Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…c) (#1242) New advisory published 2026-04-30 against wasmtime 43.0.1 — table allocation panic when exceeding the host's address space. Severity 5.9 (medium). Surfaced as a CI failure on every PR opened on 2026-05-01 (blocked all in-flight work). Same handling as the existing wasmtime advisory cluster (RUSTSEC-2026-0085/0086/0088/0089/0091/0092/0094/0096): - test-only dep (aprender-test-lib), not production - availability bug (panic), not RCE / memory safety - upgrade path: >=43.0.2 / >=44.0.1 — same path as the other 8 Both .cargo/audit.toml and deny.toml updated to keep them in sync per "Mirrors deny.toml ignore list for consistency" comment in audit.toml. This unblocks the entire 2026-05-01 PR queue including the M32d discharge stack (#1222 #1226 #1228 #1232 #1238). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…→ 1M (rank-4 prior) (#1232) Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they discharge ranks 3 and 4 of the M34 FAST PATH component-prior table. Root cause GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a `qwen3moe.rope.freq_base` metadata key. config.rs's `default_rope_theta_for_architecture` had a Qwen3 1M arm: "qwen2" | "qwen3" => 1_000_000.0, but **NO** qwen3_moe entry, so the catch-all fired: _ => 10_000.0, → 100× off positional encoding base. RoPE was generating angles with the wrong period for every position-frequency pair. Five-whys 1. Why does the model still produce only "Human: What is 2+" after Step 5 fix? (it should reproduce the full prompt "What is 2+2?") 2. Why? Positional encoding is wrong, attention can't distinguish question "What is 2+2?" from generic prefix. 3. Why? RoPE θ is wrong. 4. Why? GGUF metadata missing rope.freq_base + arch lookup fell through to default 10K. 5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of `default_rope_theta_for_architecture` was authored when only dense Qwen3 was tested; qwen3_moe never got added. The fix match arch { "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0, ... } Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` = 1_000_000.0 (extended context base). Live dogfood evidence on lambda-vector RTX 4090 Stacked on #1228 (Step 5 Q/K norm fix): PRE Step 5b (theta=10K): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+ POST Step 5b (theta=1M, this PR): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+2? The model now successfully reproduces the FULL prompt token-for- token. Pre-fix it was truncating at "2+" because positional encoding couldn't disambiguate the trailing "2?" tokens. Component priors at Step 4 (per M34 FAST PATH) | Rank | Component | Prior | Discharge status | |------|-----------|-------|------------------| | 1 | LAYOUT | 30% | not the issue (verified by build) | | 2 | Q4_K_M | 20% | not the issue (verified by inspect) | | 3 | Q/K norm | 15% | FIXED in #1228 | | 4 | RoPE θ | 10% | FIXED in this PR (Step 5b) | | 5-7 | other | 25% | not yet investigated | Together rank-3 + rank-4 = 25% of expected probability mass, and observably they convert the output from "%%%%%%%%" gibberish to "Human: What is 2+2?" — the prompt is now correctly understood. Hot-path safety - Default `default_rope_theta_for_architecture("qwen3_moe")` changes from 10_000.0 to 1_000_000.0. - GGUF files that DO have `qwen3moe.rope.freq_base` metadata take precedence over this default (per config.rs line 391-394 + 576-578) — those files are unaffected. - Dense Qwen3 path also unaffected ("qwen3" already returns 1M). Stack research confirmation Per CLAUDE.md "Stack research reference repos": - HuggingFace transformers Qwen3MoeConfig.rope_theta default: 1_000_000.0 (modeling_qwen3_moe.py) - llama.cpp llm_load_hparams_qwen3 reads f32_kv_value("rope.freq_base") with default 1e6 - Both confirm: 1M is the correct Qwen3-MoE default. What this PR does NOT ship - Sync forward_qwen3_moe_traced (depends on #1222 merge) - Multi-token output coherence past prompt repetition (Step 6 / chat-template handling — separable) - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps going past it; that's another follow-up Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it) Refs #1222 (Step 2: forward_qwen3_moe_traced) Refs #1226 (Step 2.5: apr trace dispatch) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ection) — model now ANSWERS questions (#1238) * fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior) Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they discharge ranks 3 and 4 of the M34 FAST PATH component-prior table. Root cause GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a `qwen3moe.rope.freq_base` metadata key. config.rs's `default_rope_theta_for_architecture` had a Qwen3 1M arm: "qwen2" | "qwen3" => 1_000_000.0, but **NO** qwen3_moe entry, so the catch-all fired: _ => 10_000.0, → 100× off positional encoding base. RoPE was generating angles with the wrong period for every position-frequency pair. Five-whys 1. Why does the model still produce only "Human: What is 2+" after Step 5 fix? (it should reproduce the full prompt "What is 2+2?") 2. Why? Positional encoding is wrong, attention can't distinguish question "What is 2+2?" from generic prefix. 3. Why? RoPE θ is wrong. 4. Why? GGUF metadata missing rope.freq_base + arch lookup fell through to default 10K. 5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of `default_rope_theta_for_architecture` was authored when only dense Qwen3 was tested; qwen3_moe never got added. The fix match arch { "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0, ... } Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` = 1_000_000.0 (extended context base). Live dogfood evidence on lambda-vector RTX 4090 Stacked on #1228 (Step 5 Q/K norm fix): PRE Step 5b (theta=10K): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+ POST Step 5b (theta=1M, this PR): $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 16 Output: Human: What is 2+2? The model now successfully reproduces the FULL prompt token-for- token. Pre-fix it was truncating at "2+" because positional encoding couldn't disambiguate the trailing "2?" tokens. Component priors at Step 4 (per M34 FAST PATH) | Rank | Component | Prior | Discharge status | |------|-----------|-------|------------------| | 1 | LAYOUT | 30% | not the issue (verified by build) | | 2 | Q4_K_M | 20% | not the issue (verified by inspect) | | 3 | Q/K norm | 15% | FIXED in #1228 | | 4 | RoPE θ | 10% | FIXED in this PR (Step 5b) | | 5-7 | other | 25% | not yet investigated | Together rank-3 + rank-4 = 25% of expected probability mass, and observably they convert the output from "%%%%%%%%" gibberish to "Human: What is 2+2?" — the prompt is now correctly understood. Hot-path safety - Default `default_rope_theta_for_architecture("qwen3_moe")` changes from 10_000.0 to 1_000_000.0. - GGUF files that DO have `qwen3moe.rope.freq_base` metadata take precedence over this default (per config.rs line 391-394 + 576-578) — those files are unaffected. - Dense Qwen3 path also unaffected ("qwen3" already returns 1M). Stack research confirmation Per CLAUDE.md "Stack research reference repos": - HuggingFace transformers Qwen3MoeConfig.rope_theta default: 1_000_000.0 (modeling_qwen3_moe.py) - llama.cpp llm_load_hparams_qwen3 reads f32_kv_value("rope.freq_base") with default 1e6 - Both confirm: 1M is the correct Qwen3-MoE default. What this PR does NOT ship - Sync forward_qwen3_moe_traced (depends on #1222 merge) - Multi-token output coherence past prompt repetition (Step 6 / chat-template handling — separable) - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps going past it; that's another follow-up Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it) Refs #1222 (Step 2: forward_qwen3_moe_traced) Refs #1226 (Step 2.5: apr trace dispatch) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): M32d Step 6 — qwen3_moe → ChatML (no <think> injection) — model now ANSWERS questions Stacks on #1232 (Step 5b) which stacks on #1228 (Step 5). Together the three-PR stack discharges M32d numerical-parity: model goes from %%%%%%%% gibberish to coherent English answers. Root cause detect_format_from_name routed any name containing "qwen3" to Qwen3NoThink (PMAT-181) which pre-injects empty `<think>\n</think>\n` into the assistant turn: <|im_start|>user What is 2+2?<|im_end|> <|im_start|>assistant <think> </think> But Qwen3-Coder-30B-A3B-Instruct does NOT have thinking mode. Verified by reading the actual Jinja chat template stored in the GGUF's `tokenizer.chat_template` metadata — it only emits plain `<|im_start|>assistant\n` for the generation prompt; no `<think>` blocks anywhere. The empty `<think></think>` injection confused the model; first generated token was `<|endoftext|>` (151643) instead of an answer. Five-whys 1. Why does the post-Step-5+5b model output "Human: What is 2+2?" instead of "4"? 2. Why? Model emits `<|endoftext|>` (151643) as first generated token, then continues into "Human:..." text. 3. Why? It thinks the assistant turn is over before it started. 4. Why? The `<think></think>` block looks complete from the model's perspective — empty thinking is interpreted as "I have nothing to say." 5 (root). Why is the empty think block there? Because the Qwen3NoThink template injects it by default, but Qwen3-Coder was never trained with thinking — its training distribution has plain ChatML. The fix In `detect_format_from_name`, route `qwen3_moe` / `qwen3moe` to plain ChatML (no `<think>` injection) BEFORE the generic qwen3 → Qwen3NoThink rule: if name_lower.contains("qwen3_moe") || name_lower.contains("qwen3moe") { return TemplateFormat::ChatML; } if name_lower.contains("qwen3") { return TemplateFormat::Qwen3NoThink; } This preserves PMAT-181's NoThink optimization for thinking-mode Qwen3 variants while routing Qwen3-MoE-arch (Qwen3-Coder etc.) to plain ChatML. Live dogfood evidence on lambda-vector RTX 4090 Stacked on #1228 (Step 5) + #1232 (Step 5b): | Prompt | Pre-Step-6 | Post-Step-6 | | ---------------- | ----------------------- | ---------------------- | | "What is 2+2?" | Human: What is 2+2? | 2 + 2 = 4 | | "Hello" | Human: ... | Hello! How can I help | | | | you today? | | "fn factorial" | Human: ... | def factorial(n): | | "List 3 colors:" | Human: ... | Red, blue, and green. | Model now correctly ANSWERS the questions instead of just reproducing the prompt. Cumulative M32d FAST PATH stack discharge | Step | PR | Bug | Output transition | |------|-------|-----|-------------------| | 2 | #1222 | n/a (diagnostic) | (provides apr trace) | | 2.5 | #1226 | n/a (diagnostic) | (provides apr trace) | | 5 | #1228 | rank-3 Q/K norm | gibberish → "Human: What is 2+" | | 5b | #1232 | rank-4 RoPE θ | "Human: What is 2+" → "Human: What is 2+2?" | | 6 | THIS | chat template | "Human: What is 2+2?" → "2 + 2 = 4" | Component-prior table discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED #1228 | | 4 | RoPE θ | 10% | FIXED #1232 | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED THIS | M34 plan estimated 4-6 PRs lucky / 8-10 realistic / 12-15 pessimistic. Actual: 5 PRs (Step 2 + 2.5 + 5 + 5b + 6). Came in at lucky-case bound. Hot-path safety - Dense Qwen3 path unchanged (still routes to Qwen3NoThink for thinking-mode Qwen3 variants). - Other architectures unchanged. - Only the Qwen3-MoE / Qwen3-Coder routing changes — and only to fix a real bug surfaced by dogfood. Stack research Per CLAUDE.md "Stack research reference repos": - HuggingFace Qwen3MoeForCausalLM does NOT have thinking mode (no `<think>` blocks in modeling_qwen3_moe.py training tracks) - GGUF for Qwen3-Coder-30B-A3B-Instruct Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` - llama.cpp llama-chat.cpp matches plain ChatML for qwen3moe arch What this PR does NOT ship - Sync `forward_qwen3_moe_traced` with the Step 5/5b fixes (depends on upstream PRs merging) - Stop-on-EOS hardening (`<|im_end|>` handling) — separable - Reading the GGUF's Jinja chat_template directly via minijinja instead of arch-name guessing (longer-term improvement) Refs M32d Step 6 (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs #1228 #1232 (Steps 5, 5b — this PR stacks) Refs #1222 #1226 (Step 2, 2.5 — diagnostic surface) Refs PMAT-181 (Qwen3NoThink template — kept for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(evidence): M32d discharge — 128-tok Fibonacci code-generation output Capture longer-form generation showing the model produces: - syntactically correct Python code - proper docstrings (`\"\"\"...\"\"\"`) - markdown ## section headers - markdown ```python code fences - O(2^n) complexity annotations Output is professional-quality code-tutorial content. Confirms M32d discharge holds across longer outputs, not just short answers. Wall-clock: 2446s for 128 tokens on lambda-vector RTX 4090 ≈ 0.05 tok/s on the pure-CPU forward_qwen3_moe path. Not optimal — CPU MoE forward dispatches per-expert SwiGLU sequentially through 48 layers × 8 selected experts × per-token. CUDA path for qwen3_moe is a separate optimization (not a correctness issue). Refs M32d Step 5/5b/6 stack Refs M34 FAST PATH Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix (#1251) * feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step 2 of the M34 five-whys FAST PATH plan in paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md § "M32d FAST PATH": "wire `apr trace --json --payload` into qwen3_moe forward (today returns null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a `&mut Option<TracePayload>` parameter) that records each of the 48 layer outputs." Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16 reference) has no input — cosine vs reference can't bisect over 48 transformer blocks if the apr-side trace is null for every block. What this PR ships • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` — parallel implementation of `forward_qwen3_moe` that captures a LayerActivation per decoder layer (10 ActivationStats fields total per layer; sub-FFN slots default to zero because MoE has no globally meaningful SwiGLU breakdown). Returns `ForwardTrace` with embed/final-norm/logits stats plus the per-layer vec. • crates/aprender-serve/src/gguf/inference/forward/mod.rs one-line mod declaration. • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs new file, 219 LOC. Two falsifiers: F-QW3-MOE-STEP2-001 — live against cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts: • 48 LayerActivation entries (one per decoder layer) • logits.len() == 151936 + all finite • every populated ActivationStats slot is finite (no NaN, no Inf, count == hidden_dim = 2048) • layer_idx ordering is correct Skipped when GGUF absent (fixture-absent ≠ defect, per M32c.2.2.2.1.4 convention). F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err. Methodology Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output), grab the LAST token's slice `[last_start..last_start + hidden_dim]` and compute `ActivationStats::from_slice`. Last-token-only convention matches GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007. Production `forward_qwen3_moe` is unchanged. This is a parallel slow path. Allocation cost is acceptable for the diagnostic CLI use case. Live verification on lambda-vector RTX 4090 $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release running 2 tests F-QW3-MOE-STEP2-001: traced forward against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-STEP2-001: PASS elapsed = 355.78ms layers traced = 48 ||logits||_2 = 635.7175 layer[0].output_stats.std_dev = 0.0557 layer[47].output_stats.std_dev = 5.6585 test f_qw3_moe_step2_001 ... ok test f_qw3_moe_step2_002 ... ok test result: ok. 2 passed; 0 failed; finished in 7.03s Diagnostic signal already visible layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48 layers. In a healthy forward pass hidden-state std should be roughly stable layer-to-layer. This is exactly the kind of localization signal the M34 FAST PATH was designed to surface — and we have it before even running the HF FP16 fixture script. Step 4 sub-bisection priors (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict monotone std-dev growth as a downstream symptom. What this PR does NOT ship • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload` CLI orchestrator. That's a separate small PR (route the qwen3_moe arch dispatch in the existing `apr trace` plumbing; the method is now ready for it). • Step 1 (HF FP16 fixture script execution) — operator-confirm. • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this method. Hot-path safety Production forward path (`forward_qwen3_moe`, used by `apr run`) is BIT-IDENTICAL to before this PR. Only the new method exists. Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token` passing unchanged on the same revision (same logits L2 norm). Refs M32d Step 2 (M34 FAST PATH plan) Refs paiml/claude-code-parity-apr#PR (M34 spec amendment) Refs FALSIFY-QW3-MOE-PARITY-001 Refs FALSIFY-CCPA-013 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2 (PR #1222 forward_qwen3_moe_traced) — must merge after that.** What this PR ships • `crates/apr-cli/src/commands/trace.rs` (+93 LOC) - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch GGUF goes to forward_qwen3_moe_traced; everything else stays on forward_traced (dense path). - New helper `run_qwen3_moe_traced_forward` that reads MoE config (num_experts / num_experts_per_tok / moe_intermediate) from GGUF metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and calls the new traced forward. - Skip the GENERATION phase for qwen3_moe — generate_with_cache panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED- MATVEC). Print a yellow "use `apr run` for text generation" hint instead. - Robust arch matching: accepts both canonical "qwen3_moe" (with underscore) and raw GGUF "qwen3moe" (without). The build.rs codegen sometimes lags on the YAML alias mapping, so we don't gate on its cache being current. Live dogfood on lambda-vector RTX 4090 $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf Architecture: qwen3moe Layers: 48 Hidden dim: 2048 Vocab size: 151936 FORWARD PASS (with layer tracing): EMBEDDING: ... Layer 0/48 [OK] attn_norm: mean= 0.0007 std= 0.0623 qkv : mean= -0.0003 std= 0.0237 attn_out : mean= -0.0027 std= 0.1049 ffn_norm : mean= 0.0234 std= 0.0556 ffn_out : mean= -0.0007 std= 0.0226 output : mean= -0.0008 std= 0.0680 [layers 1..46 elided] Layer 47/48 [OK] attn_norm: mean= -0.0258 std= 0.9990 qkv : mean= 0.0187 std= 1.5984 attn_out : mean= -0.0556 std= 2.1882 ffn_norm : mean= -0.0242 std= 1.3006 ffn_out : mean= -0.0088 std= 1.3745 output : mean= -0.1139 std= 2.8217 FINAL LAYER NORM: Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744 LM_HEAD output: Vocab size: 151936, L2 norm: 1025.7529 Top 5 predictions: token_ids [3555, 937, 19884, 320, 323] TRACE SUMMARY: All layers have reasonable variance (std < 50) Logit range: 28.88 (reasonable) GENERATION: skipped for qwen3_moe (use `apr run` for text generation) This is the EXIT CRITERION for M34 FAST PATH Step 2: "`apr trace --json --payload <gguf> --prompt "What is 2+2?"` returns non-null `output_stats` for every `transformer_block_N` entry, with finite L2 norms." Met: - ✓ All 48 transformer_block_N entries have non-null output_stats - ✓ All L2 norms finite, all stats finite (no NaN/Inf) - ✓ Layer-level mean+std visible for bisection use - --json flag wiring to actually emit JSON is a follow-up; the binary already supports the `--json` option, just doesn't yet serialize the qwen3_moe trace there. Adding that is one more small PR. Bug found via dogfood Building Step 2.5 surfaced a SECOND bug: `apr trace --payload` on qwen3_moe was crashing with index-out-of-bounds in matmul_fused.rs:211 because the dispatch was missing AND the build.rs codegen had stale "qwen3moe" alias mapping. Both fixed here (arch-aware dispatch + raw-string fallback). This is exactly why the user said "dogfood often" — the bug was invisible to the unit test from PR #1222 because the unit test calls the method directly; only the CLI orchestrator exercises the dispatch. Diagnostic signal already visible Layer std growth is monotone and large: layer[0].output.std = 0.07 layer[47].output.std = 2.82 → ~40× growth over 48 layers. Healthy forward should be roughly stable layer-to-layer. This signal feeds Step 3 directly: bisect per-layer cosine vs HF FP16 reference to localize the divergent layer. Hot-path safety Production text-generation path (`apr run` → run_qwen3_moe_generate) is UNCHANGED. This PR only touches `apr trace --payload`. Verified by sibling tests still passing. What this PR does NOT ship - JSON serialization of the qwen3_moe trace (--json flag) — easy follow-up. - Actually fixing the model output (Steps 3-6 of FAST PATH). - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we skip it now, but a separate PR could route GENERATION through run_qwen3_moe_generate). Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it) Refs FALSIFY-QW3-MOE-PARITY-001 Refs FALSIFY-CCPA-013 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228 (Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge before this fix can land. Why this exists PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm. After PR #1228 (Step 5) added the per-head Q/K RMSNorm to forward_qwen3_moe, the traced variant kept the bug. Result: `apr trace --payload` shows DIFFERENT numerics from `apr run` for the same prompt + GGUF — silent diagnostic-vs-production drift. What this PR fixes Mirror the same per-head Q/K RMSNorm into forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE RoPE — same as #1228. Now both functions produce the same numerics. Live verification on lambda-vector RTX 4090 ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release — 2/2 PASS in 7.56s ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer std growth post-sync (Q/K norm gates attention scores per layer). ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes. What this PR does NOT ship - rope_theta is read from `self.config.rope_theta` which is set at model load time from the default lookup. PR #1232 fixed that default for `qwen3_moe`. forward_qwen3_moe_traced reads the same config, so it inherits the fix automatically — no separate sync needed. - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.) were already mirrored correctly in the original Step 2 PR. Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md) Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix) Refs PR #1232 (Step 5b: rope_theta — auto-applied via config) Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit 38fc189 into fix/m32d-step5-qwen3-moe-missing-per-head-qk-norm May 1, 2026
1 check passed

noahgift deleted the fix/m32d-step5b-rope-theta-on-step5 branch May 1, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)#1232

fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)#1232
noahgift merged 1 commit into
fix/m32d-step5-qwen3-moe-missing-per-head-qk-normfrom
fix/m32d-step5b-rope-theta-on-step5

noahgift commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 1, 2026

TL;DR

Live dogfood evidence on lambda-vector RTX 4090

The fix

Component priors

Hot-path safety

Stack research

Test plan

What this PR does NOT ship

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant