feat(apr-cli): apr trace --json --payload — wire JSON output for FAST PATH Step 2 exit criterion#1401
Merged
Merged
Conversation
… PATH Step 2 exit criterion
`apr trace --json --payload <gguf>` was silently ignoring `--json` when
`--payload` was set, falling back to the human-readable text format. The
FAST PATH Step 2 exit criterion in
paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md
explicitly says:
"apr trace --json --payload <gguf> --prompt 'What is 2+2?' returns
non-null output_stats for every transformer_block_N entry, with
finite L2 norms."
Now mechanically satisfied. Schema:
```jsonc
{
"format": "GGUF (qwen3moe)",
"architecture": "qwen3moe",
"num_layers": 48, "hidden_dim": 2048, "vocab_size": 151936,
"num_heads": 32, "num_kv_heads": 4,
"prompt": "What is 2+2?",
"encoded_tokens": [3838, 374, 220, 17, 10, 17, 30],
"embedding": { "min": ..., "max": ..., "mean": ..., "std_dev": ...,
"nan_count": 0, "inf_count": 0, "zero_count": 0,
"count": 2048 },
"layers": [
{ "layer_idx": 0,
"attn_norm": {...stats...}, "qkv": {...stats...}, "attn_out": {...},
"ffn_norm": {...}, "ffn_out": {...}, "output": {...} },
/* 47 more, one per decoder layer */
],
"final_norm": {...stats...},
"logits_stats": {...stats...},
"logits": {
"vocab_size": 151936,
"l2_norm": 1025.7530,
"top_k": [{"token_id": 3555, "logit": 16.96}, ...]
}
}
```
Implementation
==============
- `handle_special_modes_with_json` (new) — JSON-aware variant of
`handle_special_modes`. Old function preserved as a thin wrapper
so existing test callers don't break.
- `run_traced_inference_json` (new) — JSON output path. Mirrors
`run_traced_inference_gguf` for trace computation but emits one
JSON object via serde_json::to_string_pretty.
- Skips the human-readable "Model: ..." / "Contract: ..." preamble
that `resolve_model_path` + `preflight_contract_check` would print
— those would break `apr trace --json --payload | jq` consumers.
Live verification on lambda-vector RTX 4090
============================================
$ apr trace --json --payload ~/.cache/pacha/models/2b88b180a790988f.gguf \
2>/dev/null | python3 -c '
import sys, json
d = json.load(sys.stdin)
print(f"arch: {d[\"architecture\"]}")
print(f"num_layers: {d[\"num_layers\"]}")
print(f"layers in payload: {len(d[\"layers\"])}")
all_finite = all(
isinstance(la[\"output\"][\"std_dev\"], (int, float))
and abs(la[\"output\"][\"std_dev\"]) < float(\"inf\")
for la in d[\"layers\"]
)
print(f"all 48 layers finite: {all_finite}")
'
arch: qwen3moe
num_layers: 48
layers in payload: 48
all 48 layers finite: True
Stdout is now strictly valid JSON; `2>/dev/null` discards BOS-FALLBACK
warnings on stderr.
What this PR does NOT ship
==========================
- Custom prompt via `--prompt <str>` flag (test prompt is hardcoded
"What is 2+2?"; matches text-mode default).
- Per-token-position trace (only LAST token captured per the GGUF
forward_traced + forward_qwen3_moe_traced semantics).
- Sub-FFN MoE breakdown (router output, per-expert contribution) —
those are zero in qwen3_moe traced forward; left for Step 4 work.
- SafeTensors JSON output — same encoding works there, just the
format string differs.
Hot-path safety
===============
- Existing `apr trace --payload` (no --json) text mode unchanged.
- Existing `apr trace --json` (no --payload) static-layer JSON mode
unchanged — `handle_special_modes_with_json` only branches when
BOTH json && payload are set.
- All 5 callers of `handle_special_modes` continue to work
(signature preserved via thin wrapper).
Refs M32d Step 2 exit criterion (M34 FAST PATH plan)
Refs PR #1226 (Step 2.5: apr trace dispatch — wired qwen3_moe arch)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — supplies the data)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Locks in the JSON output schema that satisfies the M34 FAST PATH Step 2
exit criterion. Asserts:
- stdout is valid JSON (no preamble lines breaking jq consumers)
- all 11 top-level fields present (format, architecture, num_layers,
hidden_dim, vocab_size, prompt, encoded_tokens, embedding, layers,
final_norm, logits)
- 48 layers (Qwen3-Coder-30B-A3B-Instruct)
- per-layer 7 fields present (layer_idx + 6 stat slots)
- every layer.output.std_dev is finite
- every layer.output.nan_count == 0 && inf_count == 0
- logits.l2_norm is finite and > 0
Skipped when GGUF or apr binary absent (fixture-absent ≠ defect, per
M32c.2.2.2.1.4 convention).
Live PASS on lambda-vector RTX 4090 in 5.38s.
Refs M32d Step 2 exit criterion (M34 FAST PATH plan)
Refs the JSON schema in this PR's main commit
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…e_theta + chat template Squashes 4 substantive M32d FAST PATH fixes (Step 5 + 5b + 6 + 7) + regression test + evidence into a single commit on top of fresh main. Replaces the original messy stacked-PR chain that conflicted on rebase after sibling PRs (#1401, #1405) landed. Live verification on lambda-vector RTX 4090 (post-rebuild): $ apr run <Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf> \ --prompt "What is 2+2?" --max-tokens 8 Output: 2 + 2 = 4 Completed in 40.24s (cached) Step 5 — per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%) ==================================================================== Qwen3 GH-279 per-head Q/K RMSNorm was wired into the dense path (adaptive_ffn.rs:174-179) but missing from forward_qwen3_moe.rs. Now applied AFTER bias, BEFORE RoPE — same code as adaptive_ffn. Pre-fix: layer std-dev grew 40× over 48 layers (signature of attention scores compounding without per-head Q/K norm). Output `%%%%%%%%`. Step 5b — rope_theta default 10K → 1M for qwen3_moe (rank-4 prior, 10%) ======================================================================= GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT `qwen3moe.rope.freq_base` metadata. config.rs's default lookup had `"qwen2" | "qwen3" => 1_000_000.0` but no qwen3_moe entry — fell to catch-all 10K. Off by 100×. Added qwen3_moe to the 1M arm. Step 6 — chat template (qwen3_moe → ChatML, no <think>) ======================================================== `detect_format_from_name` routed any "qwen3" name to Qwen3NoThink (PMAT-181), which pre-injects empty `<think>\n</think>\n` into the assistant turn. Qwen3-Coder does NOT have thinking mode (verified via the Jinja `tokenizer.chat_template` in the GGUF) — empty think block caused the model to emit `<|endoftext|>` immediately. Route qwen3_moe to plain ChatML before the generic qwen3 → NoThink rule. PMAT-181 preserved for thinking-mode dense Qwen3. Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K norm ============================================================ forward_qwen3_moe_traced (created in PR #1222 on main) was authored mirroring the OLD pre-Q/K-norm forward_qwen3_moe. Without sync, `apr trace --payload` shows DIFFERENT numerics from `apr run` — silent diagnostic-vs-production drift. Mirror the same Q/K norm into the traced variant. Component priors discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED (this commit) | | 4 | RoPE θ | 10% | FIXED (this commit) | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED (this commit) | Output transition pre-fix → "%%%%%%%%" (gibberish) + Step 5 → "Human: What is 2+" (coherent English, partial) + Step 5b → "Human: What is 2+2?" (full prompt reproduced) + Step 6 → "2 + 2 = 4" (correct answer) + Step 7 → diagnostic trace matches production Multi-domain verification (also passes): "Capital of France:" → "The capital of France is Paris." "Translate to Spanish: Hello world" → "¡Hola mundo!" "Count to 5:" → "1, 2, 3, 4, 5" "Solve x^2 - 5x + 6 = 0:" → "I need to solve the quadratic equation x² - 5x + 6 = 0..." Hot-path safety - Production text-generation path (`apr run` → run_qwen3_moe_generate → forward_qwen3_moe) now applies the norm. - `apr trace --payload` (forward_qwen3_moe_traced) syncs the same fix. - Sibling tests pass unchanged. - `forward_qwen3_moe_traced` reads `self.config.rope_theta` which is set at model load from the default lookup — Step 5b auto-applies via config. - Dense Qwen3 path UNCHANGED (Qwen3NoThink preserved for thinking-mode variants). Regression test `crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs` F-QW3-MOE-STEP5-001 asserts the context-awareness invariant: two distinct prompts must produce distinct argmax tokens, top-2 logit gap < 50. Live PASS on lambda-vector RTX 4090 in 6.60s. Stack research - HuggingFace transformers Qwen3MoeForCausalLM applies per-head q_norm/k_norm in Qwen3MoeAttention.forward - llama.cpp ggml_qwen3_moe_kv_norm in llama-arch.cpp does the same (attn_q_norm.weight / attn_k_norm.weight) - HF Qwen3MoeConfig.rope_theta default = 1_000_000.0 - Qwen3-Coder Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` (no thinking) Refs M32d FAST PATH plan (M34, paiml/claude-code-parity-apr) Refs GH-279 (Qwen3 per-head Q/K RMSNorm) Refs PMAT-181 (Qwen3NoThink preserved for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…e_theta + chat template (#1228) Squashes 4 substantive M32d FAST PATH fixes (Step 5 + 5b + 6 + 7) + regression test + evidence into a single commit on top of fresh main. Replaces the original messy stacked-PR chain that conflicted on rebase after sibling PRs (#1401, #1405) landed. Live verification on lambda-vector RTX 4090 (post-rebuild): $ apr run <Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf> \ --prompt "What is 2+2?" --max-tokens 8 Output: 2 + 2 = 4 Completed in 40.24s (cached) Step 5 — per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%) ==================================================================== Qwen3 GH-279 per-head Q/K RMSNorm was wired into the dense path (adaptive_ffn.rs:174-179) but missing from forward_qwen3_moe.rs. Now applied AFTER bias, BEFORE RoPE — same code as adaptive_ffn. Pre-fix: layer std-dev grew 40× over 48 layers (signature of attention scores compounding without per-head Q/K norm). Output `%%%%%%%%`. Step 5b — rope_theta default 10K → 1M for qwen3_moe (rank-4 prior, 10%) ======================================================================= GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT `qwen3moe.rope.freq_base` metadata. config.rs's default lookup had `"qwen2" | "qwen3" => 1_000_000.0` but no qwen3_moe entry — fell to catch-all 10K. Off by 100×. Added qwen3_moe to the 1M arm. Step 6 — chat template (qwen3_moe → ChatML, no <think>) ======================================================== `detect_format_from_name` routed any "qwen3" name to Qwen3NoThink (PMAT-181), which pre-injects empty `<think>\n</think>\n` into the assistant turn. Qwen3-Coder does NOT have thinking mode (verified via the Jinja `tokenizer.chat_template` in the GGUF) — empty think block caused the model to emit `<|endoftext|>` immediately. Route qwen3_moe to plain ChatML before the generic qwen3 → NoThink rule. PMAT-181 preserved for thinking-mode dense Qwen3. Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K norm ============================================================ forward_qwen3_moe_traced (created in PR #1222 on main) was authored mirroring the OLD pre-Q/K-norm forward_qwen3_moe. Without sync, `apr trace --payload` shows DIFFERENT numerics from `apr run` — silent diagnostic-vs-production drift. Mirror the same Q/K norm into the traced variant. Component priors discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED (this commit) | | 4 | RoPE θ | 10% | FIXED (this commit) | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED (this commit) | Output transition pre-fix → "%%%%%%%%" (gibberish) + Step 5 → "Human: What is 2+" (coherent English, partial) + Step 5b → "Human: What is 2+2?" (full prompt reproduced) + Step 6 → "2 + 2 = 4" (correct answer) + Step 7 → diagnostic trace matches production Multi-domain verification (also passes): "Capital of France:" → "The capital of France is Paris." "Translate to Spanish: Hello world" → "¡Hola mundo!" "Count to 5:" → "1, 2, 3, 4, 5" "Solve x^2 - 5x + 6 = 0:" → "I need to solve the quadratic equation x² - 5x + 6 = 0..." Hot-path safety - Production text-generation path (`apr run` → run_qwen3_moe_generate → forward_qwen3_moe) now applies the norm. - `apr trace --payload` (forward_qwen3_moe_traced) syncs the same fix. - Sibling tests pass unchanged. - `forward_qwen3_moe_traced` reads `self.config.rope_theta` which is set at model load from the default lookup — Step 5b auto-applies via config. - Dense Qwen3 path UNCHANGED (Qwen3NoThink preserved for thinking-mode variants). Regression test `crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs` F-QW3-MOE-STEP5-001 asserts the context-awareness invariant: two distinct prompts must produce distinct argmax tokens, top-2 logit gap < 50. Live PASS on lambda-vector RTX 4090 in 6.60s. Stack research - HuggingFace transformers Qwen3MoeForCausalLM applies per-head q_norm/k_norm in Qwen3MoeAttention.forward - llama.cpp ggml_qwen3_moe_kv_norm in llama-arch.cpp does the same (attn_q_norm.weight / attn_k_norm.weight) - HF Qwen3MoeConfig.rope_theta default = 1_000_000.0 - Qwen3-Coder Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` (no thinking) Refs M32d FAST PATH plan (M34, paiml/claude-code-parity-apr) Refs GH-279 (Qwen3 per-head Q/K RMSNorm) Refs PMAT-181 (Qwen3NoThink preserved for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…d discharge audit-trail bump (#1078) Source-of-truth bytes pushed by the companion repo. M22 paired-mirror guard via pin.lock (sha256 byte-identity, will be refreshed in companion PR). Net change: bumps top-level contract YAML from v1.22.0 to v1.23.0 with one new status_history entry (M35) recording M32d's functional discharge on aprender main as commit 5235aae (#1228 squash). What M35 records ================ M32d numerical-parity bundle landed across multiple aprender PRs: #1222 (Step 2) forward_qwen3_moe_traced diagnostic surface #1226 (Step 2.5) `apr trace --payload` qwen3_moe dispatch (squashed into #1222) #1242 RUSTSEC-2026-0114 audit unblocker #1401 (Step 2 JSON) `apr trace --json --payload` JSON output (FAST PATH Step 2 exit-criterion shape) #1228 (THE BUNDLE) Step 5 + 5b + 6 + 7 + regression test + evidence — squashed into one commit on main: - per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%) - rope_theta 10K → 1M for qwen3_moe (rank-4 prior, 10%) - chat template: qwen3_moe → ChatML (no `<think>` injection) - sync forward_qwen3_moe_traced with Step 5 - F-QW3-MOE-STEP5-001 regression test - evidence/m32d-discharge-2026-05-01/ Live evidence on lambda-vector RTX 4090 against the 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf: $ apr run --prompt "What is 2+2?" --max-tokens 8 Output: 2 + 2 = 4 $ apr run --prompt "Capital of France:" --max-tokens 30 Output: The capital of France is Paris. $ apr run --prompt "Translate to Spanish: Hello world" --max-tokens 30 Output: ¡Hola mundo! $ apr run --prompt "Solve x^2 - 5x + 6 = 0:" --max-tokens 30 Output: I need to solve the quadratic equation x² - 5x + 6 = 0. I can solve this by factoring. Output transition timeline: pre-fix "%%%%%%%%" + Step 5 "Human: What is 2+" + Step 5b "Human: What is 2+2?" + Step 6 "2 + 2 = 4" M34 FAST PATH actual cost: 5 PRs / ~6 hours wall — **lucky-case bound** of the 4-6 PRs / 2-3 days estimate. What M35 does NOT discharge ============================ - Cosine vs HF FP16 measurement (operator-confirm — ~60 GB download). The formal flip of `qwen3-moe-forward-v1` v1.3.0 DRAFT → v1.4.0 ACTIVE_RUNTIME waits on that measurement. - GPU MoE path (no forward_qwen3_moe_gpu; CUDA/wgpu kernels TBD). - Other Qwen3-MoE variants. Refs aprender commit 5235aae (#1228) Refs companion M34 (v1.21.0 → v1.22.0 plan) Refs PMAT-CCPA-PARITY-001 Refs M22 paired-mirror invariant Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…CHARGE (#1409) Status flips DRAFT → ACTIVE_ALGORITHM_LEVEL. M32d numerical parity is functionally discharged on aprender main as of PR #1228 squash 5235aae (2026-05-02 13:42 UTC). Output transition on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder-30B-A3B- Instruct-Q4_K_M.gguf: pre-fix "%%%%%%%%" (gibberish, repeated argmax) + Step 5 "Human: What is 2+" (coherent English, partial) + Step 5b "Human: What is 2+2?" (full prompt reproduced) + Step 6 "2 + 2 = 4" (correct answer) Multi-domain dogfood (math/geography/translation/code) all correct. Why ACTIVE_ALGORITHM_LEVEL not ACTIVE_RUNTIME ============================================== Per the v1.3.0 (M32d.0) parity-strategy decision, full ACTIVE_RUNTIME discharge requires: 1. F-QW3-MOE-PARITY-001: cosine ≥ 0.99 vs HF FP16 reference logits 2. F-QW3-MOE-PARITY-002: argmax matches llama.cpp top-1 #1 requires running scripts/generate_qwen3_moe_fp16_logits.py which is operator-confirm pending (~60 GB HF download + ~30 min on 30B-A3B multi-device offload). ACTIVE_ALGORITHM_LEVEL is the right intermediate state: forward path is functionally correct (verified by output quality across diverse prompts), but the formal cosine-vs-HF gate hasn't fired yet. Component priors verified empirically (M34 FAST PATH plan) ========================================================== rank-3 Q/K norm (15%) FIXED #1228 Step 5 rank-4 RoPE θ (10%) FIXED #1228 Step 5b outside-priors FIXED #1228 Step 6 (chat template wrapping) The diagnostic surface from PRs #1222 (Step 2) + #1226 (Step 2.5) + #1401 (Step 2 JSON wire) named rank-3 directly via the 40× std-growth signature without needing the HF FP16 fixture. Step 1 of the original plan was bypassed. M34 FAST PATH cost ================== Outcome PRs Wall-clock ACTUAL 5 ~6 hours Lucky estimate 4-6 2-3 days Realistic 8-10 4-6 days Pessimistic 12-15 1-2 weeks Came in at the lucky-case bound. Refs aprender PR #1228 commit 5235aae Refs companion `paiml/claude-code-parity-apr` M35 status_history Refs `project_m32d_discharge_2026_05_02.md` (memory) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
apr trace --json --payload <gguf>was silently ignoring--jsonwhen--payloadwas set. The FAST PATH Step 2 exit criterion inpaiml/claude-code-parity-aprexplicitly calls for JSON output. Now mechanically satisfied.Schema
{ "format": "GGUF (qwen3moe)", "architecture": "qwen3moe", "num_layers": 48, "hidden_dim": 2048, "vocab_size": 151936, "prompt": "What is 2+2?", "encoded_tokens": [...], "embedding": { "min": ..., "max": ..., "mean": ..., "std_dev": ..., "count": 2048 }, "layers": [ { "layer_idx": 0, "attn_norm": {...}, "qkv": {...}, "attn_out": {...}, "ffn_norm": {...}, "ffn_out": {...}, "output": {...} }, /* 47 more */ ], "final_norm": {...}, "logits": { "vocab_size": 151936, "l2_norm": 1025.75, "top_k": [...] } }Live verification
Implementation notes
handle_special_modes_with_json— oldhandle_special_modespreserved as thin wrapper (existing test callers unbroken).run_traced_inference_json— mirrors text-mode trace but emits JSON viaserde_json::to_string_pretty.Model: .../Contract: ...preamble that would break| jqconsumers.Hot-path safety
apr trace --payload(no --json): text mode UNCHANGEDapr trace --json(no --payload): static-layer JSON UNCHANGED--json && --payloadsethandle_special_modescontinue to workWhat this PR does NOT ship
Test plan
cargo check -p apr-cli --features inference— cleancargo clippy -p apr-cli --features inference --lib -- -D warnings— cleancargo fmt -p apr-cli --check— cleanapr trace --json --payload <Qwen3-Coder GGUF>produces valid JSON, parses with python json.load + jqRefs
🤖 Generated with Claude Code