feat(apr-cli): SHIP-007 PR-A — `apr trace --save-tensor` clap surface by noahgift · Pull Request #1405 · paiml/aprender

noahgift · 2026-05-02T10:41:21Z

Summary

Add --save-tensor, --save-tensor-dir, --save-tensor-layers to Commands::Trace
Bind the CLI signature defined in contracts/apr-cli-trace-save-tensor-v1.yaml at the binary boundary
Stub eprintln in dispatch site; PR-B will plumb to forward_traced

This is PR-A of a 4-PR cascade authorized by the operator on 2026-05-02:

Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work, do both [paths] in parallel)

PR	Scope
A (this PR)	clap surface + dispatch stub
B	wire flags through to `forward_traced` in aprender-serve
C	integration test on real APR teacher
D	`apr diff --stage` extension consuming saved tensors

The helpers (save_tensor_stage.rs, save_tensor_paths.rs, save_tensor_compose.rs) shipped in PRs #1133–#1137 already exist in crates/aprender-serve/src/inference_trace/. PR-A only adds the CLI signature so the contract is bound and apr trace --help renders correctly.

Five Whys

Why ship clap surface alone? Splits contract-binding from forward-pass wiring; reviewer can verify each layer in isolation, and the binary already prints --help exactly as the contract specifies.
Why not stub PR-A and PR-B together? forward_traced lives in aprender-serve and threads layer-id through three call sites; bundling makes the diff harder to bisect when MODEL-1 cosine drift later regresses.
Why 0..1 default? SHIP-007 root cause is pinned at layer-0 qkv matmul (memory: project_2026_04_27_session2_handoff.md). Default keeps disk usage bounded; users opt into wider ranges explicitly.
Why eprintln stub instead of returning error? Contract gate is "CLI accepts the flag" not "feature works"; error would mask the surface from existing test harnesses. Stub message names PR-B as the next step.
Why now in the SHIP-TWO loop? Path forward to ship MODEL-1 is layer-0 stage diff (per operator directive 2026-05-02); PR-A is step 1/4.

Ship % update

MODEL-1: ~64% (unchanged — PR-A is surface only; element-wise diff pending PR-B/C/D)
MODEL-2: blocked on Stack v2 corpus access (operator action — visit https://huggingface.co/datasets/bigcode/starcoderdata, accept terms)

Test plan

cargo check -p apr-cli clean
cargo check -p apr-cli --tests clean
cargo test -p apr-cli --lib test_parse_trace_command pass
cargo test -p apr-cli --lib test_dispatch_diagnostic_routes_trace pass
apr trace --help renders all 3 new flags with full descriptions
CI green (gate + workspace-test)

🤖 Generated with Claude Code

Add three CLI flags to `Commands::Trace` so the contract signature in `contracts/apr-cli-trace-save-tensor-v1.yaml` is bound at the binary boundary: --save-tensor <STAGES> comma-separated, or `all` --save-tensor-dir <DIR> override output directory --save-tensor-layers <RANGE> `START..END` (default `0..1`) Dispatch site emits a stub eprintln when `--save-tensor` is set; the forward_traced plumbing lands in PR-B. The helpers in `crates/aprender-serve/src/inference_trace/save_tensor*.rs` already exist and will be wired through next. ## Five Whys 1. Why ship clap surface alone? — Splits the contract-binding from the forward-pass wiring; reviewer can verify each layer in isolation, and the binary already prints `--help` exactly as the contract specifies. 2. Why not stub PR-A and PR-B together? — `forward_traced` lives in `aprender-serve` and threads layer-id through three call sites; bundling those changes with the apr-cli signature change makes the diff harder to bisect when MODEL-1 cosine drift later regresses. 3. Why `0..1` default? — SHIP-007 root cause is pinned at layer-0 qkv matmul (memory: `project_2026_04_27_session2_handoff.md`). Default keeps disk usage bounded; users opt into wider ranges explicitly. 4. Why eprintln stub instead of returning error? — Contract gate is "CLI accepts the flag" not "feature works"; an error would mask the surface from existing test harnesses. Stub message names PR-B as the next step so it isn't mistaken for a complete feature. 5. Why now in the SHIP-TWO loop? — Path forward to ship MODEL-1 is layer-0 stage diff (per operator directive 2026-05-02); PR-A is step 1/4 of the user-authorized parallel track alongside Stack v2. ## Ship % update MODEL-1: ~64% (unchanged — PR-A is surface only; element-wise diff pending PR-B/PR-C/PR-D). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tion (#1407) Add 6 integration tests in `tests/save_tensor_plan_integration.rs` that drive `SaveTensorPlan` (PR-B, #1406) end-to-end through the public `save_tensor::*` writer + `save_tensor_paths::*` directory helpers. These pin the plan ↔ writer contract from the OUTSIDE so the upcoming `forward_traced` wiring (PR-C-real) has a stable target. The tests mirror the future dispatch flow: build plan from CLI strings → iterate forward-pass-style `(stage, layer, data)` tuples → assert files appear exactly where the plan predicted, and only for selected stages. ## Tests (6/6 pass) | Test | What it pins | |------|-------------| | `plan_three_stages_layer_zero_writes_three_files` | multi-stage selection + skip unselected | | `plan_layer_range_filter_excludes_out_of_range` | END-exclusive range honoured end-to-end | | `plan_whole_model_stage_writes_to_root_not_layer_dir` | final_norm/lm_head bypass layer-N segment | | `plan_unselected_stage_produces_no_file` | should_save==false → zero I/O | | `plan_byte_determinism_across_two_runs` | FALSIFY-APR-TRACE-SAVE-002 at integration boundary | | `plan_all_keyword_writes_18_per_layer_files_for_one_layer` | `all` keyword → 18 stages routed correctly | ## Five Whys 1. **Why integration tests now, not in PR-B?** PR-B unit tests verify `should_save`/`stage_path` in isolation. These integration tests verify the plan ACTUALLY drives the writer correctly when used the way `forward_traced` will use it. 2. **Why ship before PR-C-real?** Locks down the plan ↔ writer contract so PR-C-real (forward_traced surgery) only has to plumb the call sites; if a future PR breaks the contract, these tests fail BEFORE the model loads. 3. **Why a separate file from `save_tensor_integration.rs`?** That file exercises the writer in isolation; this file exercises the plan-driven flow. Mixing would obscure which level of the contract regressed when a test fails. 4. **Why include the `all` keyword test?** SHIP-007 layer-0 stage diff may run `--save-tensor all --save-tensor-layers 0..1` for an exhaustive bisection; routing 18 stages through the plan is the single most stressed path and deserves its own pin. 5. **Why now in the SHIP-TWO loop?** Ship-cadence move that closes the contract surface PR-C-real depends on, while PR-A (#1405) and PR-B (#1406) CI is still in flight. Stack-friendly: branched off PR-B, merges cleanly when both land. ## Stack relationship This PR depends on PR-B (#1406) — SaveTensorPlan must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will set PR-B as the base; once PR-B merges to main, this PR auto-rebases. ## Ship % update MODEL-1: ~64% (unchanged — integration tests are preparatory; the forward_traced surgery in PR-C-real and the `apr trace --save-tensor` dispatch wiring in PR-D are still pending). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…e_theta + chat template Squashes 4 substantive M32d FAST PATH fixes (Step 5 + 5b + 6 + 7) + regression test + evidence into a single commit on top of fresh main. Replaces the original messy stacked-PR chain that conflicted on rebase after sibling PRs (#1401, #1405) landed. Live verification on lambda-vector RTX 4090 (post-rebuild): $ apr run <Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf> \ --prompt "What is 2+2?" --max-tokens 8 Output: 2 + 2 = 4 Completed in 40.24s (cached) Step 5 — per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%) ==================================================================== Qwen3 GH-279 per-head Q/K RMSNorm was wired into the dense path (adaptive_ffn.rs:174-179) but missing from forward_qwen3_moe.rs. Now applied AFTER bias, BEFORE RoPE — same code as adaptive_ffn. Pre-fix: layer std-dev grew 40× over 48 layers (signature of attention scores compounding without per-head Q/K norm). Output `%%%%%%%%`. Step 5b — rope_theta default 10K → 1M for qwen3_moe (rank-4 prior, 10%) ======================================================================= GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT `qwen3moe.rope.freq_base` metadata. config.rs's default lookup had `"qwen2" | "qwen3" => 1_000_000.0` but no qwen3_moe entry — fell to catch-all 10K. Off by 100×. Added qwen3_moe to the 1M arm. Step 6 — chat template (qwen3_moe → ChatML, no <think>) ======================================================== `detect_format_from_name` routed any "qwen3" name to Qwen3NoThink (PMAT-181), which pre-injects empty `<think>\n</think>\n` into the assistant turn. Qwen3-Coder does NOT have thinking mode (verified via the Jinja `tokenizer.chat_template` in the GGUF) — empty think block caused the model to emit `<|endoftext|>` immediately. Route qwen3_moe to plain ChatML before the generic qwen3 → NoThink rule. PMAT-181 preserved for thinking-mode dense Qwen3. Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K norm ============================================================ forward_qwen3_moe_traced (created in PR #1222 on main) was authored mirroring the OLD pre-Q/K-norm forward_qwen3_moe. Without sync, `apr trace --payload` shows DIFFERENT numerics from `apr run` — silent diagnostic-vs-production drift. Mirror the same Q/K norm into the traced variant. Component priors discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED (this commit) | | 4 | RoPE θ | 10% | FIXED (this commit) | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED (this commit) | Output transition pre-fix → "%%%%%%%%" (gibberish) + Step 5 → "Human: What is 2+" (coherent English, partial) + Step 5b → "Human: What is 2+2?" (full prompt reproduced) + Step 6 → "2 + 2 = 4" (correct answer) + Step 7 → diagnostic trace matches production Multi-domain verification (also passes): "Capital of France:" → "The capital of France is Paris." "Translate to Spanish: Hello world" → "¡Hola mundo!" "Count to 5:" → "1, 2, 3, 4, 5" "Solve x^2 - 5x + 6 = 0:" → "I need to solve the quadratic equation x² - 5x + 6 = 0..." Hot-path safety - Production text-generation path (`apr run` → run_qwen3_moe_generate → forward_qwen3_moe) now applies the norm. - `apr trace --payload` (forward_qwen3_moe_traced) syncs the same fix. - Sibling tests pass unchanged. - `forward_qwen3_moe_traced` reads `self.config.rope_theta` which is set at model load from the default lookup — Step 5b auto-applies via config. - Dense Qwen3 path UNCHANGED (Qwen3NoThink preserved for thinking-mode variants). Regression test `crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs` F-QW3-MOE-STEP5-001 asserts the context-awareness invariant: two distinct prompts must produce distinct argmax tokens, top-2 logit gap < 50. Live PASS on lambda-vector RTX 4090 in 6.60s. Stack research - HuggingFace transformers Qwen3MoeForCausalLM applies per-head q_norm/k_norm in Qwen3MoeAttention.forward - llama.cpp ggml_qwen3_moe_kv_norm in llama-arch.cpp does the same (attn_q_norm.weight / attn_k_norm.weight) - HF Qwen3MoeConfig.rope_theta default = 1_000_000.0 - Qwen3-Coder Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` (no thinking) Refs M32d FAST PATH plan (M34, paiml/claude-code-parity-apr) Refs GH-279 (Qwen3 per-head Q/K RMSNorm) Refs PMAT-181 (Qwen3NoThink preserved for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1406) * feat(aprender-serve): SHIP-007 PR-B — SaveTensorPlan plan-builder Add `inference_trace::save_tensor_plan` (`SaveTensorPlan` + `PlanParseError`) that converts the three `apr trace --save-tensor*` argument strings into a validated, ready-to-execute plan: - `--save-tensor <STAGES>` comma-list, or the literal `all` - `--save-tensor-dir <DIR>` output root - `--save-tensor-layers <RANGE>` Rust `START..END`, END exclusive The plan exposes `should_save(stage, layer)` and `stage_path(stage, layer)` queries. Per-layer stages obey the layer range; whole-model stages (`final_norm`, `lm_head`) ignore the range and are saved iff selected. Pure module — no I/O, no transformer state, no `apr-cli` coupling. PR-C threads the plan into `AprTransformer::forward_traced` as a side-channel. 28 unit tests covering: provenance pin, pass band (whitespace, `all`, `ALL`/`aLL`, wide ranges), fail band (unknown stage, empty token, malformed range, negative start, END<=START, garbage END), should_save semantics (in-range, out-of-range, unselected, whole-model bypass, default `0..1`), stage_path semantics (per-layer 0/3, whole-model with sentinel), edge cases (`layer_output` alias, duplicate preservation, min range, high layer index, Clone+Eq). ## Five Whys 1. Why split plan-builder from forward-pass wiring? — Argument validation should fail BEFORE multi-second model load; the only way to keep the error surface tight is to parse eagerly. Splitting also lets PR-C touch `forward_traced` internals without touching argument parsing. 2. Why pure (no I/O)? — `ensure_layer_dir`/file open are stateful side effects best owned by the writer (PR-C). Tests stay fast and deterministic; `cargo test --lib` runs the 28 cases in 0.00s. 3. Why `0..1` default? — Matches the clap default in PR-A (#1405). SHIP-007 root cause is pinned at layer-0 qkv matmul; default keeps disk-write blast radius bounded; users opt in to wider ranges. 4. Why `all` keyword? — Contract §invariants line 63 specifies "comma-delimited; multiple stages MAY be saved in one run". `all` is the obvious shorthand for the 18-stage exhaustive bisection the SHIP-007 hypothesis chain (§24-§32) is heading toward. 5. Why now in the SHIP-TWO loop? — Operator directive 2026-05-02: "Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work)". PR-B is step 2/4. ## Independence from PR-A PR-B does NOT modify `apr-cli`. The branch is off `origin/main` (which does not yet contain PR-A's clap fields). PR-A and PR-B can land in either order; PR-C will rebase off whichever is already in main and introduce the dispatch wiring. ## Ship % update MODEL-1: ~64% (unchanged — plan-builder is preparatory; element-wise diff still pending PR-C/D forward_traced wiring). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(aprender-serve): SHIP-007 PR-C-prep — SaveTensorPlan I/O integration (#1407) Add 6 integration tests in `tests/save_tensor_plan_integration.rs` that drive `SaveTensorPlan` (PR-B, #1406) end-to-end through the public `save_tensor::*` writer + `save_tensor_paths::*` directory helpers. These pin the plan ↔ writer contract from the OUTSIDE so the upcoming `forward_traced` wiring (PR-C-real) has a stable target. The tests mirror the future dispatch flow: build plan from CLI strings → iterate forward-pass-style `(stage, layer, data)` tuples → assert files appear exactly where the plan predicted, and only for selected stages. ## Tests (6/6 pass) | Test | What it pins | |------|-------------| | `plan_three_stages_layer_zero_writes_three_files` | multi-stage selection + skip unselected | | `plan_layer_range_filter_excludes_out_of_range` | END-exclusive range honoured end-to-end | | `plan_whole_model_stage_writes_to_root_not_layer_dir` | final_norm/lm_head bypass layer-N segment | | `plan_unselected_stage_produces_no_file` | should_save==false → zero I/O | | `plan_byte_determinism_across_two_runs` | FALSIFY-APR-TRACE-SAVE-002 at integration boundary | | `plan_all_keyword_writes_18_per_layer_files_for_one_layer` | `all` keyword → 18 stages routed correctly | ## Five Whys 1. **Why integration tests now, not in PR-B?** PR-B unit tests verify `should_save`/`stage_path` in isolation. These integration tests verify the plan ACTUALLY drives the writer correctly when used the way `forward_traced` will use it. 2. **Why ship before PR-C-real?** Locks down the plan ↔ writer contract so PR-C-real (forward_traced surgery) only has to plumb the call sites; if a future PR breaks the contract, these tests fail BEFORE the model loads. 3. **Why a separate file from `save_tensor_integration.rs`?** That file exercises the writer in isolation; this file exercises the plan-driven flow. Mixing would obscure which level of the contract regressed when a test fails. 4. **Why include the `all` keyword test?** SHIP-007 layer-0 stage diff may run `--save-tensor all --save-tensor-layers 0..1` for an exhaustive bisection; routing 18 stages through the plan is the single most stressed path and deserves its own pin. 5. **Why now in the SHIP-TWO loop?** Ship-cadence move that closes the contract surface PR-C-real depends on, while PR-A (#1405) and PR-B (#1406) CI is still in flight. Stack-friendly: branched off PR-B, merges cleanly when both land. ## Stack relationship This PR depends on PR-B (#1406) — SaveTensorPlan must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will set PR-B as the base; once PR-B merges to main, this PR auto-rebases. ## Ship % update MODEL-1: ~64% (unchanged — integration tests are preparatory; the forward_traced surgery in PR-C-real and the `apr trace --save-tensor` dispatch wiring in PR-D are still pending). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…e_theta + chat template (#1228) Squashes 4 substantive M32d FAST PATH fixes (Step 5 + 5b + 6 + 7) + regression test + evidence into a single commit on top of fresh main. Replaces the original messy stacked-PR chain that conflicted on rebase after sibling PRs (#1401, #1405) landed. Live verification on lambda-vector RTX 4090 (post-rebuild): $ apr run <Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf> \ --prompt "What is 2+2?" --max-tokens 8 Output: 2 + 2 = 4 Completed in 40.24s (cached) Step 5 — per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%) ==================================================================== Qwen3 GH-279 per-head Q/K RMSNorm was wired into the dense path (adaptive_ffn.rs:174-179) but missing from forward_qwen3_moe.rs. Now applied AFTER bias, BEFORE RoPE — same code as adaptive_ffn. Pre-fix: layer std-dev grew 40× over 48 layers (signature of attention scores compounding without per-head Q/K norm). Output `%%%%%%%%`. Step 5b — rope_theta default 10K → 1M for qwen3_moe (rank-4 prior, 10%) ======================================================================= GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT `qwen3moe.rope.freq_base` metadata. config.rs's default lookup had `"qwen2" | "qwen3" => 1_000_000.0` but no qwen3_moe entry — fell to catch-all 10K. Off by 100×. Added qwen3_moe to the 1M arm. Step 6 — chat template (qwen3_moe → ChatML, no <think>) ======================================================== `detect_format_from_name` routed any "qwen3" name to Qwen3NoThink (PMAT-181), which pre-injects empty `<think>\n</think>\n` into the assistant turn. Qwen3-Coder does NOT have thinking mode (verified via the Jinja `tokenizer.chat_template` in the GGUF) — empty think block caused the model to emit `<|endoftext|>` immediately. Route qwen3_moe to plain ChatML before the generic qwen3 → NoThink rule. PMAT-181 preserved for thinking-mode dense Qwen3. Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K norm ============================================================ forward_qwen3_moe_traced (created in PR #1222 on main) was authored mirroring the OLD pre-Q/K-norm forward_qwen3_moe. Without sync, `apr trace --payload` shows DIFFERENT numerics from `apr run` — silent diagnostic-vs-production drift. Mirror the same Q/K norm into the traced variant. Component priors discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED (this commit) | | 4 | RoPE θ | 10% | FIXED (this commit) | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED (this commit) | Output transition pre-fix → "%%%%%%%%" (gibberish) + Step 5 → "Human: What is 2+" (coherent English, partial) + Step 5b → "Human: What is 2+2?" (full prompt reproduced) + Step 6 → "2 + 2 = 4" (correct answer) + Step 7 → diagnostic trace matches production Multi-domain verification (also passes): "Capital of France:" → "The capital of France is Paris." "Translate to Spanish: Hello world" → "¡Hola mundo!" "Count to 5:" → "1, 2, 3, 4, 5" "Solve x^2 - 5x + 6 = 0:" → "I need to solve the quadratic equation x² - 5x + 6 = 0..." Hot-path safety - Production text-generation path (`apr run` → run_qwen3_moe_generate → forward_qwen3_moe) now applies the norm. - `apr trace --payload` (forward_qwen3_moe_traced) syncs the same fix. - Sibling tests pass unchanged. - `forward_qwen3_moe_traced` reads `self.config.rope_theta` which is set at model load from the default lookup — Step 5b auto-applies via config. - Dense Qwen3 path UNCHANGED (Qwen3NoThink preserved for thinking-mode variants). Regression test `crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs` F-QW3-MOE-STEP5-001 asserts the context-awareness invariant: two distinct prompts must produce distinct argmax tokens, top-2 logit gap < 50. Live PASS on lambda-vector RTX 4090 in 6.60s. Stack research - HuggingFace transformers Qwen3MoeForCausalLM applies per-head q_norm/k_norm in Qwen3MoeAttention.forward - llama.cpp ggml_qwen3_moe_kv_norm in llama-arch.cpp does the same (attn_q_norm.weight / attn_k_norm.weight) - HF Qwen3MoeConfig.rope_theta default = 1_000_000.0 - Qwen3-Coder Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` (no thinking) Refs M32d FAST PATH plan (M34, paiml/claude-code-parity-apr) Refs GH-279 (Qwen3 per-head Q/K RMSNorm) Refs PMAT-181 (Qwen3NoThink preserved for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…h_save_tensor` wrapper (#1408) * feat(aprender-serve): SHIP-007 PR-B — SaveTensorPlan plan-builder Add `inference_trace::save_tensor_plan` (`SaveTensorPlan` + `PlanParseError`) that converts the three `apr trace --save-tensor*` argument strings into a validated, ready-to-execute plan: - `--save-tensor <STAGES>` comma-list, or the literal `all` - `--save-tensor-dir <DIR>` output root - `--save-tensor-layers <RANGE>` Rust `START..END`, END exclusive The plan exposes `should_save(stage, layer)` and `stage_path(stage, layer)` queries. Per-layer stages obey the layer range; whole-model stages (`final_norm`, `lm_head`) ignore the range and are saved iff selected. Pure module — no I/O, no transformer state, no `apr-cli` coupling. PR-C threads the plan into `AprTransformer::forward_traced` as a side-channel. 28 unit tests covering: provenance pin, pass band (whitespace, `all`, `ALL`/`aLL`, wide ranges), fail band (unknown stage, empty token, malformed range, negative start, END<=START, garbage END), should_save semantics (in-range, out-of-range, unselected, whole-model bypass, default `0..1`), stage_path semantics (per-layer 0/3, whole-model with sentinel), edge cases (`layer_output` alias, duplicate preservation, min range, high layer index, Clone+Eq). ## Five Whys 1. Why split plan-builder from forward-pass wiring? — Argument validation should fail BEFORE multi-second model load; the only way to keep the error surface tight is to parse eagerly. Splitting also lets PR-C touch `forward_traced` internals without touching argument parsing. 2. Why pure (no I/O)? — `ensure_layer_dir`/file open are stateful side effects best owned by the writer (PR-C). Tests stay fast and deterministic; `cargo test --lib` runs the 28 cases in 0.00s. 3. Why `0..1` default? — Matches the clap default in PR-A (#1405). SHIP-007 root cause is pinned at layer-0 qkv matmul; default keeps disk-write blast radius bounded; users opt in to wider ranges. 4. Why `all` keyword? — Contract §invariants line 63 specifies "comma-delimited; multiple stages MAY be saved in one run". `all` is the obvious shorthand for the 18-stage exhaustive bisection the SHIP-007 hypothesis chain (§24-§32) is heading toward. 5. Why now in the SHIP-TWO loop? — Operator directive 2026-05-02: "Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work)". PR-B is step 2/4. ## Independence from PR-A PR-B does NOT modify `apr-cli`. The branch is off `origin/main` (which does not yet contain PR-A's clap fields). PR-A and PR-B can land in either order; PR-C will rebase off whichever is already in main and introduce the dispatch wiring. ## Ship % update MODEL-1: ~64% (unchanged — plan-builder is preparatory; element-wise diff still pending PR-C/D forward_traced wiring). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(aprender-serve): SHIP-007 PR-C-prep — SaveTensorPlan I/O integration (#1407) Add 6 integration tests in `tests/save_tensor_plan_integration.rs` that drive `SaveTensorPlan` (PR-B, #1406) end-to-end through the public `save_tensor::*` writer + `save_tensor_paths::*` directory helpers. These pin the plan ↔ writer contract from the OUTSIDE so the upcoming `forward_traced` wiring (PR-C-real) has a stable target. The tests mirror the future dispatch flow: build plan from CLI strings → iterate forward-pass-style `(stage, layer, data)` tuples → assert files appear exactly where the plan predicted, and only for selected stages. ## Tests (6/6 pass) | Test | What it pins | |------|-------------| | `plan_three_stages_layer_zero_writes_three_files` | multi-stage selection + skip unselected | | `plan_layer_range_filter_excludes_out_of_range` | END-exclusive range honoured end-to-end | | `plan_whole_model_stage_writes_to_root_not_layer_dir` | final_norm/lm_head bypass layer-N segment | | `plan_unselected_stage_produces_no_file` | should_save==false → zero I/O | | `plan_byte_determinism_across_two_runs` | FALSIFY-APR-TRACE-SAVE-002 at integration boundary | | `plan_all_keyword_writes_18_per_layer_files_for_one_layer` | `all` keyword → 18 stages routed correctly | ## Five Whys 1. **Why integration tests now, not in PR-B?** PR-B unit tests verify `should_save`/`stage_path` in isolation. These integration tests verify the plan ACTUALLY drives the writer correctly when used the way `forward_traced` will use it. 2. **Why ship before PR-C-real?** Locks down the plan ↔ writer contract so PR-C-real (forward_traced surgery) only has to plumb the call sites; if a future PR breaks the contract, these tests fail BEFORE the model loads. 3. **Why a separate file from `save_tensor_integration.rs`?** That file exercises the writer in isolation; this file exercises the plan-driven flow. Mixing would obscure which level of the contract regressed when a test fails. 4. **Why include the `all` keyword test?** SHIP-007 layer-0 stage diff may run `--save-tensor all --save-tensor-layers 0..1` for an exhaustive bisection; routing 18 stages through the plan is the single most stressed path and deserves its own pin. 5. **Why now in the SHIP-TWO loop?** Ship-cadence move that closes the contract surface PR-C-real depends on, while PR-A (#1405) and PR-B (#1406) CI is still in flight. Stack-friendly: branched off PR-B, merges cleanly when both land. ## Stack relationship This PR depends on PR-B (#1406) — SaveTensorPlan must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will set PR-B as the base; once PR-B merges to main, this PR auto-rebases. ## Ship % update MODEL-1: ~64% (unchanged — integration tests are preparatory; the forward_traced surgery in PR-C-real and the `apr trace --save-tensor` dispatch wiring in PR-D are still pending). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SHIP-007 PR-C-real step 1 — `forward_traced_with_save_tensor` wrapper Add public method `AprTransformer::forward_traced_with_save_tensor( tokens, plan)` that delegates to existing `forward_traced` and additionally writes the **embedding** stage to disk if the supplied [`SaveTensorPlan`] (PR-B, #1406) selects it. This is the first of several SHIP-007 forward_traced wiring steps. Step 1 covers only the embedding stage — the one stage that can be re-extracted by calling `self.embed(token_ids)` a second time, so we ship working save-tensor I/O without modifying the 360-line `forward_traced` body. Subsequent steps will thread `Option<&SaveTensorPlan>` through `forward_traced` itself so per-layer stages (qkv_matmul, ffn_gate, …) emit during the single forward pass without re-runs. ## Five Whys 1. **Why a wrapper instead of modifying forward_traced directly?** Keeps the high-risk forward-pass surgery out of step 1. The wrapper compiles in 4.5 s and connects cleanly to `forward_traced`'s existing return value; modifying the 360-line body would be a large, risky diff for a single ship-cadence iteration. 2. **Why only the embedding stage?** Embedding is the only stage that can be cheaply re-extracted post-hoc via `self.embed()` (token-table lookup, no matmuls). Other stages live inside `forward_traced` and require threading the plan through to capture. 3. **Why error-typed as `RealizarError::IoError`?** The existing project error type already has an `IoError { message }` variant; using it keeps the `Result<ForwardTrace>` signature compatible with `forward_traced` so callers don't need a `From<>` impl. 4. **Why no integration test on a real model in this PR?** Building a live `AprTransformer` requires loading a real .apr file — appropriate for PR-D (`apr trace --save-tensor` on canonical 7B teacher), not for a step-1 unit-level wrapper. The plan ↔ writer contract is already pinned by PR-C-prep (#1407). 5. **Why now in the SHIP-TWO loop?** Operator directive 2026-05-02 — "Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work)". Step 1 establishes the public method that PR-D's dispatch will call. ## Stack relationship This PR depends on PR-B (#1406) — `SaveTensorPlan` must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will auto-rebase once PR-B merges to main. ## Ship % update MODEL-1: ~64% (unchanged — wrapper is preparatory; effective tensor capture for any layer-0 stage other than embedding still requires follow-up PRs threading the plan through forward_traced). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): SHIP-007 PR-C-real-step1 — clippy doc list overindentation CI lint failed with `error: doc list item overindented` on the "## Role in the cascade" bulleted list (4 occurrences). The continuation lines inside the **This file** bullet were indented 25 spaces; clippy expects 2-space continuation indent. Reformat the bullet so the continuation text uses 2-space indent and the file body stays inside one bullet. Also update PR-B/PR-B-prep status from OPEN → MERGED (both landed this session). No code changes, doc-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…age tensors (#1413) Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1` at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot. When both inputs to `apr diff --values` start with magic bytes `APRT` (the 12-byte header written by `apr trace --save-tensor`), the dispatch now bypasses the RosettaStone whole-model walker and runs an element-wise stage-tensor diff: - max|diff| with index - RMS diff - Cosine similarity (f64-accumulated for numerical stability) - Top-K divergences sorted by |a - b| Both JSON and pretty text output are supported. Mismatched dim_product or layer fields fail-fast with a diagnostic error so callers don't silently compare incompatible stages. ## Five Whys (why now, why this scope) 1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406, PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files. Without an APRT-aware diff, layer-0 stage-by-stage element-wise bisection per `feedback_model_1_ships_gpu_only.md` is gated on external tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids. 2. **Why extend `apr diff` and not write a new subcommand?** The `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1` already names `apr diff --values` as the verifier. Extending the existing flag keeps the contract surface stable. 3. **Why an include!() file instead of inlining into diff.rs?** diff.rs already follows that pattern (diff_accumulator, diff_output_json_text, diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be audited / removed independently and doesn't grow the parent file. 4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either `is_aprt_stage_file` or the file path break the include!() chain; the pin makes that visible at test-time and forces a contract bump. 5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is running in the background for MODEL-2 (PR #1412 merged). The SHIP-007 PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when step 2 (forward_traced threading) lands. PR-D is independent and can merge in parallel with #1408. ## Verification - `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests) - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3) - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3) - provenance_pin_pr_d_rev1 (1) - `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors ## Contract update `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0: - New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing this PR's unit tests. ## Ship % update MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant and clears infrastructure blocker for SHIP-007 step E). MODEL-2: corpus tokenization in progress (~33h ETA). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

) M-MOE-SUB-2 step (a) CLI completion: connects clap surface (--save-tensor / --save-tensor-layers / --save-tensor-dir, PR-A #1405) through to forward_qwen3_moe_traced_with_plan (M74, PR #1516 squash 3138d13) for Qwen3-MoE-arch GGUF models. ## What ships 1. New pub fn `run_save_tensor_gguf_moe(path, stages, dir, layers)` in `crates/apr-cli/src/commands/trace_save_tensor.rs`. Mirrors the structure of the existing `run_save_tensor_apr` for APR models, but loads via `MappedGGUFModel` / `OwnedQuantizedModel`, validates `qwen3_moe` arch (rejects dense GGUF with a clear error), reads MoE config from GGUF metadata (`expert_count`, `expert_used_count`, `expert_feed_forward_length`), loads per-layer `Qwen3MoeQuantizedLayer` descriptors, then dispatches to `forward_qwen3_moe_traced_with_plan` with the plan derived from the CLI args. 2. Dispatch wireup in `dispatch.rs::dispatch_diagnostic_commands` under the `Commands::Trace` arm. The previous code dispatched `--save-tensor` for `.apr` only and printed a stub for other extensions; now `.gguf` dispatches to the new `run_save_tensor_gguf_moe` function. Other extensions (.safetensors) still print the stub pending SHIP-007 PR-E. ## What this does NOT ship - Dense GGUF SaveTensor wireup (still falls through to stub). - M-MOE-SUB-2 step (b) GPU sibling `forward_qwen3_moe_cuda_traced.rs` — separate PR. - Live verification on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF — exercised in M-MOE-SUB-3. ## Hot path safety Production `forward_qwen3_moe` / `forward_qwen3_moe_cuda` hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0). Production `forward_qwen3_moe_traced` (no plan) also unchanged — the new wireup uses the M74 sibling `forward_qwen3_moe_traced_with_plan`. ## Verification $ cargo build -p apr-cli --release --features inference clean $ cargo clippy -p apr-cli --lib --release --features inference \ -- -D warnings clean $ rustfmt --check trace_save_tensor.rs dispatch.rs clean $ cargo test -p apr-cli --release --lib commands::trace_save_tensor 5 passed (existing tests preserved) ## Falsifier impact - FALSIFY-MOE-SUB-002 (byte-identity preservation): still partial — needs M-MOE-SUB-2 step (b) GPU sibling for full discharge. - M-MOE-SUB-3 live bisection: now unblocked operationally — invoking `apr trace --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <dir> <gguf>` on lambda-vector RTX 4090 will produce per-layer MoeRouter + MoeFfnOut tensor files for the cached Qwen3-Coder GGUF, ready for diff vs the GPU sibling output once step (b) ships. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 step (a) CLI completion, M68 helper PR #1507 (squash 0f22c78), M74 forward_qwen3_moe_traced_with_plan PR #1516 (squash 3138d13) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced May 2, 2026

feat(aprender-serve): SHIP-007 PR-B — SaveTensorPlan plan-builder #1406

Merged

test(aprender-serve): SHIP-007 PR-C-prep — SaveTensorPlan I/O integration tests #1407

Merged

Merge branch 'main' into feat/ship-007-pr-a-trace-save-tensor-clap

ac69d5a

noahgift enabled auto-merge (squash) May 2, 2026 12:17

noahgift merged commit c22cdc3 into main May 2, 2026
10 checks passed

noahgift deleted the feat/ship-007-pr-a-trace-save-tensor-clap branch May 2, 2026 12:35

This was referenced May 3, 2026

contract(apr-cli-trace-save-tensor-v1): v1.2.0 → v1.3.0 — FALSIFY-011 records CLI dispatch wire-up PARTIAL discharge #1418

Merged

feat(apr-cli): wire --save-tensor through GGUF MoE trace dispatch — M-MOE-SUB-2 step (a) CLI #1521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cli): SHIP-007 PR-A — `apr trace --save-tensor` clap surface#1405

feat(apr-cli): SHIP-007 PR-A — `apr trace --save-tensor` clap surface#1405
noahgift merged 2 commits into
mainfrom
feat/ship-007-pr-a-trace-save-tensor-clap

noahgift commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 2, 2026

Summary

Five Whys

Ship % update

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant