feat(apr-cli): SHIP-007 PR-A — apr trace --save-tensor clap surface#1405
Merged
Conversation
Add three CLI flags to `Commands::Trace` so the contract signature in `contracts/apr-cli-trace-save-tensor-v1.yaml` is bound at the binary boundary: --save-tensor <STAGES> comma-separated, or `all` --save-tensor-dir <DIR> override output directory --save-tensor-layers <RANGE> `START..END` (default `0..1`) Dispatch site emits a stub eprintln when `--save-tensor` is set; the forward_traced plumbing lands in PR-B. The helpers in `crates/aprender-serve/src/inference_trace/save_tensor*.rs` already exist and will be wired through next. ## Five Whys 1. Why ship clap surface alone? — Splits the contract-binding from the forward-pass wiring; reviewer can verify each layer in isolation, and the binary already prints `--help` exactly as the contract specifies. 2. Why not stub PR-A and PR-B together? — `forward_traced` lives in `aprender-serve` and threads layer-id through three call sites; bundling those changes with the apr-cli signature change makes the diff harder to bisect when MODEL-1 cosine drift later regresses. 3. Why `0..1` default? — SHIP-007 root cause is pinned at layer-0 qkv matmul (memory: `project_2026_04_27_session2_handoff.md`). Default keeps disk usage bounded; users opt into wider ranges explicitly. 4. Why eprintln stub instead of returning error? — Contract gate is "CLI accepts the flag" not "feature works"; an error would mask the surface from existing test harnesses. Stub message names PR-B as the next step so it isn't mistaken for a complete feature. 5. Why now in the SHIP-TWO loop? — Path forward to ship MODEL-1 is layer-0 stage diff (per operator directive 2026-05-02); PR-A is step 1/4 of the user-authorized parallel track alongside Stack v2. ## Ship % update MODEL-1: ~64% (unchanged — PR-A is surface only; element-wise diff pending PR-B/PR-C/PR-D). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 2, 2026
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…tion (#1407) Add 6 integration tests in `tests/save_tensor_plan_integration.rs` that drive `SaveTensorPlan` (PR-B, #1406) end-to-end through the public `save_tensor::*` writer + `save_tensor_paths::*` directory helpers. These pin the plan ↔ writer contract from the OUTSIDE so the upcoming `forward_traced` wiring (PR-C-real) has a stable target. The tests mirror the future dispatch flow: build plan from CLI strings → iterate forward-pass-style `(stage, layer, data)` tuples → assert files appear exactly where the plan predicted, and only for selected stages. ## Tests (6/6 pass) | Test | What it pins | |------|-------------| | `plan_three_stages_layer_zero_writes_three_files` | multi-stage selection + skip unselected | | `plan_layer_range_filter_excludes_out_of_range` | END-exclusive range honoured end-to-end | | `plan_whole_model_stage_writes_to_root_not_layer_dir` | final_norm/lm_head bypass layer-N segment | | `plan_unselected_stage_produces_no_file` | should_save==false → zero I/O | | `plan_byte_determinism_across_two_runs` | FALSIFY-APR-TRACE-SAVE-002 at integration boundary | | `plan_all_keyword_writes_18_per_layer_files_for_one_layer` | `all` keyword → 18 stages routed correctly | ## Five Whys 1. **Why integration tests now, not in PR-B?** PR-B unit tests verify `should_save`/`stage_path` in isolation. These integration tests verify the plan ACTUALLY drives the writer correctly when used the way `forward_traced` will use it. 2. **Why ship before PR-C-real?** Locks down the plan ↔ writer contract so PR-C-real (forward_traced surgery) only has to plumb the call sites; if a future PR breaks the contract, these tests fail BEFORE the model loads. 3. **Why a separate file from `save_tensor_integration.rs`?** That file exercises the writer in isolation; this file exercises the plan-driven flow. Mixing would obscure which level of the contract regressed when a test fails. 4. **Why include the `all` keyword test?** SHIP-007 layer-0 stage diff may run `--save-tensor all --save-tensor-layers 0..1` for an exhaustive bisection; routing 18 stages through the plan is the single most stressed path and deserves its own pin. 5. **Why now in the SHIP-TWO loop?** Ship-cadence move that closes the contract surface PR-C-real depends on, while PR-A (#1405) and PR-B (#1406) CI is still in flight. Stack-friendly: branched off PR-B, merges cleanly when both land. ## Stack relationship This PR depends on PR-B (#1406) — SaveTensorPlan must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will set PR-B as the base; once PR-B merges to main, this PR auto-rebases. ## Ship % update MODEL-1: ~64% (unchanged — integration tests are preparatory; the forward_traced surgery in PR-C-real and the `apr trace --save-tensor` dispatch wiring in PR-D are still pending). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…e_theta + chat template Squashes 4 substantive M32d FAST PATH fixes (Step 5 + 5b + 6 + 7) + regression test + evidence into a single commit on top of fresh main. Replaces the original messy stacked-PR chain that conflicted on rebase after sibling PRs (#1401, #1405) landed. Live verification on lambda-vector RTX 4090 (post-rebuild): $ apr run <Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf> \ --prompt "What is 2+2?" --max-tokens 8 Output: 2 + 2 = 4 Completed in 40.24s (cached) Step 5 — per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%) ==================================================================== Qwen3 GH-279 per-head Q/K RMSNorm was wired into the dense path (adaptive_ffn.rs:174-179) but missing from forward_qwen3_moe.rs. Now applied AFTER bias, BEFORE RoPE — same code as adaptive_ffn. Pre-fix: layer std-dev grew 40× over 48 layers (signature of attention scores compounding without per-head Q/K norm). Output `%%%%%%%%`. Step 5b — rope_theta default 10K → 1M for qwen3_moe (rank-4 prior, 10%) ======================================================================= GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT `qwen3moe.rope.freq_base` metadata. config.rs's default lookup had `"qwen2" | "qwen3" => 1_000_000.0` but no qwen3_moe entry — fell to catch-all 10K. Off by 100×. Added qwen3_moe to the 1M arm. Step 6 — chat template (qwen3_moe → ChatML, no <think>) ======================================================== `detect_format_from_name` routed any "qwen3" name to Qwen3NoThink (PMAT-181), which pre-injects empty `<think>\n</think>\n` into the assistant turn. Qwen3-Coder does NOT have thinking mode (verified via the Jinja `tokenizer.chat_template` in the GGUF) — empty think block caused the model to emit `<|endoftext|>` immediately. Route qwen3_moe to plain ChatML before the generic qwen3 → NoThink rule. PMAT-181 preserved for thinking-mode dense Qwen3. Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K norm ============================================================ forward_qwen3_moe_traced (created in PR #1222 on main) was authored mirroring the OLD pre-Q/K-norm forward_qwen3_moe. Without sync, `apr trace --payload` shows DIFFERENT numerics from `apr run` — silent diagnostic-vs-production drift. Mirror the same Q/K norm into the traced variant. Component priors discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED (this commit) | | 4 | RoPE θ | 10% | FIXED (this commit) | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED (this commit) | Output transition pre-fix → "%%%%%%%%" (gibberish) + Step 5 → "Human: What is 2+" (coherent English, partial) + Step 5b → "Human: What is 2+2?" (full prompt reproduced) + Step 6 → "2 + 2 = 4" (correct answer) + Step 7 → diagnostic trace matches production Multi-domain verification (also passes): "Capital of France:" → "The capital of France is Paris." "Translate to Spanish: Hello world" → "¡Hola mundo!" "Count to 5:" → "1, 2, 3, 4, 5" "Solve x^2 - 5x + 6 = 0:" → "I need to solve the quadratic equation x² - 5x + 6 = 0..." Hot-path safety - Production text-generation path (`apr run` → run_qwen3_moe_generate → forward_qwen3_moe) now applies the norm. - `apr trace --payload` (forward_qwen3_moe_traced) syncs the same fix. - Sibling tests pass unchanged. - `forward_qwen3_moe_traced` reads `self.config.rope_theta` which is set at model load from the default lookup — Step 5b auto-applies via config. - Dense Qwen3 path UNCHANGED (Qwen3NoThink preserved for thinking-mode variants). Regression test `crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs` F-QW3-MOE-STEP5-001 asserts the context-awareness invariant: two distinct prompts must produce distinct argmax tokens, top-2 logit gap < 50. Live PASS on lambda-vector RTX 4090 in 6.60s. Stack research - HuggingFace transformers Qwen3MoeForCausalLM applies per-head q_norm/k_norm in Qwen3MoeAttention.forward - llama.cpp ggml_qwen3_moe_kv_norm in llama-arch.cpp does the same (attn_q_norm.weight / attn_k_norm.weight) - HF Qwen3MoeConfig.rope_theta default = 1_000_000.0 - Qwen3-Coder Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` (no thinking) Refs M32d FAST PATH plan (M34, paiml/claude-code-parity-apr) Refs GH-279 (Qwen3 per-head Q/K RMSNorm) Refs PMAT-181 (Qwen3NoThink preserved for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…1406) * feat(aprender-serve): SHIP-007 PR-B — SaveTensorPlan plan-builder Add `inference_trace::save_tensor_plan` (`SaveTensorPlan` + `PlanParseError`) that converts the three `apr trace --save-tensor*` argument strings into a validated, ready-to-execute plan: - `--save-tensor <STAGES>` comma-list, or the literal `all` - `--save-tensor-dir <DIR>` output root - `--save-tensor-layers <RANGE>` Rust `START..END`, END exclusive The plan exposes `should_save(stage, layer)` and `stage_path(stage, layer)` queries. Per-layer stages obey the layer range; whole-model stages (`final_norm`, `lm_head`) ignore the range and are saved iff selected. Pure module — no I/O, no transformer state, no `apr-cli` coupling. PR-C threads the plan into `AprTransformer::forward_traced` as a side-channel. 28 unit tests covering: provenance pin, pass band (whitespace, `all`, `ALL`/`aLL`, wide ranges), fail band (unknown stage, empty token, malformed range, negative start, END<=START, garbage END), should_save semantics (in-range, out-of-range, unselected, whole-model bypass, default `0..1`), stage_path semantics (per-layer 0/3, whole-model with sentinel), edge cases (`layer_output` alias, duplicate preservation, min range, high layer index, Clone+Eq). ## Five Whys 1. Why split plan-builder from forward-pass wiring? — Argument validation should fail BEFORE multi-second model load; the only way to keep the error surface tight is to parse eagerly. Splitting also lets PR-C touch `forward_traced` internals without touching argument parsing. 2. Why pure (no I/O)? — `ensure_layer_dir`/file open are stateful side effects best owned by the writer (PR-C). Tests stay fast and deterministic; `cargo test --lib` runs the 28 cases in 0.00s. 3. Why `0..1` default? — Matches the clap default in PR-A (#1405). SHIP-007 root cause is pinned at layer-0 qkv matmul; default keeps disk-write blast radius bounded; users opt in to wider ranges. 4. Why `all` keyword? — Contract §invariants line 63 specifies "comma-delimited; multiple stages MAY be saved in one run". `all` is the obvious shorthand for the 18-stage exhaustive bisection the SHIP-007 hypothesis chain (§24-§32) is heading toward. 5. Why now in the SHIP-TWO loop? — Operator directive 2026-05-02: "Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work)". PR-B is step 2/4. ## Independence from PR-A PR-B does NOT modify `apr-cli`. The branch is off `origin/main` (which does not yet contain PR-A's clap fields). PR-A and PR-B can land in either order; PR-C will rebase off whichever is already in main and introduce the dispatch wiring. ## Ship % update MODEL-1: ~64% (unchanged — plan-builder is preparatory; element-wise diff still pending PR-C/D forward_traced wiring). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(aprender-serve): SHIP-007 PR-C-prep — SaveTensorPlan I/O integration (#1407) Add 6 integration tests in `tests/save_tensor_plan_integration.rs` that drive `SaveTensorPlan` (PR-B, #1406) end-to-end through the public `save_tensor::*` writer + `save_tensor_paths::*` directory helpers. These pin the plan ↔ writer contract from the OUTSIDE so the upcoming `forward_traced` wiring (PR-C-real) has a stable target. The tests mirror the future dispatch flow: build plan from CLI strings → iterate forward-pass-style `(stage, layer, data)` tuples → assert files appear exactly where the plan predicted, and only for selected stages. ## Tests (6/6 pass) | Test | What it pins | |------|-------------| | `plan_three_stages_layer_zero_writes_three_files` | multi-stage selection + skip unselected | | `plan_layer_range_filter_excludes_out_of_range` | END-exclusive range honoured end-to-end | | `plan_whole_model_stage_writes_to_root_not_layer_dir` | final_norm/lm_head bypass layer-N segment | | `plan_unselected_stage_produces_no_file` | should_save==false → zero I/O | | `plan_byte_determinism_across_two_runs` | FALSIFY-APR-TRACE-SAVE-002 at integration boundary | | `plan_all_keyword_writes_18_per_layer_files_for_one_layer` | `all` keyword → 18 stages routed correctly | ## Five Whys 1. **Why integration tests now, not in PR-B?** PR-B unit tests verify `should_save`/`stage_path` in isolation. These integration tests verify the plan ACTUALLY drives the writer correctly when used the way `forward_traced` will use it. 2. **Why ship before PR-C-real?** Locks down the plan ↔ writer contract so PR-C-real (forward_traced surgery) only has to plumb the call sites; if a future PR breaks the contract, these tests fail BEFORE the model loads. 3. **Why a separate file from `save_tensor_integration.rs`?** That file exercises the writer in isolation; this file exercises the plan-driven flow. Mixing would obscure which level of the contract regressed when a test fails. 4. **Why include the `all` keyword test?** SHIP-007 layer-0 stage diff may run `--save-tensor all --save-tensor-layers 0..1` for an exhaustive bisection; routing 18 stages through the plan is the single most stressed path and deserves its own pin. 5. **Why now in the SHIP-TWO loop?** Ship-cadence move that closes the contract surface PR-C-real depends on, while PR-A (#1405) and PR-B (#1406) CI is still in flight. Stack-friendly: branched off PR-B, merges cleanly when both land. ## Stack relationship This PR depends on PR-B (#1406) — SaveTensorPlan must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will set PR-B as the base; once PR-B merges to main, this PR auto-rebases. ## Ship % update MODEL-1: ~64% (unchanged — integration tests are preparatory; the forward_traced surgery in PR-C-real and the `apr trace --save-tensor` dispatch wiring in PR-D are still pending). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 2, 2026
…e_theta + chat template (#1228) Squashes 4 substantive M32d FAST PATH fixes (Step 5 + 5b + 6 + 7) + regression test + evidence into a single commit on top of fresh main. Replaces the original messy stacked-PR chain that conflicted on rebase after sibling PRs (#1401, #1405) landed. Live verification on lambda-vector RTX 4090 (post-rebuild): $ apr run <Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf> \ --prompt "What is 2+2?" --max-tokens 8 Output: 2 + 2 = 4 Completed in 40.24s (cached) Step 5 — per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%) ==================================================================== Qwen3 GH-279 per-head Q/K RMSNorm was wired into the dense path (adaptive_ffn.rs:174-179) but missing from forward_qwen3_moe.rs. Now applied AFTER bias, BEFORE RoPE — same code as adaptive_ffn. Pre-fix: layer std-dev grew 40× over 48 layers (signature of attention scores compounding without per-head Q/K norm). Output `%%%%%%%%`. Step 5b — rope_theta default 10K → 1M for qwen3_moe (rank-4 prior, 10%) ======================================================================= GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT `qwen3moe.rope.freq_base` metadata. config.rs's default lookup had `"qwen2" | "qwen3" => 1_000_000.0` but no qwen3_moe entry — fell to catch-all 10K. Off by 100×. Added qwen3_moe to the 1M arm. Step 6 — chat template (qwen3_moe → ChatML, no <think>) ======================================================== `detect_format_from_name` routed any "qwen3" name to Qwen3NoThink (PMAT-181), which pre-injects empty `<think>\n</think>\n` into the assistant turn. Qwen3-Coder does NOT have thinking mode (verified via the Jinja `tokenizer.chat_template` in the GGUF) — empty think block caused the model to emit `<|endoftext|>` immediately. Route qwen3_moe to plain ChatML before the generic qwen3 → NoThink rule. PMAT-181 preserved for thinking-mode dense Qwen3. Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K norm ============================================================ forward_qwen3_moe_traced (created in PR #1222 on main) was authored mirroring the OLD pre-Q/K-norm forward_qwen3_moe. Without sync, `apr trace --payload` shows DIFFERENT numerics from `apr run` — silent diagnostic-vs-production drift. Mirror the same Q/K norm into the traced variant. Component priors discharge status (M34 FAST PATH) | Rank | Component | Prior | Status | |------|-----------|-------|------------| | 1 | LAYOUT | 30% | not at issue | | 2 | Q4_K_M | 20% | not at issue | | 3 | Q/K norm | 15% | FIXED (this commit) | | 4 | RoPE θ | 10% | FIXED (this commit) | | 5 | router sm | 10% | not at issue | | 6 | token emb | 10% | not at issue | | 7 | other | 5% | n/a | | n/a | chat tpl | n/a | FIXED (this commit) | Output transition pre-fix → "%%%%%%%%" (gibberish) + Step 5 → "Human: What is 2+" (coherent English, partial) + Step 5b → "Human: What is 2+2?" (full prompt reproduced) + Step 6 → "2 + 2 = 4" (correct answer) + Step 7 → diagnostic trace matches production Multi-domain verification (also passes): "Capital of France:" → "The capital of France is Paris." "Translate to Spanish: Hello world" → "¡Hola mundo!" "Count to 5:" → "1, 2, 3, 4, 5" "Solve x^2 - 5x + 6 = 0:" → "I need to solve the quadratic equation x² - 5x + 6 = 0..." Hot-path safety - Production text-generation path (`apr run` → run_qwen3_moe_generate → forward_qwen3_moe) now applies the norm. - `apr trace --payload` (forward_qwen3_moe_traced) syncs the same fix. - Sibling tests pass unchanged. - `forward_qwen3_moe_traced` reads `self.config.rope_theta` which is set at model load from the default lookup — Step 5b auto-applies via config. - Dense Qwen3 path UNCHANGED (Qwen3NoThink preserved for thinking-mode variants). Regression test `crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs` F-QW3-MOE-STEP5-001 asserts the context-awareness invariant: two distinct prompts must produce distinct argmax tokens, top-2 logit gap < 50. Live PASS on lambda-vector RTX 4090 in 6.60s. Stack research - HuggingFace transformers Qwen3MoeForCausalLM applies per-head q_norm/k_norm in Qwen3MoeAttention.forward - llama.cpp ggml_qwen3_moe_kv_norm in llama-arch.cpp does the same (attn_q_norm.weight / attn_k_norm.weight) - HF Qwen3MoeConfig.rope_theta default = 1_000_000.0 - Qwen3-Coder Jinja chat_template generation prompt is plain `<|im_start|>assistant\n` (no thinking) Refs M32d FAST PATH plan (M34, paiml/claude-code-parity-apr) Refs GH-279 (Qwen3 per-head Q/K RMSNorm) Refs PMAT-181 (Qwen3NoThink preserved for thinking variants) Refs FALSIFY-QW3-MOE-FORWARD-003 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…h_save_tensor` wrapper (#1408) * feat(aprender-serve): SHIP-007 PR-B — SaveTensorPlan plan-builder Add `inference_trace::save_tensor_plan` (`SaveTensorPlan` + `PlanParseError`) that converts the three `apr trace --save-tensor*` argument strings into a validated, ready-to-execute plan: - `--save-tensor <STAGES>` comma-list, or the literal `all` - `--save-tensor-dir <DIR>` output root - `--save-tensor-layers <RANGE>` Rust `START..END`, END exclusive The plan exposes `should_save(stage, layer)` and `stage_path(stage, layer)` queries. Per-layer stages obey the layer range; whole-model stages (`final_norm`, `lm_head`) ignore the range and are saved iff selected. Pure module — no I/O, no transformer state, no `apr-cli` coupling. PR-C threads the plan into `AprTransformer::forward_traced` as a side-channel. 28 unit tests covering: provenance pin, pass band (whitespace, `all`, `ALL`/`aLL`, wide ranges), fail band (unknown stage, empty token, malformed range, negative start, END<=START, garbage END), should_save semantics (in-range, out-of-range, unselected, whole-model bypass, default `0..1`), stage_path semantics (per-layer 0/3, whole-model with sentinel), edge cases (`layer_output` alias, duplicate preservation, min range, high layer index, Clone+Eq). ## Five Whys 1. Why split plan-builder from forward-pass wiring? — Argument validation should fail BEFORE multi-second model load; the only way to keep the error surface tight is to parse eagerly. Splitting also lets PR-C touch `forward_traced` internals without touching argument parsing. 2. Why pure (no I/O)? — `ensure_layer_dir`/file open are stateful side effects best owned by the writer (PR-C). Tests stay fast and deterministic; `cargo test --lib` runs the 28 cases in 0.00s. 3. Why `0..1` default? — Matches the clap default in PR-A (#1405). SHIP-007 root cause is pinned at layer-0 qkv matmul; default keeps disk-write blast radius bounded; users opt in to wider ranges. 4. Why `all` keyword? — Contract §invariants line 63 specifies "comma-delimited; multiple stages MAY be saved in one run". `all` is the obvious shorthand for the 18-stage exhaustive bisection the SHIP-007 hypothesis chain (§24-§32) is heading toward. 5. Why now in the SHIP-TWO loop? — Operator directive 2026-05-02: "Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work)". PR-B is step 2/4. ## Independence from PR-A PR-B does NOT modify `apr-cli`. The branch is off `origin/main` (which does not yet contain PR-A's clap fields). PR-A and PR-B can land in either order; PR-C will rebase off whichever is already in main and introduce the dispatch wiring. ## Ship % update MODEL-1: ~64% (unchanged — plan-builder is preparatory; element-wise diff still pending PR-C/D forward_traced wiring). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(aprender-serve): SHIP-007 PR-C-prep — SaveTensorPlan I/O integration (#1407) Add 6 integration tests in `tests/save_tensor_plan_integration.rs` that drive `SaveTensorPlan` (PR-B, #1406) end-to-end through the public `save_tensor::*` writer + `save_tensor_paths::*` directory helpers. These pin the plan ↔ writer contract from the OUTSIDE so the upcoming `forward_traced` wiring (PR-C-real) has a stable target. The tests mirror the future dispatch flow: build plan from CLI strings → iterate forward-pass-style `(stage, layer, data)` tuples → assert files appear exactly where the plan predicted, and only for selected stages. ## Tests (6/6 pass) | Test | What it pins | |------|-------------| | `plan_three_stages_layer_zero_writes_three_files` | multi-stage selection + skip unselected | | `plan_layer_range_filter_excludes_out_of_range` | END-exclusive range honoured end-to-end | | `plan_whole_model_stage_writes_to_root_not_layer_dir` | final_norm/lm_head bypass layer-N segment | | `plan_unselected_stage_produces_no_file` | should_save==false → zero I/O | | `plan_byte_determinism_across_two_runs` | FALSIFY-APR-TRACE-SAVE-002 at integration boundary | | `plan_all_keyword_writes_18_per_layer_files_for_one_layer` | `all` keyword → 18 stages routed correctly | ## Five Whys 1. **Why integration tests now, not in PR-B?** PR-B unit tests verify `should_save`/`stage_path` in isolation. These integration tests verify the plan ACTUALLY drives the writer correctly when used the way `forward_traced` will use it. 2. **Why ship before PR-C-real?** Locks down the plan ↔ writer contract so PR-C-real (forward_traced surgery) only has to plumb the call sites; if a future PR breaks the contract, these tests fail BEFORE the model loads. 3. **Why a separate file from `save_tensor_integration.rs`?** That file exercises the writer in isolation; this file exercises the plan-driven flow. Mixing would obscure which level of the contract regressed when a test fails. 4. **Why include the `all` keyword test?** SHIP-007 layer-0 stage diff may run `--save-tensor all --save-tensor-layers 0..1` for an exhaustive bisection; routing 18 stages through the plan is the single most stressed path and deserves its own pin. 5. **Why now in the SHIP-TWO loop?** Ship-cadence move that closes the contract surface PR-C-real depends on, while PR-A (#1405) and PR-B (#1406) CI is still in flight. Stack-friendly: branched off PR-B, merges cleanly when both land. ## Stack relationship This PR depends on PR-B (#1406) — SaveTensorPlan must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will set PR-B as the base; once PR-B merges to main, this PR auto-rebases. ## Ship % update MODEL-1: ~64% (unchanged — integration tests are preparatory; the forward_traced surgery in PR-C-real and the `apr trace --save-tensor` dispatch wiring in PR-D are still pending). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SHIP-007 PR-C-real step 1 — `forward_traced_with_save_tensor` wrapper Add public method `AprTransformer::forward_traced_with_save_tensor( tokens, plan)` that delegates to existing `forward_traced` and additionally writes the **embedding** stage to disk if the supplied [`SaveTensorPlan`] (PR-B, #1406) selects it. This is the first of several SHIP-007 forward_traced wiring steps. Step 1 covers only the embedding stage — the one stage that can be re-extracted by calling `self.embed(token_ids)` a second time, so we ship working save-tensor I/O without modifying the 360-line `forward_traced` body. Subsequent steps will thread `Option<&SaveTensorPlan>` through `forward_traced` itself so per-layer stages (qkv_matmul, ffn_gate, …) emit during the single forward pass without re-runs. ## Five Whys 1. **Why a wrapper instead of modifying forward_traced directly?** Keeps the high-risk forward-pass surgery out of step 1. The wrapper compiles in 4.5 s and connects cleanly to `forward_traced`'s existing return value; modifying the 360-line body would be a large, risky diff for a single ship-cadence iteration. 2. **Why only the embedding stage?** Embedding is the only stage that can be cheaply re-extracted post-hoc via `self.embed()` (token-table lookup, no matmuls). Other stages live inside `forward_traced` and require threading the plan through to capture. 3. **Why error-typed as `RealizarError::IoError`?** The existing project error type already has an `IoError { message }` variant; using it keeps the `Result<ForwardTrace>` signature compatible with `forward_traced` so callers don't need a `From<>` impl. 4. **Why no integration test on a real model in this PR?** Building a live `AprTransformer` requires loading a real .apr file — appropriate for PR-D (`apr trace --save-tensor` on canonical 7B teacher), not for a step-1 unit-level wrapper. The plan ↔ writer contract is already pinned by PR-C-prep (#1407). 5. **Why now in the SHIP-TWO loop?** Operator directive 2026-05-02 — "Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work)". Step 1 establishes the public method that PR-D's dispatch will call. ## Stack relationship This PR depends on PR-B (#1406) — `SaveTensorPlan` must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will auto-rebase once PR-B merges to main. ## Ship % update MODEL-1: ~64% (unchanged — wrapper is preparatory; effective tensor capture for any layer-0 stage other than embedding still requires follow-up PRs threading the plan through forward_traced). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): SHIP-007 PR-C-real-step1 — clippy doc list overindentation CI lint failed with `error: doc list item overindented` on the "## Role in the cascade" bulleted list (4 occurrences). The continuation lines inside the **This file** bullet were indented 25 spaces; clippy expects 2-space continuation indent. Reformat the bullet so the continuation text uses 2-space indent and the file body stays inside one bullet. Also update PR-B/PR-B-prep status from OPEN → MERGED (both landed this session). No code changes, doc-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…age tensors (#1413) Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1` at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot. When both inputs to `apr diff --values` start with magic bytes `APRT` (the 12-byte header written by `apr trace --save-tensor`), the dispatch now bypasses the RosettaStone whole-model walker and runs an element-wise stage-tensor diff: - max|diff| with index - RMS diff - Cosine similarity (f64-accumulated for numerical stability) - Top-K divergences sorted by |a - b| Both JSON and pretty text output are supported. Mismatched dim_product or layer fields fail-fast with a diagnostic error so callers don't silently compare incompatible stages. ## Five Whys (why now, why this scope) 1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406, PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files. Without an APRT-aware diff, layer-0 stage-by-stage element-wise bisection per `feedback_model_1_ships_gpu_only.md` is gated on external tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids. 2. **Why extend `apr diff` and not write a new subcommand?** The `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1` already names `apr diff --values` as the verifier. Extending the existing flag keeps the contract surface stable. 3. **Why an include!() file instead of inlining into diff.rs?** diff.rs already follows that pattern (diff_accumulator, diff_output_json_text, diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be audited / removed independently and doesn't grow the parent file. 4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either `is_aprt_stage_file` or the file path break the include!() chain; the pin makes that visible at test-time and forces a contract bump. 5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is running in the background for MODEL-2 (PR #1412 merged). The SHIP-007 PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when step 2 (forward_traced threading) lands. PR-D is independent and can merge in parallel with #1408. ## Verification - `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests) - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3) - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3) - provenance_pin_pr_d_rev1 (1) - `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors ## Contract update `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0: - New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing this PR's unit tests. ## Ship % update MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant and clears infrastructure blocker for SHIP-007 step E). MODEL-2: corpus tokenization in progress (~33h ETA). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
) M-MOE-SUB-2 step (a) CLI completion: connects clap surface (--save-tensor / --save-tensor-layers / --save-tensor-dir, PR-A #1405) through to forward_qwen3_moe_traced_with_plan (M74, PR #1516 squash 3138d13) for Qwen3-MoE-arch GGUF models. ## What ships 1. New pub fn `run_save_tensor_gguf_moe(path, stages, dir, layers)` in `crates/apr-cli/src/commands/trace_save_tensor.rs`. Mirrors the structure of the existing `run_save_tensor_apr` for APR models, but loads via `MappedGGUFModel` / `OwnedQuantizedModel`, validates `qwen3_moe` arch (rejects dense GGUF with a clear error), reads MoE config from GGUF metadata (`expert_count`, `expert_used_count`, `expert_feed_forward_length`), loads per-layer `Qwen3MoeQuantizedLayer` descriptors, then dispatches to `forward_qwen3_moe_traced_with_plan` with the plan derived from the CLI args. 2. Dispatch wireup in `dispatch.rs::dispatch_diagnostic_commands` under the `Commands::Trace` arm. The previous code dispatched `--save-tensor` for `.apr` only and printed a stub for other extensions; now `.gguf` dispatches to the new `run_save_tensor_gguf_moe` function. Other extensions (.safetensors) still print the stub pending SHIP-007 PR-E. ## What this does NOT ship - Dense GGUF SaveTensor wireup (still falls through to stub). - M-MOE-SUB-2 step (b) GPU sibling `forward_qwen3_moe_cuda_traced.rs` — separate PR. - Live verification on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF — exercised in M-MOE-SUB-3. ## Hot path safety Production `forward_qwen3_moe` / `forward_qwen3_moe_cuda` hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0). Production `forward_qwen3_moe_traced` (no plan) also unchanged — the new wireup uses the M74 sibling `forward_qwen3_moe_traced_with_plan`. ## Verification $ cargo build -p apr-cli --release --features inference clean $ cargo clippy -p apr-cli --lib --release --features inference \ -- -D warnings clean $ rustfmt --check trace_save_tensor.rs dispatch.rs clean $ cargo test -p apr-cli --release --lib commands::trace_save_tensor 5 passed (existing tests preserved) ## Falsifier impact - FALSIFY-MOE-SUB-002 (byte-identity preservation): still partial — needs M-MOE-SUB-2 step (b) GPU sibling for full discharge. - M-MOE-SUB-3 live bisection: now unblocked operationally — invoking `apr trace --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <dir> <gguf>` on lambda-vector RTX 4090 will produce per-layer MoeRouter + MoeFfnOut tensor files for the cached Qwen3-Coder GGUF, ready for diff vs the GPU sibling output once step (b) ships. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 step (a) CLI completion, M68 helper PR #1507 (squash 0f22c78), M74 forward_qwen3_moe_traced_with_plan PR #1516 (squash 3138d13) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--save-tensor,--save-tensor-dir,--save-tensor-layerstoCommands::Tracecontracts/apr-cli-trace-save-tensor-v1.yamlat the binary boundaryforward_tracedThis is PR-A of a 4-PR cascade authorized by the operator on 2026-05-02:
forward_tracedin aprender-serveapr diff --stageextension consuming saved tensorsThe helpers (
save_tensor_stage.rs,save_tensor_paths.rs,save_tensor_compose.rs) shipped in PRs #1133–#1137 already exist incrates/aprender-serve/src/inference_trace/. PR-A only adds the CLI signature so the contract is bound andapr trace --helprenders correctly.Five Whys
--helpexactly as the contract specifies.forward_tracedlives inaprender-serveand threads layer-id through three call sites; bundling makes the diff harder to bisect when MODEL-1 cosine drift later regresses.0..1default? SHIP-007 root cause is pinned at layer-0 qkv matmul (memory:project_2026_04_27_session2_handoff.md). Default keeps disk usage bounded; users opt into wider ranges explicitly.Ship % update
Test plan
cargo check -p apr-clicleancargo check -p apr-cli --testscleancargo test -p apr-cli --lib test_parse_trace_commandpasscargo test -p apr-cli --lib test_dispatch_diagnostic_routes_tracepassapr trace --helprenders all 3 new flags with full descriptions🤖 Generated with Claude Code