feat(aprender-serve): SHIP-007 PR-C-real step 1 — forward_traced_with_save_tensor wrapper#1408
Merged
Merged
Conversation
Add `inference_trace::save_tensor_plan` (`SaveTensorPlan` + `PlanParseError`) that converts the three `apr trace --save-tensor*` argument strings into a validated, ready-to-execute plan: - `--save-tensor <STAGES>` comma-list, or the literal `all` - `--save-tensor-dir <DIR>` output root - `--save-tensor-layers <RANGE>` Rust `START..END`, END exclusive The plan exposes `should_save(stage, layer)` and `stage_path(stage, layer)` queries. Per-layer stages obey the layer range; whole-model stages (`final_norm`, `lm_head`) ignore the range and are saved iff selected. Pure module — no I/O, no transformer state, no `apr-cli` coupling. PR-C threads the plan into `AprTransformer::forward_traced` as a side-channel. 28 unit tests covering: provenance pin, pass band (whitespace, `all`, `ALL`/`aLL`, wide ranges), fail band (unknown stage, empty token, malformed range, negative start, END<=START, garbage END), should_save semantics (in-range, out-of-range, unselected, whole-model bypass, default `0..1`), stage_path semantics (per-layer 0/3, whole-model with sentinel), edge cases (`layer_output` alias, duplicate preservation, min range, high layer index, Clone+Eq). ## Five Whys 1. Why split plan-builder from forward-pass wiring? — Argument validation should fail BEFORE multi-second model load; the only way to keep the error surface tight is to parse eagerly. Splitting also lets PR-C touch `forward_traced` internals without touching argument parsing. 2. Why pure (no I/O)? — `ensure_layer_dir`/file open are stateful side effects best owned by the writer (PR-C). Tests stay fast and deterministic; `cargo test --lib` runs the 28 cases in 0.00s. 3. Why `0..1` default? — Matches the clap default in PR-A (#1405). SHIP-007 root cause is pinned at layer-0 qkv matmul; default keeps disk-write blast radius bounded; users opt in to wider ranges. 4. Why `all` keyword? — Contract §invariants line 63 specifies "comma-delimited; multiple stages MAY be saved in one run". `all` is the obvious shorthand for the 18-stage exhaustive bisection the SHIP-007 hypothesis chain (§24-§32) is heading toward. 5. Why now in the SHIP-TWO loop? — Operator directive 2026-05-02: "Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff (extend apr trace --save-tensor, multi-PR work)". PR-B is step 2/4. ## Independence from PR-A PR-B does NOT modify `apr-cli`. The branch is off `origin/main` (which does not yet contain PR-A's clap fields). PR-A and PR-B can land in either order; PR-C will rebase off whichever is already in main and introduce the dispatch wiring. ## Ship % update MODEL-1: ~64% (unchanged — plan-builder is preparatory; element-wise diff still pending PR-C/D forward_traced wiring). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion (#1407) Add 6 integration tests in `tests/save_tensor_plan_integration.rs` that drive `SaveTensorPlan` (PR-B, #1406) end-to-end through the public `save_tensor::*` writer + `save_tensor_paths::*` directory helpers. These pin the plan ↔ writer contract from the OUTSIDE so the upcoming `forward_traced` wiring (PR-C-real) has a stable target. The tests mirror the future dispatch flow: build plan from CLI strings → iterate forward-pass-style `(stage, layer, data)` tuples → assert files appear exactly where the plan predicted, and only for selected stages. ## Tests (6/6 pass) | Test | What it pins | |------|-------------| | `plan_three_stages_layer_zero_writes_three_files` | multi-stage selection + skip unselected | | `plan_layer_range_filter_excludes_out_of_range` | END-exclusive range honoured end-to-end | | `plan_whole_model_stage_writes_to_root_not_layer_dir` | final_norm/lm_head bypass layer-N segment | | `plan_unselected_stage_produces_no_file` | should_save==false → zero I/O | | `plan_byte_determinism_across_two_runs` | FALSIFY-APR-TRACE-SAVE-002 at integration boundary | | `plan_all_keyword_writes_18_per_layer_files_for_one_layer` | `all` keyword → 18 stages routed correctly | ## Five Whys 1. **Why integration tests now, not in PR-B?** PR-B unit tests verify `should_save`/`stage_path` in isolation. These integration tests verify the plan ACTUALLY drives the writer correctly when used the way `forward_traced` will use it. 2. **Why ship before PR-C-real?** Locks down the plan ↔ writer contract so PR-C-real (forward_traced surgery) only has to plumb the call sites; if a future PR breaks the contract, these tests fail BEFORE the model loads. 3. **Why a separate file from `save_tensor_integration.rs`?** That file exercises the writer in isolation; this file exercises the plan-driven flow. Mixing would obscure which level of the contract regressed when a test fails. 4. **Why include the `all` keyword test?** SHIP-007 layer-0 stage diff may run `--save-tensor all --save-tensor-layers 0..1` for an exhaustive bisection; routing 18 stages through the plan is the single most stressed path and deserves its own pin. 5. **Why now in the SHIP-TWO loop?** Ship-cadence move that closes the contract surface PR-C-real depends on, while PR-A (#1405) and PR-B (#1406) CI is still in flight. Stack-friendly: branched off PR-B, merges cleanly when both land. ## Stack relationship This PR depends on PR-B (#1406) — SaveTensorPlan must be in the tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will set PR-B as the base; once PR-B merges to main, this PR auto-rebases. ## Ship % update MODEL-1: ~64% (unchanged — integration tests are preparatory; the forward_traced surgery in PR-C-real and the `apr trace --save-tensor` dispatch wiring in PR-D are still pending). MODEL-2: blocked on Stack v2 corpus access (operator action). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…h_save_tensor` wrapper
Add public method `AprTransformer::forward_traced_with_save_tensor(
tokens, plan)` that delegates to existing `forward_traced` and
additionally writes the **embedding** stage to disk if the supplied
[`SaveTensorPlan`] (PR-B, #1406) selects it.
This is the first of several SHIP-007 forward_traced wiring steps. Step
1 covers only the embedding stage — the one stage that can be
re-extracted by calling `self.embed(token_ids)` a second time, so we
ship working save-tensor I/O without modifying the 360-line
`forward_traced` body.
Subsequent steps will thread `Option<&SaveTensorPlan>` through
`forward_traced` itself so per-layer stages (qkv_matmul, ffn_gate, …)
emit during the single forward pass without re-runs.
## Five Whys
1. **Why a wrapper instead of modifying forward_traced directly?**
Keeps the high-risk forward-pass surgery out of step 1. The wrapper
compiles in 4.5 s and connects cleanly to `forward_traced`'s
existing return value; modifying the 360-line body would be a
large, risky diff for a single ship-cadence iteration.
2. **Why only the embedding stage?** Embedding is the only stage that
can be cheaply re-extracted post-hoc via `self.embed()` (token-table
lookup, no matmuls). Other stages live inside `forward_traced` and
require threading the plan through to capture.
3. **Why error-typed as `RealizarError::IoError`?** The existing
project error type already has an `IoError { message }` variant;
using it keeps the `Result<ForwardTrace>` signature compatible
with `forward_traced` so callers don't need a `From<>` impl.
4. **Why no integration test on a real model in this PR?** Building a
live `AprTransformer` requires loading a real .apr file —
appropriate for PR-D (`apr trace --save-tensor` on canonical 7B
teacher), not for a step-1 unit-level wrapper. The plan ↔ writer
contract is already pinned by PR-C-prep (#1407).
5. **Why now in the SHIP-TWO loop?** Operator directive 2026-05-02 —
"Path forward to ship MODEL-1 is SHIP-007 layer-0 stage diff
(extend apr trace --save-tensor, multi-PR work)". Step 1
establishes the public method that PR-D's dispatch will call.
## Stack relationship
This PR depends on PR-B (#1406) — `SaveTensorPlan` must be in the
tree. Branched off `feat/ship-007-pr-b-save-tensor-plan`. GitHub will
auto-rebase once PR-B merges to main.
## Ship % update
MODEL-1: ~64% (unchanged — wrapper is preparatory; effective tensor
capture for any layer-0 stage other than embedding still requires
follow-up PRs threading the plan through forward_traced).
MODEL-2: blocked on Stack v2 corpus access (operator action).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ndentation CI lint failed with `error: doc list item overindented` on the "## Role in the cascade" bulleted list (4 occurrences). The continuation lines inside the **This file** bullet were indented 25 spaces; clippy expects 2-space continuation indent. Reformat the bullet so the continuation text uses 2-space indent and the file body stays inside one bullet. Also update PR-B/PR-B-prep status from OPEN → MERGED (both landed this session). No code changes, doc-only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
Merged
3 tasks
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…age tensors (#1413) Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1` at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot. When both inputs to `apr diff --values` start with magic bytes `APRT` (the 12-byte header written by `apr trace --save-tensor`), the dispatch now bypasses the RosettaStone whole-model walker and runs an element-wise stage-tensor diff: - max|diff| with index - RMS diff - Cosine similarity (f64-accumulated for numerical stability) - Top-K divergences sorted by |a - b| Both JSON and pretty text output are supported. Mismatched dim_product or layer fields fail-fast with a diagnostic error so callers don't silently compare incompatible stages. ## Five Whys (why now, why this scope) 1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406, PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files. Without an APRT-aware diff, layer-0 stage-by-stage element-wise bisection per `feedback_model_1_ships_gpu_only.md` is gated on external tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids. 2. **Why extend `apr diff` and not write a new subcommand?** The `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1` already names `apr diff --values` as the verifier. Extending the existing flag keeps the contract surface stable. 3. **Why an include!() file instead of inlining into diff.rs?** diff.rs already follows that pattern (diff_accumulator, diff_output_json_text, diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be audited / removed independently and doesn't grow the parent file. 4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either `is_aprt_stage_file` or the file path break the include!() chain; the pin makes that visible at test-time and forces a contract bump. 5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is running in the background for MODEL-2 (PR #1412 merged). The SHIP-007 PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when step 2 (forward_traced threading) lands. PR-D is independent and can merge in parallel with #1408. ## Verification - `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests) - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3) - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3) - provenance_pin_pr_d_rev1 (1) - `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors ## Contract update `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0: - New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing this PR's unit tests. ## Ship % update MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant and clears infrastructure blocker for SHIP-007 step E). MODEL-2: corpus tokenization in progress (~33h ETA). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
… `forward_traced_with_save_tensor` (#1414) Extends the wrapper from PR-C-real step 1 (#1408) to additionally write the `LmHead` whole-model stage when the supplied [`SaveTensorPlan`] selects it. The logits are pulled directly from `trace.logits` — the `Vec<f32>` already returned by `forward_traced` — so no recompute, no internal forward-pass surgery, no risk of behavior drift. This is the same low-risk capture pattern as step 1's `Embedding` branch (re-use already-computed data; defer the high-risk threading into `forward_traced` to future steps). ## Five Whys (why now, why this scope) 1. **Why LmHead next?** Of the 18 `SaveTensorStage` variants, only two are externally re-extractable from a `forward_traced` return value without modifying the function body: `Embedding` (cheap re-call of `self.embed`) and `LmHead` (already in `trace.logits`). Step 1 shipped Embedding; LmHead is the obvious second. 2. **Why not jump straight to per-layer stages (qkv, ffn_*)?** Those stages require threading `Option<&SaveTensorPlan>` through the 360-line `forward_traced` body. That's the bigger surgery — high blast radius, deserves its own PR with proper drift-prevention tests and a real-model integration smoke. Splitting LmHead out first lets `apr diff --values` (PR #1413) compare APR vs GGUF logits TODAY for free, before per-layer infrastructure lands. 3. **Why use the WHOLE_MODEL_LAYER sentinel?** Per `apr-cli-trace-save-tensor-v1` `byte_format` invariant: whole-model stages (lm_head, final_norm) carry `0xFFFFFFFF` in the layer field so `apr diff --values` can recognize them. Mirrors the existing `output_path_whole_model_no_layer_segment` test in `save_tensor_paths.rs`. 4. **Why no integration test on a real `AprTransformer`?** Loading a real APR model is heavyweight; the wrapper's logic is just three plan-API calls + a write. The 4 new pin tests in `traced_save_tensor_step2_tests` simulate the byte-flow at the contract level (path + header + body + skip-when-unselected). Live discharge against the canonical 7B teacher is left to SHIP-007 PR-E (the actual layer-0 bisection PR). 5. **Why now in the SHIP-TWO loop?** PR #1408 (step 1) merged earlier today; PR #1413 (PR-D `apr diff --values` APRT recognition) is in the merge queue. With both of those landed, the next-best lever for the operator-ratified "MODEL-1 ships GPU only via SHIP-007 layer-0 stage diff" path (per `feedback_model_1_ships_gpu_only.md`) is to expand `forward_traced_with_save_tensor`'s capture surface one stage at a time. LmHead is the smallest, safest next step. ## Verification - `cargo test -p aprender-serve --lib traced_save_tensor_step2` → 4/4 PASS: - step2_lm_head_writes_to_output_root_not_per_layer_dir - step2_lm_head_header_uses_whole_model_sentinel - step2_lm_head_skipped_when_plan_does_not_select_it - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving) - `cargo check -p aprender-serve --lib` clean - Step 1's existing Embedding branch is byte-identical to before (no edits to that block; only added a sibling LmHead branch). ## Contract Contract update is intentionally deferred to a follow-up commit to avoid file-conflict with PR #1413 (which is mid-merge and bumps `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0). Once #1413 lands, a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE- SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL. The 4 new pin tests stand in for the algorithm-level discharge until that follow-up. ## Ship % update - MODEL-1: ~66% → ~68% (SHIP-007 capture surface widens from 1/18 to 2/18 stages; the two cheapest captures are now wired). - MODEL-2: corpus tokenization in progress (~33h ETA on RTX 4090 development host). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 3, 2026
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…es (#1417) Before this PR, `apr trace --save-tensor <model>.apr` printed a stub message and never invoked the underlying wrapper — so the existing Embedding (PR #1408 step 1) and LmHead (PR #1414 step 2) capture surface was UNREACHABLE from the CLI. The wrapper, plan-builder, and APRT byte-format machinery had all merged but produced no files. This PR closes the dispatch gap. When `apr trace --save-tensor <stages>` is invoked on a `.apr` model, dispatch.rs now routes to a new `commands::trace_save_tensor::run_save_tensor_apr` function that: 1. Builds a `SaveTensorPlan` from `--save-tensor`/`--save-tensor-dir`/ `--save-tensor-layers`. Default output dir is `<model-stem>-trace/` next to the input. 2. Loads the APR model + embedded BPE tokenizer. 3. Encodes a fixed test prompt (`"What is 2+2?"` — same as `vector_stats.rs::run_traced_inference_apr` for consistency). 4. Calls `AprTransformer::forward_traced_with_save_tensor(&tokens, &plan)`. 5. Walks the output directory and prints every `*.bin` file with its size, plus the forward-pass success summary. `.gguf` and `.safetensors` paths still print the stub for now — SHIP-007 PR-E live diagnostics convert GGUF→APR at the import boundary so the canonical 7B teacher bisection runs through this code path. ## Five Whys 1. **Why was this missing?** Per the SHIP-007 PR-A commit message, the clap surface was shipped first as a contract pin so `apr-cli-trace-save-tensor-v1.yaml::cli_signature` was bound at the binary boundary. Subsequent PRs (B/C-prep/C-step1/C-step2) wired the library-side machinery. The dispatch glue was the missing final hop — easy to overlook because the contract test `apr trace --save-tensor --help | grep save-tensor` passed all along. 2. **Why is `.apr` only?** `forward_traced_with_save_tensor` is a method on `AprTransformer`; GGUF inference goes through a different `OwnedQuantizedModel` path. Adding GGUF support means either porting the wrapper to that type or converting GGUF→APR at import — the latter is what SHIP-007 PR-E already plans, so it's not blocking. 3. **Why a fixed prompt instead of `--prompt`?** SHIP-007 bisection needs the SAME prompt across APR and GGUF runs to make `apr diff --values` byte comparison meaningful. Hardcoding to `"What is 2+2?"` matches the existing `run_traced_inference_apr` in `vector_stats.rs`. A future `--prompt` flag is a small follow-up. 4. **Why a new module instead of extending `trace.rs`?** `trace.rs` is 722 lines already; adding the save-tensor branch via a new module (52 lines + 4 unit tests) keeps the existing 4-format dispatch intact and isolates the wrapper-specific imports (`realizar::inference_trace::save_tensor_plan::SaveTensorPlan`). 5. **Why now?** Operator's standing /loop directive is "select next best recommended choice". With PR #1414 (step 2 LmHead) merged today and PR #1416 (refactor prep) auto-merge queued, the highest-leverage move is making the existing capture surface actually work end-to-end. SHIP-007 PR-E (the live bisection) is gated on `apr trace --save-tensor` producing files; that gate is what this PR opens. ## Test plan - [x] `cargo test -p apr-cli --lib commands::trace_save_tensor` → 5/5 PASS - default_output_dir_uses_model_stem - default_output_dir_handles_bare_filename - default_output_dir_handles_no_extension - collect_bin_files_recurses_per_layer_subdirs - collect_bin_files_missing_dir_is_ok - [x] `cargo check -p apr-cli --lib` clean - [ ] Live smoke on canonical 7B teacher (operator-pre-authorized lambda-labs lane): `apr trace --payload /mnt/nvme-raid0/.../qwen2.5-coder-7b-instruct-q4k.apr --save-tensor embedding,lm_head` produces `<dir>/layer-0/embedding.bin` + `<dir>/lm_head.bin`. Deferred to a follow-up commit since the operator can run this any time after merge. - [ ] CI required checks (`ci / gate`, `workspace-test`) ## Ship % update - **MODEL-1**: ~68% → **~70%** (the wrapper surface that's been merged for 2 days is now actually invocable from the CLI, which means SHIP-007 PR-E live diagnostics are no longer blocked on this dispatch gap). - **MODEL-2**: corpus tokenization at ~96M tokens / 119 min (steady ~14K tok/s). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
AprTransformer::forward_traced_with_save_tensor(tokens, plan)that delegates to existingforward_tracedand additionally writes the embedding stage to disk if the suppliedSaveTensorPlan(PR-B, #1406) selects it.First of several SHIP-007 forward_traced wiring steps. Step 1 covers only the embedding stage — the one stage cheaply re-extractable via
self.embed(). Subsequent steps will threadOption<&SaveTensorPlan>throughforward_traceditself so per-layer stages emit during the single forward pass.Stack position
SaveTensorPlanplan-builderapr trace --save-tensordispatch wiringBase branch:
feat/ship-007-pr-b-save-tensor-plan— auto-rebases tomainonce PR-B merges.Five Whys
forward_traced's existing return value.self.embed()(token-table lookup, no matmuls). Other stages live insideforward_tracedand require threading the plan.RealizarError::IoError? Existing project error type has anIoError { message }variant; using it keeps theResult<ForwardTrace>signature compatible withforward_traced.AprTransformerrequires loading a real .apr — appropriate for PR-D, not for a step-1 wrapper. Plan ↔ writer contract already pinned by PR-C-prep (test(aprender-serve): SHIP-007 PR-C-prep —SaveTensorPlanI/O integration tests #1407).Ship % update
Test plan
cargo check -p aprender-serve --libclean (4.54 s)cargo fmt --checkcleanforward_traced(zero blast radius on existing parity tests)🤖 Generated with Claude Code