feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper by noahgift · Pull Request #1414 · paiml/aprender

noahgift · 2026-05-03T07:58:25Z

Summary

Extends the SHIP-007 PR-C-real step 1 wrapper (#1408) to capture the LmHead (final logits) whole-model stage. The logits come straight from trace.logits (already returned by forward_traced) — no recompute, no surgery on the 360-line function body.

This is the same low-risk pattern as step 1's Embedding branch (re-use already-computed data; defer the high-risk per-layer threading into forward_traced to future steps).

What changed

crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs:
- Added LmHead branch using WHOLE_MODEL_LAYER sentinel.
- 4 unit tests in new traced_save_tensor_step2_tests module pinning path layout, header sentinel, skip-when-unselected, and bit-identical logits round-trip (NaN-bit preserving).
- Updated doc-comment to reflect step-1+step-2 capture surface.
Step 1's Embedding branch is byte-identical to before (only added a sibling block).

Five Whys

Why LmHead next? Of 18 SaveTensorStage variants, only Embedding and LmHead are externally re-extractable from a forward_traced return value without modifying the function body. Step 1 shipped Embedding; LmHead is the obvious second.
Why not jump to per-layer stages? Those require threading Option<&SaveTensorPlan> through the 360-line forward body — high blast radius, deserves its own PR.
Why WHOLE_MODEL_LAYER sentinel? Per apr-cli-trace-save-tensor-v1 byte_format invariant: whole-model stages carry 0xFFFFFFFF in the layer field so apr diff --values (PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413) can recognize them.
Why no live integration test on a real AprTransformer? Loading APR is heavyweight; the wrapper's logic is three plan-API calls + a write. The 4 new pin tests simulate the byte-flow at the contract level. Live discharge is left to SHIP-007 PR-E.
Why now? PR feat(aprender-serve): SHIP-007 PR-C-real step 1 — forward_traced_with_save_tensor wrapper #1408 (step 1) merged today; PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413 (PR-D APRT diff) is in the merge queue. Per feedback_model_1_ships_gpu_only.md, the SHIP-007 layer-0 stage-diff path needs the wrapper's capture surface expanded one stage at a time. LmHead is the smallest, safest next step.

Test plan

cargo test -p aprender-serve --lib traced_save_tensor_step2 → 4/4 PASS
- step2_lm_head_writes_to_output_root_not_per_layer_dir
- step2_lm_head_header_uses_whole_model_sentinel
- step2_lm_head_skipped_when_plan_does_not_select_it
- step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving)
cargo check -p aprender-serve --lib clean
CI required checks (ci / gate, workspace-test)

Contract

Contract update is intentionally deferred to a follow-up commit to avoid file-conflict with PR #1413 (mid-merge; bumps apr-cli-trace-save-tensor-v1 v1.0.0 → v1.1.0). Once #1413 lands, a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE-SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL. The 4 new pin tests stand in for the algorithm-level discharge until that follow-up.

Ship % update

MODEL-1: ~66% → ~68% (SHIP-007 capture surface 1/18 → 2/18 stages).
MODEL-2: full Stack v1.2 corpus tokenization in progress in background (~33h ETA, ~14K tok/s).

🤖 Generated with Claude Code

… `forward_traced_with_save_tensor` Extends the wrapper from PR-C-real step 1 (#1408) to additionally write the `LmHead` whole-model stage when the supplied [`SaveTensorPlan`] selects it. The logits are pulled directly from `trace.logits` — the `Vec<f32>` already returned by `forward_traced` — so no recompute, no internal forward-pass surgery, no risk of behavior drift. This is the same low-risk capture pattern as step 1's `Embedding` branch (re-use already-computed data; defer the high-risk threading into `forward_traced` to future steps). ## Five Whys (why now, why this scope) 1. **Why LmHead next?** Of the 18 `SaveTensorStage` variants, only two are externally re-extractable from a `forward_traced` return value without modifying the function body: `Embedding` (cheap re-call of `self.embed`) and `LmHead` (already in `trace.logits`). Step 1 shipped Embedding; LmHead is the obvious second. 2. **Why not jump straight to per-layer stages (qkv, ffn_*)?** Those stages require threading `Option<&SaveTensorPlan>` through the 360-line `forward_traced` body. That's the bigger surgery — high blast radius, deserves its own PR with proper drift-prevention tests and a real-model integration smoke. Splitting LmHead out first lets `apr diff --values` (PR #1413) compare APR vs GGUF logits TODAY for free, before per-layer infrastructure lands. 3. **Why use the WHOLE_MODEL_LAYER sentinel?** Per `apr-cli-trace-save-tensor-v1` `byte_format` invariant: whole-model stages (lm_head, final_norm) carry `0xFFFFFFFF` in the layer field so `apr diff --values` can recognize them. Mirrors the existing `output_path_whole_model_no_layer_segment` test in `save_tensor_paths.rs`. 4. **Why no integration test on a real `AprTransformer`?** Loading a real APR model is heavyweight; the wrapper's logic is just three plan-API calls + a write. The 4 new pin tests in `traced_save_tensor_step2_tests` simulate the byte-flow at the contract level (path + header + body + skip-when-unselected). Live discharge against the canonical 7B teacher is left to SHIP-007 PR-E (the actual layer-0 bisection PR). 5. **Why now in the SHIP-TWO loop?** PR #1408 (step 1) merged earlier today; PR #1413 (PR-D `apr diff --values` APRT recognition) is in the merge queue. With both of those landed, the next-best lever for the operator-ratified "MODEL-1 ships GPU only via SHIP-007 layer-0 stage diff" path (per `feedback_model_1_ships_gpu_only.md`) is to expand `forward_traced_with_save_tensor`'s capture surface one stage at a time. LmHead is the smallest, safest next step. ## Verification - `cargo test -p aprender-serve --lib traced_save_tensor_step2` → 4/4 PASS: - step2_lm_head_writes_to_output_root_not_per_layer_dir - step2_lm_head_header_uses_whole_model_sentinel - step2_lm_head_skipped_when_plan_does_not_select_it - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving) - `cargo check -p aprender-serve --lib` clean - Step 1's existing Embedding branch is byte-identical to before (no edits to that block; only added a sibling LmHead branch). ## Contract Contract update is intentionally deferred to a follow-up commit to avoid file-conflict with PR #1413 (which is mid-merge and bumps `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0). Once #1413 lands, a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE- SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL. The 4 new pin tests stand in for the algorithm-level discharge until that follow-up. ## Ship % update - MODEL-1: ~66% → ~68% (SHIP-007 capture surface widens from 1/18 to 2/18 stages; the two cheapest captures are now wired). - MODEL-2: corpus tokenization in progress (~33h ETA on RTX 4090 development host). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… records LmHead step-2 PARTIAL discharge (#1415) Follow-up to PR #1414 (`forward_traced_with_save_tensor` step 2). Adds FALSIFY-APR-TRACE-SAVE-010 binding the LmHead branch at PARTIAL_ALGORITHM_LEVEL; the algorithm-level evidence cites the four new pin tests in `traced_save_tensor_step2_tests`: - step2_lm_head_writes_to_output_root_not_per_layer_dir - step2_lm_head_header_uses_whole_model_sentinel - step2_lm_head_skipped_when_plan_does_not_select_it - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit preserving) `binds_to: byte_format` because step 2 invokes the same write_tensor_file path with `WHOLE_MODEL_LAYER` sentinel as the existing `byte_format` equation specifies. Live discharge against the canonical 7B teacher is deferred to SHIP-007 PR-E (layer-0 bisection). ## Five Whys 1. **Why a separate contract follow-up?** The PR #1414 commit needed to land before this bump to avoid file-conflict with PR #1413 (which independently bumped v1.0.0 → v1.1.0 with FALSIFY-009). 2. **Why `binds_to: byte_format` and not `cli_signature`?** The wrapper doesn't add a new clap surface (PR-A already did that); it adds a new branch that emits files conforming to the existing byte-format equation. The new branch's verbatim f32 LE round-trip + NaN preservation is exactly the property `byte_format` invariants pin. 3. **Why PARTIAL_ALGORITHM_LEVEL not full discharge?** The 4 unit tests simulate the wrapper's byte-flow at the contract level using synthetic plans and fake logits — they do NOT instantiate a full AprTransformer or load a real APR model. Live discharge requires SHIP-007 PR-E. 4. **Why bump to v1.2.0?** Adding a new falsification test (FALSIFY-010) that binds an additional invariant is a minor schema change. Per semver, that's a minor bump. 5. **Why `pv validate` clean even with two new falsifiers in 24h?** The contract uses metadata.kind=schema, so falsification_tests entries are flexible; pv validates structure, IDs are unique, and binds_to references are valid. ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors - v1.0.0 → v1.1.0 (PR #1413, FALSIFY-009 binding apr_diff_values_compat) - v1.1.0 → v1.2.0 (this PR, FALSIFY-010 binding LmHead step-2 capture) ## Ship % update - MODEL-1: ~68% (unchanged — this is paperwork that records yesterday's algorithm-level discharge of step 2; the actual capture surface expansion happened in PR #1414). - MODEL-2: corpus tokenization at ~46.5M tokens / 56 min (steady ~14K tok/s); ~33h ETA for full 27 GB Stack v1.2 corpus. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…es (#1417) Before this PR, `apr trace --save-tensor <model>.apr` printed a stub message and never invoked the underlying wrapper — so the existing Embedding (PR #1408 step 1) and LmHead (PR #1414 step 2) capture surface was UNREACHABLE from the CLI. The wrapper, plan-builder, and APRT byte-format machinery had all merged but produced no files. This PR closes the dispatch gap. When `apr trace --save-tensor <stages>` is invoked on a `.apr` model, dispatch.rs now routes to a new `commands::trace_save_tensor::run_save_tensor_apr` function that: 1. Builds a `SaveTensorPlan` from `--save-tensor`/`--save-tensor-dir`/ `--save-tensor-layers`. Default output dir is `<model-stem>-trace/` next to the input. 2. Loads the APR model + embedded BPE tokenizer. 3. Encodes a fixed test prompt (`"What is 2+2?"` — same as `vector_stats.rs::run_traced_inference_apr` for consistency). 4. Calls `AprTransformer::forward_traced_with_save_tensor(&tokens, &plan)`. 5. Walks the output directory and prints every `*.bin` file with its size, plus the forward-pass success summary. `.gguf` and `.safetensors` paths still print the stub for now — SHIP-007 PR-E live diagnostics convert GGUF→APR at the import boundary so the canonical 7B teacher bisection runs through this code path. ## Five Whys 1. **Why was this missing?** Per the SHIP-007 PR-A commit message, the clap surface was shipped first as a contract pin so `apr-cli-trace-save-tensor-v1.yaml::cli_signature` was bound at the binary boundary. Subsequent PRs (B/C-prep/C-step1/C-step2) wired the library-side machinery. The dispatch glue was the missing final hop — easy to overlook because the contract test `apr trace --save-tensor --help | grep save-tensor` passed all along. 2. **Why is `.apr` only?** `forward_traced_with_save_tensor` is a method on `AprTransformer`; GGUF inference goes through a different `OwnedQuantizedModel` path. Adding GGUF support means either porting the wrapper to that type or converting GGUF→APR at import — the latter is what SHIP-007 PR-E already plans, so it's not blocking. 3. **Why a fixed prompt instead of `--prompt`?** SHIP-007 bisection needs the SAME prompt across APR and GGUF runs to make `apr diff --values` byte comparison meaningful. Hardcoding to `"What is 2+2?"` matches the existing `run_traced_inference_apr` in `vector_stats.rs`. A future `--prompt` flag is a small follow-up. 4. **Why a new module instead of extending `trace.rs`?** `trace.rs` is 722 lines already; adding the save-tensor branch via a new module (52 lines + 4 unit tests) keeps the existing 4-format dispatch intact and isolates the wrapper-specific imports (`realizar::inference_trace::save_tensor_plan::SaveTensorPlan`). 5. **Why now?** Operator's standing /loop directive is "select next best recommended choice". With PR #1414 (step 2 LmHead) merged today and PR #1416 (refactor prep) auto-merge queued, the highest-leverage move is making the existing capture surface actually work end-to-end. SHIP-007 PR-E (the live bisection) is gated on `apr trace --save-tensor` producing files; that gate is what this PR opens. ## Test plan - [x] `cargo test -p apr-cli --lib commands::trace_save_tensor` → 5/5 PASS - default_output_dir_uses_model_stem - default_output_dir_handles_bare_filename - default_output_dir_handles_no_extension - collect_bin_files_recurses_per_layer_subdirs - collect_bin_files_missing_dir_is_ok - [x] `cargo check -p apr-cli --lib` clean - [ ] Live smoke on canonical 7B teacher (operator-pre-authorized lambda-labs lane): `apr trace --payload /mnt/nvme-raid0/.../qwen2.5-coder-7b-instruct-q4k.apr --save-tensor embedding,lm_head` produces `<dir>/layer-0/embedding.bin` + `<dir>/lm_head.bin`. Deferred to a follow-up commit since the operator can run this any time after merge. - [ ] CI required checks (`ci / gate`, `workspace-test`) ## Ship % update - **MODEL-1**: ~68% → **~70%** (the wrapper surface that's been merged for 2 days is now actually invocable from the CLI, which means SHIP-007 PR-E live diagnostics are no longer blocked on this dispatch gap). - **MODEL-2**: corpus tokenization at ~96M tokens / 119 min (steady ~14K tok/s). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… records CLI dispatch wire-up PARTIAL discharge Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…7 forward_traced threading Centralizes the boilerplate that `forward_traced_with_save_tensor` inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007 PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through `AprTransformer::forward_traced` itself, every per-layer stage will call this single function instead of repeating the `should_save → ensure_layer_dir → File::create → write_tensor_file → flush` chain at each capture point. ## What changed - New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`: - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage, layer, values) -> io::Result<()>` — the gated entry point. Cheap no-op when plan is None or stage/layer not selected. Forwards to `write_stage_file` when the gate passes. - `pub fn write_stage_file(output_dir, stage, layer, values) -> io::Result<()>` — the unconditional write, exposed separately for tests and any future callers that have already gated. - 7 unit tests pinning: None=no-op, unselected-stage=no-op, per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes out-of-range, NaN-bit-preserving f32 LE round-trip, missing parent dirs auto-created. - `crates/aprender-serve/src/inference_trace/mod.rs`: register the new module. - `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`: - Replace 60-line Embedding+LmHead inline blocks with two calls to `maybe_save_stage`. Net 50-line shrink. - Wrapper's behavior is byte-identical: same API surface, same file layout, same NaN preservation. Existing 4 `traced_save_tensor_step2_tests` tests still PASS. ## Five Whys 1. **Why now?** PR #1414 (step 2) merged earlier today landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added inside the 360-line `forward_traced` body. 2. **Why a new module instead of inlining in `apr_transformer/`?** The helper has zero coupling to `AprTransformer` (it takes a plan + stage + values). Living next to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches the existing `inference_trace::save_tensor_*` family pattern. 3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?** Step 3 will thread this through `forward_traced`, which is also called from non-instrumented contexts (HTTP serving, training evals). The `Option` lets a single `forward_traced` body serve both — `maybe_save_stage(None, ...)` is a single discriminant compare in hot paths. 4. **Why expose `write_stage_file` separately from `maybe_save_stage`?** Test ergonomics (the unit tests need to verify the unconditional write path, not just the gated path) and forward-compatibility for a future `forward_traced_inner` that does its own `should_save` filtering inside a hot loop and wants to skip the option indirection. 5. **Why no contract bump in this PR?** This is a pure refactor — no behavior change, no new invariants. The existing `byte_format`, `determinism`, and `apr_diff_values_compat` invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through exactly one function instead of two now, which makes future contract obligations easier to satisfy. PR #1415 (already in flight) bumps the contract for step 2; step 3 will bump again to v1.3.0. ## Test plan - [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` → 7/7 PASS - [x] `cargo test -p aprender-serve --lib traced_save_tensor_step2_tests` → 4/4 PASS (existing tests unchanged behavior verified) - [x] `cargo test -p aprender-serve --test save_tensor_plan_integration` → 6/6 PASS (contract-level integration unchanged) - [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D warnings` clean ## Ship % update - MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 (per-layer threading inside `forward_traced`) is now a trivial follow-up: each capture point becomes one line: `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx, &gate)?;`. - MODEL-2: corpus tokenization still running (~83 min elapsed, 46.5M tokens, ~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… records CLI dispatch wire-up PARTIAL discharge (#1418) Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…7 forward_traced threading Centralizes the boilerplate that `forward_traced_with_save_tensor` inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007 PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through `AprTransformer::forward_traced` itself, every per-layer stage will call this single function instead of repeating the `should_save → ensure_layer_dir → File::create → write_tensor_file → flush` chain at each capture point. ## What changed - New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`: - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage, layer, values) -> io::Result<()>` — the gated entry point. Cheap no-op when plan is None or stage/layer not selected. Forwards to `write_stage_file` when the gate passes. - `pub fn write_stage_file(output_dir, stage, layer, values) -> io::Result<()>` — the unconditional write, exposed separately for tests and any future callers that have already gated. - 7 unit tests pinning: None=no-op, unselected-stage=no-op, per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes out-of-range, NaN-bit-preserving f32 LE round-trip, missing parent dirs auto-created. - `crates/aprender-serve/src/inference_trace/mod.rs`: register the new module. - `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`: - Replace 60-line Embedding+LmHead inline blocks with two calls to `maybe_save_stage`. Net 50-line shrink. - Wrapper's behavior is byte-identical: same API surface, same file layout, same NaN preservation. Existing 4 `traced_save_tensor_step2_tests` tests still PASS. ## Five Whys 1. **Why now?** PR #1414 (step 2) merged earlier today landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added inside the 360-line `forward_traced` body. 2. **Why a new module instead of inlining in `apr_transformer/`?** The helper has zero coupling to `AprTransformer` (it takes a plan + stage + values). Living next to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches the existing `inference_trace::save_tensor_*` family pattern. 3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?** Step 3 will thread this through `forward_traced`, which is also called from non-instrumented contexts (HTTP serving, training evals). The `Option` lets a single `forward_traced` body serve both — `maybe_save_stage(None, ...)` is a single discriminant compare in hot paths. 4. **Why expose `write_stage_file` separately from `maybe_save_stage`?** Test ergonomics (the unit tests need to verify the unconditional write path, not just the gated path) and forward-compatibility for a future `forward_traced_inner` that does its own `should_save` filtering inside a hot loop and wants to skip the option indirection. 5. **Why no contract bump in this PR?** This is a pure refactor — no behavior change, no new invariants. The existing `byte_format`, `determinism`, and `apr_diff_values_compat` invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through exactly one function instead of two now, which makes future contract obligations easier to satisfy. PR #1415 (already in flight) bumps the contract for step 2; step 3 will bump again to v1.3.0. ## Test plan - [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` → 7/7 PASS - [x] `cargo test -p aprender-serve --lib traced_save_tensor_step2_tests` → 4/4 PASS (existing tests unchanged behavior verified) - [x] `cargo test -p aprender-serve --test save_tensor_plan_integration` → 6/6 PASS (contract-level integration unchanged) - [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D warnings` clean ## Ship % update - MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 (per-layer threading inside `forward_traced`) is now a trivial follow-up: each capture point becomes one line: `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx, &gate)?;`. - MODEL-2: corpus tokenization still running (~83 min elapsed, 46.5M tokens, ~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…7 forward_traced threading (#1416) * refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading Centralizes the boilerplate that `forward_traced_with_save_tensor` inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007 PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through `AprTransformer::forward_traced` itself, every per-layer stage will call this single function instead of repeating the `should_save → ensure_layer_dir → File::create → write_tensor_file → flush` chain at each capture point. ## What changed - New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`: - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage, layer, values) -> io::Result<()>` — the gated entry point. Cheap no-op when plan is None or stage/layer not selected. Forwards to `write_stage_file` when the gate passes. - `pub fn write_stage_file(output_dir, stage, layer, values) -> io::Result<()>` — the unconditional write, exposed separately for tests and any future callers that have already gated. - 7 unit tests pinning: None=no-op, unselected-stage=no-op, per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes out-of-range, NaN-bit-preserving f32 LE round-trip, missing parent dirs auto-created. - `crates/aprender-serve/src/inference_trace/mod.rs`: register the new module. - `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`: - Replace 60-line Embedding+LmHead inline blocks with two calls to `maybe_save_stage`. Net 50-line shrink. - Wrapper's behavior is byte-identical: same API surface, same file layout, same NaN preservation. Existing 4 `traced_save_tensor_step2_tests` tests still PASS. ## Five Whys 1. **Why now?** PR #1414 (step 2) merged earlier today landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added inside the 360-line `forward_traced` body. 2. **Why a new module instead of inlining in `apr_transformer/`?** The helper has zero coupling to `AprTransformer` (it takes a plan + stage + values). Living next to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches the existing `inference_trace::save_tensor_*` family pattern. 3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?** Step 3 will thread this through `forward_traced`, which is also called from non-instrumented contexts (HTTP serving, training evals). The `Option` lets a single `forward_traced` body serve both — `maybe_save_stage(None, ...)` is a single discriminant compare in hot paths. 4. **Why expose `write_stage_file` separately from `maybe_save_stage`?** Test ergonomics (the unit tests need to verify the unconditional write path, not just the gated path) and forward-compatibility for a future `forward_traced_inner` that does its own `should_save` filtering inside a hot loop and wants to skip the option indirection. 5. **Why no contract bump in this PR?** This is a pure refactor — no behavior change, no new invariants. The existing `byte_format`, `determinism`, and `apr_diff_values_compat` invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through exactly one function instead of two now, which makes future contract obligations easier to satisfy. PR #1415 (already in flight) bumps the contract for step 2; step 3 will bump again to v1.3.0. ## Test plan - [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` → 7/7 PASS - [x] `cargo test -p aprender-serve --lib traced_save_tensor_step2_tests` → 4/4 PASS (existing tests unchanged behavior verified) - [x] `cargo test -p aprender-serve --test save_tensor_plan_integration` → 6/6 PASS (contract-level integration unchanged) - [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D warnings` clean ## Ship % update - MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 (per-layer threading inside `forward_traced`) is now a trivial follow-up: each capture point becomes one line: `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx, &gate)?;`. - MODEL-2: corpus tokenization still running (~83 min elapsed, 46.5M tokens, ~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced (#1421) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 3, 2026 07:58

Merge branch 'main' into feat/ship-007-pr-c-real-step2-lm-head

e95805e

noahgift mentioned this pull request May 3, 2026

contract(apr-cli-trace-save-tensor-v1): v1.1.0 → v1.2.0 — FALSIFY-010 records LmHead step-2 PARTIAL #1415

Merged

2 tasks

noahgift merged commit 943eac1 into main May 3, 2026
10 checks passed

noahgift deleted the feat/ship-007-pr-c-real-step2-lm-head branch May 3, 2026 08:41

noahgift mentioned this pull request May 3, 2026

refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading #1416

Merged

5 tasks

This was referenced May 3, 2026

feat(apr-cli): wire apr trace --save-tensor end-to-end for .apr files #1417

Merged

contract(apr-cli-trace-save-tensor-v1): v1.2.0 → v1.3.0 — FALSIFY-011 records CLI dispatch wire-up PARTIAL discharge #1418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper#1414

feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper#1414
noahgift merged 2 commits into
mainfrom
feat/ship-007-pr-c-real-step2-lm-head

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

What changed

Five Whys

Test plan

Contract

Ship % update

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant