feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper#1414
Merged
Merged
Conversation
… `forward_traced_with_save_tensor` Extends the wrapper from PR-C-real step 1 (#1408) to additionally write the `LmHead` whole-model stage when the supplied [`SaveTensorPlan`] selects it. The logits are pulled directly from `trace.logits` — the `Vec<f32>` already returned by `forward_traced` — so no recompute, no internal forward-pass surgery, no risk of behavior drift. This is the same low-risk capture pattern as step 1's `Embedding` branch (re-use already-computed data; defer the high-risk threading into `forward_traced` to future steps). ## Five Whys (why now, why this scope) 1. **Why LmHead next?** Of the 18 `SaveTensorStage` variants, only two are externally re-extractable from a `forward_traced` return value without modifying the function body: `Embedding` (cheap re-call of `self.embed`) and `LmHead` (already in `trace.logits`). Step 1 shipped Embedding; LmHead is the obvious second. 2. **Why not jump straight to per-layer stages (qkv, ffn_*)?** Those stages require threading `Option<&SaveTensorPlan>` through the 360-line `forward_traced` body. That's the bigger surgery — high blast radius, deserves its own PR with proper drift-prevention tests and a real-model integration smoke. Splitting LmHead out first lets `apr diff --values` (PR #1413) compare APR vs GGUF logits TODAY for free, before per-layer infrastructure lands. 3. **Why use the WHOLE_MODEL_LAYER sentinel?** Per `apr-cli-trace-save-tensor-v1` `byte_format` invariant: whole-model stages (lm_head, final_norm) carry `0xFFFFFFFF` in the layer field so `apr diff --values` can recognize them. Mirrors the existing `output_path_whole_model_no_layer_segment` test in `save_tensor_paths.rs`. 4. **Why no integration test on a real `AprTransformer`?** Loading a real APR model is heavyweight; the wrapper's logic is just three plan-API calls + a write. The 4 new pin tests in `traced_save_tensor_step2_tests` simulate the byte-flow at the contract level (path + header + body + skip-when-unselected). Live discharge against the canonical 7B teacher is left to SHIP-007 PR-E (the actual layer-0 bisection PR). 5. **Why now in the SHIP-TWO loop?** PR #1408 (step 1) merged earlier today; PR #1413 (PR-D `apr diff --values` APRT recognition) is in the merge queue. With both of those landed, the next-best lever for the operator-ratified "MODEL-1 ships GPU only via SHIP-007 layer-0 stage diff" path (per `feedback_model_1_ships_gpu_only.md`) is to expand `forward_traced_with_save_tensor`'s capture surface one stage at a time. LmHead is the smallest, safest next step. ## Verification - `cargo test -p aprender-serve --lib traced_save_tensor_step2` → 4/4 PASS: - step2_lm_head_writes_to_output_root_not_per_layer_dir - step2_lm_head_header_uses_whole_model_sentinel - step2_lm_head_skipped_when_plan_does_not_select_it - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving) - `cargo check -p aprender-serve --lib` clean - Step 1's existing Embedding branch is byte-identical to before (no edits to that block; only added a sibling LmHead branch). ## Contract Contract update is intentionally deferred to a follow-up commit to avoid file-conflict with PR #1413 (which is mid-merge and bumps `apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0). Once #1413 lands, a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE- SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL. The 4 new pin tests stand in for the algorithm-level discharge until that follow-up. ## Ship % update - MODEL-1: ~66% → ~68% (SHIP-007 capture surface widens from 1/18 to 2/18 stages; the two cheapest captures are now wired). - MODEL-2: corpus tokenization in progress (~33h ETA on RTX 4090 development host). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
5 tasks
noahgift
added a commit
that referenced
this pull request
May 3, 2026
… records LmHead step-2 PARTIAL discharge (#1415) Follow-up to PR #1414 (`forward_traced_with_save_tensor` step 2). Adds FALSIFY-APR-TRACE-SAVE-010 binding the LmHead branch at PARTIAL_ALGORITHM_LEVEL; the algorithm-level evidence cites the four new pin tests in `traced_save_tensor_step2_tests`: - step2_lm_head_writes_to_output_root_not_per_layer_dir - step2_lm_head_header_uses_whole_model_sentinel - step2_lm_head_skipped_when_plan_does_not_select_it - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit preserving) `binds_to: byte_format` because step 2 invokes the same write_tensor_file path with `WHOLE_MODEL_LAYER` sentinel as the existing `byte_format` equation specifies. Live discharge against the canonical 7B teacher is deferred to SHIP-007 PR-E (layer-0 bisection). ## Five Whys 1. **Why a separate contract follow-up?** The PR #1414 commit needed to land before this bump to avoid file-conflict with PR #1413 (which independently bumped v1.0.0 → v1.1.0 with FALSIFY-009). 2. **Why `binds_to: byte_format` and not `cli_signature`?** The wrapper doesn't add a new clap surface (PR-A already did that); it adds a new branch that emits files conforming to the existing byte-format equation. The new branch's verbatim f32 LE round-trip + NaN preservation is exactly the property `byte_format` invariants pin. 3. **Why PARTIAL_ALGORITHM_LEVEL not full discharge?** The 4 unit tests simulate the wrapper's byte-flow at the contract level using synthetic plans and fake logits — they do NOT instantiate a full AprTransformer or load a real APR model. Live discharge requires SHIP-007 PR-E. 4. **Why bump to v1.2.0?** Adding a new falsification test (FALSIFY-010) that binds an additional invariant is a minor schema change. Per semver, that's a minor bump. 5. **Why `pv validate` clean even with two new falsifiers in 24h?** The contract uses metadata.kind=schema, so falsification_tests entries are flexible; pv validates structure, IDs are unique, and binds_to references are valid. ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors - v1.0.0 → v1.1.0 (PR #1413, FALSIFY-009 binding apr_diff_values_compat) - v1.1.0 → v1.2.0 (this PR, FALSIFY-010 binding LmHead step-2 capture) ## Ship % update - MODEL-1: ~68% (unchanged — this is paperwork that records yesterday's algorithm-level discharge of step 2; the actual capture surface expansion happened in PR #1414). - MODEL-2: corpus tokenization at ~46.5M tokens / 56 min (steady ~14K tok/s); ~33h ETA for full 27 GB Stack v1.2 corpus. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…es (#1417) Before this PR, `apr trace --save-tensor <model>.apr` printed a stub message and never invoked the underlying wrapper — so the existing Embedding (PR #1408 step 1) and LmHead (PR #1414 step 2) capture surface was UNREACHABLE from the CLI. The wrapper, plan-builder, and APRT byte-format machinery had all merged but produced no files. This PR closes the dispatch gap. When `apr trace --save-tensor <stages>` is invoked on a `.apr` model, dispatch.rs now routes to a new `commands::trace_save_tensor::run_save_tensor_apr` function that: 1. Builds a `SaveTensorPlan` from `--save-tensor`/`--save-tensor-dir`/ `--save-tensor-layers`. Default output dir is `<model-stem>-trace/` next to the input. 2. Loads the APR model + embedded BPE tokenizer. 3. Encodes a fixed test prompt (`"What is 2+2?"` — same as `vector_stats.rs::run_traced_inference_apr` for consistency). 4. Calls `AprTransformer::forward_traced_with_save_tensor(&tokens, &plan)`. 5. Walks the output directory and prints every `*.bin` file with its size, plus the forward-pass success summary. `.gguf` and `.safetensors` paths still print the stub for now — SHIP-007 PR-E live diagnostics convert GGUF→APR at the import boundary so the canonical 7B teacher bisection runs through this code path. ## Five Whys 1. **Why was this missing?** Per the SHIP-007 PR-A commit message, the clap surface was shipped first as a contract pin so `apr-cli-trace-save-tensor-v1.yaml::cli_signature` was bound at the binary boundary. Subsequent PRs (B/C-prep/C-step1/C-step2) wired the library-side machinery. The dispatch glue was the missing final hop — easy to overlook because the contract test `apr trace --save-tensor --help | grep save-tensor` passed all along. 2. **Why is `.apr` only?** `forward_traced_with_save_tensor` is a method on `AprTransformer`; GGUF inference goes through a different `OwnedQuantizedModel` path. Adding GGUF support means either porting the wrapper to that type or converting GGUF→APR at import — the latter is what SHIP-007 PR-E already plans, so it's not blocking. 3. **Why a fixed prompt instead of `--prompt`?** SHIP-007 bisection needs the SAME prompt across APR and GGUF runs to make `apr diff --values` byte comparison meaningful. Hardcoding to `"What is 2+2?"` matches the existing `run_traced_inference_apr` in `vector_stats.rs`. A future `--prompt` flag is a small follow-up. 4. **Why a new module instead of extending `trace.rs`?** `trace.rs` is 722 lines already; adding the save-tensor branch via a new module (52 lines + 4 unit tests) keeps the existing 4-format dispatch intact and isolates the wrapper-specific imports (`realizar::inference_trace::save_tensor_plan::SaveTensorPlan`). 5. **Why now?** Operator's standing /loop directive is "select next best recommended choice". With PR #1414 (step 2 LmHead) merged today and PR #1416 (refactor prep) auto-merge queued, the highest-leverage move is making the existing capture surface actually work end-to-end. SHIP-007 PR-E (the live bisection) is gated on `apr trace --save-tensor` producing files; that gate is what this PR opens. ## Test plan - [x] `cargo test -p apr-cli --lib commands::trace_save_tensor` → 5/5 PASS - default_output_dir_uses_model_stem - default_output_dir_handles_bare_filename - default_output_dir_handles_no_extension - collect_bin_files_recurses_per_layer_subdirs - collect_bin_files_missing_dir_is_ok - [x] `cargo check -p apr-cli --lib` clean - [ ] Live smoke on canonical 7B teacher (operator-pre-authorized lambda-labs lane): `apr trace --payload /mnt/nvme-raid0/.../qwen2.5-coder-7b-instruct-q4k.apr --save-tensor embedding,lm_head` produces `<dir>/layer-0/embedding.bin` + `<dir>/lm_head.bin`. Deferred to a follow-up commit since the operator can run this any time after merge. - [ ] CI required checks (`ci / gate`, `workspace-test`) ## Ship % update - **MODEL-1**: ~68% → **~70%** (the wrapper surface that's been merged for 2 days is now actually invocable from the CLI, which means SHIP-007 PR-E live diagnostics are no longer blocked on this dispatch gap). - **MODEL-2**: corpus tokenization at ~96M tokens / 119 min (steady ~14K tok/s). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…7 forward_traced threading
Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.
## What changed
- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
- `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
layer, values) -> io::Result<()>` — the gated entry point. Cheap
no-op when plan is None or stage/layer not selected. Forwards to
`write_stage_file` when the gate passes.
- `pub fn write_stage_file(output_dir, stage, layer, values) ->
io::Result<()>` — the unconditional write, exposed separately
for tests and any future callers that have already gated.
- 7 unit tests pinning: None=no-op, unselected-stage=no-op,
per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
out-of-range, NaN-bit-preserving f32 LE round-trip, missing
parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
- Replace 60-line Embedding+LmHead inline blocks with two calls
to `maybe_save_stage`. Net 50-line shrink.
- Wrapper's behavior is byte-identical: same API surface, same
file layout, same NaN preservation. Existing 4
`traced_save_tensor_step2_tests` tests still PASS.
## Five Whys
1. **Why now?** PR #1414 (step 2) merged earlier today landed the
second copy of the inline block. Pre-step-3 is the right time to
factor — before 15 more capture points get added inside the
360-line `forward_traced` body.
2. **Why a new module instead of inlining in
`apr_transformer/`?** The helper has zero coupling to
`AprTransformer` (it takes a plan + stage + values). Living next
to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
Step 3 will thread this through `forward_traced`, which is also
called from non-instrumented contexts (HTTP serving, training
evals). The `Option` lets a single `forward_traced` body serve
both — `maybe_save_stage(None, ...)` is a single discriminant
compare in hot paths.
4. **Why expose `write_stage_file` separately from
`maybe_save_stage`?** Test ergonomics (the unit tests need to
verify the unconditional write path, not just the gated path)
and forward-compatibility for a future `forward_traced_inner`
that does its own `should_save` filtering inside a hot loop and
wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
no behavior change, no new invariants. The existing
`byte_format`, `determinism`, and `apr_diff_values_compat`
invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
exactly one function instead of two now, which makes future
contract obligations easier to satisfy. PR #1415 (already in
flight) bumps the contract for step 2; step 3 will bump again
to v1.3.0.
## Test plan
- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
7/7 PASS
- [x] `cargo test -p aprender-serve --lib
traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
save_tensor_plan_integration` → 6/6 PASS (contract-level
integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
warnings` clean
## Ship % update
- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
Step 3 (per-layer threading inside `forward_traced`) is now a
trivial follow-up: each capture point becomes one line:
`maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
&gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
46.5M tokens, ~33h ETA).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge (#1418) Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…7 forward_traced threading
Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.
## What changed
- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
- `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
layer, values) -> io::Result<()>` — the gated entry point. Cheap
no-op when plan is None or stage/layer not selected. Forwards to
`write_stage_file` when the gate passes.
- `pub fn write_stage_file(output_dir, stage, layer, values) ->
io::Result<()>` — the unconditional write, exposed separately
for tests and any future callers that have already gated.
- 7 unit tests pinning: None=no-op, unselected-stage=no-op,
per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
out-of-range, NaN-bit-preserving f32 LE round-trip, missing
parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
- Replace 60-line Embedding+LmHead inline blocks with two calls
to `maybe_save_stage`. Net 50-line shrink.
- Wrapper's behavior is byte-identical: same API surface, same
file layout, same NaN preservation. Existing 4
`traced_save_tensor_step2_tests` tests still PASS.
## Five Whys
1. **Why now?** PR #1414 (step 2) merged earlier today landed the
second copy of the inline block. Pre-step-3 is the right time to
factor — before 15 more capture points get added inside the
360-line `forward_traced` body.
2. **Why a new module instead of inlining in
`apr_transformer/`?** The helper has zero coupling to
`AprTransformer` (it takes a plan + stage + values). Living next
to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
Step 3 will thread this through `forward_traced`, which is also
called from non-instrumented contexts (HTTP serving, training
evals). The `Option` lets a single `forward_traced` body serve
both — `maybe_save_stage(None, ...)` is a single discriminant
compare in hot paths.
4. **Why expose `write_stage_file` separately from
`maybe_save_stage`?** Test ergonomics (the unit tests need to
verify the unconditional write path, not just the gated path)
and forward-compatibility for a future `forward_traced_inner`
that does its own `should_save` filtering inside a hot loop and
wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
no behavior change, no new invariants. The existing
`byte_format`, `determinism`, and `apr_diff_values_compat`
invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
exactly one function instead of two now, which makes future
contract obligations easier to satisfy. PR #1415 (already in
flight) bumps the contract for step 2; step 3 will bump again
to v1.3.0.
## Test plan
- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
7/7 PASS
- [x] `cargo test -p aprender-serve --lib
traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
save_tensor_plan_integration` → 6/6 PASS (contract-level
integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
warnings` clean
## Ship % update
- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
Step 3 (per-layer threading inside `forward_traced`) is now a
trivial follow-up: each capture point becomes one line:
`maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
&gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
46.5M tokens, ~33h ETA).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 3, 2026
…7 forward_traced threading (#1416) * refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading Centralizes the boilerplate that `forward_traced_with_save_tensor` inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007 PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through `AprTransformer::forward_traced` itself, every per-layer stage will call this single function instead of repeating the `should_save → ensure_layer_dir → File::create → write_tensor_file → flush` chain at each capture point. ## What changed - New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`: - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage, layer, values) -> io::Result<()>` — the gated entry point. Cheap no-op when plan is None or stage/layer not selected. Forwards to `write_stage_file` when the gate passes. - `pub fn write_stage_file(output_dir, stage, layer, values) -> io::Result<()>` — the unconditional write, exposed separately for tests and any future callers that have already gated. - 7 unit tests pinning: None=no-op, unselected-stage=no-op, per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes out-of-range, NaN-bit-preserving f32 LE round-trip, missing parent dirs auto-created. - `crates/aprender-serve/src/inference_trace/mod.rs`: register the new module. - `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`: - Replace 60-line Embedding+LmHead inline blocks with two calls to `maybe_save_stage`. Net 50-line shrink. - Wrapper's behavior is byte-identical: same API surface, same file layout, same NaN preservation. Existing 4 `traced_save_tensor_step2_tests` tests still PASS. ## Five Whys 1. **Why now?** PR #1414 (step 2) merged earlier today landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added inside the 360-line `forward_traced` body. 2. **Why a new module instead of inlining in `apr_transformer/`?** The helper has zero coupling to `AprTransformer` (it takes a plan + stage + values). Living next to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches the existing `inference_trace::save_tensor_*` family pattern. 3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?** Step 3 will thread this through `forward_traced`, which is also called from non-instrumented contexts (HTTP serving, training evals). The `Option` lets a single `forward_traced` body serve both — `maybe_save_stage(None, ...)` is a single discriminant compare in hot paths. 4. **Why expose `write_stage_file` separately from `maybe_save_stage`?** Test ergonomics (the unit tests need to verify the unconditional write path, not just the gated path) and forward-compatibility for a future `forward_traced_inner` that does its own `should_save` filtering inside a hot loop and wants to skip the option indirection. 5. **Why no contract bump in this PR?** This is a pure refactor — no behavior change, no new invariants. The existing `byte_format`, `determinism`, and `apr_diff_values_compat` invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through exactly one function instead of two now, which makes future contract obligations easier to satisfy. PR #1415 (already in flight) bumps the contract for step 2; step 3 will bump again to v1.3.0. ## Test plan - [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` → 7/7 PASS - [x] `cargo test -p aprender-serve --lib traced_save_tensor_step2_tests` → 4/4 PASS (existing tests unchanged behavior verified) - [x] `cargo test -p aprender-serve --test save_tensor_plan_integration` → 6/6 PASS (contract-level integration unchanged) - [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D warnings` clean ## Ship % update - MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 (per-layer threading inside `forward_traced`) is now a trivial follow-up: each capture point becomes one line: `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx, &gate)?;`. - MODEL-2: corpus tokenization still running (~83 min elapsed, 46.5M tokens, ~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced (#1421) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the SHIP-007 PR-C-real step 1 wrapper (#1408) to capture the
LmHead(final logits) whole-model stage. The logits come straight fromtrace.logits(already returned byforward_traced) — no recompute, no surgery on the 360-line function body.This is the same low-risk pattern as step 1's
Embeddingbranch (re-use already-computed data; defer the high-risk per-layer threading intoforward_tracedto future steps).What changed
crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs:LmHeadbranch usingWHOLE_MODEL_LAYERsentinel.traced_save_tensor_step2_testsmodule pinning path layout, header sentinel, skip-when-unselected, and bit-identical logits round-trip (NaN-bit preserving).Embeddingbranch is byte-identical to before (only added a sibling block).Five Whys
SaveTensorStagevariants, onlyEmbeddingandLmHeadare externally re-extractable from aforward_tracedreturn value without modifying the function body. Step 1 shipped Embedding; LmHead is the obvious second.Option<&SaveTensorPlan>through the 360-line forward body — high blast radius, deserves its own PR.apr-cli-trace-save-tensor-v1byte_formatinvariant: whole-model stages carry0xFFFFFFFFin the layer field soapr diff --values(PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413) can recognize them.AprTransformer? Loading APR is heavyweight; the wrapper's logic is three plan-API calls + a write. The 4 new pin tests simulate the byte-flow at the contract level. Live discharge is left to SHIP-007 PR-E.forward_traced_with_save_tensorwrapper #1408 (step 1) merged today; PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413 (PR-D APRT diff) is in the merge queue. Perfeedback_model_1_ships_gpu_only.md, the SHIP-007 layer-0 stage-diff path needs the wrapper's capture surface expanded one stage at a time. LmHead is the smallest, safest next step.Test plan
cargo test -p aprender-serve --lib traced_save_tensor_step2→ 4/4 PASScargo check -p aprender-serve --libcleanci / gate,workspace-test)Contract
Contract update is intentionally deferred to a follow-up commit to avoid file-conflict with PR #1413 (mid-merge; bumps
apr-cli-trace-save-tensor-v1v1.0.0 → v1.1.0). Once #1413 lands, a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE-SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL. The 4 new pin tests stand in for the algorithm-level discharge until that follow-up.Ship % update
🤖 Generated with Claude Code