Skip to content

feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper#1414

Merged
noahgift merged 2 commits into
mainfrom
feat/ship-007-pr-c-real-step2-lm-head
May 3, 2026
Merged

feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper#1414
noahgift merged 2 commits into
mainfrom
feat/ship-007-pr-c-real-step2-lm-head

Conversation

@noahgift

@noahgift noahgift commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Extends the SHIP-007 PR-C-real step 1 wrapper (#1408) to capture the LmHead (final logits) whole-model stage. The logits come straight from trace.logits (already returned by forward_traced) — no recompute, no surgery on the 360-line function body.

This is the same low-risk pattern as step 1's Embedding branch (re-use already-computed data; defer the high-risk per-layer threading into forward_traced to future steps).

What changed

  • crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs:
    • Added LmHead branch using WHOLE_MODEL_LAYER sentinel.
    • 4 unit tests in new traced_save_tensor_step2_tests module pinning path layout, header sentinel, skip-when-unselected, and bit-identical logits round-trip (NaN-bit preserving).
    • Updated doc-comment to reflect step-1+step-2 capture surface.
  • Step 1's Embedding branch is byte-identical to before (only added a sibling block).

Five Whys

  1. Why LmHead next? Of 18 SaveTensorStage variants, only Embedding and LmHead are externally re-extractable from a forward_traced return value without modifying the function body. Step 1 shipped Embedding; LmHead is the obvious second.
  2. Why not jump to per-layer stages? Those require threading Option<&SaveTensorPlan> through the 360-line forward body — high blast radius, deserves its own PR.
  3. Why WHOLE_MODEL_LAYER sentinel? Per apr-cli-trace-save-tensor-v1 byte_format invariant: whole-model stages carry 0xFFFFFFFF in the layer field so apr diff --values (PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413) can recognize them.
  4. Why no live integration test on a real AprTransformer? Loading APR is heavyweight; the wrapper's logic is three plan-API calls + a write. The 4 new pin tests simulate the byte-flow at the contract level. Live discharge is left to SHIP-007 PR-E.
  5. Why now? PR feat(aprender-serve): SHIP-007 PR-C-real step 1 — forward_traced_with_save_tensor wrapper #1408 (step 1) merged today; PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413 (PR-D APRT diff) is in the merge queue. Per feedback_model_1_ships_gpu_only.md, the SHIP-007 layer-0 stage-diff path needs the wrapper's capture surface expanded one stage at a time. LmHead is the smallest, safest next step.

Test plan

  • cargo test -p aprender-serve --lib traced_save_tensor_step2 → 4/4 PASS
    • step2_lm_head_writes_to_output_root_not_per_layer_dir
    • step2_lm_head_header_uses_whole_model_sentinel
    • step2_lm_head_skipped_when_plan_does_not_select_it
    • step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving)
  • cargo check -p aprender-serve --lib clean
  • CI required checks (ci / gate, workspace-test)

Contract

Contract update is intentionally deferred to a follow-up commit to avoid file-conflict with PR #1413 (mid-merge; bumps apr-cli-trace-save-tensor-v1 v1.0.0 → v1.1.0). Once #1413 lands, a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE-SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL. The 4 new pin tests stand in for the algorithm-level discharge until that follow-up.

Ship % update

  • MODEL-1: ~66% → ~68% (SHIP-007 capture surface 1/18 → 2/18 stages).
  • MODEL-2: full Stack v1.2 corpus tokenization in progress in background (~33h ETA, ~14K tok/s).

🤖 Generated with Claude Code

… `forward_traced_with_save_tensor`

Extends the wrapper from PR-C-real step 1 (#1408) to additionally write
the `LmHead` whole-model stage when the supplied [`SaveTensorPlan`]
selects it. The logits are pulled directly from `trace.logits` — the
`Vec<f32>` already returned by `forward_traced` — so no recompute, no
internal forward-pass surgery, no risk of behavior drift.

This is the same low-risk capture pattern as step 1's `Embedding`
branch (re-use already-computed data; defer the high-risk threading
into `forward_traced` to future steps).

## Five Whys (why now, why this scope)

1. **Why LmHead next?** Of the 18 `SaveTensorStage` variants, only two
   are externally re-extractable from a `forward_traced` return value
   without modifying the function body: `Embedding` (cheap re-call of
   `self.embed`) and `LmHead` (already in `trace.logits`). Step 1
   shipped Embedding; LmHead is the obvious second.
2. **Why not jump straight to per-layer stages (qkv, ffn_*)?** Those
   stages require threading `Option<&SaveTensorPlan>` through the
   360-line `forward_traced` body. That's the bigger surgery — high
   blast radius, deserves its own PR with proper drift-prevention
   tests and a real-model integration smoke. Splitting LmHead out
   first lets `apr diff --values` (PR #1413) compare APR vs GGUF
   logits TODAY for free, before per-layer infrastructure lands.
3. **Why use the WHOLE_MODEL_LAYER sentinel?** Per
   `apr-cli-trace-save-tensor-v1` `byte_format` invariant: whole-model
   stages (lm_head, final_norm) carry `0xFFFFFFFF` in the layer field
   so `apr diff --values` can recognize them. Mirrors the existing
   `output_path_whole_model_no_layer_segment` test in
   `save_tensor_paths.rs`.
4. **Why no integration test on a real `AprTransformer`?** Loading a
   real APR model is heavyweight; the wrapper's logic is just three
   plan-API calls + a write. The 4 new pin tests in
   `traced_save_tensor_step2_tests` simulate the byte-flow at the
   contract level (path + header + body + skip-when-unselected).
   Live discharge against the canonical 7B teacher is left to
   SHIP-007 PR-E (the actual layer-0 bisection PR).
5. **Why now in the SHIP-TWO loop?** PR #1408 (step 1) merged earlier
   today; PR #1413 (PR-D `apr diff --values` APRT recognition) is in
   the merge queue. With both of those landed, the next-best lever
   for the operator-ratified "MODEL-1 ships GPU only via SHIP-007
   layer-0 stage diff" path (per `feedback_model_1_ships_gpu_only.md`)
   is to expand `forward_traced_with_save_tensor`'s capture surface
   one stage at a time. LmHead is the smallest, safest next step.

## Verification

- `cargo test -p aprender-serve --lib traced_save_tensor_step2` →
  4/4 PASS:
    - step2_lm_head_writes_to_output_root_not_per_layer_dir
    - step2_lm_head_header_uses_whole_model_sentinel
    - step2_lm_head_skipped_when_plan_does_not_select_it
    - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving)
- `cargo check -p aprender-serve --lib` clean
- Step 1's existing Embedding branch is byte-identical to before
  (no edits to that block; only added a sibling LmHead branch).

## Contract

Contract update is intentionally deferred to a follow-up commit to
avoid file-conflict with PR #1413 (which is mid-merge and bumps
`apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0). Once #1413 lands,
a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE-
SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL.
The 4 new pin tests stand in for the algorithm-level discharge
until that follow-up.

## Ship % update

- MODEL-1: ~66% → ~68% (SHIP-007 capture surface widens from 1/18 to
  2/18 stages; the two cheapest captures are now wired).
- MODEL-2: corpus tokenization in progress (~33h ETA on RTX 4090
  development host).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 3, 2026 07:58
@noahgift noahgift merged commit 943eac1 into main May 3, 2026
10 checks passed
@noahgift noahgift deleted the feat/ship-007-pr-c-real-step2-lm-head branch May 3, 2026 08:41
noahgift added a commit that referenced this pull request May 3, 2026
… records LmHead step-2 PARTIAL discharge (#1415)

Follow-up to PR #1414 (`forward_traced_with_save_tensor` step 2). Adds
FALSIFY-APR-TRACE-SAVE-010 binding the LmHead branch at
PARTIAL_ALGORITHM_LEVEL; the algorithm-level evidence cites the four
new pin tests in `traced_save_tensor_step2_tests`:

- step2_lm_head_writes_to_output_root_not_per_layer_dir
- step2_lm_head_header_uses_whole_model_sentinel
- step2_lm_head_skipped_when_plan_does_not_select_it
- step2_lm_head_writes_logits_bytes_verbatim (NaN-bit preserving)

`binds_to: byte_format` because step 2 invokes the same write_tensor_file
path with `WHOLE_MODEL_LAYER` sentinel as the existing `byte_format`
equation specifies. Live discharge against the canonical 7B teacher is
deferred to SHIP-007 PR-E (layer-0 bisection).

## Five Whys

1. **Why a separate contract follow-up?** The PR #1414 commit needed to
   land before this bump to avoid file-conflict with PR #1413
   (which independently bumped v1.0.0 → v1.1.0 with FALSIFY-009).
2. **Why `binds_to: byte_format` and not `cli_signature`?** The wrapper
   doesn't add a new clap surface (PR-A already did that); it adds a
   new branch that emits files conforming to the existing byte-format
   equation. The new branch's verbatim f32 LE round-trip + NaN preservation
   is exactly the property `byte_format` invariants pin.
3. **Why PARTIAL_ALGORITHM_LEVEL not full discharge?** The 4 unit tests
   simulate the wrapper's byte-flow at the contract level using synthetic
   plans and fake logits — they do NOT instantiate a full AprTransformer
   or load a real APR model. Live discharge requires SHIP-007 PR-E.
4. **Why bump to v1.2.0?** Adding a new falsification test (FALSIFY-010)
   that binds an additional invariant is a minor schema change. Per
   semver, that's a minor bump.
5. **Why `pv validate` clean even with two new falsifiers in 24h?** The
   contract uses metadata.kind=schema, so falsification_tests entries
   are flexible; pv validates structure, IDs are unique, and binds_to
   references are valid.

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors
- v1.0.0 → v1.1.0 (PR #1413, FALSIFY-009 binding apr_diff_values_compat)
- v1.1.0 → v1.2.0 (this PR, FALSIFY-010 binding LmHead step-2 capture)

## Ship % update

- MODEL-1: ~68% (unchanged — this is paperwork that records yesterday's
  algorithm-level discharge of step 2; the actual capture surface
  expansion happened in PR #1414).
- MODEL-2: corpus tokenization at ~46.5M tokens / 56 min (steady ~14K
  tok/s); ~33h ETA for full 27 GB Stack v1.2 corpus.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…es (#1417)

Before this PR, `apr trace --save-tensor <model>.apr` printed a stub
message and never invoked the underlying wrapper — so the existing
Embedding (PR #1408 step 1) and LmHead (PR #1414 step 2) capture
surface was UNREACHABLE from the CLI. The wrapper, plan-builder, and
APRT byte-format machinery had all merged but produced no files.

This PR closes the dispatch gap. When `apr trace --save-tensor
<stages>` is invoked on a `.apr` model, dispatch.rs now routes to a
new `commands::trace_save_tensor::run_save_tensor_apr` function that:

1. Builds a `SaveTensorPlan` from `--save-tensor`/`--save-tensor-dir`/
   `--save-tensor-layers`. Default output dir is `<model-stem>-trace/`
   next to the input.
2. Loads the APR model + embedded BPE tokenizer.
3. Encodes a fixed test prompt (`"What is 2+2?"` — same as
   `vector_stats.rs::run_traced_inference_apr` for consistency).
4. Calls `AprTransformer::forward_traced_with_save_tensor(&tokens,
   &plan)`.
5. Walks the output directory and prints every `*.bin` file with its
   size, plus the forward-pass success summary.

`.gguf` and `.safetensors` paths still print the stub for now —
SHIP-007 PR-E live diagnostics convert GGUF→APR at the import boundary
so the canonical 7B teacher bisection runs through this code path.

## Five Whys

1. **Why was this missing?** Per the SHIP-007 PR-A commit message,
   the clap surface was shipped first as a contract pin so
   `apr-cli-trace-save-tensor-v1.yaml::cli_signature` was bound at the
   binary boundary. Subsequent PRs (B/C-prep/C-step1/C-step2) wired
   the library-side machinery. The dispatch glue was the missing
   final hop — easy to overlook because the contract test
   `apr trace --save-tensor --help | grep save-tensor` passed all along.
2. **Why is `.apr` only?** `forward_traced_with_save_tensor` is a
   method on `AprTransformer`; GGUF inference goes through a
   different `OwnedQuantizedModel` path. Adding GGUF support means
   either porting the wrapper to that type or converting GGUF→APR at
   import — the latter is what SHIP-007 PR-E already plans, so it's
   not blocking.
3. **Why a fixed prompt instead of `--prompt`?** SHIP-007 bisection
   needs the SAME prompt across APR and GGUF runs to make
   `apr diff --values` byte comparison meaningful. Hardcoding to
   `"What is 2+2?"` matches the existing `run_traced_inference_apr`
   in `vector_stats.rs`. A future `--prompt` flag is a small follow-up.
4. **Why a new module instead of extending `trace.rs`?** `trace.rs`
   is 722 lines already; adding the save-tensor branch via a new
   module (52 lines + 4 unit tests) keeps the existing 4-format
   dispatch intact and isolates the wrapper-specific imports
   (`realizar::inference_trace::save_tensor_plan::SaveTensorPlan`).
5. **Why now?** Operator's standing /loop directive is "select next
   best recommended choice". With PR #1414 (step 2 LmHead) merged
   today and PR #1416 (refactor prep) auto-merge queued, the
   highest-leverage move is making the existing capture surface
   actually work end-to-end. SHIP-007 PR-E (the live bisection) is
   gated on `apr trace --save-tensor` producing files; that gate is
   what this PR opens.

## Test plan

- [x] `cargo test -p apr-cli --lib commands::trace_save_tensor` →
  5/5 PASS
  - default_output_dir_uses_model_stem
  - default_output_dir_handles_bare_filename
  - default_output_dir_handles_no_extension
  - collect_bin_files_recurses_per_layer_subdirs
  - collect_bin_files_missing_dir_is_ok
- [x] `cargo check -p apr-cli --lib` clean
- [ ] Live smoke on canonical 7B teacher (operator-pre-authorized
  lambda-labs lane): `apr trace --payload
  /mnt/nvme-raid0/.../qwen2.5-coder-7b-instruct-q4k.apr
  --save-tensor embedding,lm_head` produces
  `<dir>/layer-0/embedding.bin` + `<dir>/lm_head.bin`. Deferred to a
  follow-up commit since the operator can run this any time after
  merge.
- [ ] CI required checks (`ci / gate`, `workspace-test`)

## Ship % update

- **MODEL-1**: ~68% → **~70%** (the wrapper surface that's been
  merged for 2 days is now actually invocable from the CLI, which
  means SHIP-007 PR-E live diagnostics are no longer blocked on
  this dispatch gap).
- **MODEL-2**: corpus tokenization at ~96M tokens / 119 min
  (steady ~14K tok/s).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…7 forward_traced threading

Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.

## What changed

- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
  - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
    layer, values) -> io::Result<()>` — the gated entry point. Cheap
    no-op when plan is None or stage/layer not selected. Forwards to
    `write_stage_file` when the gate passes.
  - `pub fn write_stage_file(output_dir, stage, layer, values) ->
    io::Result<()>` — the unconditional write, exposed separately
    for tests and any future callers that have already gated.
  - 7 unit tests pinning: None=no-op, unselected-stage=no-op,
    per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
    with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
    out-of-range, NaN-bit-preserving f32 LE round-trip, missing
    parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
  new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
  - Replace 60-line Embedding+LmHead inline blocks with two calls
    to `maybe_save_stage`. Net 50-line shrink.
  - Wrapper's behavior is byte-identical: same API surface, same
    file layout, same NaN preservation. Existing 4
    `traced_save_tensor_step2_tests` tests still PASS.

## Five Whys

1. **Why now?** PR #1414 (step 2) merged earlier today landed the
   second copy of the inline block. Pre-step-3 is the right time to
   factor — before 15 more capture points get added inside the
   360-line `forward_traced` body.
2. **Why a new module instead of inlining in
   `apr_transformer/`?** The helper has zero coupling to
   `AprTransformer` (it takes a plan + stage + values). Living next
   to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
   the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
   Step 3 will thread this through `forward_traced`, which is also
   called from non-instrumented contexts (HTTP serving, training
   evals). The `Option` lets a single `forward_traced` body serve
   both — `maybe_save_stage(None, ...)` is a single discriminant
   compare in hot paths.
4. **Why expose `write_stage_file` separately from
   `maybe_save_stage`?** Test ergonomics (the unit tests need to
   verify the unconditional write path, not just the gated path)
   and forward-compatibility for a future `forward_traced_inner`
   that does its own `should_save` filtering inside a hot loop and
   wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
   no behavior change, no new invariants. The existing
   `byte_format`, `determinism`, and `apr_diff_values_compat`
   invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
   exactly one function instead of two now, which makes future
   contract obligations easier to satisfy. PR #1415 (already in
   flight) bumps the contract for step 2; step 3 will bump again
   to v1.3.0.

## Test plan

- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
  7/7 PASS
- [x] `cargo test -p aprender-serve --lib
  traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
  unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
  save_tensor_plan_integration` → 6/6 PASS (contract-level
  integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
  warnings` clean

## Ship % update

- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
  Step 3 (per-layer threading inside `forward_traced`) is now a
  trivial follow-up: each capture point becomes one line:
  `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
  &gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
  46.5M tokens, ~33h ETA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge (#1418)

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…7 forward_traced threading

Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.

## What changed

- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
  - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
    layer, values) -> io::Result<()>` — the gated entry point. Cheap
    no-op when plan is None or stage/layer not selected. Forwards to
    `write_stage_file` when the gate passes.
  - `pub fn write_stage_file(output_dir, stage, layer, values) ->
    io::Result<()>` — the unconditional write, exposed separately
    for tests and any future callers that have already gated.
  - 7 unit tests pinning: None=no-op, unselected-stage=no-op,
    per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
    with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
    out-of-range, NaN-bit-preserving f32 LE round-trip, missing
    parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
  new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
  - Replace 60-line Embedding+LmHead inline blocks with two calls
    to `maybe_save_stage`. Net 50-line shrink.
  - Wrapper's behavior is byte-identical: same API surface, same
    file layout, same NaN preservation. Existing 4
    `traced_save_tensor_step2_tests` tests still PASS.

## Five Whys

1. **Why now?** PR #1414 (step 2) merged earlier today landed the
   second copy of the inline block. Pre-step-3 is the right time to
   factor — before 15 more capture points get added inside the
   360-line `forward_traced` body.
2. **Why a new module instead of inlining in
   `apr_transformer/`?** The helper has zero coupling to
   `AprTransformer` (it takes a plan + stage + values). Living next
   to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
   the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
   Step 3 will thread this through `forward_traced`, which is also
   called from non-instrumented contexts (HTTP serving, training
   evals). The `Option` lets a single `forward_traced` body serve
   both — `maybe_save_stage(None, ...)` is a single discriminant
   compare in hot paths.
4. **Why expose `write_stage_file` separately from
   `maybe_save_stage`?** Test ergonomics (the unit tests need to
   verify the unconditional write path, not just the gated path)
   and forward-compatibility for a future `forward_traced_inner`
   that does its own `should_save` filtering inside a hot loop and
   wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
   no behavior change, no new invariants. The existing
   `byte_format`, `determinism`, and `apr_diff_values_compat`
   invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
   exactly one function instead of two now, which makes future
   contract obligations easier to satisfy. PR #1415 (already in
   flight) bumps the contract for step 2; step 3 will bump again
   to v1.3.0.

## Test plan

- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
  7/7 PASS
- [x] `cargo test -p aprender-serve --lib
  traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
  unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
  save_tensor_plan_integration` → 6/6 PASS (contract-level
  integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
  warnings` clean

## Ship % update

- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
  Step 3 (per-layer threading inside `forward_traced`) is now a
  trivial follow-up: each capture point becomes one line:
  `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
  &gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
  46.5M tokens, ~33h ETA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…7 forward_traced threading (#1416)

* refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading

Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.

## What changed

- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
  - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
    layer, values) -> io::Result<()>` — the gated entry point. Cheap
    no-op when plan is None or stage/layer not selected. Forwards to
    `write_stage_file` when the gate passes.
  - `pub fn write_stage_file(output_dir, stage, layer, values) ->
    io::Result<()>` — the unconditional write, exposed separately
    for tests and any future callers that have already gated.
  - 7 unit tests pinning: None=no-op, unselected-stage=no-op,
    per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
    with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
    out-of-range, NaN-bit-preserving f32 LE round-trip, missing
    parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
  new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
  - Replace 60-line Embedding+LmHead inline blocks with two calls
    to `maybe_save_stage`. Net 50-line shrink.
  - Wrapper's behavior is byte-identical: same API surface, same
    file layout, same NaN preservation. Existing 4
    `traced_save_tensor_step2_tests` tests still PASS.

## Five Whys

1. **Why now?** PR #1414 (step 2) merged earlier today landed the
   second copy of the inline block. Pre-step-3 is the right time to
   factor — before 15 more capture points get added inside the
   360-line `forward_traced` body.
2. **Why a new module instead of inlining in
   `apr_transformer/`?** The helper has zero coupling to
   `AprTransformer` (it takes a plan + stage + values). Living next
   to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
   the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
   Step 3 will thread this through `forward_traced`, which is also
   called from non-instrumented contexts (HTTP serving, training
   evals). The `Option` lets a single `forward_traced` body serve
   both — `maybe_save_stage(None, ...)` is a single discriminant
   compare in hot paths.
4. **Why expose `write_stage_file` separately from
   `maybe_save_stage`?** Test ergonomics (the unit tests need to
   verify the unconditional write path, not just the gated path)
   and forward-compatibility for a future `forward_traced_inner`
   that does its own `should_save` filtering inside a hot loop and
   wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
   no behavior change, no new invariants. The existing
   `byte_format`, `determinism`, and `apr_diff_values_compat`
   invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
   exactly one function instead of two now, which makes future
   contract obligations easier to satisfy. PR #1415 (already in
   flight) bumps the contract for step 2; step 3 will bump again
   to v1.3.0.

## Test plan

- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
  7/7 PASS
- [x] `cargo test -p aprender-serve --lib
  traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
  unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
  save_tensor_plan_integration` → 6/6 PASS (contract-level
  integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
  warnings` clean

## Ship % update

- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
  Step 3 (per-layer threading inside `forward_traced`) is now a
  trivial follow-up: each capture point becomes one line:
  `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
  &gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
  46.5M tokens, ~33h ETA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced (#1421)

End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher
captures all 16 stages in a single forward pass:
- 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut,
  PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut,
  PostFfnResidual
- 2 whole-model: FinalNorm, LmHead

Buffer sizes verified against Qwen2.5-Coder-7B config:
- 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header
- 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates)
- 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim)
- 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head)

## What changed

`AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)`
is the new private impl that threads the plan through every natural capture
point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with
`None` (zero-overhead — `maybe_save_stage` early-returns on `None`).
`forward_traced_with_save_tensor` is now a pure delegator: no more double-embed
or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass.

## Five Whys

1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per
   `apr-cli-trace-save-tensor-v1.yaml`.
2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1
   `apr run` divergence) is unknown; capturing every stage lets the operator
   element-wise diff against a reference at any of 16 points.
3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids
   double-compute of embeddings and the wrapper's double-bookkeeping. DRY.
4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing
   `forward_traced(tokens)` is called from many test sites and the trait
   `TracedForward`. Optional preserves the public API.
5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each
   capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask
   the natural buffer-name → stage-name pairing and make audit harder.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant