Skip to content

contract(apr-cli-trace-save-tensor-v1): v1.1.0 → v1.2.0 — FALSIFY-010 records LmHead step-2 PARTIAL#1415

Merged
noahgift merged 2 commits into
mainfrom
chore/contract-trace-save-tensor-v1.2.0-step2
May 3, 2026
Merged

contract(apr-cli-trace-save-tensor-v1): v1.1.0 → v1.2.0 — FALSIFY-010 records LmHead step-2 PARTIAL#1415
noahgift merged 2 commits into
mainfrom
chore/contract-trace-save-tensor-v1.2.0-step2

Conversation

@noahgift

@noahgift noahgift commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up paperwork to PR #1414 (SHIP-007 PR-C-real step 2 — LmHead capture). Adds FALSIFY-APR-TRACE-SAVE-010 binding the LmHead branch at PARTIAL_ALGORITHM_LEVEL with explicit algorithm-level evidence pointing at the four new pin tests in `traced_save_tensor_step2_tests`.

The bump was deferred from PR #1414 itself to avoid file-conflict with PR #1413 (which independently bumped v1.0.0 → v1.1.0 with FALSIFY-009). With #1413 now merged on main, this is the natural follow-up.

What changed

  • `contracts/apr-cli-trace-save-tensor-v1.yaml` v1.1.0 → v1.2.0
  • New FALSIFY-APR-TRACE-SAVE-010 binding `byte_format` (the equation that already specifies WHOLE_MODEL_LAYER + f32 LE + NaN preservation — exactly what step 2's LmHead branch invokes via `write_tensor_file`).
  • 6-line `algorithm_evidence` block citing the 4 unit tests + impl path + deferred-live-discharge note pointing at SHIP-007 PR-E.

Test plan

  • `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors
  • CI required checks (`ci / gate`, `workspace-test`)

Five Whys

  1. Why a separate contract follow-up? PR feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper #1414 needed PR feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors #1413 to land first to avoid file-conflict on `metadata.version`.
  2. Why `binds_to: byte_format`? Step 2 doesn't add a new clap surface; it invokes the same write path with `WHOLE_MODEL_LAYER` sentinel that `byte_format` invariants already specify (NaN-bit preservation, f32 LE, 12-byte header).
  3. Why PARTIAL_ALGORITHM_LEVEL not full? Pin tests use synthetic plans + fake logits; live discharge against canonical 7B teacher is deferred to SHIP-007 PR-E (layer-0 bisection).
  4. Why v1.2.0? Adding a new falsification test that binds an existing invariant is a minor schema change per semver.
  5. Why now? Records the algorithm-level discharge while the operator is still building MODEL-2 corpus context — keeps the contract ledger in sync with what's already in code (`forward_traced_with_save_tensor` step 2 in PR feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper #1414).

Ship % update

🤖 Generated with Claude Code

… records LmHead step-2 PARTIAL discharge

Follow-up to PR #1414 (`forward_traced_with_save_tensor` step 2). Adds
FALSIFY-APR-TRACE-SAVE-010 binding the LmHead branch at
PARTIAL_ALGORITHM_LEVEL; the algorithm-level evidence cites the four
new pin tests in `traced_save_tensor_step2_tests`:

- step2_lm_head_writes_to_output_root_not_per_layer_dir
- step2_lm_head_header_uses_whole_model_sentinel
- step2_lm_head_skipped_when_plan_does_not_select_it
- step2_lm_head_writes_logits_bytes_verbatim (NaN-bit preserving)

`binds_to: byte_format` because step 2 invokes the same write_tensor_file
path with `WHOLE_MODEL_LAYER` sentinel as the existing `byte_format`
equation specifies. Live discharge against the canonical 7B teacher is
deferred to SHIP-007 PR-E (layer-0 bisection).

## Five Whys

1. **Why a separate contract follow-up?** The PR #1414 commit needed to
   land before this bump to avoid file-conflict with PR #1413
   (which independently bumped v1.0.0 → v1.1.0 with FALSIFY-009).
2. **Why `binds_to: byte_format` and not `cli_signature`?** The wrapper
   doesn't add a new clap surface (PR-A already did that); it adds a
   new branch that emits files conforming to the existing byte-format
   equation. The new branch's verbatim f32 LE round-trip + NaN preservation
   is exactly the property `byte_format` invariants pin.
3. **Why PARTIAL_ALGORITHM_LEVEL not full discharge?** The 4 unit tests
   simulate the wrapper's byte-flow at the contract level using synthetic
   plans and fake logits — they do NOT instantiate a full AprTransformer
   or load a real APR model. Live discharge requires SHIP-007 PR-E.
4. **Why bump to v1.2.0?** Adding a new falsification test (FALSIFY-010)
   that binds an additional invariant is a minor schema change. Per
   semver, that's a minor bump.
5. **Why `pv validate` clean even with two new falsifiers in 24h?** The
   contract uses metadata.kind=schema, so falsification_tests entries
   are flexible; pv validates structure, IDs are unique, and binds_to
   references are valid.

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors
- v1.0.0 → v1.1.0 (PR #1413, FALSIFY-009 binding apr_diff_values_compat)
- v1.1.0 → v1.2.0 (this PR, FALSIFY-010 binding LmHead step-2 capture)

## Ship % update

- MODEL-1: ~68% (unchanged — this is paperwork that records yesterday's
  algorithm-level discharge of step 2; the actual capture surface
  expansion happened in PR #1414).
- MODEL-2: corpus tokenization at ~46.5M tokens / 56 min (steady ~14K
  tok/s); ~33h ETA for full 27 GB Stack v1.2 corpus.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 3, 2026 08:26
@noahgift noahgift merged commit 5f88d88 into main May 3, 2026
10 checks passed
@noahgift noahgift deleted the chore/contract-trace-save-tensor-v1.2.0-step2 branch May 3, 2026 09:08
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…7 forward_traced threading

Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.

## What changed

- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
  - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
    layer, values) -> io::Result<()>` — the gated entry point. Cheap
    no-op when plan is None or stage/layer not selected. Forwards to
    `write_stage_file` when the gate passes.
  - `pub fn write_stage_file(output_dir, stage, layer, values) ->
    io::Result<()>` — the unconditional write, exposed separately
    for tests and any future callers that have already gated.
  - 7 unit tests pinning: None=no-op, unselected-stage=no-op,
    per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
    with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
    out-of-range, NaN-bit-preserving f32 LE round-trip, missing
    parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
  new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
  - Replace 60-line Embedding+LmHead inline blocks with two calls
    to `maybe_save_stage`. Net 50-line shrink.
  - Wrapper's behavior is byte-identical: same API surface, same
    file layout, same NaN preservation. Existing 4
    `traced_save_tensor_step2_tests` tests still PASS.

## Five Whys

1. **Why now?** PR #1414 (step 2) merged earlier today landed the
   second copy of the inline block. Pre-step-3 is the right time to
   factor — before 15 more capture points get added inside the
   360-line `forward_traced` body.
2. **Why a new module instead of inlining in
   `apr_transformer/`?** The helper has zero coupling to
   `AprTransformer` (it takes a plan + stage + values). Living next
   to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
   the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
   Step 3 will thread this through `forward_traced`, which is also
   called from non-instrumented contexts (HTTP serving, training
   evals). The `Option` lets a single `forward_traced` body serve
   both — `maybe_save_stage(None, ...)` is a single discriminant
   compare in hot paths.
4. **Why expose `write_stage_file` separately from
   `maybe_save_stage`?** Test ergonomics (the unit tests need to
   verify the unconditional write path, not just the gated path)
   and forward-compatibility for a future `forward_traced_inner`
   that does its own `should_save` filtering inside a hot loop and
   wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
   no behavior change, no new invariants. The existing
   `byte_format`, `determinism`, and `apr_diff_values_compat`
   invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
   exactly one function instead of two now, which makes future
   contract obligations easier to satisfy. PR #1415 (already in
   flight) bumps the contract for step 2; step 3 will bump again
   to v1.3.0.

## Test plan

- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
  7/7 PASS
- [x] `cargo test -p aprender-serve --lib
  traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
  unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
  save_tensor_plan_integration` → 6/6 PASS (contract-level
  integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
  warnings` clean

## Ship % update

- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
  Step 3 (per-layer threading inside `forward_traced`) is now a
  trivial follow-up: each capture point becomes one line:
  `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
  &gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
  46.5M tokens, ~33h ETA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge (#1418)

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…7 forward_traced threading

Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.

## What changed

- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
  - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
    layer, values) -> io::Result<()>` — the gated entry point. Cheap
    no-op when plan is None or stage/layer not selected. Forwards to
    `write_stage_file` when the gate passes.
  - `pub fn write_stage_file(output_dir, stage, layer, values) ->
    io::Result<()>` — the unconditional write, exposed separately
    for tests and any future callers that have already gated.
  - 7 unit tests pinning: None=no-op, unselected-stage=no-op,
    per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
    with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
    out-of-range, NaN-bit-preserving f32 LE round-trip, missing
    parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
  new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
  - Replace 60-line Embedding+LmHead inline blocks with two calls
    to `maybe_save_stage`. Net 50-line shrink.
  - Wrapper's behavior is byte-identical: same API surface, same
    file layout, same NaN preservation. Existing 4
    `traced_save_tensor_step2_tests` tests still PASS.

## Five Whys

1. **Why now?** PR #1414 (step 2) merged earlier today landed the
   second copy of the inline block. Pre-step-3 is the right time to
   factor — before 15 more capture points get added inside the
   360-line `forward_traced` body.
2. **Why a new module instead of inlining in
   `apr_transformer/`?** The helper has zero coupling to
   `AprTransformer` (it takes a plan + stage + values). Living next
   to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
   the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
   Step 3 will thread this through `forward_traced`, which is also
   called from non-instrumented contexts (HTTP serving, training
   evals). The `Option` lets a single `forward_traced` body serve
   both — `maybe_save_stage(None, ...)` is a single discriminant
   compare in hot paths.
4. **Why expose `write_stage_file` separately from
   `maybe_save_stage`?** Test ergonomics (the unit tests need to
   verify the unconditional write path, not just the gated path)
   and forward-compatibility for a future `forward_traced_inner`
   that does its own `should_save` filtering inside a hot loop and
   wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
   no behavior change, no new invariants. The existing
   `byte_format`, `determinism`, and `apr_diff_values_compat`
   invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
   exactly one function instead of two now, which makes future
   contract obligations easier to satisfy. PR #1415 (already in
   flight) bumps the contract for step 2; step 3 will bump again
   to v1.3.0.

## Test plan

- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
  7/7 PASS
- [x] `cargo test -p aprender-serve --lib
  traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
  unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
  save_tensor_plan_integration` → 6/6 PASS (contract-level
  integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
  warnings` clean

## Ship % update

- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
  Step 3 (per-layer threading inside `forward_traced`) is now a
  trivial follow-up: each capture point becomes one line:
  `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
  &gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
  46.5M tokens, ~33h ETA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…7 forward_traced threading (#1416)

* refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading

Centralizes the boilerplate that `forward_traced_with_save_tensor`
inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007
PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through
`AprTransformer::forward_traced` itself, every per-layer stage will
call this single function instead of repeating the
`should_save → ensure_layer_dir → File::create → write_tensor_file →
flush` chain at each capture point.

## What changed

- New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`:
  - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage,
    layer, values) -> io::Result<()>` — the gated entry point. Cheap
    no-op when plan is None or stage/layer not selected. Forwards to
    `write_stage_file` when the gate passes.
  - `pub fn write_stage_file(output_dir, stage, layer, values) ->
    io::Result<()>` — the unconditional write, exposed separately
    for tests and any future callers that have already gated.
  - 7 unit tests pinning: None=no-op, unselected-stage=no-op,
    per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin
    with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes
    out-of-range, NaN-bit-preserving f32 LE round-trip, missing
    parent dirs auto-created.
- `crates/aprender-serve/src/inference_trace/mod.rs`: register the
  new module.
- `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`:
  - Replace 60-line Embedding+LmHead inline blocks with two calls
    to `maybe_save_stage`. Net 50-line shrink.
  - Wrapper's behavior is byte-identical: same API surface, same
    file layout, same NaN preservation. Existing 4
    `traced_save_tensor_step2_tests` tests still PASS.

## Five Whys

1. **Why now?** PR #1414 (step 2) merged earlier today landed the
   second copy of the inline block. Pre-step-3 is the right time to
   factor — before 15 more capture points get added inside the
   360-line `forward_traced` body.
2. **Why a new module instead of inlining in
   `apr_transformer/`?** The helper has zero coupling to
   `AprTransformer` (it takes a plan + stage + values). Living next
   to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches
   the existing `inference_trace::save_tensor_*` family pattern.
3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?**
   Step 3 will thread this through `forward_traced`, which is also
   called from non-instrumented contexts (HTTP serving, training
   evals). The `Option` lets a single `forward_traced` body serve
   both — `maybe_save_stage(None, ...)` is a single discriminant
   compare in hot paths.
4. **Why expose `write_stage_file` separately from
   `maybe_save_stage`?** Test ergonomics (the unit tests need to
   verify the unconditional write path, not just the gated path)
   and forward-compatibility for a future `forward_traced_inner`
   that does its own `should_save` filtering inside a hot loop and
   wants to skip the option indirection.
5. **Why no contract bump in this PR?** This is a pure refactor —
   no behavior change, no new invariants. The existing
   `byte_format`, `determinism`, and `apr_diff_values_compat`
   invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through
   exactly one function instead of two now, which makes future
   contract obligations easier to satisfy. PR #1415 (already in
   flight) bumps the contract for step 2; step 3 will bump again
   to v1.3.0.

## Test plan

- [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` →
  7/7 PASS
- [x] `cargo test -p aprender-serve --lib
  traced_save_tensor_step2_tests` → 4/4 PASS (existing tests
  unchanged behavior verified)
- [x] `cargo test -p aprender-serve --test
  save_tensor_plan_integration` → 6/6 PASS (contract-level
  integration unchanged)
- [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D
  warnings` clean

## Ship % update

- MODEL-1: ~68% (unchanged — pure refactor; no new capture surface).
  Step 3 (per-layer threading inside `forward_traced`) is now a
  trivial follow-up: each capture point becomes one line:
  `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx,
  &gate)?;`.
- MODEL-2: corpus tokenization still running (~83 min elapsed,
  46.5M tokens, ~33h ETA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced (#1421)

End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher
captures all 16 stages in a single forward pass:
- 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut,
  PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut,
  PostFfnResidual
- 2 whole-model: FinalNorm, LmHead

Buffer sizes verified against Qwen2.5-Coder-7B config:
- 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header
- 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates)
- 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim)
- 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head)

## What changed

`AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)`
is the new private impl that threads the plan through every natural capture
point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with
`None` (zero-overhead — `maybe_save_stage` early-returns on `None`).
`forward_traced_with_save_tensor` is now a pure delegator: no more double-embed
or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass.

## Five Whys

1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per
   `apr-cli-trace-save-tensor-v1.yaml`.
2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1
   `apr run` divergence) is unknown; capturing every stage lets the operator
   element-wise diff against a reference at any of 16 points.
3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids
   double-compute of embeddings and the wrapper's double-bookkeeping. DRY.
4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing
   `forward_traced(tokens)` is called from many test sites and the trait
   `TracedForward`. Optional preserves the public API.
5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each
   capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask
   the natural buffer-name → stage-name pairing and make audit harder.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant