refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading by noahgift · Pull Request #1416 · paiml/aprender

noahgift · 2026-05-03T09:01:41Z

Summary

Centralizes the boilerplate that forward_traced_with_save_tensor inlines twice (Embedding step 1 in #1408, LmHead step 2 in #1414). Pre-surgery for SHIP-007 PR-C-real step 3, where the plan threads through the 360-line forward_traced body and 15+ capture points each become a single line: maybe_save_stage(plan, stage, layer, &values)?;.

What changed

New crates/aprender-serve/src/inference_trace/save_tensor_emit.rs:
- pub fn maybe_save_stage(Option<&Plan>, stage, layer, values) — gated entry; cheap no-op when None or unselected
- pub fn write_stage_file(output_dir, stage, layer, values) — unconditional write
- 7 unit tests pinning all branches + NaN-bit round-trip
traced_save_tensor.rs: 60 inline lines → 2 helper calls; net 50-line shrink, byte-identical behavior
inference_trace/mod.rs: registered new module

Five Whys

Why now? PR feat(aprender-serve): SHIP-007 PR-C-real step 2 — LmHead capture in forward_traced wrapper #1414 (step 2) just landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added.
Why a new module? Zero coupling to AprTransformer (takes plan+stage+values). Matches the inference_trace::save_tensor_* family pattern.
Why Option<&Plan> not &Plan? Step 3 threads this through forward_traced, which is also called from non-instrumented contexts (HTTP serving, training evals). None is a single discriminant compare in hot paths.
Why expose write_stage_file separately? Test ergonomics + forward-compat for a future forward_traced_inner that gates internally and skips the option indirection.
Why no contract bump? Pure refactor — no behavior change, no new invariants. Existing byte_format, determinism, apr_diff_values_compat invariants flow through exactly one function instead of two, which makes future obligations easier to satisfy.

Test plan

cargo test -p aprender-serve --lib save_tensor_emit::tests → 7/7 PASS
cargo test -p aprender-serve --lib traced_save_tensor_step2_tests → 4/4 PASS (behavior preserved)
cargo test -p aprender-serve --test save_tensor_plan_integration → 6/6 PASS
cargo clippy -p aprender-serve --lib --no-deps -- -D warnings clean
CI required checks (ci / gate, workspace-test)

Ship % update

MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 surgery is now trivial: each per-layer capture point becomes one line.
MODEL-2: corpus tokenization steady (~83 min, 46.5M tokens, ~33h ETA).

🤖 Generated with Claude Code

…es (#1417) Before this PR, `apr trace --save-tensor <model>.apr` printed a stub message and never invoked the underlying wrapper — so the existing Embedding (PR #1408 step 1) and LmHead (PR #1414 step 2) capture surface was UNREACHABLE from the CLI. The wrapper, plan-builder, and APRT byte-format machinery had all merged but produced no files. This PR closes the dispatch gap. When `apr trace --save-tensor <stages>` is invoked on a `.apr` model, dispatch.rs now routes to a new `commands::trace_save_tensor::run_save_tensor_apr` function that: 1. Builds a `SaveTensorPlan` from `--save-tensor`/`--save-tensor-dir`/ `--save-tensor-layers`. Default output dir is `<model-stem>-trace/` next to the input. 2. Loads the APR model + embedded BPE tokenizer. 3. Encodes a fixed test prompt (`"What is 2+2?"` — same as `vector_stats.rs::run_traced_inference_apr` for consistency). 4. Calls `AprTransformer::forward_traced_with_save_tensor(&tokens, &plan)`. 5. Walks the output directory and prints every `*.bin` file with its size, plus the forward-pass success summary. `.gguf` and `.safetensors` paths still print the stub for now — SHIP-007 PR-E live diagnostics convert GGUF→APR at the import boundary so the canonical 7B teacher bisection runs through this code path. ## Five Whys 1. **Why was this missing?** Per the SHIP-007 PR-A commit message, the clap surface was shipped first as a contract pin so `apr-cli-trace-save-tensor-v1.yaml::cli_signature` was bound at the binary boundary. Subsequent PRs (B/C-prep/C-step1/C-step2) wired the library-side machinery. The dispatch glue was the missing final hop — easy to overlook because the contract test `apr trace --save-tensor --help | grep save-tensor` passed all along. 2. **Why is `.apr` only?** `forward_traced_with_save_tensor` is a method on `AprTransformer`; GGUF inference goes through a different `OwnedQuantizedModel` path. Adding GGUF support means either porting the wrapper to that type or converting GGUF→APR at import — the latter is what SHIP-007 PR-E already plans, so it's not blocking. 3. **Why a fixed prompt instead of `--prompt`?** SHIP-007 bisection needs the SAME prompt across APR and GGUF runs to make `apr diff --values` byte comparison meaningful. Hardcoding to `"What is 2+2?"` matches the existing `run_traced_inference_apr` in `vector_stats.rs`. A future `--prompt` flag is a small follow-up. 4. **Why a new module instead of extending `trace.rs`?** `trace.rs` is 722 lines already; adding the save-tensor branch via a new module (52 lines + 4 unit tests) keeps the existing 4-format dispatch intact and isolates the wrapper-specific imports (`realizar::inference_trace::save_tensor_plan::SaveTensorPlan`). 5. **Why now?** Operator's standing /loop directive is "select next best recommended choice". With PR #1414 (step 2 LmHead) merged today and PR #1416 (refactor prep) auto-merge queued, the highest-leverage move is making the existing capture surface actually work end-to-end. SHIP-007 PR-E (the live bisection) is gated on `apr trace --save-tensor` producing files; that gate is what this PR opens. ## Test plan - [x] `cargo test -p apr-cli --lib commands::trace_save_tensor` → 5/5 PASS - default_output_dir_uses_model_stem - default_output_dir_handles_bare_filename - default_output_dir_handles_no_extension - collect_bin_files_recurses_per_layer_subdirs - collect_bin_files_missing_dir_is_ok - [x] `cargo check -p apr-cli --lib` clean - [ ] Live smoke on canonical 7B teacher (operator-pre-authorized lambda-labs lane): `apr trace --payload /mnt/nvme-raid0/.../qwen2.5-coder-7b-instruct-q4k.apr --save-tensor embedding,lm_head` produces `<dir>/layer-0/embedding.bin` + `<dir>/lm_head.bin`. Deferred to a follow-up commit since the operator can run this any time after merge. - [ ] CI required checks (`ci / gate`, `workspace-test`) ## Ship % update - **MODEL-1**: ~68% → **~70%** (the wrapper surface that's been merged for 2 days is now actually invocable from the CLI, which means SHIP-007 PR-E live diagnostics are no longer blocked on this dispatch gap). - **MODEL-2**: corpus tokenization at ~96M tokens / 119 min (steady ~14K tok/s). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… records CLI dispatch wire-up PARTIAL discharge Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… records CLI dispatch wire-up PARTIAL discharge (#1418) Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to: cli_signature`. Before PR #1417, `apr trace --save-tensor` only printed a stub and never invoked `forward_traced_with_save_tensor`. The contract test `apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was already passing at the binary-boundary level — but the dispatch glue was missing, leaving Embedding + LmHead capture surface unreachable from the CLI for 2 days post-step-2 merge. FALSIFY-011 extends the existing `cli_signature` invariant from "the flag is recognized" to "the flag actually produces files". ## Five Whys 1. **Why a separate contract bump?** Avoids file-conflict with the in-flight refactor PR #1416 (which only touches `crates/aprender-serve/`). My contract change is isolated to `contracts/apr-cli-trace-save-tensor-v1.yaml`. 2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the byte format or determinism — it makes the CLI surface that the `cli_signature` equation already specified actually invocable. Same equation, expanded discharge level. 3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path resolution (3) and recursive *.bin walking (2) — algorithm-level. A live discharge against the canonical 7B teacher is operator- gated by post-merge smoke (~30s for a 7B forward + 2 file writes). 4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test that binds an existing invariant is a minor schema change per semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's discharge timeline: - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2) - v1.3.0 (this PR): cli_signature → end-to-end dispatch 5. **Why now?** Records the algorithm-level discharge so when the operator runs the live smoke post-#1417-merge, the contract ledger doesn't lag the code. Same paperwork pattern as #1415 (which followed #1414). ## Verification - `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors, 0 warnings ## Ship % update - MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417). - MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady ~14K tok/s; ~33h ETA total). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…7 forward_traced threading Centralizes the boilerplate that `forward_traced_with_save_tensor` inlines twice (Embedding step 1, LmHead step 2). When the SHIP-007 PR-C-real step-3+ surgery threads `Option<&SaveTensorPlan>` through `AprTransformer::forward_traced` itself, every per-layer stage will call this single function instead of repeating the `should_save → ensure_layer_dir → File::create → write_tensor_file → flush` chain at each capture point. ## What changed - New `crates/aprender-serve/src/inference_trace/save_tensor_emit.rs`: - `pub fn maybe_save_stage(plan: Option<&SaveTensorPlan>, stage, layer, values) -> io::Result<()>` — the gated entry point. Cheap no-op when plan is None or stage/layer not selected. Forwards to `write_stage_file` when the gate passes. - `pub fn write_stage_file(output_dir, stage, layer, values) -> io::Result<()>` — the unconditional write, exposed separately for tests and any future callers that have already gated. - 7 unit tests pinning: None=no-op, unselected-stage=no-op, per-layer→layer-N/<stage>.bin, whole-model→<root>/<stage>.bin with WHOLE_MODEL_LAYER sentinel, layer-range filter excludes out-of-range, NaN-bit-preserving f32 LE round-trip, missing parent dirs auto-created. - `crates/aprender-serve/src/inference_trace/mod.rs`: register the new module. - `crates/aprender-serve/src/apr_transformer/traced_save_tensor.rs`: - Replace 60-line Embedding+LmHead inline blocks with two calls to `maybe_save_stage`. Net 50-line shrink. - Wrapper's behavior is byte-identical: same API surface, same file layout, same NaN preservation. Existing 4 `traced_save_tensor_step2_tests` tests still PASS. ## Five Whys 1. **Why now?** PR #1414 (step 2) merged earlier today landed the second copy of the inline block. Pre-step-3 is the right time to factor — before 15 more capture points get added inside the 360-line `forward_traced` body. 2. **Why a new module instead of inlining in `apr_transformer/`?** The helper has zero coupling to `AprTransformer` (it takes a plan + stage + values). Living next to `save_tensor`, `save_tensor_paths`, `save_tensor_plan` matches the existing `inference_trace::save_tensor_*` family pattern. 3. **Why `Option<&SaveTensorPlan>` instead of `&SaveTensorPlan`?** Step 3 will thread this through `forward_traced`, which is also called from non-instrumented contexts (HTTP serving, training evals). The `Option` lets a single `forward_traced` body serve both — `maybe_save_stage(None, ...)` is a single discriminant compare in hot paths. 4. **Why expose `write_stage_file` separately from `maybe_save_stage`?** Test ergonomics (the unit tests need to verify the unconditional write path, not just the gated path) and forward-compatibility for a future `forward_traced_inner` that does its own `should_save` filtering inside a hot loop and wants to skip the option indirection. 5. **Why no contract bump in this PR?** This is a pure refactor — no behavior change, no new invariants. The existing `byte_format`, `determinism`, and `apr_diff_values_compat` invariants in `apr-cli-trace-save-tensor-v1.yaml` flow through exactly one function instead of two now, which makes future contract obligations easier to satisfy. PR #1415 (already in flight) bumps the contract for step 2; step 3 will bump again to v1.3.0. ## Test plan - [x] `cargo test -p aprender-serve --lib save_tensor_emit::tests` → 7/7 PASS - [x] `cargo test -p aprender-serve --lib traced_save_tensor_step2_tests` → 4/4 PASS (existing tests unchanged behavior verified) - [x] `cargo test -p aprender-serve --test save_tensor_plan_integration` → 6/6 PASS (contract-level integration unchanged) - [x] `cargo clippy -p aprender-serve --lib --no-deps -- -D warnings` clean ## Ship % update - MODEL-1: ~68% (unchanged — pure refactor; no new capture surface). Step 3 (per-layer threading inside `forward_traced`) is now a trivial follow-up: each capture point becomes one line: `maybe_save_stage(plan, SaveTensorStage::FfnGate, layer_idx, &gate)?;`. - MODEL-2: corpus tokenization still running (~83 min elapsed, 46.5M tokens, ~33h ETA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rPlan threading through forward_traced (#1421) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher captures all 16 stages in a single forward pass: - 14 per-layer: Embedding, AttnNorm, QkvMatmul, QkvBias, Attention, AttnOut, PostAttnResidual, FfnNorm, FfnGate, FfnUp, FfnSilu, FfnSwigl, FfnOut, PostFfnResidual - 2 whole-model: FinalNorm, LmHead Buffer sizes verified against Qwen2.5-Coder-7B config: - 100,364 B = 7 tokens × 3,584 hidden_dim × 4 + 12-byte APRT header - 530,444 B = 7 tokens × 18,944 intermediate_dim × 4 + 12 (FFN intermediates) - 129,036 B = 7 tokens × 4,608 qkv_dim × 4 + 12 (qkv_dim = hidden + 2*kv_dim) - 608,268 B = 152,064 vocab × 4 + 12 (whole-model lm_head) ## What changed `AprTransformer::forward_traced_with_plan(tokens, plan: Option<&SaveTensorPlan>)` is the new private impl that threads the plan through every natural capture point. `forward_traced(tokens)` becomes a 1-line wrapper that calls it with `None` (zero-overhead — `maybe_save_stage` early-returns on `None`). `forward_traced_with_save_tensor` is now a pure delegator: no more double-embed or post-loop re-emission of LmHead. All 6 forward_traced regression tests pass. ## Five Whys 1. Why thread plan through forward_traced? Step 3 of SHIP-007 PR-C-real per `apr-cli-trace-save-tensor-v1.yaml`. 2. Why all 16 stages, not subset? The bisection target (root cause of MODEL-1 `apr run` divergence) is unknown; capturing every stage lets the operator element-wise diff against a reference at any of 16 points. 3. Why single-pass via `Option<&SaveTensorPlan>` (not separate method)? Avoids double-compute of embeddings and the wrapper's double-bookkeeping. DRY. 4. Why `Option<&SaveTensorPlan>` (not always-required)? Existing `forward_traced(tokens)` is called from many test sites and the trait `TracedForward`. Optional preserves the public API. 5. Why only inserted `maybe_save_stage` calls (no helpers refactor)? Each capture is a bare `emit(Stage, layer, &buf)?` — adding helpers would mask the natural buffer-name → stage-name pairing and make audit harder. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…discharge for FALSIFY-009/010/011 (#1422) End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

@v

…+ empirical bug location (#1423) * contract(apr-cli-trace-save-tensor-v1): v1.3.0 → v1.4.0 — FUNCTIONAL discharge for FALSIFY-009/010/011 End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher (RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421): - 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual - 2 whole-model (root/*): final_norm, lm_head All 16 file sizes match `12 + 4 * dim_product` for their stage type (3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab). Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL: - FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format) - FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture) - FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up) `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns 0 errors / 0 warnings. Five Whys 1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically verified in single live run". DISCHARGED requires either bytewise equivalence vs an oracle OR repeatable run-to-run cross-machine verification. SHIP-007 PR-C-real step 3 just ships the surface; the oracle comparison (APR vs HF Transformers reference) is the next leg. 2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence` blocks; bumping all three together at FUNCTIONAL is the natural semver event. 3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)? Drift-prevention: future readers need to see BOTH the algorithm-level tests that pin the impl AND the live byte-counts/file-counts that validate the impl runs end-to-end on the canonical teacher. 4. Why hand-cite the 16 stage names in the contract? They're the surface over which the next milestone (SHIP-007 layer-0 element-wise bisection vs HF reference) will diff — making them visible in the contract is the drift-prevention pin. 5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED` tracks the document's lifecycle, not the falsifier maturity. Promoting to ACTIVE requires a separate decision after the spec audit (out of scope for this paperwork commit). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference stage generator Authors `scripts/generate_qwen25_coder_fp16_stages.py` — a Python tool that runs `Qwen/Qwen2.5-Coder-7B-Instruct` at FP16 with forward hooks attached to each natural per-layer module and dumps the activations in the same APRT byte format that `apr trace --save-tensor` produces. Output layout mirrors the APR side (`layer-0/<stage>.bin` + `<stage>.bin`) so `apr diff --values <apr>.bin <hf>.bin` works element-wise without any rewriting. Captured 13/16 stages directly: - Per-layer (11): embedding, attn_norm, attn_out, post_ffn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out - Whole-model (2): final_norm, lm_head Skipped 3/16 (qkv_matmul / qkv_bias / attention) — these need deeper instrumentation since HF stores Q/K/V separately + RoPE is internal to self_attn. Deferred to a follow-up; the 13 captured stages already cover all major points along the forward pass. Five Whys 1. Why need an HF FP16 reference? SHIP-007 layer-0 element-wise diff needs a ground-truth oracle to compare APR Q4K against; FP16 is the closest published reference for this model. 2. Why not just use the existing `qwen2.5-coder-7b-instruct-q4k.safetensors` on disk? That's the same Q4K data we already feed into APR — diffing it against APR would only catch APR-side bugs that change weights, not bugs in forward computation. We need an INDEPENDENT reference. 3. Why hooks instead of direct model code edits? HF's modeling_qwen2.py is auto-loaded via `trust_remote_code=True`; the hooks let us inspect every stage without forking HF's source. 4. Why APRT byte format (not torch.save)? `apr diff --values` already recognizes APRT files (PR #1413) — using the same format makes the diff a one-liner. Drift-prevention: same format on both sides keeps comparison shape-agnostic. 5. Why skip qkv_matmul/qkv_bias/attention now? Discharging the discoverable 13 stages is high-leverage; the remaining 3 require manual q+k+v concatenation and Q@Kᵀ@v re-derivation. Worth a follow-up PR but blocking on it would delay every other stage's bisection signal. Note: This script is NOT auto-run in CI — it requires HF cache containing `Qwen/Qwen2.5-Coder-7B-Instruct` (~15 GB). Confirmed already cached at ~/.cache/huggingface/hub/ on noah-Lambda-Vector 2026-05-03. Operator runs it once via `uv run --with torch --with transformers` to produce the fixture; downstream `apr diff` passes are deterministic byte comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): layer-0 oracle bisection — attn_out is the first diverging stage End-to-end empirical bisection on canonical Qwen2.5-Coder-7B-Instruct teacher with HF FP16 ground truth (CPU forward, HF cache hit). Element-wise diff every shared layer-0 stage between APR Q4K and HF FP16: | Stage | Cosine sim | Status | |-------------------|---------------|--------------------------------------------| | embedding | 1.0000000000 | bit-identical (correct) | | attn_norm | 0.9999999483 | within Q4K noise (correct) | | **attn_out** | **0.9966403** | **FIRST DROP — bug is in attention block** | | ffn_* (downstream)| 0.996-0.999 | carries drift (downstream artifacts) | | final_norm | 0.9932669898 | (whole-model — accumulates 28 layers) | | lm_head | 0.9969170161 | (whole-model — last-token logits) | This narrows the SHIP-007 root cause to the layer-0 attention block, specifically between RMSNorm output (cos=0.99999995, correct) and post-O-proj attention output (cos=0.9966, wrong). Possible bug sites within the block: 1. qkv_matmul (Q4K matmul × QKV weights) — needs HF-side capture 2. qkv_bias 3. RoPE on Q/K 4. Q@Kᵀ scaled-dot-product 5. Softmax with causal mask 6. softmax @ V 7. O-projection (Q4K matmul × O-proj weight) Next milestone: extend `scripts/generate_qwen25_coder_fp16_stages.py` with qkv_matmul / qkv_bias / attention capture (currently deferred to PARTIAL coverage), re-run diff, pinpoint the divergent kernel. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(scripts): SHIP-007 v2 — qkv_matmul/qkv_bias/attention captures narrow bug to attention math (#1424) Extends `generate_qwen25_coder_fp16_stages.py` with HF-side captures for the 3 stages previously deferred. Refines the SHIP-007 layer-0 bisection from "inside attention block" (v1) to "between qkv_bias and attention" (v2). ## Refined cosine table | Stage | Cosine | Δ from prev | Status | |---------------|-----------|-------------|-----------------------| | attn_norm | 0.9999999 | -5e-8 | RMSNorm correct | | qkv_matmul | 0.99969 | -3.1e-4 | Q4K matmul noise (OK) | | qkv_bias | 0.9999975 | +2.8e-4 | bias dampens | | **attention** | 0.99858 | **-1.4e-3** | **← bug is here** | | attn_out | 0.99664 | -1.9e-3 | O-proj amplifies | Bug is **between qkv_bias and attention** = inside the attention math: RoPE / Q@Kᵀ / scale / causal mask / softmax / V@. NOT in QKV matmul (acceptable Q4K noise). NOT in QKV bias add (within FP precision). O-projection adds its own ~1.9e-3 cosine drop — secondary. ## Implementation New HF-side hooks: - `make_qkv_hook` on q_proj/k_proj/v_proj — concat outputs to derive qkv_bias (post-bias) and qkv_matmul (post-bias minus per-Linear bias) - `hook_o_proj_pre` (forward_pre_hook) on o_proj — captures its INPUT, which is APR's "attention" stage (post softmax(Q@Kᵀ)@v, pre-O-proj) Script now produces 15 stage files (was 12 in v1). ## Why qkv_matmul cos=0.99969 < qkv_bias cos=0.9999975 Mathematical artifact, not a bug: - qkv_matmul = Q4K_matmul(Q4K_input × Q4K_weight) — has ~3e-4 cosine noise vs FP16 - qkv_bias = qkv_matmul + bias (deterministic FP16 bias vector) - Adding deterministic vector dominates direction → relative noise dampens - Both APR and HF add the same bias values → cos increases on both sides equally Confirmed via: bias subtraction matches (HF - bias ≈ APR pre-bias on each side). ## Five Whys 1. Why need qkv stage captures? v1 only narrowed bug to "attention block" — not enough to drive a fix. We need to know if the bug is in the projections or the attention math. 2. Why is qkv_matmul cos lower than qkv_bias? See above — bias addition is a known mathematical artifact with deterministic vectors. 3. Why is the bug between qkv_bias and attention specifically? Cos=0.9999975 → 0.99858 is a 70× factor, far above Q4K floor. The intermediate ops (RoPE, scale, softmax, mask, V@) introduce real divergence. 4. Why O-proj adds another 1.9e-3 drop? Q4K_matmul of attention × O-proj weight is the same Q4K-vs-FP16 floor as qkv_matmul. Acceptable. 5. Why narrow further to RoPE/scale/softmax/mask/V@? Each is a candidate. Without finer-grained captures inside HF's monolithic Qwen2Attention, v2 cannot bisect further. Future work: instrument HF's attention internals OR cross-reference candle/pytorch for the algebraic spec of each sub-op. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v3+v4 — APR attention audit + DECISIVE PIVOT to GPU execution path (#1425) * evidence(ship-007): v3 — APR attention code audit vs HF Qwen2 reference Cross-referenced APR's attention forward (`inference.rs` + `helpers.rs`) against HF Transformers Qwen2 to identify the algebraic source of the v2-measured 1.4e-3 cosine drop between qkv_bias and attention. ## Audit result: NO algebraic bug in APR attention Verified MATCHES vs HF Qwen2: - RoPE rotation formula (split-half, x[i]=x1·cos − x2·sin / x[i+½d]=x1·sin + x2·cos) - RoPE freq formula (1/theta^(2i/d)) - rope_theta value (1000000.0 from `metadata.rope_theta`) - Attention scale (1/sqrt(head_dim)) - Causal mask (`for j in 0..=i` triangular) - Softmax (f32 max-subtract) - QKV bias position (post-matmul, pre-RoPE) - GQA-7:1 head indexing (`kv_head = head/group_size`) ## Refined hypothesis The 1.4e-3 cosine drop is most likely **systematic precision loss from Q4K dequant compounding through attention math**, NOT a structural algorithmic bug. Specifically: 1. APR's `forward_traced` uses F32 dequantized Q4K weights (per `inference.rs:38` comment "Q4K layers not used in traced forward"). 2. The Q4K dequant is lossy (~1e-3 RMS per element). 3. When these slightly-off Q values are dotted against slightly-off K values (also from Q4K dequant), the product compounds the error. 4. This compounding produces cos=0.99858 at attention output — consistent with systematic precision loss, not a bug. ## Implication for SHIP-007 fix If this hypothesis is right, the layer-0 attention bisection has hit the natural noise floor of Q4K-vs-FP16 comparison. The actual `apr run` quality issue may be: (a) Further downstream — accumulating drift through 28 layers (b) NOT a forward-pass bug at all — could be sampling/decoding config (c) Q4K kernel-specific — `apr run` uses Q4K kernels (faster path) while `forward_traced` uses F32 dequant (more accurate path); the two might diverge in how the kernel handles edge cases ## Next narrowing tests 1. Run `apr trace --save-tensor` on the FP16 safetensors version of the teacher; if cos improves to >0.999 across all stages, confirms (a)/(c). 2. Multi-layer cosine sweep (layers 0/1/13/27) to characterize drift growth. 3. argmax-flip check on lm_head — if APR top-1 token matches HF top-1, the drift is "noise" not bug-relevant. Evidence: `evidence/ship-007-layer-0-oracle-bisection-2026-05-03/findings-v3-attention-code-audit.md` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence(ship-007): v4 — DECISIVE PIVOT, bug pinpointed to GPU execution path Two falsifying live tests run on canonical 7B teacher reframe SHIP-007 fundamentally: ## Test 1: lm_head argmax MATCHES APR forward_traced top-1 token: 220 (' ') HF FP16 top-1 token: 220 (' ') First 3 top-5 ranks identical: [220, 576, 2014] → The cos=0.998 forward divergence in v1/v2/v3 is NOT bug-relevant for greedy decoding. It's just systematic precision noise. ## Test 2: `apr run --temperature 0.0` produces gibberish $ apr run qwen2.5-coder-7b-instruct-q4k.apr --prompt 'What is 2+2?' \ --max-tokens 16 --temperature 0.0 ampiezza = 0.5 diametro = 10 → Italian-looking gibberish, NOT '4', NOT a coherent answer. ## Test 3: even `--max-tokens 1` disagrees with forward_traced $ apr run [...] --max-tokens 1 --temperature 0.0 ampie → Single-step apr run produces different first token than forward_traced (which argmaxed to 220 ' '). ## The pivot The SHIP-007 bug is NOT in the forward pass instrumented by `forward_traced`. It's in the `apr run` GPU/wgpu hybrid execution path: | Path | Backend | Weights | Output for "What is 2+2?" | |------------------|--------------------------------|---------|---------------------------| | forward_traced | CPU scalar-loop matmul | F32 | argmax=220 (' ', matches HF) | | apr run | CUDA graph (646 kernels) + wgpu | F32 | "ampie..." (gibberish) | Both paths use the same F32 weights (apr run dequantizes Q4K to F32 before GPU upload, per PMAT-333 log line). The divergence is in **kernel implementations** — CPU scalar loops vs CUDA/wgpu kernels. ## All previous findings invalidated - v1 "bug is in attention block" — INVALID (was just Q4K precision noise) - v2 "bug is between qkv_bias and attention" — INVALID (same) - v3 "no algebraic bug, must be precision" — PARTIALLY CORRECT (forward_traced IS correct), but missed that the actual broken path is `apr run` GPU. The forward_traced bisection chain (cos drops at attention) is now understood as a RED HERRING — it captures a different code path than the buggy one. ## Next narrowing 1. Force `apr run` to use CPU (env var or feature flag) — does it match forward_traced? If yes, confirms GPU/wgpu parity bug. 2. Element-wise diff GPU layer-0 attention output vs CPU forward_traced. 3. Audit `realizar/src/quantize/fused_*` and CUDA graph kernels for forward-pass bugs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 3, 2026 09:01

noahgift mentioned this pull request May 3, 2026

feat(apr-cli): wire apr trace --save-tensor end-to-end for .apr files #1417

Merged

4 tasks

noahgift force-pushed the refactor/inference-trace-maybe-save-stage branch from 24db57d to 8d1d9fe Compare May 3, 2026 11:04

noahgift mentioned this pull request May 3, 2026

feat(aprender-serve): SHIP-007 PR-C-real step 3 — per-layer SaveTensorPlan threading through forward_traced #1421

Merged

6 tasks

noahgift and others added 2 commits May 3, 2026 13:41

noahgift force-pushed the refactor/inference-trace-maybe-save-stage branch from 7b9c6bc to 6da9e9d Compare May 3, 2026 11:41

noahgift mentioned this pull request May 3, 2026

contract(apr-cli-trace-save-tensor-v1): v1.3.0 → v1.4.0 — FUNCTIONAL discharge for FALSIFY-009/010/011 #1422

Merged

4 tasks

noahgift merged commit 872ab80 into main May 3, 2026
10 checks passed

noahgift deleted the refactor/inference-trace-maybe-save-stage branch May 3, 2026 11:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading#1416

refactor(aprender-serve): extract maybe_save_stage helper for SHIP-007 forward_traced threading#1416
noahgift merged 2 commits into
mainfrom
refactor/inference-trace-maybe-save-stage

noahgift commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 3, 2026

Summary

What changed

Five Whys

Test plan

Ship % update

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant