Skip to content

feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors#1413

Merged
noahgift merged 2 commits into
mainfrom
feat/apr-diff-values-aprt-stage-tensor-v1
May 3, 2026
Merged

feat(apr-cli): SHIP-007 PR-D — apr diff --values recognizes APRT stage tensors#1413
noahgift merged 2 commits into
mainfrom
feat/apr-diff-values-aprt-stage-tensor-v1

Conversation

@noahgift

@noahgift noahgift commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Closes the apr_diff_values_compat invariant of apr-cli-trace-save-tensor-v1 at PARTIAL_ALGORITHM_LEVEL.
  • New diff_05_aprt_stage.rs include slot: when both inputs to apr diff --values start with magic bytes APRT, dispatch bypasses the whole-model RosettaStone walker and runs an element-wise stage-tensor diff (max|diff|, RMS, cosine sim, top-K).
  • Mismatched dim_product or layer → fail-fast error (no silent compare of incompatible stages).
  • Contract apr-cli-trace-save-tensor-v1 v1.0.0 → v1.1.0 with new FALSIFY-APR-TRACE-SAVE-009.

Why now

The SHIP-007 PR-A→B→C cascade for MODEL-1 layer-0 stage-by-stage element-wise bisection (per feedback_model_1_ships_gpu_only.md) needs PR-D infrastructure ready in parallel with PR #1408 (PR-C-real step 1). PR-D is CLI-only — no dependency on forward_traced threading — so it merges independently.

apr trace --save-tensor writes APRT-prefixed per-stage f32 tensors. Without this PR, callers must either parse the 12-byte header by hand or shell out to a Python script — exactly the kind of muda APR-MONO §26.8 forbids.

What changed

  • crates/apr-cli/src/commands/diff_05_aprt_stage.rs (new): is_aprt_stage_file, compute_aprt_stage_stats, run_aprt_stage_diff + 11 unit tests (provenance pin, magic detection, stats correctness, error cases).
  • crates/apr-cli/src/commands/diff.rs: detect APRT magic on both --values inputs and dispatch before the RosettaStone path; legacy callers (model-vs-model diff) unchanged.
  • contracts/apr-cli-trace-save-tensor-v1.yaml: v1.0.0 → v1.1.0; new FALSIFY-APR-TRACE-SAVE-009 with algorithm_evidence citing the new unit tests.

Test plan

  • cargo test -p apr-cli --lib commands::diff::aprt → 11/11 PASS
  • cargo clippy -p apr-cli --lib --no-deps -- -D warnings clean
  • pv validate contracts/apr-cli-trace-save-tensor-v1.yaml → 0 errors
  • CI required checks (ci / gate, workspace-test)

Five Whys

  1. Why now? SHIP-007 cascade needs PR-D ready when PR-C-real step 2 lands.
  2. Why extend apr diff instead of new subcommand? Contract apr_diff_values_compat already names apr diff --values as the verifier.
  3. Why an include!() file? diff.rs follows that pattern (diff_accumulator, diff_output_json_text, diff_04).
  4. Why no live integration smoke? The infrastructure for end-to-end live (apr trace --save-tensor X.bin) requires SHIP-007 PR-C-real step 2 (feat(aprender-serve): SHIP-007 PR-C-real step 1 — forward_traced_with_save_tensor wrapper #1408 stacked) to be merged. The unit tests pin the byte-format contract via synthetic APRT fixtures, which is sufficient at PARTIAL_ALGORITHM_LEVEL per the contract's own discharge ladder.
  5. Why dogfood realizar::inference_trace::save_tensor::read_tensor_file instead of inline parsing? Reusing the same parser the writer uses is the canonical way to prevent format drift. apr-cli already imports from realizar via the default inference feature.

Ship % update

  • MODEL-1: ~64% → ~66% (1 invariant DISCHARGED-at-algorithm; infrastructure clear for SHIP-007 step E live diffing).
  • MODEL-2: full Stack v1.2 Python corpus tokenization running in background (~33h ETA).

🤖 Generated with Claude Code

…age tensors

Closes the `apr_diff_values_compat` invariant of `apr-cli-trace-save-tensor-v1`
at PARTIAL_ALGORITHM_LEVEL via a new `diff_05_aprt_stage.rs` include slot.

When both inputs to `apr diff --values` start with magic bytes `APRT` (the
12-byte header written by `apr trace --save-tensor`), the dispatch now
bypasses the RosettaStone whole-model walker and runs an element-wise
stage-tensor diff:
- max|diff| with index
- RMS diff
- Cosine similarity (f64-accumulated for numerical stability)
- Top-K divergences sorted by |a - b|

Both JSON and pretty text output are supported. Mismatched dim_product or
layer fields fail-fast with a diagnostic error so callers don't silently
compare incompatible stages.

## Five Whys (why now, why this scope)

1. **Why is this needed?** `apr trace --save-tensor` (PR-A #1405, PR-B #1406,
   PR-C-prep #1407) writes per-stage f32 tensors as `APRT`-prefixed files.
   Without an APRT-aware diff, layer-0 stage-by-stage element-wise
   bisection per `feedback_model_1_ships_gpu_only.md` is gated on external
   tooling — exactly the kind of muda the APR-MONO §26.8 rule forbids.
2. **Why extend `apr diff` and not write a new subcommand?** The
   `apr_diff_values_compat` invariant in `apr-cli-trace-save-tensor-v1`
   already names `apr diff --values` as the verifier. Extending the
   existing flag keeps the contract surface stable.
3. **Why an include!() file instead of inlining into diff.rs?** diff.rs
   already follows that pattern (diff_accumulator, diff_output_json_text,
   diff_04). Keeping APRT logic in `diff_05_aprt_stage.rs` lets it be
   audited / removed independently and doesn't grow the parent file.
4. **Why pin via `provenance_pin_pr_d_rev1`?** Future renames of either
   `is_aprt_stage_file` or the file path break the include!() chain;
   the pin makes that visible at test-time and forces a contract bump.
5. **Why now?** Tokenization of the 27 GB Stack v1.2 Python corpus is
   running in the background for MODEL-2 (PR #1412 merged). The SHIP-007
   PR-C-real cascade for MODEL-1 needs PR-D infrastructure ready when
   step 2 (forward_traced threading) lands. PR-D is independent and can
   merge in parallel with #1408.

## Verification

- `cargo test -p apr-cli --lib commands::diff::aprt` → 11/11 PASS
  - is_aprt_stage_file: detects/rejects/truncated/missing (4 tests)
  - compute_aprt_stage_stats: identical=zero, known max/RMS, top-K sort (3)
  - run_aprt_stage_diff: dim/layer mismatch errors, identical succeeds (3)
  - provenance_pin_pr_d_rev1 (1)
- `cargo clippy -p apr-cli --lib --no-deps -- -D warnings` clean
- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors

## Contract update

`apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0:
- New FALSIFY-APR-TRACE-SAVE-009 binding `apr_diff_values_compat` at
  PARTIAL_ALGORITHM_LEVEL with 4-line `algorithm_evidence` block citing
  this PR's unit tests.

## Ship % update

MODEL-1: ~64% → ~66% (PR-D is small but discharges 1 PARTIAL invariant
and clears infrastructure blocker for SHIP-007 step E).
MODEL-2: corpus tokenization in progress (~33h ETA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 3, 2026 07:36
@noahgift noahgift merged commit e9294fa into main May 3, 2026
10 checks passed
@noahgift noahgift deleted the feat/apr-diff-values-aprt-stage-tensor-v1 branch May 3, 2026 08:11
noahgift added a commit that referenced this pull request May 3, 2026
… `forward_traced_with_save_tensor` (#1414)

Extends the wrapper from PR-C-real step 1 (#1408) to additionally write
the `LmHead` whole-model stage when the supplied [`SaveTensorPlan`]
selects it. The logits are pulled directly from `trace.logits` — the
`Vec<f32>` already returned by `forward_traced` — so no recompute, no
internal forward-pass surgery, no risk of behavior drift.

This is the same low-risk capture pattern as step 1's `Embedding`
branch (re-use already-computed data; defer the high-risk threading
into `forward_traced` to future steps).

## Five Whys (why now, why this scope)

1. **Why LmHead next?** Of the 18 `SaveTensorStage` variants, only two
   are externally re-extractable from a `forward_traced` return value
   without modifying the function body: `Embedding` (cheap re-call of
   `self.embed`) and `LmHead` (already in `trace.logits`). Step 1
   shipped Embedding; LmHead is the obvious second.
2. **Why not jump straight to per-layer stages (qkv, ffn_*)?** Those
   stages require threading `Option<&SaveTensorPlan>` through the
   360-line `forward_traced` body. That's the bigger surgery — high
   blast radius, deserves its own PR with proper drift-prevention
   tests and a real-model integration smoke. Splitting LmHead out
   first lets `apr diff --values` (PR #1413) compare APR vs GGUF
   logits TODAY for free, before per-layer infrastructure lands.
3. **Why use the WHOLE_MODEL_LAYER sentinel?** Per
   `apr-cli-trace-save-tensor-v1` `byte_format` invariant: whole-model
   stages (lm_head, final_norm) carry `0xFFFFFFFF` in the layer field
   so `apr diff --values` can recognize them. Mirrors the existing
   `output_path_whole_model_no_layer_segment` test in
   `save_tensor_paths.rs`.
4. **Why no integration test on a real `AprTransformer`?** Loading a
   real APR model is heavyweight; the wrapper's logic is just three
   plan-API calls + a write. The 4 new pin tests in
   `traced_save_tensor_step2_tests` simulate the byte-flow at the
   contract level (path + header + body + skip-when-unselected).
   Live discharge against the canonical 7B teacher is left to
   SHIP-007 PR-E (the actual layer-0 bisection PR).
5. **Why now in the SHIP-TWO loop?** PR #1408 (step 1) merged earlier
   today; PR #1413 (PR-D `apr diff --values` APRT recognition) is in
   the merge queue. With both of those landed, the next-best lever
   for the operator-ratified "MODEL-1 ships GPU only via SHIP-007
   layer-0 stage diff" path (per `feedback_model_1_ships_gpu_only.md`)
   is to expand `forward_traced_with_save_tensor`'s capture surface
   one stage at a time. LmHead is the smallest, safest next step.

## Verification

- `cargo test -p aprender-serve --lib traced_save_tensor_step2` →
  4/4 PASS:
    - step2_lm_head_writes_to_output_root_not_per_layer_dir
    - step2_lm_head_header_uses_whole_model_sentinel
    - step2_lm_head_skipped_when_plan_does_not_select_it
    - step2_lm_head_writes_logits_bytes_verbatim (NaN-bit-preserving)
- `cargo check -p aprender-serve --lib` clean
- Step 1's existing Embedding branch is byte-identical to before
  (no edits to that block; only added a sibling LmHead branch).

## Contract

Contract update is intentionally deferred to a follow-up commit to
avoid file-conflict with PR #1413 (which is mid-merge and bumps
`apr-cli-trace-save-tensor-v1` v1.0.0 → v1.1.0). Once #1413 lands,
a small follow-up will bump v1.1.0 → v1.2.0 with FALSIFY-APR-TRACE-
SAVE-010 binding the new LmHead branch at PARTIAL_ALGORITHM_LEVEL.
The 4 new pin tests stand in for the algorithm-level discharge
until that follow-up.

## Ship % update

- MODEL-1: ~66% → ~68% (SHIP-007 capture surface widens from 1/18 to
  2/18 stages; the two cheapest captures are now wired).
- MODEL-2: corpus tokenization in progress (~33h ETA on RTX 4090
  development host).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records LmHead step-2 PARTIAL discharge (#1415)

Follow-up to PR #1414 (`forward_traced_with_save_tensor` step 2). Adds
FALSIFY-APR-TRACE-SAVE-010 binding the LmHead branch at
PARTIAL_ALGORITHM_LEVEL; the algorithm-level evidence cites the four
new pin tests in `traced_save_tensor_step2_tests`:

- step2_lm_head_writes_to_output_root_not_per_layer_dir
- step2_lm_head_header_uses_whole_model_sentinel
- step2_lm_head_skipped_when_plan_does_not_select_it
- step2_lm_head_writes_logits_bytes_verbatim (NaN-bit preserving)

`binds_to: byte_format` because step 2 invokes the same write_tensor_file
path with `WHOLE_MODEL_LAYER` sentinel as the existing `byte_format`
equation specifies. Live discharge against the canonical 7B teacher is
deferred to SHIP-007 PR-E (layer-0 bisection).

## Five Whys

1. **Why a separate contract follow-up?** The PR #1414 commit needed to
   land before this bump to avoid file-conflict with PR #1413
   (which independently bumped v1.0.0 → v1.1.0 with FALSIFY-009).
2. **Why `binds_to: byte_format` and not `cli_signature`?** The wrapper
   doesn't add a new clap surface (PR-A already did that); it adds a
   new branch that emits files conforming to the existing byte-format
   equation. The new branch's verbatim f32 LE round-trip + NaN preservation
   is exactly the property `byte_format` invariants pin.
3. **Why PARTIAL_ALGORITHM_LEVEL not full discharge?** The 4 unit tests
   simulate the wrapper's byte-flow at the contract level using synthetic
   plans and fake logits — they do NOT instantiate a full AprTransformer
   or load a real APR model. Live discharge requires SHIP-007 PR-E.
4. **Why bump to v1.2.0?** Adding a new falsification test (FALSIFY-010)
   that binds an additional invariant is a minor schema change. Per
   semver, that's a minor bump.
5. **Why `pv validate` clean even with two new falsifiers in 24h?** The
   contract uses metadata.kind=schema, so falsification_tests entries
   are flexible; pv validates structure, IDs are unique, and binds_to
   references are valid.

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` → 0 errors
- v1.0.0 → v1.1.0 (PR #1413, FALSIFY-009 binding apr_diff_values_compat)
- v1.1.0 → v1.2.0 (this PR, FALSIFY-010 binding LmHead step-2 capture)

## Ship % update

- MODEL-1: ~68% (unchanged — this is paperwork that records yesterday's
  algorithm-level discharge of step 2; the actual capture surface
  expansion happened in PR #1414).
- MODEL-2: corpus tokenization at ~46.5M tokens / 56 min (steady ~14K
  tok/s); ~33h ETA for full 27 GB Stack v1.2 corpus.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
… records CLI dispatch wire-up PARTIAL discharge (#1418)

Follow-up paperwork to PR #1417 (`apr trace --save-tensor` end-to-end
dispatch for .apr files). Adds FALSIFY-APR-TRACE-SAVE-011 binding the
new dispatch wire-up at PARTIAL_ALGORITHM_LEVEL with `binds_to:
cli_signature`.

Before PR #1417, `apr trace --save-tensor` only printed a stub and
never invoked `forward_traced_with_save_tensor`. The contract test
`apr trace --save-tensor --help | grep save-tensor` (FALSIFY-001) was
already passing at the binary-boundary level — but the dispatch glue
was missing, leaving Embedding + LmHead capture surface unreachable
from the CLI for 2 days post-step-2 merge.

FALSIFY-011 extends the existing `cli_signature` invariant from
"the flag is recognized" to "the flag actually produces files".

## Five Whys

1. **Why a separate contract bump?** Avoids file-conflict with the
   in-flight refactor PR #1416 (which only touches
   `crates/aprender-serve/`). My contract change is isolated to
   `contracts/apr-cli-trace-save-tensor-v1.yaml`.
2. **Why `binds_to: cli_signature`?** PR #1417 doesn't change the
   byte format or determinism — it makes the CLI surface that the
   `cli_signature` equation already specified actually invocable.
   Same equation, expanded discharge level.
3. **Why PARTIAL_ALGORITHM_LEVEL?** The 5 unit tests cover path
   resolution (3) and recursive *.bin walking (2) — algorithm-level.
   A live discharge against the canonical 7B teacher is operator-
   gated by post-merge smoke (~30s for a 7B forward + 2 file writes).
4. **Why bump v1.2.0 → v1.3.0?** Adding a new falsification test
   that binds an existing invariant is a minor schema change per
   semver. v1.0.0 → v1.1.0 → v1.2.0 → v1.3.0 records each step's
   discharge timeline:
     - v1.1.0 (PR #1413): apr_diff_values_compat → APRT-aware diff
     - v1.2.0 (PR #1415): byte_format → LmHead capture (step 2)
     - v1.3.0 (this PR): cli_signature → end-to-end dispatch
5. **Why now?** Records the algorithm-level discharge so when the
   operator runs the live smoke post-#1417-merge, the contract
   ledger doesn't lag the code. Same paperwork pattern as #1415
   (which followed #1414).

## Verification

- `pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` →
  0 errors, 0 warnings

## Ship % update

- MODEL-1: ~70% (unchanged — pure paperwork; code is in PR #1417).
- MODEL-2: corpus tokenization at ~115M tokens / 143 min (steady
  ~14K tok/s; ~33h ETA total).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 3, 2026
…+ empirical bug location (#1423)

* contract(apr-cli-trace-save-tensor-v1): v1.3.0 → v1.4.0 — FUNCTIONAL discharge for FALSIFY-009/010/011

End-to-end live smoke on canonical Qwen2.5-Coder-7B-Instruct-Q4K teacher
(RTX 4090 lambda-labs, 2026-05-03) produced all 16 APRT stage files in a
single forward pass via SHIP-007 PR-C-real step 3 (PRs #1416 + #1421):

- 14 per-layer (layer-0/*): embedding, attn_norm, qkv_matmul, qkv_bias,
  attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up,
  ffn_silu, ffn_swigl, ffn_out, post_ffn_residual
- 2 whole-model (root/*): final_norm, lm_head

All 16 file sizes match `12 + 4 * dim_product` for their stage type
(3584 hidden / 18944 intermediate / 4608 qkv / 152064 vocab).

Three FALSIFY entries promoted PARTIAL_ALGORITHM_LEVEL → FUNCTIONAL:
- FALSIFY-APR-TRACE-SAVE-009 (apr_diff_values_compat — APRT byte format)
- FALSIFY-APR-TRACE-SAVE-010 (LmHead step-2 capture)
- FALSIFY-APR-TRACE-SAVE-011 (CLI dispatch wire-up)

`pv validate contracts/apr-cli-trace-save-tensor-v1.yaml` returns
0 errors / 0 warnings.

Five Whys
1. Why FUNCTIONAL not DISCHARGED? FUNCTIONAL means "behavior empirically
   verified in single live run". DISCHARGED requires either bytewise
   equivalence vs an oracle OR repeatable run-to-run cross-machine
   verification. SHIP-007 PR-C-real step 3 just ships the surface; the
   oracle comparison (APR vs HF Transformers reference) is the next leg.
2. Why bump on PR #1421 merge, not on a single follow-up commit? Each of
   FALSIFY-009/010/011 was already at PARTIAL with separate `_evidence`
   blocks; bumping all three together at FUNCTIONAL is the natural
   semver event.
3. Why `functional_evidence` block (alongside existing `algorithm_evidence`)?
   Drift-prevention: future readers need to see BOTH the algorithm-level
   tests that pin the impl AND the live byte-counts/file-counts that
   validate the impl runs end-to-end on the canonical teacher.
4. Why hand-cite the 16 stage names in the contract? They're the surface
   over which the next milestone (SHIP-007 layer-0 element-wise bisection
   vs HF reference) will diff — making them visible in the contract is
   the drift-prevention pin.
5. Why no v1.5.0 status: ACTIVE bump? The metadata `status: PROPOSED`
   tracks the document's lifecycle, not the falsifier maturity. Promoting
   to ACTIVE requires a separate decision after the spec audit (out of
   scope for this paperwork commit).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(scripts): SHIP-007 layer-0 oracle bisection — HF FP16 reference stage generator

Authors `scripts/generate_qwen25_coder_fp16_stages.py` — a Python tool that
runs `Qwen/Qwen2.5-Coder-7B-Instruct` at FP16 with forward hooks attached
to each natural per-layer module and dumps the activations in the same
APRT byte format that `apr trace --save-tensor` produces. Output layout
mirrors the APR side (`layer-0/<stage>.bin` + `<stage>.bin`) so
`apr diff --values <apr>.bin <hf>.bin` works element-wise without any
rewriting.

Captured 13/16 stages directly:
- Per-layer (11): embedding, attn_norm, attn_out, post_ffn_residual,
  ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out
- Whole-model (2): final_norm, lm_head

Skipped 3/16 (qkv_matmul / qkv_bias / attention) — these need
deeper instrumentation since HF stores Q/K/V separately + RoPE is
internal to self_attn. Deferred to a follow-up; the 13 captured
stages already cover all major points along the forward pass.

Five Whys
1. Why need an HF FP16 reference? SHIP-007 layer-0 element-wise diff
   needs a ground-truth oracle to compare APR Q4K against; FP16 is
   the closest published reference for this model.
2. Why not just use the existing `qwen2.5-coder-7b-instruct-q4k.safetensors`
   on disk? That's the same Q4K data we already feed into APR — diffing
   it against APR would only catch APR-side bugs that change weights, not
   bugs in forward computation. We need an INDEPENDENT reference.
3. Why hooks instead of direct model code edits? HF's modeling_qwen2.py
   is auto-loaded via `trust_remote_code=True`; the hooks let us inspect
   every stage without forking HF's source.
4. Why APRT byte format (not torch.save)? `apr diff --values` already
   recognizes APRT files (PR #1413) — using the same format makes the
   diff a one-liner. Drift-prevention: same format on both sides keeps
   comparison shape-agnostic.
5. Why skip qkv_matmul/qkv_bias/attention now? Discharging the discoverable
   13 stages is high-leverage; the remaining 3 require manual q+k+v
   concatenation and Q@Kᵀ@v re-derivation. Worth a follow-up PR but
   blocking on it would delay every other stage's bisection signal.

Note: This script is NOT auto-run in CI — it requires HF cache containing
`Qwen/Qwen2.5-Coder-7B-Instruct` (~15 GB). Confirmed already cached at
~/.cache/huggingface/hub/ on noah-Lambda-Vector 2026-05-03. Operator runs
it once via `uv run --with torch --with transformers` to produce the
fixture; downstream `apr diff` passes are deterministic byte comparisons.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence(ship-007): layer-0 oracle bisection — attn_out is the first diverging stage

End-to-end empirical bisection on canonical Qwen2.5-Coder-7B-Instruct
teacher with HF FP16 ground truth (CPU forward, HF cache hit).

Element-wise diff every shared layer-0 stage between APR Q4K and HF FP16:

| Stage             | Cosine sim    | Status                                     |
|-------------------|---------------|--------------------------------------------|
| embedding         | 1.0000000000  | bit-identical (correct)                    |
| attn_norm         | 0.9999999483  | within Q4K noise (correct)                 |
| **attn_out**      | **0.9966403** | **FIRST DROP — bug is in attention block** |
| ffn_* (downstream)| 0.996-0.999   | carries drift (downstream artifacts)       |
| final_norm        | 0.9932669898  | (whole-model — accumulates 28 layers)      |
| lm_head           | 0.9969170161  | (whole-model — last-token logits)          |

This narrows the SHIP-007 root cause to the layer-0 attention block,
specifically between RMSNorm output (cos=0.99999995, correct) and
post-O-proj attention output (cos=0.9966, wrong).

Possible bug sites within the block:
1. qkv_matmul (Q4K matmul × QKV weights) — needs HF-side capture
2. qkv_bias
3. RoPE on Q/K
4. Q@Kᵀ scaled-dot-product
5. Softmax with causal mask
6. softmax @ V
7. O-projection (Q4K matmul × O-proj weight)

Next milestone: extend `scripts/generate_qwen25_coder_fp16_stages.py`
with qkv_matmul / qkv_bias / attention capture (currently deferred to
PARTIAL coverage), re-run diff, pinpoint the divergent kernel.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(scripts): SHIP-007 v2 — qkv_matmul/qkv_bias/attention captures narrow bug to attention math (#1424)

Extends `generate_qwen25_coder_fp16_stages.py` with HF-side captures for
the 3 stages previously deferred. Refines the SHIP-007 layer-0 bisection
from "inside attention block" (v1) to "between qkv_bias and attention" (v2).

## Refined cosine table

| Stage         | Cosine    | Δ from prev | Status                |
|---------------|-----------|-------------|-----------------------|
| attn_norm     | 0.9999999 | -5e-8       | RMSNorm correct       |
| qkv_matmul    | 0.99969   | -3.1e-4     | Q4K matmul noise (OK) |
| qkv_bias      | 0.9999975 | +2.8e-4     | bias dampens          |
| **attention** | 0.99858   | **-1.4e-3** | **← bug is here**     |
| attn_out      | 0.99664   | -1.9e-3     | O-proj amplifies      |

Bug is **between qkv_bias and attention** = inside the attention math:
RoPE / Q@Kᵀ / scale / causal mask / softmax / V@.

NOT in QKV matmul (acceptable Q4K noise).
NOT in QKV bias add (within FP precision).
O-projection adds its own ~1.9e-3 cosine drop — secondary.

## Implementation

New HF-side hooks:
- `make_qkv_hook` on q_proj/k_proj/v_proj — concat outputs to derive
  qkv_bias (post-bias) and qkv_matmul (post-bias minus per-Linear bias)
- `hook_o_proj_pre` (forward_pre_hook) on o_proj — captures its INPUT,
  which is APR's "attention" stage (post softmax(Q@Kᵀ)@v, pre-O-proj)

Script now produces 15 stage files (was 12 in v1).

## Why qkv_matmul cos=0.99969 < qkv_bias cos=0.9999975

Mathematical artifact, not a bug:
- qkv_matmul = Q4K_matmul(Q4K_input × Q4K_weight) — has ~3e-4 cosine noise vs FP16
- qkv_bias = qkv_matmul + bias (deterministic FP16 bias vector)
- Adding deterministic vector dominates direction → relative noise dampens
- Both APR and HF add the same bias values → cos increases on both sides equally

Confirmed via: bias subtraction matches (HF - bias ≈ APR pre-bias on each side).

## Five Whys

1. Why need qkv stage captures? v1 only narrowed bug to "attention block" —
   not enough to drive a fix. We need to know if the bug is in the projections
   or the attention math.
2. Why is qkv_matmul cos lower than qkv_bias? See above — bias addition
   is a known mathematical artifact with deterministic vectors.
3. Why is the bug between qkv_bias and attention specifically? Cos=0.9999975
   → 0.99858 is a 70× factor, far above Q4K floor. The intermediate ops
   (RoPE, scale, softmax, mask, V@) introduce real divergence.
4. Why O-proj adds another 1.9e-3 drop? Q4K_matmul of attention × O-proj
   weight is the same Q4K-vs-FP16 floor as qkv_matmul. Acceptable.
5. Why narrow further to RoPE/scale/softmax/mask/V@? Each is a candidate.
   Without finer-grained captures inside HF's monolithic Qwen2Attention,
   v2 cannot bisect further. Future work: instrument HF's attention internals
   OR cross-reference candle/pytorch for the algebraic spec of each sub-op.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* evidence(ship-007): v3+v4 — APR attention audit + DECISIVE PIVOT to GPU execution path (#1425)

* evidence(ship-007): v3 — APR attention code audit vs HF Qwen2 reference

Cross-referenced APR's attention forward (`inference.rs` + `helpers.rs`)
against HF Transformers Qwen2 to identify the algebraic source of the
v2-measured 1.4e-3 cosine drop between qkv_bias and attention.

## Audit result: NO algebraic bug in APR attention

Verified MATCHES vs HF Qwen2:
- RoPE rotation formula (split-half, x[i]=x1·cos − x2·sin / x[i+½d]=x1·sin + x2·cos)
- RoPE freq formula (1/theta^(2i/d))
- rope_theta value (1000000.0 from `metadata.rope_theta`)
- Attention scale (1/sqrt(head_dim))
- Causal mask (`for j in 0..=i` triangular)
- Softmax (f32 max-subtract)
- QKV bias position (post-matmul, pre-RoPE)
- GQA-7:1 head indexing (`kv_head = head/group_size`)

## Refined hypothesis

The 1.4e-3 cosine drop is most likely **systematic precision loss from
Q4K dequant compounding through attention math**, NOT a structural
algorithmic bug. Specifically:

1. APR's `forward_traced` uses F32 dequantized Q4K weights (per
   `inference.rs:38` comment "Q4K layers not used in traced forward").
2. The Q4K dequant is lossy (~1e-3 RMS per element).
3. When these slightly-off Q values are dotted against slightly-off K
   values (also from Q4K dequant), the product compounds the error.
4. This compounding produces cos=0.99858 at attention output — consistent
   with systematic precision loss, not a bug.

## Implication for SHIP-007 fix

If this hypothesis is right, the layer-0 attention bisection has hit
the natural noise floor of Q4K-vs-FP16 comparison. The actual `apr run`
quality issue may be:
(a) Further downstream — accumulating drift through 28 layers
(b) NOT a forward-pass bug at all — could be sampling/decoding config
(c) Q4K kernel-specific — `apr run` uses Q4K kernels (faster path) while
    `forward_traced` uses F32 dequant (more accurate path); the two might
    diverge in how the kernel handles edge cases

## Next narrowing tests

1. Run `apr trace --save-tensor` on the FP16 safetensors version of the
   teacher; if cos improves to >0.999 across all stages, confirms (a)/(c).
2. Multi-layer cosine sweep (layers 0/1/13/27) to characterize drift growth.
3. argmax-flip check on lm_head — if APR top-1 token matches HF top-1,
   the drift is "noise" not bug-relevant.

Evidence: `evidence/ship-007-layer-0-oracle-bisection-2026-05-03/findings-v3-attention-code-audit.md`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence(ship-007): v4 — DECISIVE PIVOT, bug pinpointed to GPU execution path

Two falsifying live tests run on canonical 7B teacher reframe SHIP-007
fundamentally:

## Test 1: lm_head argmax MATCHES
APR forward_traced top-1 token: 220 (' ')
HF FP16            top-1 token: 220 (' ')
First 3 top-5 ranks identical: [220, 576, 2014]

→ The cos=0.998 forward divergence in v1/v2/v3 is NOT bug-relevant for
  greedy decoding. It's just systematic precision noise.

## Test 2: `apr run --temperature 0.0` produces gibberish
$ apr run qwen2.5-coder-7b-instruct-q4k.apr --prompt 'What is 2+2?' \
    --max-tokens 16 --temperature 0.0
ampiezza = 0.5
diametro = 10

→ Italian-looking gibberish, NOT '4', NOT a coherent answer.

## Test 3: even `--max-tokens 1` disagrees with forward_traced
$ apr run [...] --max-tokens 1 --temperature 0.0
ampie

→ Single-step apr run produces different first token than
  forward_traced (which argmaxed to 220 ' ').

## The pivot

The SHIP-007 bug is NOT in the forward pass instrumented by
`forward_traced`. It's in the `apr run` GPU/wgpu hybrid execution path:

| Path             | Backend                        | Weights | Output for "What is 2+2?" |
|------------------|--------------------------------|---------|---------------------------|
| forward_traced   | CPU scalar-loop matmul         | F32     | argmax=220 (' ', matches HF) |
| apr run          | CUDA graph (646 kernels) + wgpu | F32     | "ampie..." (gibberish)        |

Both paths use the same F32 weights (apr run dequantizes Q4K to F32 before
GPU upload, per PMAT-333 log line). The divergence is in **kernel
implementations** — CPU scalar loops vs CUDA/wgpu kernels.

## All previous findings invalidated

- v1 "bug is in attention block" — INVALID (was just Q4K precision noise)
- v2 "bug is between qkv_bias and attention" — INVALID (same)
- v3 "no algebraic bug, must be precision" — PARTIALLY CORRECT (forward_traced
  IS correct), but missed that the actual broken path is `apr run` GPU.

The forward_traced bisection chain (cos drops at attention) is now understood
as a RED HERRING — it captures a different code path than the buggy one.

## Next narrowing

1. Force `apr run` to use CPU (env var or feature flag) — does it match
   forward_traced? If yes, confirms GPU/wgpu parity bug.
2. Element-wise diff GPU layer-0 attention output vs CPU forward_traced.
3. Audit `realizar/src/quantize/fused_*` and CUDA graph kernels for
   forward-pass bugs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant