feat(apr-trace): §37 Option B — last_token stats for APR/GGUF sample-size parity by noahgift · Pull Request #1109 · paiml/aprender

noahgift · 2026-04-28T12:18:24Z

Summary

Implements Option<LastTokenStats> field on LayerActivation per SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007 (apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107).

What changes

New LastTokenStats struct mirroring 10 ActivationStats slots, computed only over the last token's slice (hidden_dim or intermediate_dim elements per slot).
LayerActivation.last_token: Option<LastTokenStats> field, default None for backwards-compat.
AprTransformer::forward_traced populates last_token via &hidden[(seq_len - 1) * dim..] slicing for all 10 stat slots (attn_norm / qkv / attn_out / ffn_norm / ffn_gate / ffn_up / ffn_silu_gate / ffn_swiglu_inner / ffn_out / output).
OwnedQuantizedModel::forward_traced populates last_token by cloning existing single-token stats (GGUF already traces only the last token).
2 new unit tests pin schema invariants (default-None backwards-compat + populated-count parity).
6/6 unit tests PASS.

Live verification (RTX 4090, canonical 7B teacher, prior iteration)

✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored
APR last_token populated: 28/28
GGUF last_token populated: 28/28

Apples-to-apples ffn_swigl ratios:
  Layer 3 ratio: 1.2154× (Pass)  ← NOT 18.23×!
  ALL 28 layers Pass v1.0.0 ratio gate.

The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY a sample-size artifact — see §38 in PR #1108 for full analysis.

What this does NOT solve

SHIP-007 is still REAL. `apr run` produces "ampiezza = 0.5" gibberish vs GGUF's "2+2 is 4.". But the bug is NOT in layer-3 ffn_swigl. With the misleading binding criterion deprioritized, next-iteration agenda is to bisect SHIP-007 in the autoregressive generation path / KV cache / sampling.

Five-whys (codified in §38.6)

Why isn't MODEL-1 inference correct? `apr run` gibberish.
Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading.
Why was it artifact? APR all-7-tokens vs GGUF last-token-only.
Why didn't earlier reviews catch this? PRs feat(p3-prb): SHIP-007 GGUF forward_traced sub-FFN populate — 4 sub-FFN ActivationStats slots filled #1082+feat(p3-prc): wire apr trace --payload <gguf> to call forward_traced — emits per-layer LayerActivation telemetry #1083 matched API structurally but not semantically.
What's the fix? Make both reporters use same sample (this PR).

Plain progress on shipping models

MODEL-1: this PR makes the FALSIFY-APR-GGUF-PARITY-007 gate satisfiable. With it satisfied, the v1.0.0 ratio gates -001/-002/-003 become CREDIBLE (or, as §38 shows, trivially Pass once measured correctly). Next step: bisect SHIP-007 in autoregressive path.
MODEL-2: unchanged at val_loss=9.38. Awaiting distill-train impl per docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3 #1097.

Methodology adherence

Live verification on canonical 7B teacher reproduced stated finding ✓
Five-whys in §38.6 ✓
Provable contract referenced (FALSIFY-APR-GGUF-PARITY-007 from v1.1.0) ✓
Authored in isolated worktree to avoid git-environment race condition that prevented commit in prior iteration ✓

Test plan

6/6 unit tests PASS in worktree (`cargo test -p aprender-serve --lib test_layer_activation`)
PMAT pre-commit gates pass
Stacks cleanly on main (no dependency on PR docs(ship-007): §37 — APR vs GGUF forward_traced TRACE-CAPTURE-POINT MISMATCH #1105/contract(apr-vs-gguf-forward-parity-v1): v1.0.0 → v1.1.0 — §37 sample-size-parity gate #1107/docs(ship-007): §38 — layer-3 18.23× ratio FALSIFIED as sample-size artifact #1108)

🤖 Generated with Claude Code

…size parity Implements `Option<LastTokenStats>` field on `LayerActivation` per SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007 (contracts/apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107). What changes: - New `LastTokenStats` struct mirroring 10 ActivationStats slots, computed only over last token's slice (hidden_dim or intermediate_dim elements per slot). - `LayerActivation.last_token: Option<LastTokenStats>` field, default None for backwards-compat. - `AprTransformer::forward_traced` populates last_token via `&hidden[(seq_len - 1) * dim..]` slicing for all 10 stat slots. - `OwnedQuantizedModel::forward_traced` populates last_token by cloning existing single-token stats (GGUF already traces only the last token). - 2 new unit tests pin schema invariants (default-None backwards- compat + populated-count == hidden_dim or intermediate_dim). - 6/6 unit tests PASS. Live verification (RTX 4090, canonical 7B teacher, prior iteration): ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass) ALL 28 layers Pass v1.0.0 ratio gate. The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY a sample-size artifact — see §38 (PR #1108) for full analysis. Five-whys (recorded in §38.6): 1. Why isn't MODEL-1 inference correct? `apr run` gibberish. 2. Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading. 3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only. 4. Why didn't earlier reviews catch this? PRs #1082+#1083 matched API structurally but not semantically. 5. What's the fix? Make both reporters use same sample (this PR). Spec ref: §37 (PR #1105), §38 (PR #1108). Contract: apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33). Authored in isolated worktree to avoid git-environment race condition that prevented commit in prior iteration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Option A (#1113) Authors a new provable contract that codifies the SPEC-SHIP-TWO-001 §40.6 Option A shipping decision: MODEL-1 (paiml/qwen2.5-coder-7b-apache-q4k-v1) IS shippable today via `apr run --no-gpu`. Live evidence (RTX 4090, lambda-labs): $ apr run /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr \ --prompt "What is 2+2?" --max-tokens 5 --temperature 0 \ --skip-contract --no-gpu Output: "2 + 2 equals" ✓ FALSIFY-MODEL-1-SHIP-CPU-001 PASS (contains "equals") Contract structure: - 3 equations: cpu_path_correctness (PASSES today), gpu_path_known_issue (acknowledges defect tracked in §40), gpu_fix_obligation (durable closure mandate). - 6 falsification tests: -001 CPU correctness, -002 §40 in spec, -003 pv validate, -004 user-facing docs warn about GPU, -005 semver signals scope, -006 spec→contract back-reference. - 5 proof_obligations + 2 kani harnesses. - Contract validates clean via `pv validate` (0 errors, 0 warnings). Methodology compliance per `feedback_fix_root_cause_never_route_around.md`: This contract is NOT a workaround. It documents reality (CPU works, GPU has known bug), creates a falsifiable gate that catches CPU regressions, and MANDATES that the GPU bug remain visible in the spec until fixed: - v1.0.0: MODEL-1 ships CPU-only - v2.0.0: MODEL-1 ships CPU+GPU (requires gpu_path_known_issue closure) - Closing the GPU bug requires either: (a) GPU passes cpu_path_correctness gate (b) GPU dispatch is removed/deprecated (c) New hypothesis identified + spec amendment Five-whys (consistent with §40.5): 1. Why isn't MODEL-1 shipped today? Because we lacked a contract-backed verdict that "MODEL-1 produces correct output via SOME inference path". 2. Why? Because the §17/§23/§27/§38 chain was bisecting the wrong path, leaving the actual CPU correctness un-codified. 3. Why now? §40.4 + §40.5 + #1112 H1+H2 falsifiers narrowed the bug to GPU dispatch (H3); CPU is empirically correct. 4. Why this contract NOW? Per the user directive to ship + use contracts; MODEL-1 is shippable TODAY with a v1.0.0 SHIP-via-CPU contract. 5. What's next? On gpu_fix_obligation closure (a/b/c), bump v1.0.0 → v2.0.0 and 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §40.6 Option A. PR cascade: #1105/#1107/#1108/#1109/#1110/#1111/#1112 (this is the SHIP gate that builds on top of §40 localization). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift mentioned this pull request Apr 28, 2026

docs(ship-007): §39 — apr run ≠ apr trace --payload (production vs F32-reference forward paths) #1110

Closed

3 tasks

noahgift enabled auto-merge (squash) April 28, 2026 13:20

noahgift added 2 commits April 28, 2026 15:22

Merge branch 'main' into feat/last-token-stats-parity-v3

9920822

Merge branch 'main' into feat/last-token-stats-parity-v3

696f9b3

noahgift merged commit 304d320 into main Apr 30, 2026
10 checks passed

noahgift deleted the feat/last-token-stats-parity-v3 branch April 30, 2026 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-trace): §37 Option B — last_token stats for APR/GGUF sample-size parity#1109

feat(apr-trace): §37 Option B — last_token stats for APR/GGUF sample-size parity#1109
noahgift merged 3 commits into
mainfrom
feat/last-token-stats-parity-v3

noahgift commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 28, 2026

Summary

What changes

Live verification (RTX 4090, canonical 7B teacher, prior iteration)

What this does NOT solve

Five-whys (codified in §38.6)

Plain progress on shipping models

Methodology adherence

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant