docs(ship-007): §38 — layer-3 18.23× ratio FALSIFIED as sample-size artifact by noahgift · Pull Request #1108 · paiml/aprender

noahgift · 2026-04-28T12:12:08Z

Summary

§38 spec amendment documenting live falsification of the §17→§23→§27 binding criterion chain
The 18.23× layer-3 ffn_swigl ratio cited in §27 is almost entirely a sample-size artifact, not precision drift
With apples-to-apples last-token stats on BOTH sides, layer-3 ratio = 1.2154× (well within Pass bounds [0.5, 2.0])
ALL 28 layers Pass the v1.0.0 ratio gate

Live evidence (RTX 4090, lambda-labs, canonical 7B teacher)

✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored
Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass)
ALL 28 layers Pass the v1.0.0 ratio gate.

What this does NOT solve

SHIP-007 is still REAL. `apr run` on the canonical teacher produces "ampiezza = 0.5\ndiametro = 10" (Italian gibberish) vs GGUF's correct "2+2 is 4.". But the bug is NOT in layer-3 ffn_swigl — it lives elsewhere (autoregressive generation path, KV cache pre-fill, sampling, or some sub-component the single-forward trace path doesn't exercise).

Implementation status

The Option B implementation (~150 LOC + 2 unit tests + 1 live diagnostic) was authored and tested clean (`cargo test -p aprender-serve --lib test_layer_activation`: 6/6 PASS), but is currently uncommitted due to a git-environment race condition between the linter and parallel sessions switching HEAD between branches mid-commit.

Resolution path: rerun implementation in isolated worktree next iteration. Patch preserved at `/tmp/last-token-impl.patch` (78 KB diff); live diagnostic results at `/tmp/ship-007-bisection/last-token-parity-diag.log`.

Five-whys (codified in §38.6)

Why isn't MODEL-1 inference correct? `apr run` gibberish.
Why hasn't the §17/§23/§27 chain produced a fix in 4+ iterations? 18.23× signal was misleading.
Why was it artifact? APR all-7-tokens vs GGUF last-token-only stats.
Why didn't earlier reviews catch this? PR feat(p3-prb): SHIP-007 GGUF forward_traced sub-FFN populate — 4 sub-FFN ActivationStats slots filled #1082+feat(p3-prc): wire apr trace --payload <gguf> to call forward_traced — emits per-layer LayerActivation telemetry #1083 matched APR's API structurally but not semantically.
What's the fix? Make both reporters use same sample (Option B implementation).

Plain progress on shipping models

MODEL-1: still blocked, but the §17/§23/§27 layer-3 hypothesis chain is now deprioritized. Next-iteration agenda: (1) reapply Option B implementation in worktree; (2) bisect SHIP-007 in autoregressive path / KV cache; (3) find real bug; 5 MODEL-1 PARTIALs auto-discharge.
MODEL-2: unchanged at val_loss=9.38. Awaiting distill-train impl per docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3 #1097.

Methodology adherence

Live verification on canonical 7B teacher (RTX 4090) ✓
Five-whys recorded in spec §38.6 ✓
Provable contract referenced (FALSIFY-APR-GGUF-PARITY-007 from v1.1.0) ✓
Per `feedback_fix_root_cause_never_route_around.md`: investigative falsification IS the discharge step ✓

Test plan

PMAT pre-commit gates pass
Live verification on canonical 7B teacher reproduces stated finding (logs in /tmp/ship-007-bisection/)
Stacks cleanly on PR docs(ship-007): §37 — APR vs GGUF forward_traced TRACE-CAPTURE-POINT MISMATCH #1105 (§37) + PR contract(apr-vs-gguf-forward-parity-v1): v1.0.0 → v1.1.0 — §37 sample-size-parity gate #1107 (contract v1.1.0)

🤖 Generated with Claude Code

…rtifact Live verification of §37 Option B implementation (last-token stats on `LayerActivation`) on canonical 7B teacher (RTX 4090): ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass) ALL 28 layers Pass the v1.0.0 ratio gate. The §17 → §23 → §27 hypothesis chain (silu_g*u multiply / Q4K precision / matmul kernel as the SHIP-007 root cause) is REFUTED. The 18.23× layer-3 signal was almost entirely a sample-size artifact: APR's `forward_traced` traces all 7 prompt tokens (count=25088 for attn_norm), GGUF's `forward_traced` traces only the last token (count=3584). The std ratio mixed real drift with sampling noise. What this does NOT solve: SHIP-007 is still REAL. `apr run` produces "ampiezza = 0.5\ndiametro = 10" (Italian gibberish) vs GGUF's "2+2 is 4.". But the bug is NOT in layer-3 ffn_swigl. It lives elsewhere — autoregressive generation path, KV cache pre-fill, sampling, or some sub-component the trace path doesn't capture. Implementation status: Option B authored + tested clean (6/6 unit tests Pass) but currently uncommitted on the working tree due to a git-environment race condition between linter and parallel sessions switching HEAD mid-commit. Patch preserved at /tmp/last-token-impl.patch (78 KB diff); live diagnostic results at /tmp/ship-007-bisection/last-token-parity-diag.log. Resolution path: rerun implementation in isolated worktree next iteration. Per feedback_fix_root_cause_never_route_around.md: falsifying the misleading binding criterion IS the discharge step. The §17/§23/§27 chain is now deprioritized; next-iteration agenda: 1. Reapply Option B implementation in worktree (clean of racing). 2. Bisect SHIP-007 in autoregressive path / KV cache (single- forward layer ratios all Pass). 3. Find the real bug; 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §37 (PR #1105), §38 (this PR), apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…size parity (#1109) Implements `Option<LastTokenStats>` field on `LayerActivation` per SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007 (contracts/apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107). What changes: - New `LastTokenStats` struct mirroring 10 ActivationStats slots, computed only over last token's slice (hidden_dim or intermediate_dim elements per slot). - `LayerActivation.last_token: Option<LastTokenStats>` field, default None for backwards-compat. - `AprTransformer::forward_traced` populates last_token via `&hidden[(seq_len - 1) * dim..]` slicing for all 10 stat slots. - `OwnedQuantizedModel::forward_traced` populates last_token by cloning existing single-token stats (GGUF already traces only the last token). - 2 new unit tests pin schema invariants (default-None backwards- compat + populated-count == hidden_dim or intermediate_dim). - 6/6 unit tests PASS. Live verification (RTX 4090, canonical 7B teacher, prior iteration): ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass) ALL 28 layers Pass v1.0.0 ratio gate. The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY a sample-size artifact — see §38 (PR #1108) for full analysis. Five-whys (recorded in §38.6): 1. Why isn't MODEL-1 inference correct? `apr run` gibberish. 2. Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading. 3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only. 4. Why didn't earlier reviews catch this? PRs #1082+#1083 matched API structurally but not semantically. 5. What's the fix? Make both reporters use same sample (this PR). Spec ref: §37 (PR #1105), §38 (PR #1108). Contract: apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33). Authored in isolated worktree to avoid git-environment race condition that prevented commit in prior iteration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…Option A (#1113) Authors a new provable contract that codifies the SPEC-SHIP-TWO-001 §40.6 Option A shipping decision: MODEL-1 (paiml/qwen2.5-coder-7b-apache-q4k-v1) IS shippable today via `apr run --no-gpu`. Live evidence (RTX 4090, lambda-labs): $ apr run /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr \ --prompt "What is 2+2?" --max-tokens 5 --temperature 0 \ --skip-contract --no-gpu Output: "2 + 2 equals" ✓ FALSIFY-MODEL-1-SHIP-CPU-001 PASS (contains "equals") Contract structure: - 3 equations: cpu_path_correctness (PASSES today), gpu_path_known_issue (acknowledges defect tracked in §40), gpu_fix_obligation (durable closure mandate). - 6 falsification tests: -001 CPU correctness, -002 §40 in spec, -003 pv validate, -004 user-facing docs warn about GPU, -005 semver signals scope, -006 spec→contract back-reference. - 5 proof_obligations + 2 kani harnesses. - Contract validates clean via `pv validate` (0 errors, 0 warnings). Methodology compliance per `feedback_fix_root_cause_never_route_around.md`: This contract is NOT a workaround. It documents reality (CPU works, GPU has known bug), creates a falsifiable gate that catches CPU regressions, and MANDATES that the GPU bug remain visible in the spec until fixed: - v1.0.0: MODEL-1 ships CPU-only - v2.0.0: MODEL-1 ships CPU+GPU (requires gpu_path_known_issue closure) - Closing the GPU bug requires either: (a) GPU passes cpu_path_correctness gate (b) GPU dispatch is removed/deprecated (c) New hypothesis identified + spec amendment Five-whys (consistent with §40.5): 1. Why isn't MODEL-1 shipped today? Because we lacked a contract-backed verdict that "MODEL-1 produces correct output via SOME inference path". 2. Why? Because the §17/§23/§27/§38 chain was bisecting the wrong path, leaving the actual CPU correctness un-codified. 3. Why now? §40.4 + §40.5 + #1112 H1+H2 falsifiers narrowed the bug to GPU dispatch (H3); CPU is empirically correct. 4. Why this contract NOW? Per the user directive to ship + use contracts; MODEL-1 is shippable TODAY with a v1.0.0 SHIP-via-CPU contract. 5. What's next? On gpu_fix_obligation closure (a/b/c), bump v1.0.0 → v2.0.0 and 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §40.6 Option A. PR cascade: #1105/#1107/#1108/#1109/#1110/#1111/#1112 (this is the SHIP gate that builds on top of §40 localization). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced Apr 28, 2026

feat(apr-trace): §37 Option B — last_token stats for APR/GGUF sample-size parity #1109

Merged

docs(ship-007): §39 — apr run ≠ apr trace --payload (production vs F32-reference forward paths) #1110

Closed

noahgift enabled auto-merge (squash) April 28, 2026 13:20

noahgift added 2 commits April 28, 2026 15:21

Merge branch 'main' into docs/ship-007-layer3-ratio-falsified-2

e8fe644

Merge branch 'main' into docs/ship-007-layer3-ratio-falsified-2

5fd786f

noahgift added 2 commits April 30, 2026 04:53

Merge branch 'main' into docs/ship-007-layer3-ratio-falsified-2

7c5d2c1

Merge branch 'main' into docs/ship-007-layer3-ratio-falsified-2

36a2f06

noahgift merged commit cbaef74 into main Apr 30, 2026
10 checks passed

noahgift deleted the docs/ship-007-layer3-ratio-falsified-2 branch April 30, 2026 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): §38 — layer-3 18.23× ratio FALSIFIED as sample-size artifact#1108

docs(ship-007): §38 — layer-3 18.23× ratio FALSIFIED as sample-size artifact#1108
noahgift merged 5 commits into
mainfrom
docs/ship-007-layer3-ratio-falsified-2

noahgift commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 28, 2026

Summary

Live evidence (RTX 4090, lambda-labs, canonical 7B teacher)

What this does NOT solve

Implementation status

Five-whys (codified in §38.6)

Plain progress on shipping models

Methodology adherence

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant