Skip to content

docs(ship-007): §38 — layer-3 18.23× ratio FALSIFIED as sample-size artifact#1108

Merged
noahgift merged 5 commits into
mainfrom
docs/ship-007-layer3-ratio-falsified-2
Apr 30, 2026
Merged

docs(ship-007): §38 — layer-3 18.23× ratio FALSIFIED as sample-size artifact#1108
noahgift merged 5 commits into
mainfrom
docs/ship-007-layer3-ratio-falsified-2

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

  • §38 spec amendment documenting live falsification of the §17→§23→§27 binding criterion chain
  • The 18.23× layer-3 ffn_swigl ratio cited in §27 is almost entirely a sample-size artifact, not precision drift
  • With apples-to-apples last-token stats on BOTH sides, layer-3 ratio = 1.2154× (well within Pass bounds [0.5, 2.0])
  • ALL 28 layers Pass the v1.0.0 ratio gate

Live evidence (RTX 4090, lambda-labs, canonical 7B teacher)

✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored
Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass)
ALL 28 layers Pass the v1.0.0 ratio gate.

What this does NOT solve

SHIP-007 is still REAL. `apr run` on the canonical teacher produces "ampiezza = 0.5\ndiametro = 10" (Italian gibberish) vs GGUF's correct "2+2 is 4.". But the bug is NOT in layer-3 ffn_swigl — it lives elsewhere (autoregressive generation path, KV cache pre-fill, sampling, or some sub-component the single-forward trace path doesn't exercise).

Implementation status

The Option B implementation (~150 LOC + 2 unit tests + 1 live diagnostic) was authored and tested clean (`cargo test -p aprender-serve --lib test_layer_activation`: 6/6 PASS), but is currently uncommitted due to a git-environment race condition between the linter and parallel sessions switching HEAD between branches mid-commit.

Resolution path: rerun implementation in isolated worktree next iteration. Patch preserved at `/tmp/last-token-impl.patch` (78 KB diff); live diagnostic results at `/tmp/ship-007-bisection/last-token-parity-diag.log`.

Five-whys (codified in §38.6)

  1. Why isn't MODEL-1 inference correct? `apr run` gibberish.
  2. Why hasn't the §17/§23/§27 chain produced a fix in 4+ iterations? 18.23× signal was misleading.
  3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only stats.
  4. Why didn't earlier reviews catch this? PR feat(p3-prb): SHIP-007 GGUF forward_traced sub-FFN populate — 4 sub-FFN ActivationStats slots filled #1082+feat(p3-prc): wire apr trace --payload <gguf> to call forward_traced — emits per-layer LayerActivation telemetry #1083 matched APR's API structurally but not semantically.
  5. What's the fix? Make both reporters use same sample (Option B implementation).

Plain progress on shipping models

Methodology adherence

  • Live verification on canonical 7B teacher (RTX 4090) ✓
  • Five-whys recorded in spec §38.6 ✓
  • Provable contract referenced (FALSIFY-APR-GGUF-PARITY-007 from v1.1.0) ✓
  • Per `feedback_fix_root_cause_never_route_around.md`: investigative falsification IS the discharge step ✓

Test plan

🤖 Generated with Claude Code

…rtifact

Live verification of §37 Option B implementation (last-token stats
on `LayerActivation`) on canonical 7B teacher (RTX 4090):

  ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored
  Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass)
  ALL 28 layers Pass the v1.0.0 ratio gate.

The §17 → §23 → §27 hypothesis chain (silu_g*u multiply / Q4K
precision / matmul kernel as the SHIP-007 root cause) is REFUTED.
The 18.23× layer-3 signal was almost entirely a sample-size artifact:
APR's `forward_traced` traces all 7 prompt tokens (count=25088 for
attn_norm), GGUF's `forward_traced` traces only the last token
(count=3584). The std ratio mixed real drift with sampling noise.

What this does NOT solve: SHIP-007 is still REAL. `apr run` produces
"ampiezza = 0.5\ndiametro = 10" (Italian gibberish) vs GGUF's
"2+2 is 4.". But the bug is NOT in layer-3 ffn_swigl. It lives
elsewhere — autoregressive generation path, KV cache pre-fill,
sampling, or some sub-component the trace path doesn't capture.

Implementation status: Option B authored + tested clean (6/6 unit
tests Pass) but currently uncommitted on the working tree due to
a git-environment race condition between linter and parallel
sessions switching HEAD mid-commit. Patch preserved at
/tmp/last-token-impl.patch (78 KB diff); live diagnostic results
at /tmp/ship-007-bisection/last-token-parity-diag.log.

Resolution path: rerun implementation in isolated worktree next
iteration.

Per feedback_fix_root_cause_never_route_around.md: falsifying the
misleading binding criterion IS the discharge step. The §17/§23/§27
chain is now deprioritized; next-iteration agenda:

1. Reapply Option B implementation in worktree (clean of racing).
2. Bisect SHIP-007 in autoregressive path / KV cache (single-
   forward layer ratios all Pass).
3. Find the real bug; 5 MODEL-1 PARTIALs auto-discharge.

Spec ref: §37 (PR #1105), §38 (this PR), apr-vs-gguf-forward-parity-v1
v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 30, 2026
…size parity (#1109)

Implements `Option<LastTokenStats>` field on `LayerActivation` per
SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007
(contracts/apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107).

What changes:
- New `LastTokenStats` struct mirroring 10 ActivationStats slots,
  computed only over last token's slice (hidden_dim or
  intermediate_dim elements per slot).
- `LayerActivation.last_token: Option<LastTokenStats>` field, default
  None for backwards-compat.
- `AprTransformer::forward_traced` populates last_token via
  `&hidden[(seq_len - 1) * dim..]` slicing for all 10 stat slots.
- `OwnedQuantizedModel::forward_traced` populates last_token by
  cloning existing single-token stats (GGUF already traces only
  the last token).
- 2 new unit tests pin schema invariants (default-None backwards-
  compat + populated-count == hidden_dim or intermediate_dim).
- 6/6 unit tests PASS.

Live verification (RTX 4090, canonical 7B teacher, prior iteration):
  ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored
  Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass)
  ALL 28 layers Pass v1.0.0 ratio gate.

The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY
a sample-size artifact — see §38 (PR #1108) for full analysis.

Five-whys (recorded in §38.6):
1. Why isn't MODEL-1 inference correct? `apr run` gibberish.
2. Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading.
3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only.
4. Why didn't earlier reviews catch this? PRs #1082+#1083 matched
   API structurally but not semantically.
5. What's the fix? Make both reporters use same sample (this PR).

Spec ref: §37 (PR #1105), §38 (PR #1108).
Contract: apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107).
Coverage scoreboard unchanged (15+33).

Authored in isolated worktree to avoid git-environment race
condition that prevented commit in prior iteration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit cbaef74 into main Apr 30, 2026
10 checks passed
@noahgift noahgift deleted the docs/ship-007-layer3-ratio-falsified-2 branch April 30, 2026 03:46
noahgift added a commit that referenced this pull request May 13, 2026
…Option A (#1113)

Authors a new provable contract that codifies the SPEC-SHIP-TWO-001 §40.6
Option A shipping decision: MODEL-1 (paiml/qwen2.5-coder-7b-apache-q4k-v1)
IS shippable today via `apr run --no-gpu`.

Live evidence (RTX 4090, lambda-labs):

  $ apr run /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr \
      --prompt "What is 2+2?" --max-tokens 5 --temperature 0 \
      --skip-contract --no-gpu
  Output: "2 + 2 equals"

  ✓ FALSIFY-MODEL-1-SHIP-CPU-001 PASS (contains "equals")

Contract structure:
- 3 equations: cpu_path_correctness (PASSES today), gpu_path_known_issue
  (acknowledges defect tracked in §40), gpu_fix_obligation (durable
  closure mandate).
- 6 falsification tests: -001 CPU correctness, -002 §40 in spec,
  -003 pv validate, -004 user-facing docs warn about GPU, -005 semver
  signals scope, -006 spec→contract back-reference.
- 5 proof_obligations + 2 kani harnesses.
- Contract validates clean via `pv validate` (0 errors, 0 warnings).

Methodology compliance per `feedback_fix_root_cause_never_route_around.md`:

This contract is NOT a workaround. It documents reality (CPU works, GPU
has known bug), creates a falsifiable gate that catches CPU regressions,
and MANDATES that the GPU bug remain visible in the spec until fixed:

- v1.0.0: MODEL-1 ships CPU-only
- v2.0.0: MODEL-1 ships CPU+GPU (requires gpu_path_known_issue closure)
- Closing the GPU bug requires either:
  (a) GPU passes cpu_path_correctness gate
  (b) GPU dispatch is removed/deprecated
  (c) New hypothesis identified + spec amendment

Five-whys (consistent with §40.5):
1. Why isn't MODEL-1 shipped today? Because we lacked a contract-backed
   verdict that "MODEL-1 produces correct output via SOME inference path".
2. Why? Because the §17/§23/§27/§38 chain was bisecting the wrong path,
   leaving the actual CPU correctness un-codified.
3. Why now? §40.4 + §40.5 + #1112 H1+H2 falsifiers narrowed the bug to
   GPU dispatch (H3); CPU is empirically correct.
4. Why this contract NOW? Per the user directive to ship + use contracts;
   MODEL-1 is shippable TODAY with a v1.0.0 SHIP-via-CPU contract.
5. What's next? On gpu_fix_obligation closure (a/b/c), bump v1.0.0 → v2.0.0
   and 5 MODEL-1 PARTIALs auto-discharge.

Spec ref: §40.6 Option A.
PR cascade: #1105/#1107/#1108/#1109/#1110/#1111/#1112 (this is the SHIP
gate that builds on top of §40 localization).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant