Skip to content

docs(ship-007): §16 APR forward CPU path isolated as root cause#1063

Merged
noahgift merged 3 commits into
mainfrom
docs/ship-007-apr-forward-isolation
Apr 26, 2026
Merged

docs(ship-007): §16 APR forward CPU path isolated as root cause#1063
noahgift merged 3 commits into
mainfrom
docs/ship-007-apr-forward-isolation

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

What §16 contains

  1. §16.1 — full live trace evidence with both run outputs side-by-side
  2. §16.2 — elimination table (7 suspects ruled out, with citations to feat(ship-003): FALSIFY-SHIP-003 DISCHARGED via apr diff 339-tensor cosine sweep (5th MODEL-1 of cycle, depends on PR #1058) #1059/test(ship-007): Qwen2.5-Coder-7B GQA-7:1 CPU/GPU attention parity falsifier — kernel ruled out as root cause #1061 and existing parity tests)
  3. §16.3 — surviving suspects (layer-composition glue in forward_single_with_scratch, multi-layer KV cache layout, RoPE setup, LM head)
  4. §16.4 — falsifiable next investigation step: apr trace --payload --layer 0 bisection across 28 layers (1-2 sessions, not multi-PR)
  5. §16.5 — methodological continuation: zero eprintln! per feedback_apr_trace_not_eprintln.md

Why this matters

All 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008) discharge once this single bug is fixed (per §15.7 blast-radius inventory). The next root-cause-fix PR is now much more focused: start with crates/aprender-serve/src/gguf/inference/forward/single_cache.rs and the APR-specific forward_single_with_scratch path.

Test plan

  • §16 added at end of spec, before END OF SPECIFICATION marker
  • Atomic-next-action banner updated v2.59.0 → v2.61.0
  • PMAT pre-commit gates pass (complexity, SATD, docs)
  • Investigation evidence is reproducible: apr trace <model.apr|.gguf> --payload

🤖 Generated with Claude Code

…cause — spec v2.59.0 → v2.61.0

Live `apr trace --payload` on the canonical paiml/qwen2.5-coder-7b-apache-q4k-v1
teacher (noah-Lambda-Vector RTX 4090, 2026-04-26) ran twice on CPU with the
same prompt "What is 2+2?", same encoded tokens [3838, 374, 220, 17, 10, 17, 30],
same embedded BPE tokenizer:

  APR teacher  → top-1 token=220 (" "), logit=16.7368  ← WRONG
  GGUF teacher → " 2+2 is 4."                          ← CORRECT

Combined with §15.4 (PR #1061 — GPU GQA-7:1 attention parity tests all PASS),
this eliminates: GPU stack, GQA attention kernel, tokenizer, loader-side data
layout, Q4K dequantization, RMSNorm, embedding lookup. Surviving suspects are
all in the APR-format CPU forward path:

  - Layer-composition glue in forward_single_with_scratch
  - Multi-layer KV cache layout (across-layer indexing)
  - Position embedding (RoPE) layout / sin/cos cache
  - LM head projection

§16.4 specifies the falsifiable next investigation step: `apr trace --payload
--layer 0` bisection across 28 layers. 1-2 sessions task, not multi-PR. Whatever
fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs
(SHIP-002/005/006/007/008) per §15.7's blast-radius inventory.

Spec v2.59.0 → v2.61.0 (jumps v2.60.0; reserved for #1062 conflict-merge).
No coverage tally change — investigation-recording amendment, not rule
promotion.

Methodological continuation per feedback_apr_trace_not_eprintln.md: zero
eprintln! added, exact same `apr trace --payload` primitive used in §15.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 71c1152 into main Apr 26, 2026
10 checks passed
@noahgift noahgift deleted the docs/ship-007-apr-forward-isolation branch April 26, 2026 11:04
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…irmed APR-side at inference.rs:160-164 — spec v2.71.0 → v2.72.0 (#1084)

Live evidence on noah-Lambda-Vector RTX 4090 2026-04-27.
Built apr from PR #1083 branch (commits 77c016b + c657968
+ f249464 from PR A+B+C cascade). Ran `apr trace --payload`
on canonical 7B teacher in BOTH formats with identical prompt
+ tokenizer.

Result:
| Layer | APR ffn_swigl std | GGUF ffn_swigl std | Ratio |
|------:|------------------:|-------------------:|------:|
| 3     | 1.2216            | 0.0670             | 18.23x |

§26.4 binding criterion threshold: ≥10x → APR-side bug.
**Observed 18.23x — 8x past the threshold, decisive verdict.**

The investigation chain that started in §15.4 (GPU GQA
elimination) has reached its conclusion at §27:

§15.4 → §16 → §17 → §23 → §27 (this)
"Whole forward path" → "GPU eliminated" → "(layer=3, FFN sub-block)"
→ "(layer=3, ffn_swigl)" → "**APR-side at inference.rs:160-164**"

Cascade-damping signature confirmed:
- Layers 0-2: ratio ~1.1x (normal)
- Layer 3: 18.23x (anomaly)
- Layers 4-5: 3.3-4.5x (cascade)
- Layer 6+: ~1x (recovered)

This is consistent with a localized perturbation (off-by-one,
buffer aliasing, or F32-vs-Q4K dequant defect at layer-3-
specifically) rather than persistent residual-stream corruption.

Per §17.5, SHIP-007 fix discharges 5 MODEL-1 PARTIALs at once
(SHIP-002/005/006/007/008). §26.5 expected coverage flip: 33+12
→ 28+17 when fix lands.

§27 does NOT discharge by itself — it locates the bug for fixing.
Next investigation reads `inference.rs:160-164` and tests 4 hypotheses:
1. Off-by-one slice indexing
2. Buffer aliasing (scratch reuse pattern)
3. F32-vs-Q4K dequant defect at layer-3 input range
4. Activation overflow (SiLU saturation amplifies multiply)

Methodology held throughout: zero eprintln!, zero route-arounds,
apr is canonical (§26.8), all instrumentation via `apr trace
--payload`. Lambda-labs lane pre-authorized.

Evidence persisted to evidence/ship-007-apr-vs-gguf-2026-04-27/:
- apr-trace.txt (13.5 KB)
- gguf-trace.txt (13.7 KB)
- binding-criterion-summary.json

Note: §27 reproduction requires PR #1081 + #1082 + #1083
cascade to merge first (the apr trace --payload <gguf> wiring
is in PR C). Evidence was generated with a local build of PR
#1083 branch.

Spec v2.71.0 → v2.72.0. Coverage flip pending fix.

Spec: SPEC-SHIP-TWO-001 §26.4 P3 verdict
References:
- §15.4 (PR #1062) — GPU GQA eliminated
- §16 (PR #1063) — APR CPU isolated
- §17 (PR #1064) — layer-3 FFN sub-block
- §23 (PR #1075) — layer-3 ffn_swigl named
- §26.8 (PR #1079) — apr-is-canonical methodology rule
- PR #1081 (P3 PR A scaffold)
- PR #1082 (P3 PR B sub-FFN populate)
- PR #1083 (P3 PR C CLI wiring)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant