docs(ship-007): §17 layer-3 ffn_out anomaly identified — first divergent layer named#1064
Merged
Merged
Conversation
4 tasks
71753f2 to
80010cf
Compare
This was referenced Apr 26, 2026
….0 → v2.62.0
Executed §16.4's first iteration ("apr trace --payload --layer 0 on
both APR and GGUF teachers, bisect through 28 layers") against the
APR teacher's existing per-layer telemetry. The full 28-layer
ffn_out std progression on paiml/qwen2.5-coder-7b-apache-q4k-v1
(prompt "What is 2+2?") shows a 31× discontinuity at layer 3:
Layer 2: ffn_out std=0.22
Layer 3: ffn_out std=11.46 ← 31× spike
Layer 4: ffn_out std=3.84 ← damps in 1 layer (one-off perturbation)
Median: ffn_out std=0.5–2.0
The residual stream's output std jumps 0.72 → 11.78 at layer 3 and
stays elevated. Three signals point at layer 3 ffn_out specifically:
(a) magnitude 31× isn't architecture-driven (SHIP-003 PR #1059's
339-tensor cosine sweep proved underlying weights are byte-equivalent
to SafeTensors); (b) damps in one layer (one-off perturbation
pattern, not stable feature); (c) mean shift -0.082 is 100× median
magnitude, suggesting sign-bias defect not magnitude defect.
§17.3 narrows §16.3's four candidates: layer-composition glue in
forward_single_with_scratch at layer 3 FFN is "most likely". Three
new §17.3 candidates added: Q4K dequant under load on 18944-dim FFN;
SiLU numerical stability under SwiGLU `gate * silu(up)`; fused
gate+up matvec dispatch defect (per CLAUDE.md FFN section).
§17.4 specifies sub-layer bisection: emit gate_proj_out, silu(up_proj_out),
gate_proj_out * silu(up_proj_out), down_proj_out separately. Whichever
sub-tensor first shows the 31× std discontinuity vs GGUF path is the
bug site. This requires the §15.5 TraceStep enum extension — now
load-bearing for the fix.
Spec v2.61.0 → v2.62.0. No coverage tally change.
Methodologically: zero eprintln!, zero bash workarounds, third re-use
of `apr trace --payload` primitive without modification (after §15
and §16). Per feedback_apr_trace_not_eprintln.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the §17 falsifier evidence as raw artifacts: evidence/ship-007-layer-3-anomaly/ ├── apr-trace-payload-7b-2026-04-26.txt # 274 lines, all 28 layers ├── gguf-trace-payload-7b-2026-04-26.txt # 34 lines, final decode only └── discharge-evidence-v1.json # JSON summary Precise measurement: layer-3 ffn_out std = 11.459 / layer-2 ffn_out std = 0.216 → 53× spike (§17 stated 31×; actual ratio is even more extreme). The output residual stream's std jumps 0.7159 (layer 2) → 11.7756 (layer 3) → 25+ (layers 9-19) and never recovers below 13. This matches the realizar/aprender-serve CLAUDE.md FFN verification checklist note: "Verify FFN output doesn't cause catastrophic cancellation" — the layer-3 spike IS that catastrophic cancellation pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2162c0e to
ac06497
Compare
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…irmed APR-side at inference.rs:160-164 — spec v2.71.0 → v2.72.0 (#1084) Live evidence on noah-Lambda-Vector RTX 4090 2026-04-27. Built apr from PR #1083 branch (commits 77c016b + c657968 + f249464 from PR A+B+C cascade). Ran `apr trace --payload` on canonical 7B teacher in BOTH formats with identical prompt + tokenizer. Result: | Layer | APR ffn_swigl std | GGUF ffn_swigl std | Ratio | |------:|------------------:|-------------------:|------:| | 3 | 1.2216 | 0.0670 | 18.23x | §26.4 binding criterion threshold: ≥10x → APR-side bug. **Observed 18.23x — 8x past the threshold, decisive verdict.** The investigation chain that started in §15.4 (GPU GQA elimination) has reached its conclusion at §27: §15.4 → §16 → §17 → §23 → §27 (this) "Whole forward path" → "GPU eliminated" → "(layer=3, FFN sub-block)" → "(layer=3, ffn_swigl)" → "**APR-side at inference.rs:160-164**" Cascade-damping signature confirmed: - Layers 0-2: ratio ~1.1x (normal) - Layer 3: 18.23x (anomaly) - Layers 4-5: 3.3-4.5x (cascade) - Layer 6+: ~1x (recovered) This is consistent with a localized perturbation (off-by-one, buffer aliasing, or F32-vs-Q4K dequant defect at layer-3- specifically) rather than persistent residual-stream corruption. Per §17.5, SHIP-007 fix discharges 5 MODEL-1 PARTIALs at once (SHIP-002/005/006/007/008). §26.5 expected coverage flip: 33+12 → 28+17 when fix lands. §27 does NOT discharge by itself — it locates the bug for fixing. Next investigation reads `inference.rs:160-164` and tests 4 hypotheses: 1. Off-by-one slice indexing 2. Buffer aliasing (scratch reuse pattern) 3. F32-vs-Q4K dequant defect at layer-3 input range 4. Activation overflow (SiLU saturation amplifies multiply) Methodology held throughout: zero eprintln!, zero route-arounds, apr is canonical (§26.8), all instrumentation via `apr trace --payload`. Lambda-labs lane pre-authorized. Evidence persisted to evidence/ship-007-apr-vs-gguf-2026-04-27/: - apr-trace.txt (13.5 KB) - gguf-trace.txt (13.7 KB) - binding-criterion-summary.json Note: §27 reproduction requires PR #1081 + #1082 + #1083 cascade to merge first (the apr trace --payload <gguf> wiring is in PR C). Evidence was generated with a local build of PR #1083 branch. Spec v2.71.0 → v2.72.0. Coverage flip pending fix. Spec: SPEC-SHIP-TWO-001 §26.4 P3 verdict References: - §15.4 (PR #1062) — GPU GQA eliminated - §16 (PR #1063) — APR CPU isolated - §17 (PR #1064) — layer-3 FFN sub-block - §23 (PR #1075) — layer-3 ffn_swigl named - §26.8 (PR #1079) — apr-is-canonical methodology rule - PR #1081 (P3 PR A scaffold) - PR #1082 (P3 PR B sub-FFN populate) - PR #1083 (P3 PR C CLI wiring) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
apr trace --payloadalready emits per-layer mean/std for all 28 transformer blocks. Examined progression revealed a 31× discontinuity at layer 3:ffn_outstd=11.46 (vs layer 2 std=0.22 and layer-4-26 median 0.5–2.0).forward_single_with_scratchat layer 3 FFN ("most likely") + 3 new suspects (Q4K dequant under load on 18944-dim FFN; SiLU numerical stability under SwiGLU; fused gate+up matvec dispatch defect).What §17 contains
gate_proj_out/silu(up_proj_out)/down_proj_out(requires §15.5 TraceStep enum extension — now load-bearing)apr trace --payloadprimitive without modificationWhy this matters
The bug surface for SHIP-007 is now a single layer index (3) and a single sub-block (FFN), narrowed from §16.3's 28×4 candidate space. The next root-cause fix PR is much more focused than §16's "APR forward CPU path" surface.
Stacks under
Test plan
apr trace <model.apr> --payloademits the per-layer stats used in §17.1🤖 Generated with Claude Code