Skip to content

docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0)#1072

Closed
noahgift wants to merge 2 commits into
mainfrom
docs/ship-007-21-sub-ffn-bisection-result
Closed

docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0)#1072
noahgift wants to merge 2 commits into
mainfrom
docs/ship-007-21-sub-ffn-bisection-result

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Key finding — sub-FFN std at layer 3

Sub-FFN slot Layer 1-2 baseline Layer 3 Ratio
ffn_norm 0.85 / 0.86 1.00 1.16× (normal)
ffn_gate 1.50 / 1.99 1.92 0.97× (normal)
ffn_up 1.10 / 0.94 1.34 1.42× (small)
ffn_silu 0.043 / 0.052 0.168 3.2× (precursor)
ffn_swigl 0.061 / 0.071 1.222 17.2× ← first anomaly
ffn_out 0.345 / 0.216 11.459 53× (cascaded)

Both gate and up individually at layer 3 are normal. Their element-wise product silu(g) * u lands at 17× the layer-2 baseline. Bug site is inference.rs:160-164 specifically the ffn_hidden.push(silu_g * u) element-wise multiply, possibly with off-by-one indexing.

What §21 contains (8 subsections)

  1. §21.1 Live trace command (apr trace --payload)
  2. §21.2 28-layer × 6-field std table
  3. §21.3 The first divergent sub-FFN slot is ffn_swigl
  4. §21.4 Why ffn_swigl is anomalous despite gate/up being normal individually
  5. §21.5 Refined suspect surface — element-wise multiply correctness named
  6. §21.6 Falsifiable next step — extend GGUF-path telemetry, compare APR vs GGUF layer-3 ffn_swigl
  7. §21.7 What §21 is NOT (depends on PR feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml #1066 in cascade)
  8. §21.8 Methodological alignment (4th re-use of apr trace --payload)

Stacks under

Evidence files

evidence/ship-007-layer-3-anomaly/
├── sub-ffn-bisection-2026-04-26.txt    # 386-line full apr trace output
└── sub-ffn-per-layer-stds.csv          # 28-layer × 6-field summary

Test plan

  • §21 added at end of spec, before END OF SPECIFICATION marker
  • Atomic-next-action banner updated v2.65.0 → v2.66.0
  • PMAT pre-commit gates pass
  • Sub-FFN telemetry fields populated correctly on SwiGLU path (verified via live trace; new layer block emits 10 lines instead of 6)

Why this matters

§21.6 specifies a focused falsifier: extend GGUF-path forward_traced with the same 4 fields, then compare APR vs GGUF layer-3 ffn_swigl directly. This is the next 1-session task that disambiguates between (a) "normal model behavior at layer 3" vs (b) "APR-side bug in element-wise multiply". Per §17.5: whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008).

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 26, 2026 10:21
@noahgift noahgift force-pushed the docs/ship-007-21-sub-ffn-bisection-result branch 3 times, most recently from 2aaa5f0 to 5d067e1 Compare April 26, 2026 13:05
noahgift and others added 2 commits April 26, 2026 15:31
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-007-21-sub-ffn-bisection-result branch from 5d067e1 to 1719aac Compare April 26, 2026 13:31
@noahgift

Copy link
Copy Markdown
Contributor Author

Re-authoring as §23 v2.67.0 since §22 (PR #1074) is now on main with v2.66.0 banner. The sub-FFN bisection finding deserves its own section number; this PR's content survives in the new PR.

@noahgift noahgift closed this Apr 26, 2026
auto-merge was automatically disabled April 26, 2026 17:59

Pull request was closed

noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.66.0 → v2.67.0 (#1075)

§17.4 specified sub-layer bisection of FFN as the falsifier next
step. PR #1066 added the 4 sub-FFN ActivationStats fields. §23
records the first run on the canonical 7B teacher post-#1066-merge.

(Originally authored as §21 in the closed PR #1072. Re-numbered as
§23 because §22 (PR #1074) landed first with v2.66.0 banner; this
PR brings v2.67.0.)

## Key finding

Live `apr trace --payload` on `paiml/qwen2.5-coder-7b-apache-q4k-v1`
teacher (CPU, prompt "What is 2+2?") layer-3 sub-FFN std:

| Sub-FFN slot | L1-2 baseline | L3 | Ratio |
|--------------|--------------:|----:|------:|
| ffn_norm     | 0.85 / 0.86   | 1.00 | 1.16× normal |
| ffn_gate     | 1.50 / 1.99   | 1.92 | 0.97× normal |
| ffn_up       | 1.10 / 0.94   | 1.34 | 1.42× small |
| ffn_silu     | 0.043 / 0.052 | 0.168 | 3.2× precursor |
| **ffn_swigl** | **0.061 / 0.071** | **1.222** | **17.2× anomaly** |
| ffn_out      | 0.345 / 0.216 | 11.459 | 53× cascade |

Gate/up individually normal at layer 3. Element-wise multiply at
inference.rs:163 `ffn_hidden.push(silu_g * u)` is the named bug
site (possibly off-by-one slice indexing).

## Bug surface narrowing chain
- §15.4: GPU GQA kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out)
- **§23: layer 3 ffn_swigl named (17× first anomaly site)**

## Falsifiable next investigation step (§23.6)

Extend `OwnedQuantizedModel::forward_traced` (the GGUF path; needs
to be authored per `project_ship_007_gguf_forward_traced_plan.md`)
with same 4 sub-FFN fields. Compare APR vs GGUF layer-3 ffn_swigl
directly:
- ≈0.07 → APR-side bug pinned to inference.rs:160-164
- ≈1.22 → spike is normal model behavior; bug elsewhere

## Evidence persisted
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv (28-layer × 6-field summary)

Spec v2.66.0 → v2.67.0. No coverage tally change.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…P-007 layer-3 H1/H2 unblock

Authors the GGUF-side sub-FFN telemetry contract that unblocks the
SHIP-007 layer-3 ffn_swigl bisection (memory:
project_ship_007_layer_3_swiglu_bisection.md).

BACKGROUND:
SHIP-007 §21 (aprender PR #1072) narrowed the bug to
"(layer=3, ffn_swigl element-wise multiply)" on the APR forward
path. APR layer-3 ffn_swigl std = 1.222 (17.2× layer-2 baseline).
But §21 cannot distinguish:

H1: Token-position-dependent correlation (NORMAL model behavior)
H2: APR-side bug (forward path produces wrong VALUES vs GGUF)

without GGUF-side per-layer sub-FFN telemetry. Currently NO
forward_traced method exists on OwnedQuantizedModel.

THIS CONTRACT:
Pins the architecture for adding GGUF-side traced forward
mirroring `trace-ffn-sub-block-v1` (APR sibling). Same 5 sub-FFN
fields (gate_proj_out, up_proj_out, silu_gate, swiglu_inner,
ffn_down_out). Same equation. Cross-comparison enabled by schema
parity (PO-TRACE-FFN-GGUF-002).

Implementation stages (multi-PR cascade, deliberate session):
- M-FFN-GGUF-0: contract scaffold (this PR)
- M-FFN-GGUF-1: LayerActivation struct on GGUF side
- M-FFN-GGUF-2: NEW forward_traced on OwnedQuantizedModel
- M-FFN-GGUF-3: heavy comparison harness APR vs GGUF layer-3
- M-FFN-GGUF-4: SHIP-007 fix PR cites H1 or H2

4 falsification tests defined:
- FFN-GGUF-001: forward_traced exists on OwnedQuantizedModel
- FFN-GGUF-002: byte-identity vs production forward
- FFN-GGUF-003: APR-vs-GGUF layer-3 ffn_swigl std distinguishes H1/H2
- FFN-GGUF-004: SHIP-007 fix PR cites bisected hypothesis

PATTERN PRECEDENT:
Mirrors the proven trace-moe-gpu-sub-stages-v1 cascade closure
(M50→M87, 2026-05-04→06): contract scaffold first, then
implementation stages, then heavy harness, then fix. Bug class
is similar: bisection identified a SURFACE but cannot distinguish
hypotheses without sibling telemetry.

This contract is the parallel-track autonomous work that should
have been authored during M-GPU-MOE CI dead time but wasn't. With
M-GPU-MOE-1.x cascade now closed, the SHIP-007 layer-3 cascade
can proceed in parallel-or-serial fashion.

Discharges 5 transitively-blocked MODEL-1 PARTIALs per ship-two-
models-spec.md §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007,
SHIP-008.

YAML-only — production hot paths byte-unchanged.

`pv validate` 0/0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…P-007 layer-3 H1/H2 unblock (#1532)

Authors the GGUF-side sub-FFN telemetry contract that unblocks the
SHIP-007 layer-3 ffn_swigl bisection (memory:
project_ship_007_layer_3_swiglu_bisection.md).

BACKGROUND:
SHIP-007 §21 (aprender PR #1072) narrowed the bug to
"(layer=3, ffn_swigl element-wise multiply)" on the APR forward
path. APR layer-3 ffn_swigl std = 1.222 (17.2× layer-2 baseline).
But §21 cannot distinguish:

H1: Token-position-dependent correlation (NORMAL model behavior)
H2: APR-side bug (forward path produces wrong VALUES vs GGUF)

without GGUF-side per-layer sub-FFN telemetry. Currently NO
forward_traced method exists on OwnedQuantizedModel.

THIS CONTRACT:
Pins the architecture for adding GGUF-side traced forward
mirroring `trace-ffn-sub-block-v1` (APR sibling). Same 5 sub-FFN
fields (gate_proj_out, up_proj_out, silu_gate, swiglu_inner,
ffn_down_out). Same equation. Cross-comparison enabled by schema
parity (PO-TRACE-FFN-GGUF-002).

Implementation stages (multi-PR cascade, deliberate session):
- M-FFN-GGUF-0: contract scaffold (this PR)
- M-FFN-GGUF-1: LayerActivation struct on GGUF side
- M-FFN-GGUF-2: NEW forward_traced on OwnedQuantizedModel
- M-FFN-GGUF-3: heavy comparison harness APR vs GGUF layer-3
- M-FFN-GGUF-4: SHIP-007 fix PR cites H1 or H2

4 falsification tests defined:
- FFN-GGUF-001: forward_traced exists on OwnedQuantizedModel
- FFN-GGUF-002: byte-identity vs production forward
- FFN-GGUF-003: APR-vs-GGUF layer-3 ffn_swigl std distinguishes H1/H2
- FFN-GGUF-004: SHIP-007 fix PR cites bisected hypothesis

PATTERN PRECEDENT:
Mirrors the proven trace-moe-gpu-sub-stages-v1 cascade closure
(M50→M87, 2026-05-04→06): contract scaffold first, then
implementation stages, then heavy harness, then fix. Bug class
is similar: bisection identified a SURFACE but cannot distinguish
hypotheses without sibling telemetry.

This contract is the parallel-track autonomous work that should
have been authored during M-GPU-MOE CI dead time but wasn't. With
M-GPU-MOE-1.x cascade now closed, the SHIP-007 layer-3 cascade
can proceed in parallel-or-serial fashion.

Discharges 5 transitively-blocked MODEL-1 PARTIALs per ship-two-
models-spec.md §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007,
SHIP-008.

YAML-only — production hot paths byte-unchanged.

`pv validate` 0/0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant