docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0) by noahgift · Pull Request #1072 · paiml/aprender

noahgift · 2026-04-26T10:21:19Z

Summary

§21 records the first run of §17.4's sub-layer bisection on the canonical 7B teacher (PR feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml #1066's sub-FFN telemetry impl deployed locally).
Bug surface narrows from §17's "(layer=3, FFN sub-block)" to (layer=3, ffn_swigl is the first 17× anomaly site).
Spec v2.65.0 → v2.66.0.

Key finding — sub-FFN std at layer 3

Sub-FFN slot	Layer 1-2 baseline	Layer 3	Ratio
ffn_norm	0.85 / 0.86	1.00	1.16× (normal)
ffn_gate	1.50 / 1.99	1.92	0.97× (normal)
ffn_up	1.10 / 0.94	1.34	1.42× (small)
ffn_silu	0.043 / 0.052	0.168	3.2× (precursor)
ffn_swigl	0.061 / 0.071	1.222	17.2× ← first anomaly
ffn_out	0.345 / 0.216	11.459	53× (cascaded)

Both gate and up individually at layer 3 are normal. Their element-wise product silu(g) * u lands at 17× the layer-2 baseline. Bug site is inference.rs:160-164 specifically the ffn_hidden.push(silu_g * u) element-wise multiply, possibly with off-by-one indexing.

What §21 contains (8 subsections)

§21.1 Live trace command (apr trace --payload)
§21.2 28-layer × 6-field std table
§21.3 The first divergent sub-FFN slot is ffn_swigl
§21.4 Why ffn_swigl is anomalous despite gate/up being normal individually
§21.5 Refined suspect surface — element-wise multiply correctness named
§21.6 Falsifiable next step — extend GGUF-path telemetry, compare APR vs GGUF layer-3 ffn_swigl
§21.7 What §21 is NOT (depends on PR feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml #1066 in cascade)
§21.8 Methodological alignment (4th re-use of apr trace --payload)

Stacks under

docs(ship-two-001): §20 live CUDA training dispatch evidence — spec v2.65.0 #1070 (§20 — live CUDA training dispatch)
docs(ship-two-001): §19 task #132 correction (CUDA training shipped) #1068 (§19 — task feat(voice): Voice processing module - embeddings, style transfer, cloning, isolation #132 correction)
docs(ship-two-001): §18 training status snapshot as chain-of-thought #1067 (§18 — training status snapshot)
feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml #1066 (sub-FFN telemetry impl — provided the new fields used by §21)
docs(ship-007): §17 layer-3 ffn_out anomaly identified — first divergent layer named #1064 (§17 — layer-3 FFN sub-block named)
docs(ship-007): §16 APR forward CPU path isolated as root cause #1063 (§16 — APR forward CPU path isolated)

Evidence files

evidence/ship-007-layer-3-anomaly/
├── sub-ffn-bisection-2026-04-26.txt    # 386-line full apr trace output
└── sub-ffn-per-layer-stds.csv          # 28-layer × 6-field summary

Test plan

§21 added at end of spec, before END OF SPECIFICATION marker
Atomic-next-action banner updated v2.65.0 → v2.66.0
PMAT pre-commit gates pass
Sub-FFN telemetry fields populated correctly on SwiGLU path (verified via live trace; new layer block emits 10 lines instead of 6)

Why this matters

§21.6 specifies a focused falsifier: extend GGUF-path forward_traced with the same 4 fields, then compare APR vs GGUF layer-3 ffn_swigl directly. This is the next 1-session task that disambiguates between (a) "normal model behavior at layer 3" vs (b) "APR-side bug in element-wise multiply". Per §17.5: whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008).

🤖 Generated with Claude Code

…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-04-26T17:59:35Z

Re-authoring as §23 v2.67.0 since §22 (PR #1074) is now on main with v2.66.0 banner. The sub-FFN bisection finding deserves its own section number; this PR's content survives in the new PR.

…t 17× anomaly site — spec v2.66.0 → v2.67.0 (#1075) §17.4 specified sub-layer bisection of FFN as the falsifier next step. PR #1066 added the 4 sub-FFN ActivationStats fields. §23 records the first run on the canonical 7B teacher post-#1066-merge. (Originally authored as §21 in the closed PR #1072. Re-numbered as §23 because §22 (PR #1074) landed first with v2.66.0 banner; this PR brings v2.67.0.) ## Key finding Live `apr trace --payload` on `paiml/qwen2.5-coder-7b-apache-q4k-v1` teacher (CPU, prompt "What is 2+2?") layer-3 sub-FFN std: | Sub-FFN slot | L1-2 baseline | L3 | Ratio | |--------------|--------------:|----:|------:| | ffn_norm | 0.85 / 0.86 | 1.00 | 1.16× normal | | ffn_gate | 1.50 / 1.99 | 1.92 | 0.97× normal | | ffn_up | 1.10 / 0.94 | 1.34 | 1.42× small | | ffn_silu | 0.043 / 0.052 | 0.168 | 3.2× precursor | | **ffn_swigl** | **0.061 / 0.071** | **1.222** | **17.2× anomaly** | | ffn_out | 0.345 / 0.216 | 11.459 | 53× cascade | Gate/up individually normal at layer 3. Element-wise multiply at inference.rs:163 `ffn_hidden.push(silu_g * u)` is the named bug site (possibly off-by-one slice indexing). ## Bug surface narrowing chain - §15.4: GPU GQA kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out) - **§23: layer 3 ffn_swigl named (17× first anomaly site)** ## Falsifiable next investigation step (§23.6) Extend `OwnedQuantizedModel::forward_traced` (the GGUF path; needs to be authored per `project_ship_007_gguf_forward_traced_plan.md`) with same 4 sub-FFN fields. Compare APR vs GGUF layer-3 ffn_swigl directly: - ≈0.07 → APR-side bug pinned to inference.rs:160-164 - ≈1.22 → spike is normal model behavior; bug elsewhere ## Evidence persisted - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv (28-layer × 6-field summary) Spec v2.66.0 → v2.67.0. No coverage tally change. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…P-007 layer-3 H1/H2 unblock Authors the GGUF-side sub-FFN telemetry contract that unblocks the SHIP-007 layer-3 ffn_swigl bisection (memory: project_ship_007_layer_3_swiglu_bisection.md). BACKGROUND: SHIP-007 §21 (aprender PR #1072) narrowed the bug to "(layer=3, ffn_swigl element-wise multiply)" on the APR forward path. APR layer-3 ffn_swigl std = 1.222 (17.2× layer-2 baseline). But §21 cannot distinguish: H1: Token-position-dependent correlation (NORMAL model behavior) H2: APR-side bug (forward path produces wrong VALUES vs GGUF) without GGUF-side per-layer sub-FFN telemetry. Currently NO forward_traced method exists on OwnedQuantizedModel. THIS CONTRACT: Pins the architecture for adding GGUF-side traced forward mirroring `trace-ffn-sub-block-v1` (APR sibling). Same 5 sub-FFN fields (gate_proj_out, up_proj_out, silu_gate, swiglu_inner, ffn_down_out). Same equation. Cross-comparison enabled by schema parity (PO-TRACE-FFN-GGUF-002). Implementation stages (multi-PR cascade, deliberate session): - M-FFN-GGUF-0: contract scaffold (this PR) - M-FFN-GGUF-1: LayerActivation struct on GGUF side - M-FFN-GGUF-2: NEW forward_traced on OwnedQuantizedModel - M-FFN-GGUF-3: heavy comparison harness APR vs GGUF layer-3 - M-FFN-GGUF-4: SHIP-007 fix PR cites H1 or H2 4 falsification tests defined: - FFN-GGUF-001: forward_traced exists on OwnedQuantizedModel - FFN-GGUF-002: byte-identity vs production forward - FFN-GGUF-003: APR-vs-GGUF layer-3 ffn_swigl std distinguishes H1/H2 - FFN-GGUF-004: SHIP-007 fix PR cites bisected hypothesis PATTERN PRECEDENT: Mirrors the proven trace-moe-gpu-sub-stages-v1 cascade closure (M50→M87, 2026-05-04→06): contract scaffold first, then implementation stages, then heavy harness, then fix. Bug class is similar: bisection identified a SURFACE but cannot distinguish hypotheses without sibling telemetry. This contract is the parallel-track autonomous work that should have been authored during M-GPU-MOE CI dead time but wasn't. With M-GPU-MOE-1.x cascade now closed, the SHIP-007 layer-3 cascade can proceed in parallel-or-serial fashion. Discharges 5 transitively-blocked MODEL-1 PARTIALs per ship-two- models-spec.md §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…P-007 layer-3 H1/H2 unblock (#1532) Authors the GGUF-side sub-FFN telemetry contract that unblocks the SHIP-007 layer-3 ffn_swigl bisection (memory: project_ship_007_layer_3_swiglu_bisection.md). BACKGROUND: SHIP-007 §21 (aprender PR #1072) narrowed the bug to "(layer=3, ffn_swigl element-wise multiply)" on the APR forward path. APR layer-3 ffn_swigl std = 1.222 (17.2× layer-2 baseline). But §21 cannot distinguish: H1: Token-position-dependent correlation (NORMAL model behavior) H2: APR-side bug (forward path produces wrong VALUES vs GGUF) without GGUF-side per-layer sub-FFN telemetry. Currently NO forward_traced method exists on OwnedQuantizedModel. THIS CONTRACT: Pins the architecture for adding GGUF-side traced forward mirroring `trace-ffn-sub-block-v1` (APR sibling). Same 5 sub-FFN fields (gate_proj_out, up_proj_out, silu_gate, swiglu_inner, ffn_down_out). Same equation. Cross-comparison enabled by schema parity (PO-TRACE-FFN-GGUF-002). Implementation stages (multi-PR cascade, deliberate session): - M-FFN-GGUF-0: contract scaffold (this PR) - M-FFN-GGUF-1: LayerActivation struct on GGUF side - M-FFN-GGUF-2: NEW forward_traced on OwnedQuantizedModel - M-FFN-GGUF-3: heavy comparison harness APR vs GGUF layer-3 - M-FFN-GGUF-4: SHIP-007 fix PR cites H1 or H2 4 falsification tests defined: - FFN-GGUF-001: forward_traced exists on OwnedQuantizedModel - FFN-GGUF-002: byte-identity vs production forward - FFN-GGUF-003: APR-vs-GGUF layer-3 ffn_swigl std distinguishes H1/H2 - FFN-GGUF-004: SHIP-007 fix PR cites bisected hypothesis PATTERN PRECEDENT: Mirrors the proven trace-moe-gpu-sub-stages-v1 cascade closure (M50→M87, 2026-05-04→06): contract scaffold first, then implementation stages, then heavy harness, then fix. Bug class is similar: bisection identified a SURFACE but cannot distinguish hypotheses without sibling telemetry. This contract is the parallel-track autonomous work that should have been authored during M-GPU-MOE CI dead time but wasn't. With M-GPU-MOE-1.x cascade now closed, the SHIP-007 layer-3 cascade can proceed in parallel-or-serial fashion. Discharges 5 transitively-blocked MODEL-1 PARTIALs per ship-two- models-spec.md §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 26, 2026 10:21

noahgift force-pushed the docs/ship-007-21-sub-ffn-bisection-result branch 3 times, most recently from 2aaa5f0 to 5d067e1 Compare April 26, 2026 13:05

noahgift and others added 2 commits April 26, 2026 15:31

noahgift force-pushed the docs/ship-007-21-sub-ffn-bisection-result branch from 5d067e1 to 1719aac Compare April 26, 2026 13:31

noahgift closed this Apr 26, 2026

auto-merge was automatically disabled April 26, 2026 17:59
Pull request was closed

noahgift mentioned this pull request Apr 26, 2026

docs(ship-007): §23 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.67.0) #1075

Merged

noahgift mentioned this pull request May 6, 2026

contract(trace-ffn-sub-block-gguf-v1): v1.0.0 PROPOSED scaffold — SHIP-007 layer-3 H1/H2 unblock #1532

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0)#1072

docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0)#1072
noahgift wants to merge 2 commits into
mainfrom
docs/ship-007-21-sub-ffn-bisection-result

noahgift commented Apr 26, 2026

Uh oh!

noahgift commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 26, 2026

Summary

Key finding — sub-FFN std at layer 3

What §21 contains (8 subsections)

Stacks under

Evidence files

Test plan

Why this matters

Uh oh!

noahgift commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant