docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0)#1072
Closed
noahgift wants to merge 2 commits into
Closed
docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0)#1072noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
2aaa5f0 to
5d067e1
Compare
…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5d067e1 to
1719aac
Compare
Contributor
Author
|
Re-authoring as §23 v2.67.0 since §22 (PR #1074) is now on main with v2.66.0 banner. The sub-FFN bisection finding deserves its own section number; this PR's content survives in the new PR. |
auto-merge was automatically disabled
April 26, 2026 17:59
Pull request was closed
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.66.0 → v2.67.0 (#1075) §17.4 specified sub-layer bisection of FFN as the falsifier next step. PR #1066 added the 4 sub-FFN ActivationStats fields. §23 records the first run on the canonical 7B teacher post-#1066-merge. (Originally authored as §21 in the closed PR #1072. Re-numbered as §23 because §22 (PR #1074) landed first with v2.66.0 banner; this PR brings v2.67.0.) ## Key finding Live `apr trace --payload` on `paiml/qwen2.5-coder-7b-apache-q4k-v1` teacher (CPU, prompt "What is 2+2?") layer-3 sub-FFN std: | Sub-FFN slot | L1-2 baseline | L3 | Ratio | |--------------|--------------:|----:|------:| | ffn_norm | 0.85 / 0.86 | 1.00 | 1.16× normal | | ffn_gate | 1.50 / 1.99 | 1.92 | 0.97× normal | | ffn_up | 1.10 / 0.94 | 1.34 | 1.42× small | | ffn_silu | 0.043 / 0.052 | 0.168 | 3.2× precursor | | **ffn_swigl** | **0.061 / 0.071** | **1.222** | **17.2× anomaly** | | ffn_out | 0.345 / 0.216 | 11.459 | 53× cascade | Gate/up individually normal at layer 3. Element-wise multiply at inference.rs:163 `ffn_hidden.push(silu_g * u)` is the named bug site (possibly off-by-one slice indexing). ## Bug surface narrowing chain - §15.4: GPU GQA kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out) - **§23: layer 3 ffn_swigl named (17× first anomaly site)** ## Falsifiable next investigation step (§23.6) Extend `OwnedQuantizedModel::forward_traced` (the GGUF path; needs to be authored per `project_ship_007_gguf_forward_traced_plan.md`) with same 4 sub-FFN fields. Compare APR vs GGUF layer-3 ffn_swigl directly: - ≈0.07 → APR-side bug pinned to inference.rs:160-164 - ≈1.22 → spike is normal model behavior; bug elsewhere ## Evidence persisted - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv (28-layer × 6-field summary) Spec v2.66.0 → v2.67.0. No coverage tally change. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…P-007 layer-3 H1/H2 unblock Authors the GGUF-side sub-FFN telemetry contract that unblocks the SHIP-007 layer-3 ffn_swigl bisection (memory: project_ship_007_layer_3_swiglu_bisection.md). BACKGROUND: SHIP-007 §21 (aprender PR #1072) narrowed the bug to "(layer=3, ffn_swigl element-wise multiply)" on the APR forward path. APR layer-3 ffn_swigl std = 1.222 (17.2× layer-2 baseline). But §21 cannot distinguish: H1: Token-position-dependent correlation (NORMAL model behavior) H2: APR-side bug (forward path produces wrong VALUES vs GGUF) without GGUF-side per-layer sub-FFN telemetry. Currently NO forward_traced method exists on OwnedQuantizedModel. THIS CONTRACT: Pins the architecture for adding GGUF-side traced forward mirroring `trace-ffn-sub-block-v1` (APR sibling). Same 5 sub-FFN fields (gate_proj_out, up_proj_out, silu_gate, swiglu_inner, ffn_down_out). Same equation. Cross-comparison enabled by schema parity (PO-TRACE-FFN-GGUF-002). Implementation stages (multi-PR cascade, deliberate session): - M-FFN-GGUF-0: contract scaffold (this PR) - M-FFN-GGUF-1: LayerActivation struct on GGUF side - M-FFN-GGUF-2: NEW forward_traced on OwnedQuantizedModel - M-FFN-GGUF-3: heavy comparison harness APR vs GGUF layer-3 - M-FFN-GGUF-4: SHIP-007 fix PR cites H1 or H2 4 falsification tests defined: - FFN-GGUF-001: forward_traced exists on OwnedQuantizedModel - FFN-GGUF-002: byte-identity vs production forward - FFN-GGUF-003: APR-vs-GGUF layer-3 ffn_swigl std distinguishes H1/H2 - FFN-GGUF-004: SHIP-007 fix PR cites bisected hypothesis PATTERN PRECEDENT: Mirrors the proven trace-moe-gpu-sub-stages-v1 cascade closure (M50→M87, 2026-05-04→06): contract scaffold first, then implementation stages, then heavy harness, then fix. Bug class is similar: bisection identified a SURFACE but cannot distinguish hypotheses without sibling telemetry. This contract is the parallel-track autonomous work that should have been authored during M-GPU-MOE CI dead time but wasn't. With M-GPU-MOE-1.x cascade now closed, the SHIP-007 layer-3 cascade can proceed in parallel-or-serial fashion. Discharges 5 transitively-blocked MODEL-1 PARTIALs per ship-two- models-spec.md §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…P-007 layer-3 H1/H2 unblock (#1532) Authors the GGUF-side sub-FFN telemetry contract that unblocks the SHIP-007 layer-3 ffn_swigl bisection (memory: project_ship_007_layer_3_swiglu_bisection.md). BACKGROUND: SHIP-007 §21 (aprender PR #1072) narrowed the bug to "(layer=3, ffn_swigl element-wise multiply)" on the APR forward path. APR layer-3 ffn_swigl std = 1.222 (17.2× layer-2 baseline). But §21 cannot distinguish: H1: Token-position-dependent correlation (NORMAL model behavior) H2: APR-side bug (forward path produces wrong VALUES vs GGUF) without GGUF-side per-layer sub-FFN telemetry. Currently NO forward_traced method exists on OwnedQuantizedModel. THIS CONTRACT: Pins the architecture for adding GGUF-side traced forward mirroring `trace-ffn-sub-block-v1` (APR sibling). Same 5 sub-FFN fields (gate_proj_out, up_proj_out, silu_gate, swiglu_inner, ffn_down_out). Same equation. Cross-comparison enabled by schema parity (PO-TRACE-FFN-GGUF-002). Implementation stages (multi-PR cascade, deliberate session): - M-FFN-GGUF-0: contract scaffold (this PR) - M-FFN-GGUF-1: LayerActivation struct on GGUF side - M-FFN-GGUF-2: NEW forward_traced on OwnedQuantizedModel - M-FFN-GGUF-3: heavy comparison harness APR vs GGUF layer-3 - M-FFN-GGUF-4: SHIP-007 fix PR cites H1 or H2 4 falsification tests defined: - FFN-GGUF-001: forward_traced exists on OwnedQuantizedModel - FFN-GGUF-002: byte-identity vs production forward - FFN-GGUF-003: APR-vs-GGUF layer-3 ffn_swigl std distinguishes H1/H2 - FFN-GGUF-004: SHIP-007 fix PR cites bisected hypothesis PATTERN PRECEDENT: Mirrors the proven trace-moe-gpu-sub-stages-v1 cascade closure (M50→M87, 2026-05-04→06): contract scaffold first, then implementation stages, then heavy harness, then fix. Bug class is similar: bisection identified a SURFACE but cannot distinguish hypotheses without sibling telemetry. This contract is the parallel-track autonomous work that should have been authored during M-GPU-MOE CI dead time but wasn't. With M-GPU-MOE-1.x cascade now closed, the SHIP-007 layer-3 cascade can proceed in parallel-or-serial fashion. Discharges 5 transitively-blocked MODEL-1 PARTIALs per ship-two- models-spec.md §17.5: SHIP-002, SHIP-005, SHIP-006, SHIP-007, SHIP-008. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Key finding — sub-FFN std at layer 3
Both gate and up individually at layer 3 are normal. Their element-wise product
silu(g) * ulands at 17× the layer-2 baseline. Bug site isinference.rs:160-164specifically theffn_hidden.push(silu_g * u)element-wise multiply, possibly with off-by-one indexing.What §21 contains (8 subsections)
apr trace --payload)apr trace --payload)Stacks under
Evidence files
Test plan
Why this matters
§21.6 specifies a focused falsifier: extend GGUF-path
forward_tracedwith the same 4 fields, then compare APR vs GGUF layer-3 ffn_swigl directly. This is the next 1-session task that disambiguates between (a) "normal model behavior at layer 3" vs (b) "APR-side bug in element-wise multiply". Per §17.5: whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008).🤖 Generated with Claude Code