feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml#1066
Merged
Merged
Conversation
…ml — 4 new ActivationStats fields on LayerActivation Implements `contracts/trace-ffn-sub-block-v1.yaml` v1.0.0 (PR #1065). Load-bearing for the SHIP-007 fix per ship-two-models-spec.md §15.5 + §17.4. Closes the per-layer telemetry gap that was the §17.4 falsifier prerequisite. ## What changed `LayerActivation` (apr_transformer/mod.rs) gains 4 new `ActivationStats` fields between `ffn_norm_stats` and `ffn_out_stats`: - ffn_gate_stats (post-gate-proj-matmul; OBL-SUB-FFN-001) - ffn_up_stats (post-up-proj-matmul; OBL-SUB-FFN-002) - ffn_silu_gate_stats (post-SiLU on gate; OBL-SUB-FFN-003) - ffn_swiglu_inner_stats (post-elementwise silu(gate)*up; OBL-SUB-FFN-004) `forward_traced` (CPU path, inference.rs) populates all 4 on the SwiGLU path; on the GELU/non-gated path, only `ffn_up_stats` is populated (other 3 stay default-zero — there's no SwiGLU sub-structure to capture). GPU TracedForward (forward_from_model.rs, gpu_forward_pass.rs) zero-fills the 4 new fields per OBL-SUB-FFN-008 (FALSIFY-SUB-FFN-004 follow-up — GPU- side capture is staged separately). CLI renderer (apr-cli/src/commands/vector_stats.rs) emits 4 new lines per layer in computation order (ffn_gate/up/silu/swigl) BETWEEN ffn_norm and ffn_out, suppressed when default-zero (GPU path until follow-up). ## Backward compat (FALSIFY-SUB-FFN-002) `ffn_out_stats` semantics are byte-identical pre/post: same matmul output, same residual contribution. All 720 existing apr_transformer tests pass without modification. ## New tests `tests/forward_traced.rs::test_sub_ffn_telemetry_swiglu_path_populates_all_4_fields` asserts SwiGLU path populates all 4 fields with the correct intermediate_dim count (FALSIFY-SUB-FFN-001 + FALSIFY-SUB-FFN-006). `tests/forward_traced.rs::test_sub_ffn_telemetry_gelu_path_only_up_populated` asserts GELU path leaves 3 fields default-zero and only populates ffn_up_stats with pre-GELU values. ## What this enables §17.4's falsifier next step: run `apr trace --payload` on the canonical 7B teacher and identify whichever of {ffn_gate, ffn_up, ffn_silu_gate, ffn_swiglu_inner, ffn_out} carries the layer-3 53× spike. Whichever sub-tensor first shows the discontinuity is the SHIP-007 bug site. ## Outstanding follow-ups (per contract) - OBL-SUB-FFN-008: GPU sub-FFN telemetry capture — staged for follow-up - Live evidence on the canonical 7B teacher pinning the layer-3 spike to a sub-FFN slot — staged for follow-up Closes contract OBL-SUB-FFN-001..007 algorithm-level. Live evidence discharge pending follow-up `apr trace` run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
Merged
4 tasks
Closed
4 tasks
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…1068) * docs(ship-two-001): §18 — training status snapshot as chain-of-thought — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §19 — task #132 correction (CUDA training has shipped) — spec v2.63.0 → v2.64.0 §18.5 (just landed in PR #1067) stated: > Training compute is the real risk — `apr pretrain --device cuda` > is NOT functional today (task #132). A sub-agent investigation on 2026-04-26 confirmed this premise was outdated by ~5 days. Task #132 closed at commit f7ad114 (2026-04-21). ## What's actually on disk The CLI dispatch path is wired: apr pretrain --device {cpu|cuda|auto} → resolve_device() (entrenar::train::device, train/device.rs:110) → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301) ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364) │ → CudaTransformerTrainer::new(cfg) │ (transformer_trainer/cuda_trainer.rs:2156-2244) └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325) → TransformerTrainer::new(cfg) GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/): - forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels - backward: gemm_backward_a/b, rms_norm_backward - optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda - AMP: GradScaler D2H per step bounded to ~512 B (loss_partials). AdamW state lives on GPU. ## Live smoke test on noah-Lambda-Vector RTX 4090 $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \ --dataset ... --tokenizer ... --run-dir ... \ --device cuda --synthetic --num-steps 4 --json error: --device `cuda` requested but CUDA runtime is not available on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002: no silent CPU fallback). Rebuild with `--features cuda` or pass `--device cpu`. The graceful contract-cited error proves: CLI parses --device cuda correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this is a rebuild-time issue, not a code-architecture gap. ## Three real residuals (post-§19) (A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR (B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch (C) Operator authorization for 10K-step run — decision, not engineering ## Methodological lesson (§19.8) §15→§17 narrowing was good chain-of-thought (each deduction a falsifiable result on live evidence). §18.5 was bad chain-of- thought — the premise was inherited from a stale memory entry without re-verification. Going forward (per `feedback_no_guessing.md`): when a §18-style status snapshot cites a memory entry as evidence for a gap, the memory entry's claims must be re-verified against the code at write-time. Spec v2.63.0 → v2.64.0. No coverage tally change. Memory entry `project_task_132_cuda_training_backend_gap.md` description updated separately to reflect the closed status. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0
§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.
## What §21 contains (8 subsections)
- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
layer 3, but their elementwise product is 17× — implies an
unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
correctness (`inference.rs:163`) + off-by-one slice indexing as
newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)
## Per-layer ffn_swigl progression (key data)
| Layer | ffn_swigl std |
|------:|--------------:|
| 0 | 0.088 |
| 1 | 0.061 |
| 2 | 0.071 |
| **3** | **1.222** | ← 17.2× layer 2
| 4 | 0.390 |
| 5-25 | ~0.15-0.55 |
| 26 | 1.452 |
| 27 | 2.247 |
Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.
## Bug surface narrowing (across §15→§16→§17→§21)
- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)
The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.
Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.
Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv
Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.66.0 → v2.67.0 (#1075) §17.4 specified sub-layer bisection of FFN as the falsifier next step. PR #1066 added the 4 sub-FFN ActivationStats fields. §23 records the first run on the canonical 7B teacher post-#1066-merge. (Originally authored as §21 in the closed PR #1072. Re-numbered as §23 because §22 (PR #1074) landed first with v2.66.0 banner; this PR brings v2.67.0.) ## Key finding Live `apr trace --payload` on `paiml/qwen2.5-coder-7b-apache-q4k-v1` teacher (CPU, prompt "What is 2+2?") layer-3 sub-FFN std: | Sub-FFN slot | L1-2 baseline | L3 | Ratio | |--------------|--------------:|----:|------:| | ffn_norm | 0.85 / 0.86 | 1.00 | 1.16× normal | | ffn_gate | 1.50 / 1.99 | 1.92 | 0.97× normal | | ffn_up | 1.10 / 0.94 | 1.34 | 1.42× small | | ffn_silu | 0.043 / 0.052 | 0.168 | 3.2× precursor | | **ffn_swigl** | **0.061 / 0.071** | **1.222** | **17.2× anomaly** | | ffn_out | 0.345 / 0.216 | 11.459 | 53× cascade | Gate/up individually normal at layer 3. Element-wise multiply at inference.rs:163 `ffn_hidden.push(silu_g * u)` is the named bug site (possibly off-by-one slice indexing). ## Bug surface narrowing chain - §15.4: GPU GQA kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out) - **§23: layer 3 ffn_swigl named (17× first anomaly site)** ## Falsifiable next investigation step (§23.6) Extend `OwnedQuantizedModel::forward_traced` (the GGUF path; needs to be authored per `project_ship_007_gguf_forward_traced_plan.md`) with same 4 sub-FFN fields. Compare APR vs GGUF layer-3 ffn_swigl directly: - ≈0.07 → APR-side bug pinned to inference.rs:160-164 - ≈1.22 → spike is normal model behavior; bug elsewhere ## Evidence persisted - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv (28-layer × 6-field summary) Spec v2.66.0 → v2.67.0. No coverage tally change. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…=10.24) — v2.75 → v2.76 (#1090) * docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — spec v2.75.0 → v2.76.0 Live three-stage bisection on canonical 7B teacher pinpoints the divergence point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0 qkv at four stages with prompt "What is 2+2?": | Stage | mean | std | Match GGUF (1.14)? | |-------|------|-----|---------------------| | Embedding | 1e-5 | 0.0174 | OK (input) | | Post-RMSNorm | -8e-5 | 0.221 | OK (input) | | Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance | | qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large | | Post-bias | +0.256 | 10.329 | matches APR trace blowup | The 9× std blowup happens ENTIRELY at the qkv_bias addition step (pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias matches APR's existing trace. K-part bias is most extreme (post-bias std=29.49). PR E v2 is now scoped to ONE specific investigation per §31.4: - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes - dump GGUF's same 3 tensors - byte-compare: - if APR != GGUF, the GGUF→APR converter is broken - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting §31 falsification chain (now closed at the root): §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN) → §23 (layer 3, ffn_swigl) → §27 ratio 18.23× → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel comparison) → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED) The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass. Drift-prevention test for next session (per §31.5): assert per-layer |APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10. Files: - crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt - evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis) - §31 spec section (8 subsections) - Header: v2.75.0 → v2.76.0 Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 — §31 REFUTED, qkv_bias is byte-identical APR=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 follow-up — layer-3 ffn_gate/up/down Q4K bytes ARE byte-identical APR=GGUF Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's .apr and .gguf files. Result: ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36× ratio per existing trace) does NOT come from differing weight bytes. This eliminates the GGUF→APR converter as the bug surface for layer 3. Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias byte-identical), the elimination chain is now: - QKV matmul kernel: ✓ correct (§30) - QKV bias bytes: ✓ correct (§32) - Layer-3 FFN weight bytes: ✓ correct (this commit) The remaining hypothesis: cumulative layer-by-layer F32 precision drift through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36 (10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating ~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different between the two formats to push silu into different saturation regions, producing the 18× ffn_swigl ratio. Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default (zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to populate them. The existing apr-trace.txt and gguf-trace.txt evidence files (2026-04-27) were generated when PR B was applied locally to the binary — those numbers are valid but require PR B to land on main for reproducibility. Files added: - crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt Coverage scoreboard unchanged. Investigation continues; PR E v3 scope narrows further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…— emits per-layer LayerActivation telemetry P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new forward_traced method (PR A scaffold + PR B sub-FFN populate) into the apr-cli trace dispatch. Without this, `apr trace --payload <model.gguf>` only does generation+garbage-detection — it does NOT emit per-layer telemetry needed for the §23 layer-3 ffn_swigl APR-vs-GGUF bisection. Changes: 1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf Now calls model.forward_traced(&test_tokens) BEFORE generation, prints embed/per-layer/final-norm/logit/summary stats via the existing vector_stats helpers. Falls back gracefully on Err (e.g., encoder-decoder models from PR A's guard). 2. crates/apr-cli/src/commands/vector_stats.rs 4 helpers flipped from private to pub(crate) so trace.rs GGUF dispatch can reuse them (they were already used by the APR dispatch in run_traced_inference_apr): - print_layer_activations - print_logit_predictions - print_trace_summary - print_activation_stats / print_activation_stats_colored Output format matches the APR side exactly, so `apr trace --payload <file>.apr` and `apr trace --payload <file>.gguf` produce side-by-side comparable per-layer stat blocks. The §23 layer-3 ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and ffn_out (already handled by print_layer_activations:137-142 suppression-when-zero pattern from PR #1066). After this PR + PR A + PR B all merge, the §26.4 binding criterion becomes runnable on noah-Lambda-Vector RTX 4090: ``` $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr | grep -A1 "Layer 3" $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer 3" ``` Outcome: - ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior Either discharges 5 MODEL-1 PARTIALs at once per §17.5 (SHIP-002/005/006/007/008). Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A). Will retarget to main once both merge. Validated: - `cargo check -p apr-cli --features inference` exits 0 - `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0 Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step References: - PR #1081 (P3 PR A: GGUF forward_traced scaffold) - PR #1082 (P3 PR B: sub-FFN populate) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md (CLI wiring step) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…— emits per-layer LayerActivation telemetry (#1083) P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new forward_traced method (PR A scaffold + PR B sub-FFN populate) into the apr-cli trace dispatch. Without this, `apr trace --payload <model.gguf>` only does generation+garbage-detection — it does NOT emit per-layer telemetry needed for the §23 layer-3 ffn_swigl APR-vs-GGUF bisection. Changes: 1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf Now calls model.forward_traced(&test_tokens) BEFORE generation, prints embed/per-layer/final-norm/logit/summary stats via the existing vector_stats helpers. Falls back gracefully on Err (e.g., encoder-decoder models from PR A's guard). 2. crates/apr-cli/src/commands/vector_stats.rs 4 helpers flipped from private to pub(crate) so trace.rs GGUF dispatch can reuse them (they were already used by the APR dispatch in run_traced_inference_apr): - print_layer_activations - print_logit_predictions - print_trace_summary - print_activation_stats / print_activation_stats_colored Output format matches the APR side exactly, so `apr trace --payload <file>.apr` and `apr trace --payload <file>.gguf` produce side-by-side comparable per-layer stat blocks. The §23 layer-3 ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and ffn_out (already handled by print_layer_activations:137-142 suppression-when-zero pattern from PR #1066). After this PR + PR A + PR B all merge, the §26.4 binding criterion becomes runnable on noah-Lambda-Vector RTX 4090: ``` $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr | grep -A1 "Layer 3" $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer 3" ``` Outcome: - ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior Either discharges 5 MODEL-1 PARTIALs at once per §17.5 (SHIP-002/005/006/007/008). Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A). Will retarget to main once both merge. Validated: - `cargo check -p apr-cli --features inference` exits 0 - `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0 Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step References: - PR #1081 (P3 PR A: GGUF forward_traced scaffold) - PR #1082 (P3 PR B: sub-FFN populate) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md (CLI wiring step) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
contracts/trace-ffn-sub-block-v1.yamlv1.0.0 (authored in contract(trace-ffn-sub-block-v1): pre-commit schema for sub-FFN telemetry (SHIP-007 load-bearing) #1065).ActivationStatsfields toLayerActivationbetweenffn_norm_statsandffn_out_stats:ffn_gate_stats,ffn_up_stats,ffn_silu_gate_stats,ffn_swiglu_inner_stats.What this implements (per contract obligations)
ffn_gate_statspopulated on SwiGLU path (post-gate-proj-matmul)ffn_up_statspopulated on SwiGLU & GELU paths (post-up-proj-matmul; pre-GELU on GELU path)ffn_silu_gate_statspopulated on SwiGLU path (post-SiLU on gate)ffn_swiglu_inner_statspopulated on SwiGLU path (post-elementwise multiply)ffn_out_statssemantics preserved (FALSIFY-SUB-FFN-002 — backward compat)Test plan
cargo test -p aprender-serve --lib --release apr_transformer→ 720/720 passed (was 718; added 2 new tests)cargo build -p apr-clisucceedstest_sub_ffn_telemetry_swiglu_path_populates_all_4_fields— asserts SwiGLU path populates all 4 fields with correct intermediate_dim count (covers FALSIFY-SUB-FFN-001/006)test_sub_ffn_telemetry_gelu_path_only_up_populated— asserts GELU path leaves 3 fields default-zero and populates only ffn_up_statsWhat this enables
§17.4's falsifier next step: run
apr trace --payloadon the canonical 7B teacher; whichever of {ffn_gate, ffn_up, ffn_silu_gate, ffn_swiglu_inner, ffn_out} first shows the layer-3 53× spike is the SHIP-007 bug site. Per §17.5: whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008).Stacks under
Outstanding follow-ups (per contract)
apr trace --payloadon the canonical 7B teacher to pin the layer-3 53× spike to a specific sub-FFN slot — staged for follow-up after this PR + contract YAML both land🤖 Generated with Claude Code