feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml by noahgift · Pull Request #1066 · paiml/aprender

noahgift · 2026-04-26T07:54:21Z

Summary

Implements contracts/trace-ffn-sub-block-v1.yaml v1.0.0 (authored in contract(trace-ffn-sub-block-v1): pre-commit schema for sub-FFN telemetry (SHIP-007 load-bearing) #1065).
Load-bearing for the SHIP-007 fix per ship-two-models-spec.md §15.5 + §17.4. Closes the per-layer telemetry gap that was §17.4's falsifier prerequisite.
Adds 4 new ActivationStats fields to LayerActivation between ffn_norm_stats and ffn_out_stats: ffn_gate_stats, ffn_up_stats, ffn_silu_gate_stats, ffn_swiglu_inner_stats.

What this implements (per contract obligations)

OBL-SUB-FFN-001 ✅ ffn_gate_stats populated on SwiGLU path (post-gate-proj-matmul)
OBL-SUB-FFN-002 ✅ ffn_up_stats populated on SwiGLU & GELU paths (post-up-proj-matmul; pre-GELU on GELU path)
OBL-SUB-FFN-003 ✅ ffn_silu_gate_stats populated on SwiGLU path (post-SiLU on gate)
OBL-SUB-FFN-004 ✅ ffn_swiglu_inner_stats populated on SwiGLU path (post-elementwise multiply)
OBL-SUB-FFN-005 ✅ ffn_out_stats semantics preserved (FALSIFY-SUB-FFN-002 — backward compat)
OBL-SUB-FFN-006 ✅ Renderer emits 4 new lines per SwiGLU-path layer in computation order
OBL-SUB-FFN-007 ✅ Doc-comments on all 4 new fields cite the contract
OBL-SUB-FFN-008 Staged — GPU TracedForward zero-fills the 4 new fields with code comment pointing to follow-up

Test plan

cargo test -p aprender-serve --lib --release apr_transformer → 720/720 passed (was 718; added 2 new tests)
cargo build -p apr-cli succeeds
PMAT pre-commit gates pass (complexity, SATD, docs)
New test test_sub_ffn_telemetry_swiglu_path_populates_all_4_fields — asserts SwiGLU path populates all 4 fields with correct intermediate_dim count (covers FALSIFY-SUB-FFN-001/006)
New test test_sub_ffn_telemetry_gelu_path_only_up_populated — asserts GELU path leaves 3 fields default-zero and populates only ffn_up_stats

What this enables

§17.4's falsifier next step: run apr trace --payload on the canonical 7B teacher; whichever of {ffn_gate, ffn_up, ffn_silu_gate, ffn_swiglu_inner, ffn_out} first shows the layer-3 53× spike is the SHIP-007 bug site. Per §17.5: whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008).

Stacks under

contract(trace-ffn-sub-block-v1): pre-commit schema for sub-FFN telemetry (SHIP-007 load-bearing) #1065 (sub-FFN contract envelope) — both can land independently; this PR's tests only depend on the LayerActivation struct changes, not the contract YAML

Outstanding follow-ups (per contract)

OBL-SUB-FFN-008: GPU sub-FFN telemetry capture (FALSIFY-SUB-FFN-004) — staged for follow-up PR
Live evidence: run apr trace --payload on the canonical 7B teacher to pin the layer-3 53× spike to a specific sub-FFN slot — staged for follow-up after this PR + contract YAML both land

🤖 Generated with Claude Code

…ml — 4 new ActivationStats fields on LayerActivation Implements `contracts/trace-ffn-sub-block-v1.yaml` v1.0.0 (PR #1065). Load-bearing for the SHIP-007 fix per ship-two-models-spec.md §15.5 + §17.4. Closes the per-layer telemetry gap that was the §17.4 falsifier prerequisite. ## What changed `LayerActivation` (apr_transformer/mod.rs) gains 4 new `ActivationStats` fields between `ffn_norm_stats` and `ffn_out_stats`: - ffn_gate_stats (post-gate-proj-matmul; OBL-SUB-FFN-001) - ffn_up_stats (post-up-proj-matmul; OBL-SUB-FFN-002) - ffn_silu_gate_stats (post-SiLU on gate; OBL-SUB-FFN-003) - ffn_swiglu_inner_stats (post-elementwise silu(gate)*up; OBL-SUB-FFN-004) `forward_traced` (CPU path, inference.rs) populates all 4 on the SwiGLU path; on the GELU/non-gated path, only `ffn_up_stats` is populated (other 3 stay default-zero — there's no SwiGLU sub-structure to capture). GPU TracedForward (forward_from_model.rs, gpu_forward_pass.rs) zero-fills the 4 new fields per OBL-SUB-FFN-008 (FALSIFY-SUB-FFN-004 follow-up — GPU- side capture is staged separately). CLI renderer (apr-cli/src/commands/vector_stats.rs) emits 4 new lines per layer in computation order (ffn_gate/up/silu/swigl) BETWEEN ffn_norm and ffn_out, suppressed when default-zero (GPU path until follow-up). ## Backward compat (FALSIFY-SUB-FFN-002) `ffn_out_stats` semantics are byte-identical pre/post: same matmul output, same residual contribution. All 720 existing apr_transformer tests pass without modification. ## New tests `tests/forward_traced.rs::test_sub_ffn_telemetry_swiglu_path_populates_all_4_fields` asserts SwiGLU path populates all 4 fields with the correct intermediate_dim count (FALSIFY-SUB-FFN-001 + FALSIFY-SUB-FFN-006). `tests/forward_traced.rs::test_sub_ffn_telemetry_gelu_path_only_up_populated` asserts GELU path leaves 3 fields default-zero and only populates ffn_up_stats with pre-GELU values. ## What this enables §17.4's falsifier next step: run `apr trace --payload` on the canonical 7B teacher and identify whichever of {ffn_gate, ffn_up, ffn_silu_gate, ffn_swiglu_inner, ffn_out} carries the layer-3 53× spike. Whichever sub-tensor first shows the discontinuity is the SHIP-007 bug site. ## Outstanding follow-ups (per contract) - OBL-SUB-FFN-008: GPU sub-FFN telemetry capture — staged for follow-up - Live evidence on the canonical 7B teacher pinning the layer-3 spike to a sub-FFN slot — staged for follow-up Closes contract OBL-SUB-FFN-001..007 algorithm-level. Live evidence discharge pending follow-up `apr trace` run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1068) * docs(ship-two-001): §18 — training status snapshot as chain-of-thought — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §19 — task #132 correction (CUDA training has shipped) — spec v2.63.0 → v2.64.0 §18.5 (just landed in PR #1067) stated: > Training compute is the real risk — `apr pretrain --device cuda` > is NOT functional today (task #132). A sub-agent investigation on 2026-04-26 confirmed this premise was outdated by ~5 days. Task #132 closed at commit f7ad114 (2026-04-21). ## What's actually on disk The CLI dispatch path is wired: apr pretrain --device {cpu|cuda|auto} → resolve_device() (entrenar::train::device, train/device.rs:110) → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301) ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364) │ → CudaTransformerTrainer::new(cfg) │ (transformer_trainer/cuda_trainer.rs:2156-2244) └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325) → TransformerTrainer::new(cfg) GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/): - forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels - backward: gemm_backward_a/b, rms_norm_backward - optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda - AMP: GradScaler D2H per step bounded to ~512 B (loss_partials). AdamW state lives on GPU. ## Live smoke test on noah-Lambda-Vector RTX 4090 $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \ --dataset ... --tokenizer ... --run-dir ... \ --device cuda --synthetic --num-steps 4 --json error: --device `cuda` requested but CUDA runtime is not available on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002: no silent CPU fallback). Rebuild with `--features cuda` or pass `--device cpu`. The graceful contract-cited error proves: CLI parses --device cuda correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this is a rebuild-time issue, not a code-architecture gap. ## Three real residuals (post-§19) (A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR (B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch (C) Operator authorization for 10K-step run — decision, not engineering ## Methodological lesson (§19.8) §15→§17 narrowing was good chain-of-thought (each deduction a falsifiable result on live evidence). §18.5 was bad chain-of- thought — the premise was inherited from a stale memory entry without re-verification. Going forward (per `feedback_no_guessing.md`): when a §18-style status snapshot cites a memory entry as evidence for a gap, the memory entry's claims must be re-verified against the code at write-time. Spec v2.63.0 → v2.64.0. No coverage tally change. Memory entry `project_task_132_cuda_training_backend_gap.md` description updated separately to reflect the closed status. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.66.0 → v2.67.0 (#1075) §17.4 specified sub-layer bisection of FFN as the falsifier next step. PR #1066 added the 4 sub-FFN ActivationStats fields. §23 records the first run on the canonical 7B teacher post-#1066-merge. (Originally authored as §21 in the closed PR #1072. Re-numbered as §23 because §22 (PR #1074) landed first with v2.66.0 banner; this PR brings v2.67.0.) ## Key finding Live `apr trace --payload` on `paiml/qwen2.5-coder-7b-apache-q4k-v1` teacher (CPU, prompt "What is 2+2?") layer-3 sub-FFN std: | Sub-FFN slot | L1-2 baseline | L3 | Ratio | |--------------|--------------:|----:|------:| | ffn_norm | 0.85 / 0.86 | 1.00 | 1.16× normal | | ffn_gate | 1.50 / 1.99 | 1.92 | 0.97× normal | | ffn_up | 1.10 / 0.94 | 1.34 | 1.42× small | | ffn_silu | 0.043 / 0.052 | 0.168 | 3.2× precursor | | **ffn_swigl** | **0.061 / 0.071** | **1.222** | **17.2× anomaly** | | ffn_out | 0.345 / 0.216 | 11.459 | 53× cascade | Gate/up individually normal at layer 3. Element-wise multiply at inference.rs:163 `ffn_hidden.push(silu_g * u)` is the named bug site (possibly off-by-one slice indexing). ## Bug surface narrowing chain - §15.4: GPU GQA kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out) - **§23: layer 3 ffn_swigl named (17× first anomaly site)** ## Falsifiable next investigation step (§23.6) Extend `OwnedQuantizedModel::forward_traced` (the GGUF path; needs to be authored per `project_ship_007_gguf_forward_traced_plan.md`) with same 4 sub-FFN fields. Compare APR vs GGUF layer-3 ffn_swigl directly: - ≈0.07 → APR-side bug pinned to inference.rs:160-164 - ≈1.22 → spike is normal model behavior; bug elsewhere ## Evidence persisted - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv (28-layer × 6-field summary) Spec v2.66.0 → v2.67.0. No coverage tally change. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…=10.24) — v2.75 → v2.76 (#1090) * docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — spec v2.75.0 → v2.76.0 Live three-stage bisection on canonical 7B teacher pinpoints the divergence point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0 qkv at four stages with prompt "What is 2+2?": | Stage | mean | std | Match GGUF (1.14)? | |-------|------|-----|---------------------| | Embedding | 1e-5 | 0.0174 | OK (input) | | Post-RMSNorm | -8e-5 | 0.221 | OK (input) | | Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance | | qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large | | Post-bias | +0.256 | 10.329 | matches APR trace blowup | The 9× std blowup happens ENTIRELY at the qkv_bias addition step (pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias matches APR's existing trace. K-part bias is most extreme (post-bias std=29.49). PR E v2 is now scoped to ONE specific investigation per §31.4: - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes - dump GGUF's same 3 tensors - byte-compare: - if APR != GGUF, the GGUF→APR converter is broken - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting §31 falsification chain (now closed at the root): §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN) → §23 (layer 3, ffn_swigl) → §27 ratio 18.23× → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel comparison) → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED) The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass. Drift-prevention test for next session (per §31.5): assert per-layer |APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10. Files: - crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt - evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis) - §31 spec section (8 subsections) - Header: v2.75.0 → v2.76.0 Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 — §31 REFUTED, qkv_bias is byte-identical APR=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 follow-up — layer-3 ffn_gate/up/down Q4K bytes ARE byte-identical APR=GGUF Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's .apr and .gguf files. Result: ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36× ratio per existing trace) does NOT come from differing weight bytes. This eliminates the GGUF→APR converter as the bug surface for layer 3. Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias byte-identical), the elimination chain is now: - QKV matmul kernel: ✓ correct (§30) - QKV bias bytes: ✓ correct (§32) - Layer-3 FFN weight bytes: ✓ correct (this commit) The remaining hypothesis: cumulative layer-by-layer F32 precision drift through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36 (10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating ~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different between the two formats to push silu into different saturation regions, producing the 18× ffn_swigl ratio. Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default (zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to populate them. The existing apr-trace.txt and gguf-trace.txt evidence files (2026-04-27) were generated when PR B was applied locally to the binary — those numbers are valid but require PR B to land on main for reproducibility. Files added: - crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt Coverage scoreboard unchanged. Investigation continues; PR E v3 scope narrows further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…— emits per-layer LayerActivation telemetry P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new forward_traced method (PR A scaffold + PR B sub-FFN populate) into the apr-cli trace dispatch. Without this, `apr trace --payload <model.gguf>` only does generation+garbage-detection — it does NOT emit per-layer telemetry needed for the §23 layer-3 ffn_swigl APR-vs-GGUF bisection. Changes: 1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf Now calls model.forward_traced(&test_tokens) BEFORE generation, prints embed/per-layer/final-norm/logit/summary stats via the existing vector_stats helpers. Falls back gracefully on Err (e.g., encoder-decoder models from PR A's guard). 2. crates/apr-cli/src/commands/vector_stats.rs 4 helpers flipped from private to pub(crate) so trace.rs GGUF dispatch can reuse them (they were already used by the APR dispatch in run_traced_inference_apr): - print_layer_activations - print_logit_predictions - print_trace_summary - print_activation_stats / print_activation_stats_colored Output format matches the APR side exactly, so `apr trace --payload <file>.apr` and `apr trace --payload <file>.gguf` produce side-by-side comparable per-layer stat blocks. The §23 layer-3 ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and ffn_out (already handled by print_layer_activations:137-142 suppression-when-zero pattern from PR #1066). After this PR + PR A + PR B all merge, the §26.4 binding criterion becomes runnable on noah-Lambda-Vector RTX 4090: ``` $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr | grep -A1 "Layer 3" $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer 3" ``` Outcome: - ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior Either discharges 5 MODEL-1 PARTIALs at once per §17.5 (SHIP-002/005/006/007/008). Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A). Will retarget to main once both merge. Validated: - `cargo check -p apr-cli --features inference` exits 0 - `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0 Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step References: - PR #1081 (P3 PR A: GGUF forward_traced scaffold) - PR #1082 (P3 PR B: sub-FFN populate) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md (CLI wiring step) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…— emits per-layer LayerActivation telemetry (#1083) P3 PR C — completes the SHIP-007 §26.4 P3 chain by wiring the new forward_traced method (PR A scaffold + PR B sub-FFN populate) into the apr-cli trace dispatch. Without this, `apr trace --payload <model.gguf>` only does generation+garbage-detection — it does NOT emit per-layer telemetry needed for the §23 layer-3 ffn_swigl APR-vs-GGUF bisection. Changes: 1. crates/apr-cli/src/commands/trace.rs::run_traced_inference_gguf Now calls model.forward_traced(&test_tokens) BEFORE generation, prints embed/per-layer/final-norm/logit/summary stats via the existing vector_stats helpers. Falls back gracefully on Err (e.g., encoder-decoder models from PR A's guard). 2. crates/apr-cli/src/commands/vector_stats.rs 4 helpers flipped from private to pub(crate) so trace.rs GGUF dispatch can reuse them (they were already used by the APR dispatch in run_traced_inference_apr): - print_layer_activations - print_logit_predictions - print_trace_summary - print_activation_stats / print_activation_stats_colored Output format matches the APR side exactly, so `apr trace --payload <file>.apr` and `apr trace --payload <file>.gguf` produce side-by-side comparable per-layer stat blocks. The §23 layer-3 ffn_swigl line emits as `ffn_swigl: ...` between ffn_silu and ffn_out (already handled by print_layer_activations:137-142 suppression-when-zero pattern from PR #1066). After this PR + PR A + PR B all merge, the §26.4 binding criterion becomes runnable on noah-Lambda-Vector RTX 4090: ``` $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr | grep -A1 "Layer 3" $ apr trace --payload /mnt/.../qwen2.5-coder-7b-instruct-q4k.gguf | grep -A1 "Layer 3" ``` Outcome: - ratio ≥10× → SHIP-007 bug is APR-side at apr_transformer/inference.rs:160-164 - ratio <2× → 17× spike is normal Qwen2.5 trained behavior Either discharges 5 MODEL-1 PARTIALs at once per §17.5 (SHIP-002/005/006/007/008). Stacked on PR #1082 (PR B), which is stacked on PR #1081 (PR A). Will retarget to main once both merge. Validated: - `cargo check -p apr-cli --features inference` exits 0 - `cargo clippy -p apr-cli --features inference -- -D warnings` exits 0 Spec: SPEC-SHIP-TWO-001 §26.4 P3 final wiring step References: - PR #1081 (P3 PR A: GGUF forward_traced scaffold) - PR #1082 (P3 PR B: sub-FFN populate) - §23 (layer-3 ffn_swigl is the first 17× anomaly site, APR side) - project_ship_007_gguf_forward_traced_plan.md (CLI wiring step) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 26, 2026 07:54

noahgift mentioned this pull request Apr 26, 2026

docs(ship-two-001): §18 training status snapshot as chain-of-thought #1067

Closed

4 tasks

Merge branch 'main' into feat/sub-ffn-telemetry-impl

20b4bca

noahgift mentioned this pull request Apr 26, 2026

contract(gpu-training-backend-v1): GATE-GPUTRAIN-004 verdict pending → pass (v1.4 → v1.5) #1071

Merged

4 tasks

Merge branch 'main' into feat/sub-ffn-telemetry-impl

0a5123a

noahgift mentioned this pull request Apr 26, 2026

docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0) #1072

Closed

4 tasks

Merge branch 'main' into feat/sub-ffn-telemetry-impl

31a89d6

noahgift merged commit 3af9ff2 into main Apr 26, 2026
10 checks passed

noahgift deleted the feat/sub-ffn-telemetry-impl branch April 26, 2026 11:44

noahgift mentioned this pull request Apr 26, 2026

docs(ship-007): §23 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.67.0) #1075

Merged

noahgift mentioned this pull request May 6, 2026

contract(trace-ffn-sub-block-gguf-v1): v1.0.0 → v1.1.0 — §27 evidence integrated, M-FFN-GGUF-3 DISCHARGED #1534

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml#1066

feat(sub-ffn-telemetry): 4 new ActivationStats fields on LayerActivation — implements trace-ffn-sub-block-v1.yaml#1066
noahgift merged 4 commits into
mainfrom
feat/sub-ffn-telemetry-impl

noahgift commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 26, 2026

Summary

What this implements (per contract obligations)

Test plan

What this enables

Stacks under

Outstanding follow-ups (per contract)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant