docs(ship-two-001): §18 training status snapshot as chain-of-thought by noahgift · Pull Request #1067 · paiml/aprender

noahgift · 2026-04-26T08:08:23Z

Summary

Adds §18 of ship-two-models-spec.md as a chain-of-thought training status snapshot.
Walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section.
Spec v2.62.0 → v2.63.0. No coverage tally change — chain-of-thought recording, not a discharge.

What §18 contains (9 subsections)

§18.1 Why are we training models at all? (the Sovereign AI Stack Proof framing)
§18.2 What does "DISCHARGED" mean here? (PARTIAL_ALGORITHM_LEVEL vs DISCHARGED vs unbound)
§18.3 MODEL-1 status table (5/10 DISCHARGED, 5/10 PARTIAL — all gated on §17)
§18.4 MODEL-2 status table (3/12 DISCHARGED, 9/12 PARTIAL — all gated on convergence)
§18.5 What's blocking the convergence run? (chain backward to task feat(voice): Voice processing module - embeddings, style transfer, cloning, isolation #132)
§18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram showing §15 → §15.4 → §16 → §17)
§18.7 What "knowing" looks like at each step (Genchi Genbutsu / falsifiability discipline)
§18.8 What's the next observable state-change? (two parallel paths: short SHIP-007 fix vs long MODEL-2 convergence)
§18.9 Methodological invariant (5-step loop: live evidence → contract → drift-prevention test → spec amendment → auto-merge PR)

Why the chain-of-thought style

The spec has grown to 17 prior sections. A reader landing here cold needs the deduction structure, not just the facts. §18 makes the conditional reasoning explicit: "if you accept §15.4, then §16's hypothesis is justified; if you accept §16, then §17 narrows; if §17, then sub-FFN bisection (PR #1066) is the load-bearing next step."

Stacks under

docs(ship-007): §17 layer-3 ffn_out anomaly identified — first divergent layer named #1064 (§17 — layer-3 ffn_out anomaly)
docs(ship-007): §16 APR forward CPU path isolated as root cause #1063 (§16 — APR forward CPU path)

Test plan

§18 added at end of spec, before END OF SPECIFICATION marker
Atomic-next-action banner updated v2.62.0 → v2.63.0
PMAT pre-commit gates pass (complexity, SATD, docs)
All status numbers in §18 match prior sections (33+12 coverage; 5/10 MODEL-1; 3/12 MODEL-2; 7/7 GPUTRAIN)

🤖 Generated with Claude Code

…pped) — spec v2.63.0 → v2.64.0 §18.5 (just landed in PR #1067) stated: > Training compute is the real risk — `apr pretrain --device cuda` > is NOT functional today (task #132). A sub-agent investigation on 2026-04-26 confirmed this premise was outdated by ~5 days. Task #132 closed at commit f7ad114 (2026-04-21). ## What's actually on disk The CLI dispatch path is wired: apr pretrain --device {cpu|cuda|auto} → resolve_device() (entrenar::train::device, train/device.rs:110) → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301) ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364) │ → CudaTransformerTrainer::new(cfg) │ (transformer_trainer/cuda_trainer.rs:2156-2244) └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325) → TransformerTrainer::new(cfg) GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/): - forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels - backward: gemm_backward_a/b, rms_norm_backward - optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda - AMP: GradScaler D2H per step bounded to ~512 B (loss_partials). AdamW state lives on GPU. ## Live smoke test on noah-Lambda-Vector RTX 4090 $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \ --dataset ... --tokenizer ... --run-dir ... \ --device cuda --synthetic --num-steps 4 --json error: --device `cuda` requested but CUDA runtime is not available on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002: no silent CPU fallback). Rebuild with `--features cuda` or pass `--device cpu`. The graceful contract-cited error proves: CLI parses --device cuda correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this is a rebuild-time issue, not a code-architecture gap. ## Three real residuals (post-§19) (A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR (B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch (C) Operator authorization for 10K-step run — decision, not engineering ## Methodological lesson (§19.8) §15→§17 narrowing was good chain-of-thought (each deduction a falsifiable result on live evidence). §18.5 was bad chain-of- thought — the premise was inherited from a stale memory entry without re-verification. Going forward (per `feedback_no_guessing.md`): when a §18-style status snapshot cites a memory entry as evidence for a gap, the memory entry's claims must be re-verified against the code at write-time. Spec v2.63.0 → v2.64.0. No coverage tally change. Memory entry `project_task_132_cuda_training_backend_gap.md` description updated separately to reflect the closed status. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pped) — spec v2.63.0 → v2.64.0 §18.5 (just landed in PR #1067) stated: > Training compute is the real risk — `apr pretrain --device cuda` > is NOT functional today (task #132). A sub-agent investigation on 2026-04-26 confirmed this premise was outdated by ~5 days. Task #132 closed at commit f7ad114 (2026-04-21). ## What's actually on disk The CLI dispatch path is wired: apr pretrain --device {cpu|cuda|auto} → resolve_device() (entrenar::train::device, train/device.rs:110) → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301) ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364) │ → CudaTransformerTrainer::new(cfg) │ (transformer_trainer/cuda_trainer.rs:2156-2244) └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325) → TransformerTrainer::new(cfg) GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/): - forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels - backward: gemm_backward_a/b, rms_norm_backward - optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda - AMP: GradScaler D2H per step bounded to ~512 B (loss_partials). AdamW state lives on GPU. ## Live smoke test on noah-Lambda-Vector RTX 4090 $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \ --dataset ... --tokenizer ... --run-dir ... \ --device cuda --synthetic --num-steps 4 --json error: --device `cuda` requested but CUDA runtime is not available on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002: no silent CPU fallback). Rebuild with `--features cuda` or pass `--device cpu`. The graceful contract-cited error proves: CLI parses --device cuda correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this is a rebuild-time issue, not a code-architecture gap. ## Three real residuals (post-§19) (A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR (B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch (C) Operator authorization for 10K-step run — decision, not engineering ## Methodological lesson (§19.8) §15→§17 narrowing was good chain-of-thought (each deduction a falsifiable result on live evidence). §18.5 was bad chain-of- thought — the premise was inherited from a stale memory entry without re-verification. Going forward (per `feedback_no_guessing.md`): when a §18-style status snapshot cites a memory entry as evidence for a gap, the memory entry's claims must be re-verified against the code at write-time. Spec v2.63.0 → v2.64.0. No coverage tally change. Memory entry `project_task_132_cuda_training_backend_gap.md` description updated separately to reflect the closed status. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pped) — spec v2.63.0 → v2.64.0 §18.5 (just landed in PR #1067) stated: > Training compute is the real risk — `apr pretrain --device cuda` > is NOT functional today (task #132). A sub-agent investigation on 2026-04-26 confirmed this premise was outdated by ~5 days. Task #132 closed at commit f7ad114 (2026-04-21). ## What's actually on disk The CLI dispatch path is wired: apr pretrain --device {cpu|cuda|auto} → resolve_device() (entrenar::train::device, train/device.rs:110) → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301) ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364) │ → CudaTransformerTrainer::new(cfg) │ (transformer_trainer/cuda_trainer.rs:2156-2244) └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325) → TransformerTrainer::new(cfg) GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/): - forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels - backward: gemm_backward_a/b, rms_norm_backward - optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda - AMP: GradScaler D2H per step bounded to ~512 B (loss_partials). AdamW state lives on GPU. ## Live smoke test on noah-Lambda-Vector RTX 4090 $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \ --dataset ... --tokenizer ... --run-dir ... \ --device cuda --synthetic --num-steps 4 --json error: --device `cuda` requested but CUDA runtime is not available on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002: no silent CPU fallback). Rebuild with `--features cuda` or pass `--device cpu`. The graceful contract-cited error proves: CLI parses --device cuda correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this is a rebuild-time issue, not a code-architecture gap. ## Three real residuals (post-§19) (A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR (B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch (C) Operator authorization for 10K-step run — decision, not engineering ## Methodological lesson (§19.8) §15→§17 narrowing was good chain-of-thought (each deduction a falsifiable result on live evidence). §18.5 was bad chain-of- thought — the premise was inherited from a stale memory entry without re-verification. Going forward (per `feedback_no_guessing.md`): when a §18-style status snapshot cites a memory entry as evidence for a gap, the memory entry's claims must be re-verified against the code at write-time. Spec v2.63.0 → v2.64.0. No coverage tally change. Memory entry `project_task_132_cuda_training_backend_gap.md` description updated separately to reflect the closed status. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…1068) * docs(ship-two-001): §18 — training status snapshot as chain-of-thought — spec v2.62.0 → v2.63.0 §18 walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section. ## Section structure - §18.1 Why are we training models at all? - §18.2 What does "DISCHARGED" mean here, and where are we? - §18.3 MODEL-1 — five fully discharged, five blocked on one bug - §18.4 MODEL-2 — three discharged, nine blocked on convergence - §18.5 What's blocking the convergence run? - §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram) - §18.7 What "knowing" looks like at each step - §18.8 What's the next observable state-change? (two parallel paths) - §18.9 Methodological invariant (5-step loop) ## Key durable facts captured - Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers - MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL all transitively gated on SHIP-007 - MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated on a converged 370M run; the convergence run itself is blocked at task #132 (`apr pretrain --device cuda` not yet wired through `TransformerTrainer::new`) - GPUTRAIN suite: 7/7 DISCHARGED (full closure) - SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF divergence) → §17 layer 3 FFN sub-block named (53× spike) → sub-FFN bisection (PR #1066) - Two parallel paths to next observable state-change: (a) short — SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 + Stack v2 tokenization + convergence ## What this is NOT This is investigation-recording, not a discharge. Coverage tally unchanged. The chain-of-thought view is intentionally narrative-style to make the deduction structure obvious to future sessions and to external readers asking "where are we?" Spec v2.62.0 → v2.63.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §19 — task #132 correction (CUDA training has shipped) — spec v2.63.0 → v2.64.0 §18.5 (just landed in PR #1067) stated: > Training compute is the real risk — `apr pretrain --device cuda` > is NOT functional today (task #132). A sub-agent investigation on 2026-04-26 confirmed this premise was outdated by ~5 days. Task #132 closed at commit f7ad114 (2026-04-21). ## What's actually on disk The CLI dispatch path is wired: apr pretrain --device {cpu|cuda|auto} → resolve_device() (entrenar::train::device, train/device.rs:110) → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301) ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364) │ → CudaTransformerTrainer::new(cfg) │ (transformer_trainer/cuda_trainer.rs:2156-2244) └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325) → TransformerTrainer::new(cfg) GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/): - forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels - backward: gemm_backward_a/b, rms_norm_backward - optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda - AMP: GradScaler D2H per step bounded to ~512 B (loss_partials). AdamW state lives on GPU. ## Live smoke test on noah-Lambda-Vector RTX 4090 $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \ --dataset ... --tokenizer ... --run-dir ... \ --device cuda --synthetic --num-steps 4 --json error: --device `cuda` requested but CUDA runtime is not available on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002: no silent CPU fallback). Rebuild with `--features cuda` or pass `--device cpu`. The graceful contract-cited error proves: CLI parses --device cuda correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this is a rebuild-time issue, not a code-architecture gap. ## Three real residuals (post-§19) (A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR (B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch (C) Operator authorization for 10K-step run — decision, not engineering ## Methodological lesson (§19.8) §15→§17 narrowing was good chain-of-thought (each deduction a falsifiable result on live evidence). §18.5 was bad chain-of- thought — the premise was inherited from a stale memory entry without re-verification. Going forward (per `feedback_no_guessing.md`): when a §18-style status snapshot cites a memory entry as evidence for a gap, the memory entry's claims must be re-verified against the code at write-time. Spec v2.63.0 → v2.64.0. No coverage tally change. Memory entry `project_task_132_cuda_training_backend_gap.md` description updated separately to reflect the closed status. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-04-26T13:04:05Z

§18 content is already on main via PR #1068's merge (which had §18 stacked). Closing as redundant.

…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…2.64.0 → v2.65.0 (#1070) §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…=10.24) — v2.75 → v2.76 (#1090) * docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — spec v2.75.0 → v2.76.0 Live three-stage bisection on canonical 7B teacher pinpoints the divergence point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0 qkv at four stages with prompt "What is 2+2?": | Stage | mean | std | Match GGUF (1.14)? | |-------|------|-----|---------------------| | Embedding | 1e-5 | 0.0174 | OK (input) | | Post-RMSNorm | -8e-5 | 0.221 | OK (input) | | Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance | | qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large | | Post-bias | +0.256 | 10.329 | matches APR trace blowup | The 9× std blowup happens ENTIRELY at the qkv_bias addition step (pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias matches APR's existing trace. K-part bias is most extreme (post-bias std=29.49). PR E v2 is now scoped to ONE specific investigation per §31.4: - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes - dump GGUF's same 3 tensors - byte-compare: - if APR != GGUF, the GGUF→APR converter is broken - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting §31 falsification chain (now closed at the root): §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN) → §23 (layer 3, ffn_swigl) → §27 ratio 18.23× → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel comparison) → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED) The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages found it in one pass. Drift-prevention test for next session (per §31.5): assert per-layer |APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10. Files: - crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt - evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis) - §31 spec section (8 subsections) - Header: v2.75.0 → v2.76.0 Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 — §31 REFUTED, qkv_bias is byte-identical APR=GGUF (trace point mismatch) — v2.76 → v2.77 §31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24). Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG: | Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff | |--------|---------:|--------:|----------:|---------:|---------:|---------:| | q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 | | k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 | | v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 | APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained qkv_bias values; both formats store/load them correctly. So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)? **TRACE-CAPTURE-POINT MISMATCH.** GGUF (gguf/inference/forward/traced.rs:144): - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226 - = PRE-BIAS measurement → std=1.14 APR (apr_transformer/pmat-260.rs:331-334): - matmul writes `qkv` then `add_bias(qkv, bias)` in-place - Trace captured AFTER bias add - = POST-BIAS measurement → std=10.33 Both forward passes correctly apply qkv_bias. The 9× gap exists only in the TRACE STATISTICS, not in the actual computation. Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as §28 originally said. §30's investigation (which refuted §28) only tested LAYER 0 QKV matmul. LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope: Run §31-style bisection AT LAYER 3 with the proper trace capture points, comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at matched points per PR #1066/#1067 forward_traced sub-FFN slots). Methodology lesson (§32.5): when stat-bisection finds a "smoking gun," ALWAYS verify with byte-level comparison against the reference. Stats can mislead when measurement points differ. Toyota Way: verify physical state (byte equality), not just symptoms (statistical gaps). Files: - crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt - §32 spec section (6 subsections) - §31 marked SUPERSEDED in spec - Header v2.76.0 → v2.77.0 Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific bisection localizes the actual divergence point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(ship-two-001): §32 follow-up — layer-3 ffn_gate/up/down Q4K bytes ARE byte-identical APR=GGUF Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's .apr and .gguf files. Result: ffn_gate.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_up.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte ffn_down.weight Q4K (38,191,104 bytes) → ✓ APR ≡ GGUF byte-for-byte layer-0 ffn_gate Q4K (sanity) → ✓ APR ≡ GGUF byte-for-byte So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36× ratio per existing trace) does NOT come from differing weight bytes. This eliminates the GGUF→APR converter as the bug surface for layer 3. Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias byte-identical), the elimination chain is now: - QKV matmul kernel: ✓ correct (§30) - QKV bias bytes: ✓ correct (§32) - Layer-3 FFN weight bytes: ✓ correct (this commit) The remaining hypothesis: cumulative layer-by-layer F32 precision drift through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36 (10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating ~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different between the two formats to push silu into different saturation regions, producing the 18× ffn_swigl ratio. Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default (zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to populate them. The existing apr-trace.txt and gguf-trace.txt evidence files (2026-04-27) were generated when PR B was applied locally to the binary — those numbers are valid but require PR B to land on main for reproducibility. Files added: - crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable) - evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt Coverage scoreboard unchanged. Investigation continues; PR E v3 scope narrows further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 26, 2026 08:08

noahgift force-pushed the docs/ship-007-18-training-status-cot branch from 7bcab89 to f78ec68 Compare April 26, 2026 11:13

noahgift force-pushed the docs/ship-007-18-training-status-cot branch from 57657e9 to 06505b1 Compare April 26, 2026 12:39

noahgift closed this Apr 26, 2026

auto-merge was automatically disabled April 26, 2026 13:04
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-two-001): §18 training status snapshot as chain-of-thought#1067

docs(ship-two-001): §18 training status snapshot as chain-of-thought#1067
noahgift wants to merge 1 commit into
mainfrom
docs/ship-007-18-training-status-cot

noahgift commented Apr 26, 2026

Uh oh!

noahgift commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 26, 2026

Summary

What §18 contains (9 subsections)

Why the chain-of-thought style

Stacks under

Test plan

Uh oh!

noahgift commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant