docs(ship-two-001): §20 live CUDA training dispatch evidence — spec v2.65.0 by noahgift · Pull Request #1070 · paiml/aprender

noahgift · 2026-04-26T09:16:54Z

Summary

§20 records the live CUDA training dispatch on noah-Lambda-Vector RTX 4090 — concrete progress on §19.4 Residual B's "live evidence" half.
Step (a) of §19.5's corrected long path ("rebuild canonical apr binary with --features cuda") is DONE.
100 real CUDA training steps executed; median wall_ms = 264.74 ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget).
Spec v2.64.0 → v2.65.0.

Live evidence captured (RTX 4090)

wall_ms statistics: min=257.86, median=264.74, max=467.66 (step 0 kernel warmup), steady-state 260-270 ms — well below GATE-GPUTRAIN-004's 500ms budget
nvidia-smi PID 1658504 / 6636 MiB captured mid-run, confirming GPU residency (no silent CPU fallback, GATE-GPUTRAIN-002 enforced)
train_loss progression: step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing — correct direction for fresh-init 370M)
GATE-TRAIN-005 ship-blocker fired at epoch boundary (val_loss=10.31 > 10.0 — correct behavior, the 100 steps are insufficient for convergence)

Evidence files

evidence/task-132-residual-b/
├── cuda-50step-2026-04-26.json     # 100-step JSONL with wall_ms (from PR #1069 contract bump)
└── nvidia-smi-during-run.csv       # PID 1658504 / 6636 MiB

Gate-by-gate impact

Gate	Prior	Post-§20	Evidence
GATE-GPUTRAIN-002 (no silent CPU fallback)	PARTIAL	ACTIVE_WITH_LIVE_EVIDENCE	Rebuild produces GPU-residency-bound run; non-CUDA build still fails contract-cited at GATE-002
GATE-GPUTRAIN-003 (PID in nvidia-smi)	ACTIVE	CONFIRMED	PID 1658504, 6636 MiB stable
GATE-GPUTRAIN-004 (per-step latency < 500ms)	PARTIAL	DISCHARGEABLE	Median 264.74 ms across 100 real steps
GATE-GPUTRAIN-005 (train_loss decreases)	PARTIAL	OBSERVED	step 0→99: Δ=−0.52

Stacks under

docs(ship-two-001): §19 task #132 correction (CUDA training shipped) #1068 (§19 — task feat(voice): Voice processing module - embeddings, style transfer, cloning, isolation #132 correction)
docs(ship-two-001): §18 training status snapshot as chain-of-thought #1067 (§18 — training status snapshot)
Pairs with feat(pretrain): add wall_ms to StepMetrics — Residual B per spec §19.4 #1069 (wall_ms code) — JSONL field used here is from that contract bump

Coverage tally update

Pending the contract bump for gpu-training-backend-v1.yaml GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE. §20 records the data; the contract amendment captures the durable verdict (separate follow-up PR).

🤖 Generated with Claude Code

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…→ pass — spec §20 + #1059 evidence — v1.4.0 → v1.5.0 GATE-GPUTRAIN-004 (370M step-time budget < 500ms on RTX 4090) was marked `verdict: pending` despite its paired falsification test FALSIFY-GPUTRAIN-005 being DISCHARGED with median 101.30 ms (20.3% of budget) since 2026-04-24. This contract bump flips the gate to `verdict: pass` with a `verdict_basis` field citing both: 1. **FALSIFY-GPUTRAIN-005 evidence** (canonical config seq_len=2048 batch=1): median 101.30 ms across 25 steps on noah-Lambda-Vector RTX 4090 — `evidence/task-132/`. 2. **§20 evidence** (PR #1070, different config seq_len=512): median 264.74 ms across 100 steps — `evidence/task-132-residual-b/`. Both well under the 500ms ceiling. Two evidence files at different config bands demonstrate budget compliance is robust at this margin. Contract version v1.4.0 → v1.5.0 (additive metadata, no rule change). `pv validate`: 0 errors, 0 warnings. This is a contract-cosmetic flip — GATE-GPUTRAIN-004's underlying invariant has been satisfied since 2026-04-24; the `verdict: pending` field was only the gate's own pointer was missing. References: - spec §20 (PR #1070): live evidence capture 2026-04-26 - spec §19.4 Residual B: this is the contractual durable verdict - evidence/task-132/rtx4090-370m-step-budget-and-repro.json - evidence/task-132-residual-b/cuda-50step-2026-04-26.json Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…→ pass — spec §20 + #1059 evidence — v1.4.0 → v1.5.0 (#1071) GATE-GPUTRAIN-004 (370M step-time budget < 500ms on RTX 4090) was marked `verdict: pending` despite its paired falsification test FALSIFY-GPUTRAIN-005 being DISCHARGED with median 101.30 ms (20.3% of budget) since 2026-04-24. This contract bump flips the gate to `verdict: pass` with a `verdict_basis` field citing both: 1. **FALSIFY-GPUTRAIN-005 evidence** (canonical config seq_len=2048 batch=1): median 101.30 ms across 25 steps on noah-Lambda-Vector RTX 4090 — `evidence/task-132/`. 2. **§20 evidence** (PR #1070, different config seq_len=512): median 264.74 ms across 100 steps — `evidence/task-132-residual-b/`. Both well under the 500ms ceiling. Two evidence files at different config bands demonstrate budget compliance is robust at this margin. Contract version v1.4.0 → v1.5.0 (additive metadata, no rule change). `pv validate`: 0 errors, 0 warnings. This is a contract-cosmetic flip — GATE-GPUTRAIN-004's underlying invariant has been satisfied since 2026-04-24; the `verdict: pending` field was only the gate's own pointer was missing. References: - spec §20 (PR #1070): live evidence capture 2026-04-26 - spec §19.4 Residual B: this is the contractual durable verdict - evidence/task-132/rtx4090-370m-step-budget-and-repro.json - evidence/task-132-residual-b/cuda-50step-2026-04-26.json Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…2.64.0 → v2.65.0 §19 verified `apr pretrain --device cuda` is wired but the canonical apr binary lacked `--features cuda`. §20 records the next step: **rebuild + live dispatch + evidence capture** on RTX 4090. ## What §20 contains (9 subsections) 1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli) 2. §20.2 — Live dispatch command + 100-step JSONL output 3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget) 4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run 5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005) 6. §20.6 — Evidence files at evidence/task-132-residual-b/ 7. §20.7 — Long-path status: §19.5 step (a) DONE 8. §20.8 — What §20 is NOT (contract bump is follow-up PR) 9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought) ## Live evidence captured - 100 real CUDA training steps on noah-Lambda-Vector RTX 4090 - Real corpus: /mnt/nvme-raid0/data/csn-python-shards - Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257) - wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66 kernel-warmup outlier) - train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing) - val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch boundary (correct behavior for fresh-init 370M before convergence) - nvidia-smi PID 1658504 / 6636 MiB stable mid-run ## Spec progression v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate follow-up PR; §20 records the data, the contract amendment captures the durable verdict). ## Stacks under - #1068 (§19 — task #132 correction) - #1067 (§18 — training status snapshot) - Concrete progress on §19.4 Residual B (live evidence half) - Pairs with PR #1069 (wall_ms code half — provided the JSONL field used for the GATE-GPUTRAIN-004 timing data) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t 17× anomaly site — spec v2.65.0 → v2.66.0 §17.4 specified the falsifier next step as sub-layer bisection of {ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added the 4 new ActivationStats fields. §21 records the **first run of the bisection on the canonical 7B teacher**. ## What §21 contains (8 subsections) - §21.1 Live trace command + 10-line per-layer block - §21.2 Per-layer std table (28 layers × 6 fields) - §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2× layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade) - §21.4 Why this matters — silu(g) and u individually normal at layer 3, but their elementwise product is 17× — implies an unusual positive correlation or alignment bug - §21.5 Refined surviving suspect surface — element-wise multiply correctness (`inference.rs:163`) + off-by-one slice indexing as newly-named candidate - §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare APR vs GGUF layer-3 ffn_swigl directly - §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on PR #1066 in cascade) - §21.8 Methodological alignment (live-evidence pattern) ## Per-layer ffn_swigl progression (key data) | Layer | ffn_swigl std | |------:|--------------:| | 0 | 0.088 | | 1 | 0.061 | | 2 | 0.071 | | **3** | **1.222** | ← 17.2× layer 2 | 4 | 0.390 | | 5-25 | ~0.15-0.55 | | 26 | 1.452 | | 27 | 2.247 | Layer 3 stands out specifically — both above and below it, ffn_swigl is in the 0.06-0.55 band. The 1.22 value is anomalous. ## Bug surface narrowing (across §15→§16→§17→§21) - §15: candidate space = whole forward path - §15.4: GPU GQA attention kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out spike) - **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site) The fix surface is now: `inference.rs:160-164`, specifically the `ffn_hidden.push(silu_g * u)` element-wise multiply. Spec v2.65.0 → v2.66.0. No coverage tally change — investigation- recording, not a discharge. Evidence persisted to: - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv Stacks under #1070 (§20) which is under #1068 (§19) which is under #1067 (§18) which is under #1064 (§17) which is under #1063 (§16). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 26, 2026 09:17

This was referenced Apr 26, 2026

contract(gpu-training-backend-v1): GATE-GPUTRAIN-004 verdict pending → pass (v1.4 → v1.5) #1071

Merged

docs(ship-007): §21 sub-FFN bisection — layer-3 ffn_swigl first 17× anomaly site (v2.66.0) #1072

Closed

noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from a8353bd to 1de30ff Compare April 26, 2026 11:14

noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from e49cf4d to 5966029 Compare April 26, 2026 12:40

noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from 5966029 to a72f6f0 Compare April 26, 2026 13:04

noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from a72f6f0 to ea1d39d Compare April 26, 2026 13:31

noahgift merged commit f1ab869 into main Apr 26, 2026
10 checks passed

noahgift deleted the docs/ship-007-20-cuda-live-evidence branch April 26, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-two-001): §20 live CUDA training dispatch evidence — spec v2.65.0#1070

docs(ship-two-001): §20 live CUDA training dispatch evidence — spec v2.65.0#1070
noahgift merged 1 commit into
mainfrom
docs/ship-007-20-cuda-live-evidence

noahgift commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 26, 2026

Summary

Live evidence captured (RTX 4090)

Evidence files

Gate-by-gate impact

Stacks under

Coverage tally update

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant