Skip to content

docs(ship-two-001): §20 live CUDA training dispatch evidence — spec v2.65.0#1070

Merged
noahgift merged 1 commit into
mainfrom
docs/ship-007-20-cuda-live-evidence
Apr 26, 2026
Merged

docs(ship-two-001): §20 live CUDA training dispatch evidence — spec v2.65.0#1070
noahgift merged 1 commit into
mainfrom
docs/ship-007-20-cuda-live-evidence

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

  • §20 records the live CUDA training dispatch on noah-Lambda-Vector RTX 4090 — concrete progress on §19.4 Residual B's "live evidence" half.
  • Step (a) of §19.5's corrected long path ("rebuild canonical apr binary with --features cuda") is DONE.
  • 100 real CUDA training steps executed; median wall_ms = 264.74 ms (47% headroom under GATE-GPUTRAIN-004's 500ms budget).
  • Spec v2.64.0 → v2.65.0.

Live evidence captured (RTX 4090)

  • wall_ms statistics: min=257.86, median=264.74, max=467.66 (step 0 kernel warmup), steady-state 260-270 ms — well below GATE-GPUTRAIN-004's 500ms budget
  • nvidia-smi PID 1658504 / 6636 MiB captured mid-run, confirming GPU residency (no silent CPU fallback, GATE-GPUTRAIN-002 enforced)
  • train_loss progression: step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing — correct direction for fresh-init 370M)
  • GATE-TRAIN-005 ship-blocker fired at epoch boundary (val_loss=10.31 > 10.0 — correct behavior, the 100 steps are insufficient for convergence)

Evidence files

evidence/task-132-residual-b/
├── cuda-50step-2026-04-26.json     # 100-step JSONL with wall_ms (from PR #1069 contract bump)
└── nvidia-smi-during-run.csv       # PID 1658504 / 6636 MiB

Gate-by-gate impact

Gate Prior Post-§20 Evidence
GATE-GPUTRAIN-002 (no silent CPU fallback) PARTIAL ACTIVE_WITH_LIVE_EVIDENCE Rebuild produces GPU-residency-bound run; non-CUDA build still fails contract-cited at GATE-002
GATE-GPUTRAIN-003 (PID in nvidia-smi) ACTIVE CONFIRMED PID 1658504, 6636 MiB stable
GATE-GPUTRAIN-004 (per-step latency < 500ms) PARTIAL DISCHARGEABLE Median 264.74 ms across 100 real steps
GATE-GPUTRAIN-005 (train_loss decreases) PARTIAL OBSERVED step 0→99: Δ=−0.52

Stacks under

Coverage tally update

Pending the contract bump for gpu-training-backend-v1.yaml GATE-GPUTRAIN-004 PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE. §20 records the data; the contract amendment captures the durable verdict (separate follow-up PR).

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 26, 2026 09:17
@noahgift noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from a8353bd to 1de30ff Compare April 26, 2026 11:14
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from e49cf4d to 5966029 Compare April 26, 2026 12:40
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from 5966029 to a72f6f0 Compare April 26, 2026 13:04
noahgift added a commit that referenced this pull request Apr 26, 2026
…→ pass — spec §20 + #1059 evidence — v1.4.0 → v1.5.0

GATE-GPUTRAIN-004 (370M step-time budget < 500ms on RTX 4090) was
marked `verdict: pending` despite its paired falsification test
FALSIFY-GPUTRAIN-005 being DISCHARGED with median 101.30 ms
(20.3% of budget) since 2026-04-24.

This contract bump flips the gate to `verdict: pass` with a
`verdict_basis` field citing both:

1. **FALSIFY-GPUTRAIN-005 evidence** (canonical config seq_len=2048
   batch=1): median 101.30 ms across 25 steps on
   noah-Lambda-Vector RTX 4090 — `evidence/task-132/`.
2. **§20 evidence** (PR #1070, different config seq_len=512):
   median 264.74 ms across 100 steps — `evidence/task-132-residual-b/`.

Both well under the 500ms ceiling. Two evidence files at different
config bands demonstrate budget compliance is robust at this margin.

Contract version v1.4.0 → v1.5.0 (additive metadata, no rule
change). `pv validate`: 0 errors, 0 warnings.

This is a contract-cosmetic flip — GATE-GPUTRAIN-004's underlying
invariant has been satisfied since 2026-04-24; the `verdict: pending`
field was only the gate's own pointer was missing.

References:
- spec §20 (PR #1070): live evidence capture 2026-04-26
- spec §19.4 Residual B: this is the contractual durable verdict
- evidence/task-132/rtx4090-370m-step-budget-and-repro.json
- evidence/task-132-residual-b/cuda-50step-2026-04-26.json

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…→ pass — spec §20 + #1059 evidence — v1.4.0 → v1.5.0 (#1071)

GATE-GPUTRAIN-004 (370M step-time budget < 500ms on RTX 4090) was
marked `verdict: pending` despite its paired falsification test
FALSIFY-GPUTRAIN-005 being DISCHARGED with median 101.30 ms
(20.3% of budget) since 2026-04-24.

This contract bump flips the gate to `verdict: pass` with a
`verdict_basis` field citing both:

1. **FALSIFY-GPUTRAIN-005 evidence** (canonical config seq_len=2048
   batch=1): median 101.30 ms across 25 steps on
   noah-Lambda-Vector RTX 4090 — `evidence/task-132/`.
2. **§20 evidence** (PR #1070, different config seq_len=512):
   median 264.74 ms across 100 steps — `evidence/task-132-residual-b/`.

Both well under the 500ms ceiling. Two evidence files at different
config bands demonstrate budget compliance is robust at this margin.

Contract version v1.4.0 → v1.5.0 (additive metadata, no rule
change). `pv validate`: 0 errors, 0 warnings.

This is a contract-cosmetic flip — GATE-GPUTRAIN-004's underlying
invariant has been satisfied since 2026-04-24; the `verdict: pending`
field was only the gate's own pointer was missing.

References:
- spec §20 (PR #1070): live evidence capture 2026-04-26
- spec §19.4 Residual B: this is the contractual durable verdict
- evidence/task-132/rtx4090-370m-step-budget-and-repro.json
- evidence/task-132-residual-b/cuda-50step-2026-04-26.json

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-007-20-cuda-live-evidence branch from a72f6f0 to ea1d39d Compare April 26, 2026 13:31
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit f1ab869 into main Apr 26, 2026
10 checks passed
@noahgift noahgift deleted the docs/ship-007-20-cuda-live-evidence branch April 26, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant