Skip to content

docs(ship-two-001): §19 task #132 correction (CUDA training shipped)#1068

Merged
noahgift merged 2 commits into
mainfrom
docs/ship-007-19-task-132-correction
Apr 26, 2026
Merged

docs(ship-two-001): §19 task #132 correction (CUDA training shipped)#1068
noahgift merged 2 commits into
mainfrom
docs/ship-007-19-task-132-correction

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

What §19 contains (8 subsections)

  1. §19.1 What's actually on disk today (call graph: CLI → resolve_device → drive_real_cuda → CudaTransformerTrainer)
  2. §19.2 GPU kernels actually invoked (forward/backward/optimizer/loss/AMP all live in aprender-train/src/autograd/)
  3. §19.3 Smoke test on noah-Lambda-Vector — graceful GATE-GPUTRAIN-002 error when binary lacks --features cuda proves wiring exists
  4. §19.4 Residual work table — 3 real gaps (INV-TRAIN-003 sha256 / GATE-GPUTRAIN-004,005 live evidence / operator authorization)
  5. §19.5 Corrected §18.8 long path — much shorter than originally stated
  6. §19.6 Why §18.5 was wrong (inherited from stale memory entry)
  7. §19.7 No coverage tally change — but PARTIALs are now correctly scoped
  8. §19.8 Methodological lesson — binding rule per feedback_no_guessing.md: future §18-style status sections must re-verify memory-cited claims against code

Why this matters

The two-model goal becomes much more tractable than §18 implied. The 9 MODEL-2 PARTIALs are now correctly scoped: blocked on (a) rebuild apr with --features cuda [one-time], (b) two small PRs [Residuals A+B], (c) Stack v2 tokenization [data-engineering], (d) operator authorization [decision]. The "wire CUDA training" load-bearing complaint is resolved.

Stacks under

Test plan

  • §19 added at end of spec, before END OF SPECIFICATION marker
  • Atomic-next-action banner updated v2.63.0 → v2.64.0
  • PMAT pre-commit gates pass (complexity, SATD, docs)
  • Live smoke test on noah-Lambda-Vector confirmed graceful error message (proves CLI dispatch wired)
  • Memory entry project_task_132_cuda_training_backend_gap.md description + MEMORY.md index entry updated separately to prevent the stale-claim recurrence

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 26, 2026 08:16
@noahgift noahgift force-pushed the docs/ship-007-19-task-132-correction branch from 5dd4f85 to 8cf4edd Compare April 26, 2026 11:14
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift and others added 2 commits April 26, 2026 14:40
…t — spec v2.62.0 → v2.63.0

§18 walks the deduction chain that connects the spec's two-model goal
to the current state, so future sessions can re-enter the work without
re-reading every prior section.

## Section structure

- §18.1 Why are we training models at all?
- §18.2 What does "DISCHARGED" mean here, and where are we?
- §18.3 MODEL-1 — five fully discharged, five blocked on one bug
- §18.4 MODEL-2 — three discharged, nine blocked on convergence
- §18.5 What's blocking the convergence run?
- §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram)
- §18.7 What "knowing" looks like at each step
- §18.8 What's the next observable state-change? (two parallel paths)
- §18.9 Methodological invariant (5-step loop)

## Key durable facts captured

- Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers
- MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL
  all transitively gated on SHIP-007
- MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated
  on a converged 370M run; the convergence run itself is blocked at
  task #132 (`apr pretrain --device cuda` not yet wired through
  `TransformerTrainer::new`)
- GPUTRAIN suite: 7/7 DISCHARGED (full closure)
- SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel
  ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF
  divergence) → §17 layer 3 FFN sub-block named (53× spike) →
  sub-FFN bisection (PR #1066)
- Two parallel paths to next observable state-change: (a) short —
  SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 +
  Stack v2 tokenization + convergence

## What this is NOT

This is investigation-recording, not a discharge. Coverage tally
unchanged. The chain-of-thought view is intentionally narrative-style
to make the deduction structure obvious to future sessions and to
external readers asking "where are we?"

Spec v2.62.0 → v2.63.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pped) — spec v2.63.0 → v2.64.0

§18.5 (just landed in PR #1067) stated:

> Training compute is the real risk — `apr pretrain --device cuda`
> is NOT functional today (task #132).

A sub-agent investigation on 2026-04-26 confirmed this premise was
outdated by ~5 days. Task #132 closed at commit f7ad114
(2026-04-21).

## What's actually on disk

The CLI dispatch path is wired:
  apr pretrain --device {cpu|cuda|auto}
    → resolve_device() (entrenar::train::device, train/device.rs:110)
    → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301)
       ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364)
       │     → CudaTransformerTrainer::new(cfg)
       │       (transformer_trainer/cuda_trainer.rs:2156-2244)
       └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325)
             → TransformerTrainer::new(cfg)

GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/):
- forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels
- backward: gemm_backward_a/b, rms_norm_backward
- optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda
- AMP: GradScaler

D2H per step bounded to ~512 B (loss_partials). AdamW state lives
on GPU.

## Live smoke test on noah-Lambda-Vector RTX 4090

  $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \
      --dataset ... --tokenizer ... --run-dir ... \
      --device cuda --synthetic --num-steps 4 --json
  error: --device `cuda` requested but CUDA runtime is not available
  on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002:
  no silent CPU fallback). Rebuild with `--features cuda` or pass
  `--device cpu`.

The graceful contract-cited error proves: CLI parses --device cuda
correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks
the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this
is a rebuild-time issue, not a code-architecture gap.

## Three real residuals (post-§19)

(A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR
(B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch
(C) Operator authorization for 10K-step run — decision, not engineering

## Methodological lesson (§19.8)

§15→§17 narrowing was good chain-of-thought (each deduction a
falsifiable result on live evidence). §18.5 was bad chain-of-
thought — the premise was inherited from a stale memory entry
without re-verification. Going forward (per
`feedback_no_guessing.md`): when a §18-style status snapshot
cites a memory entry as evidence for a gap, the memory entry's
claims must be re-verified against the code at write-time.

Spec v2.63.0 → v2.64.0. No coverage tally change.

Memory entry `project_task_132_cuda_training_backend_gap.md`
description updated separately to reflect the closed status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-007-19-task-132-correction branch from 0fe9cb6 to ab42eac Compare April 26, 2026 12:40
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 0d58387 into main Apr 26, 2026
10 checks passed
@noahgift noahgift deleted the docs/ship-007-19-task-132-correction branch April 26, 2026 13:01
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0 (#1070)

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant