Skip to content

docs(ship-two-001): §18 training status snapshot as chain-of-thought#1067

Closed
noahgift wants to merge 1 commit into
mainfrom
docs/ship-007-18-training-status-cot
Closed

docs(ship-two-001): §18 training status snapshot as chain-of-thought#1067
noahgift wants to merge 1 commit into
mainfrom
docs/ship-007-18-training-status-cot

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

  • Adds §18 of ship-two-models-spec.md as a chain-of-thought training status snapshot.
  • Walks the deduction chain that connects the spec's two-model goal to the current state, so future sessions can re-enter the work without re-reading every prior section.
  • Spec v2.62.0 → v2.63.0. No coverage tally change — chain-of-thought recording, not a discharge.

What §18 contains (9 subsections)

  1. §18.1 Why are we training models at all? (the Sovereign AI Stack Proof framing)
  2. §18.2 What does "DISCHARGED" mean here? (PARTIAL_ALGORITHM_LEVEL vs DISCHARGED vs unbound)
  3. §18.3 MODEL-1 status table (5/10 DISCHARGED, 5/10 PARTIAL — all gated on §17)
  4. §18.4 MODEL-2 status table (3/12 DISCHARGED, 9/12 PARTIAL — all gated on convergence)
  5. §18.5 What's blocking the convergence run? (chain backward to task feat(voice): Voice processing module - embeddings, style transfer, cloning, isolation #132)
  6. §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram showing §15 → §15.4 → §16 → §17)
  7. §18.7 What "knowing" looks like at each step (Genchi Genbutsu / falsifiability discipline)
  8. §18.8 What's the next observable state-change? (two parallel paths: short SHIP-007 fix vs long MODEL-2 convergence)
  9. §18.9 Methodological invariant (5-step loop: live evidence → contract → drift-prevention test → spec amendment → auto-merge PR)

Why the chain-of-thought style

The spec has grown to 17 prior sections. A reader landing here cold needs the deduction structure, not just the facts. §18 makes the conditional reasoning explicit: "if you accept §15.4, then §16's hypothesis is justified; if you accept §16, then §17 narrows; if §17, then sub-FFN bisection (PR #1066) is the load-bearing next step."

Stacks under

Test plan

  • §18 added at end of spec, before END OF SPECIFICATION marker
  • Atomic-next-action banner updated v2.62.0 → v2.63.0
  • PMAT pre-commit gates pass (complexity, SATD, docs)
  • All status numbers in §18 match prior sections (33+12 coverage; 5/10 MODEL-1; 3/12 MODEL-2; 7/7 GPUTRAIN)

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 26, 2026 08:08
@noahgift noahgift force-pushed the docs/ship-007-18-training-status-cot branch from 7bcab89 to f78ec68 Compare April 26, 2026 11:13
noahgift added a commit that referenced this pull request Apr 26, 2026
…pped) — spec v2.63.0 → v2.64.0

§18.5 (just landed in PR #1067) stated:

> Training compute is the real risk — `apr pretrain --device cuda`
> is NOT functional today (task #132).

A sub-agent investigation on 2026-04-26 confirmed this premise was
outdated by ~5 days. Task #132 closed at commit f7ad114
(2026-04-21).

## What's actually on disk

The CLI dispatch path is wired:
  apr pretrain --device {cpu|cuda|auto}
    → resolve_device() (entrenar::train::device, train/device.rs:110)
    → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301)
       ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364)
       │     → CudaTransformerTrainer::new(cfg)
       │       (transformer_trainer/cuda_trainer.rs:2156-2244)
       └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325)
             → TransformerTrainer::new(cfg)

GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/):
- forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels
- backward: gemm_backward_a/b, rms_norm_backward
- optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda
- AMP: GradScaler

D2H per step bounded to ~512 B (loss_partials). AdamW state lives
on GPU.

## Live smoke test on noah-Lambda-Vector RTX 4090

  $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \
      --dataset ... --tokenizer ... --run-dir ... \
      --device cuda --synthetic --num-steps 4 --json
  error: --device `cuda` requested but CUDA runtime is not available
  on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002:
  no silent CPU fallback). Rebuild with `--features cuda` or pass
  `--device cpu`.

The graceful contract-cited error proves: CLI parses --device cuda
correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks
the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this
is a rebuild-time issue, not a code-architecture gap.

## Three real residuals (post-§19)

(A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR
(B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch
(C) Operator authorization for 10K-step run — decision, not engineering

## Methodological lesson (§19.8)

§15→§17 narrowing was good chain-of-thought (each deduction a
falsifiable result on live evidence). §18.5 was bad chain-of-
thought — the premise was inherited from a stale memory entry
without re-verification. Going forward (per
`feedback_no_guessing.md`): when a §18-style status snapshot
cites a memory entry as evidence for a gap, the memory entry's
claims must be re-verified against the code at write-time.

Spec v2.63.0 → v2.64.0. No coverage tally change.

Memory entry `project_task_132_cuda_training_backend_gap.md`
description updated separately to reflect the closed status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t — spec v2.62.0 → v2.63.0

§18 walks the deduction chain that connects the spec's two-model goal
to the current state, so future sessions can re-enter the work without
re-reading every prior section.

## Section structure

- §18.1 Why are we training models at all?
- §18.2 What does "DISCHARGED" mean here, and where are we?
- §18.3 MODEL-1 — five fully discharged, five blocked on one bug
- §18.4 MODEL-2 — three discharged, nine blocked on convergence
- §18.5 What's blocking the convergence run?
- §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram)
- §18.7 What "knowing" looks like at each step
- §18.8 What's the next observable state-change? (two parallel paths)
- §18.9 Methodological invariant (5-step loop)

## Key durable facts captured

- Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers
- MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL
  all transitively gated on SHIP-007
- MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated
  on a converged 370M run; the convergence run itself is blocked at
  task #132 (`apr pretrain --device cuda` not yet wired through
  `TransformerTrainer::new`)
- GPUTRAIN suite: 7/7 DISCHARGED (full closure)
- SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel
  ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF
  divergence) → §17 layer 3 FFN sub-block named (53× spike) →
  sub-FFN bisection (PR #1066)
- Two parallel paths to next observable state-change: (a) short —
  SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 +
  Stack v2 tokenization + convergence

## What this is NOT

This is investigation-recording, not a discharge. Coverage tally
unchanged. The chain-of-thought view is intentionally narrative-style
to make the deduction structure obvious to future sessions and to
external readers asking "where are we?"

Spec v2.62.0 → v2.63.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-007-18-training-status-cot branch from 57657e9 to 06505b1 Compare April 26, 2026 12:39
noahgift added a commit that referenced this pull request Apr 26, 2026
…pped) — spec v2.63.0 → v2.64.0

§18.5 (just landed in PR #1067) stated:

> Training compute is the real risk — `apr pretrain --device cuda`
> is NOT functional today (task #132).

A sub-agent investigation on 2026-04-26 confirmed this premise was
outdated by ~5 days. Task #132 closed at commit f7ad114
(2026-04-21).

## What's actually on disk

The CLI dispatch path is wired:
  apr pretrain --device {cpu|cuda|auto}
    → resolve_device() (entrenar::train::device, train/device.rs:110)
    → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301)
       ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364)
       │     → CudaTransformerTrainer::new(cfg)
       │       (transformer_trainer/cuda_trainer.rs:2156-2244)
       └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325)
             → TransformerTrainer::new(cfg)

GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/):
- forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels
- backward: gemm_backward_a/b, rms_norm_backward
- optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda
- AMP: GradScaler

D2H per step bounded to ~512 B (loss_partials). AdamW state lives
on GPU.

## Live smoke test on noah-Lambda-Vector RTX 4090

  $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \
      --dataset ... --tokenizer ... --run-dir ... \
      --device cuda --synthetic --num-steps 4 --json
  error: --device `cuda` requested but CUDA runtime is not available
  on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002:
  no silent CPU fallback). Rebuild with `--features cuda` or pass
  `--device cpu`.

The graceful contract-cited error proves: CLI parses --device cuda
correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks
the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this
is a rebuild-time issue, not a code-architecture gap.

## Three real residuals (post-§19)

(A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR
(B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch
(C) Operator authorization for 10K-step run — decision, not engineering

## Methodological lesson (§19.8)

§15→§17 narrowing was good chain-of-thought (each deduction a
falsifiable result on live evidence). §18.5 was bad chain-of-
thought — the premise was inherited from a stale memory entry
without re-verification. Going forward (per
`feedback_no_guessing.md`): when a §18-style status snapshot
cites a memory entry as evidence for a gap, the memory entry's
claims must be re-verified against the code at write-time.

Spec v2.63.0 → v2.64.0. No coverage tally change.

Memory entry `project_task_132_cuda_training_backend_gap.md`
description updated separately to reflect the closed status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…pped) — spec v2.63.0 → v2.64.0

§18.5 (just landed in PR #1067) stated:

> Training compute is the real risk — `apr pretrain --device cuda`
> is NOT functional today (task #132).

A sub-agent investigation on 2026-04-26 confirmed this premise was
outdated by ~5 days. Task #132 closed at commit f7ad114
(2026-04-21).

## What's actually on disk

The CLI dispatch path is wired:
  apr pretrain --device {cpu|cuda|auto}
    → resolve_device() (entrenar::train::device, train/device.rs:110)
    → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301)
       ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364)
       │     → CudaTransformerTrainer::new(cfg)
       │       (transformer_trainer/cuda_trainer.rs:2156-2244)
       └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325)
             → TransformerTrainer::new(cfg)

GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/):
- forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels
- backward: gemm_backward_a/b, rms_norm_backward
- optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda
- AMP: GradScaler

D2H per step bounded to ~512 B (loss_partials). AdamW state lives
on GPU.

## Live smoke test on noah-Lambda-Vector RTX 4090

  $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \
      --dataset ... --tokenizer ... --run-dir ... \
      --device cuda --synthetic --num-steps 4 --json
  error: --device `cuda` requested but CUDA runtime is not available
  on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002:
  no silent CPU fallback). Rebuild with `--features cuda` or pass
  `--device cpu`.

The graceful contract-cited error proves: CLI parses --device cuda
correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks
the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this
is a rebuild-time issue, not a code-architecture gap.

## Three real residuals (post-§19)

(A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR
(B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch
(C) Operator authorization for 10K-step run — decision, not engineering

## Methodological lesson (§19.8)

§15→§17 narrowing was good chain-of-thought (each deduction a
falsifiable result on live evidence). §18.5 was bad chain-of-
thought — the premise was inherited from a stale memory entry
without re-verification. Going forward (per
`feedback_no_guessing.md`): when a §18-style status snapshot
cites a memory entry as evidence for a gap, the memory entry's
claims must be re-verified against the code at write-time.

Spec v2.63.0 → v2.64.0. No coverage tally change.

Memory entry `project_task_132_cuda_training_backend_gap.md`
description updated separately to reflect the closed status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…pped) — spec v2.63.0 → v2.64.0

§18.5 (just landed in PR #1067) stated:

> Training compute is the real risk — `apr pretrain --device cuda`
> is NOT functional today (task #132).

A sub-agent investigation on 2026-04-26 confirmed this premise was
outdated by ~5 days. Task #132 closed at commit f7ad114
(2026-04-21).

## What's actually on disk

The CLI dispatch path is wired:
  apr pretrain --device {cpu|cuda|auto}
    → resolve_device() (entrenar::train::device, train/device.rs:110)
    → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301)
       ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364)
       │     → CudaTransformerTrainer::new(cfg)
       │       (transformer_trainer/cuda_trainer.rs:2156-2244)
       └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325)
             → TransformerTrainer::new(cfg)

GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/):
- forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels
- backward: gemm_backward_a/b, rms_norm_backward
- optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda
- AMP: GradScaler

D2H per step bounded to ~512 B (loss_partials). AdamW state lives
on GPU.

## Live smoke test on noah-Lambda-Vector RTX 4090

  $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \
      --dataset ... --tokenizer ... --run-dir ... \
      --device cuda --synthetic --num-steps 4 --json
  error: --device `cuda` requested but CUDA runtime is not available
  on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002:
  no silent CPU fallback). Rebuild with `--features cuda` or pass
  `--device cpu`.

The graceful contract-cited error proves: CLI parses --device cuda
correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks
the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this
is a rebuild-time issue, not a code-architecture gap.

## Three real residuals (post-§19)

(A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR
(B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch
(C) Operator authorization for 10K-step run — decision, not engineering

## Methodological lesson (§19.8)

§15→§17 narrowing was good chain-of-thought (each deduction a
falsifiable result on live evidence). §18.5 was bad chain-of-
thought — the premise was inherited from a stale memory entry
without re-verification. Going forward (per
`feedback_no_guessing.md`): when a §18-style status snapshot
cites a memory entry as evidence for a gap, the memory entry's
claims must be re-verified against the code at write-time.

Spec v2.63.0 → v2.64.0. No coverage tally change.

Memory entry `project_task_132_cuda_training_backend_gap.md`
description updated separately to reflect the closed status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…1068)

* docs(ship-two-001): §18 — training status snapshot as chain-of-thought — spec v2.62.0 → v2.63.0

§18 walks the deduction chain that connects the spec's two-model goal
to the current state, so future sessions can re-enter the work without
re-reading every prior section.

## Section structure

- §18.1 Why are we training models at all?
- §18.2 What does "DISCHARGED" mean here, and where are we?
- §18.3 MODEL-1 — five fully discharged, five blocked on one bug
- §18.4 MODEL-2 — three discharged, nine blocked on convergence
- §18.5 What's blocking the convergence run?
- §18.6 The SHIP-007 narrowing — chain of deductions (ASCII diagram)
- §18.7 What "knowing" looks like at each step
- §18.8 What's the next observable state-change? (two parallel paths)
- §18.9 Methodological invariant (5-step loop)

## Key durable facts captured

- Coverage tally: 33 PARTIAL + 12 DISCHARGED across 45 levers
- MODEL-1: 5/10 ACs DISCHARGED (SHIP-001/003/004/009/010); 5/10 PARTIAL
  all transitively gated on SHIP-007
- MODEL-2: 3/12 ACs DISCHARGED (SHIP-011/021/022); 9/12 PARTIAL gated
  on a converged 370M run; the convergence run itself is blocked at
  task #132 (`apr pretrain --device cuda` not yet wired through
  `TransformerTrainer::new`)
- GPUTRAIN suite: 7/7 DISCHARGED (full closure)
- SHIP-007 narrowing diagram: §15 hypothesis → §15.4 GPU GQA kernel
  ELIMINATED → §16 GPU stack ELIMINATED (CPU APR vs CPU GGUF
  divergence) → §17 layer 3 FFN sub-block named (53× spike) →
  sub-FFN bisection (PR #1066)
- Two parallel paths to next observable state-change: (a) short —
  SHIP-007 bug site named via sub-FFN trace; (b) long — task #132 +
  Stack v2 tokenization + convergence

## What this is NOT

This is investigation-recording, not a discharge. Coverage tally
unchanged. The chain-of-thought view is intentionally narrative-style
to make the deduction structure obvious to future sessions and to
external readers asking "where are we?"

Spec v2.62.0 → v2.63.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(ship-two-001): §19 — task #132 correction (CUDA training has shipped) — spec v2.63.0 → v2.64.0

§18.5 (just landed in PR #1067) stated:

> Training compute is the real risk — `apr pretrain --device cuda`
> is NOT functional today (task #132).

A sub-agent investigation on 2026-04-26 confirmed this premise was
outdated by ~5 days. Task #132 closed at commit f7ad114
(2026-04-21).

## What's actually on disk

The CLI dispatch path is wired:
  apr pretrain --device {cpu|cuda|auto}
    → resolve_device() (entrenar::train::device, train/device.rs:110)
    → drive_real(...) (apr-cli/src/commands/pretrain.rs:252-301)
       ├── Device::Cuda → drive_real_cuda(...) (pretrain.rs:336-364)
       │     → CudaTransformerTrainer::new(cfg)
       │       (transformer_trainer/cuda_trainer.rs:2156-2244)
       └── Device::Cpu → drive_real_cpu(...) (pretrain.rs:307-325)
             → TransformerTrainer::new(cfg)

GPU kernels invoked from cuda branch (all in aprender-train/src/autograd/):
- forward: gemm_forward, rms_norm_forward, pre_warm_forward_kernels
- backward: gemm_backward_a/b, rms_norm_backward
- optimizer/loss: adamw_step_cuda, fused_cross_entropy_cuda
- AMP: GradScaler

D2H per step bounded to ~512 B (loss_partials). AdamW state lives
on GPU.

## Live smoke test on noah-Lambda-Vector RTX 4090

  $ /mnt/nvme-raid0/targets/aprender/release/apr pretrain \
      --dataset ... --tokenizer ... --run-dir ... \
      --device cuda --synthetic --num-steps 4 --json
  error: --device `cuda` requested but CUDA runtime is not available
  on this host (contract gpu-training-backend-v1 GATE-GPUTRAIN-002:
  no silent CPU fallback). Rebuild with `--features cuda` or pass
  `--device cpu`.

The graceful contract-cited error proves: CLI parses --device cuda
correctly; dispatch path emits GATE-GPUTRAIN-002 when binary lacks
the `cuda` feature. Per `feedback_cuda_feature_footgun.md`, this
is a rebuild-time issue, not a code-architecture gap.

## Three real residuals (post-§19)

(A) INV-TRAIN-003 GPU AdamW-state sha256 — small PR
(B) GATE-GPUTRAIN-004/005 live evidence — small PR + operator dispatch
(C) Operator authorization for 10K-step run — decision, not engineering

## Methodological lesson (§19.8)

§15→§17 narrowing was good chain-of-thought (each deduction a
falsifiable result on live evidence). §18.5 was bad chain-of-
thought — the premise was inherited from a stale memory entry
without re-verification. Going forward (per
`feedback_no_guessing.md`): when a §18-style status snapshot
cites a memory entry as evidence for a gap, the memory entry's
claims must be re-verified against the code at write-time.

Spec v2.63.0 → v2.64.0. No coverage tally change.

Memory entry `project_task_132_cuda_training_backend_gap.md`
description updated separately to reflect the closed status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

§18 content is already on main via PR #1068's merge (which had §18 stacked). Closing as redundant.

@noahgift noahgift closed this Apr 26, 2026
auto-merge was automatically disabled April 26, 2026 13:04

Pull request was closed

noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…t 17× anomaly site — spec v2.65.0 → v2.66.0

§17.4 specified the falsifier next step as sub-layer bisection of
{ffn_gate_out, silu(g), silu(g)*u, ffn_down_out}. PR #1066 added
the 4 new ActivationStats fields. §21 records the **first run of
the bisection on the canonical 7B teacher**.

## What §21 contains (8 subsections)

- §21.1 Live trace command + 10-line per-layer block
- §21.2 Per-layer std table (28 layers × 6 fields)
- §21.3 The first divergent sub-FFN slot is **ffn_swigl** (17.2×
  layer 2; ffn_silu shows 3.2× precursor; ffn_out shows 53× cascade)
- §21.4 Why this matters — silu(g) and u individually normal at
  layer 3, but their elementwise product is 17× — implies an
  unusual positive correlation or alignment bug
- §21.5 Refined surviving suspect surface — element-wise multiply
  correctness (`inference.rs:163`) + off-by-one slice indexing as
  newly-named candidate
- §21.6 Falsifiable next step: GGUF-path sub-FFN telemetry, compare
  APR vs GGUF layer-3 ffn_swigl directly
- §21.7 What §21 is NOT (doesn't pin to a code line yet, depends on
  PR #1066 in cascade)
- §21.8 Methodological alignment (live-evidence pattern)

## Per-layer ffn_swigl progression (key data)

| Layer | ffn_swigl std |
|------:|--------------:|
| 0     | 0.088         |
| 1     | 0.061         |
| 2     | 0.071         |
| **3** | **1.222**     |  ← 17.2× layer 2
| 4     | 0.390         |
| 5-25  | ~0.15-0.55    |
| 26    | 1.452         |
| 27    | 2.247         |

Layer 3 stands out specifically — both above and below it, ffn_swigl
is in the 0.06-0.55 band. The 1.22 value is anomalous.

## Bug surface narrowing (across §15→§16→§17→§21)

- §15: candidate space = whole forward path
- §15.4: GPU GQA attention kernel ELIMINATED
- §16: GPU stack ELIMINATED (CPU APR vs CPU GGUF)
- §17: layer 3 FFN sub-block named (53× ffn_out spike)
- **§21: layer 3 ffn_swigl named** (17× spike, first anomaly site)

The fix surface is now: `inference.rs:160-164`, specifically the
`ffn_hidden.push(silu_g * u)` element-wise multiply.

Spec v2.65.0 → v2.66.0. No coverage tally change — investigation-
recording, not a discharge.

Evidence persisted to:
- evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines)
- evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv

Stacks under #1070 (§20) which is under #1068 (§19) which is under
#1067 (§18) which is under #1064 (§17) which is under #1063 (§16).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…2.64.0 → v2.65.0 (#1070)

§19 verified `apr pretrain --device cuda` is wired but the canonical
apr binary lacked `--features cuda`. §20 records the next step:
**rebuild + live dispatch + evidence capture** on RTX 4090.

## What §20 contains (9 subsections)

1. §20.1 — Rebuild (40s incremental, `--features cuda` enabled apr-cli)
2. §20.2 — Live dispatch command + 100-step JSONL output
3. §20.3 — wall_ms statistics: median=264.74ms (47% headroom under
   GATE-GPUTRAIN-004's 500ms budget)
4. §20.4 — nvidia-smi PID 1658504 / 6636 MiB GPU memory captured mid-run
5. §20.5 — Gate-by-gate impact table (GATE-GPUTRAIN-002/003/004/005)
6. §20.6 — Evidence files at evidence/task-132-residual-b/
7. §20.7 — Long-path status: §19.5 step (a) DONE
8. §20.8 — What §20 is NOT (contract bump is follow-up PR)
9. §20.9 — Methodological alignment (live-evidence pattern, not chain-of-thought)

## Live evidence captured

- 100 real CUDA training steps on noah-Lambda-Vector RTX 4090
- Real corpus: /mnt/nvme-raid0/data/csn-python-shards
- Real tokenizer: /mnt/nvme-raid0/models/model-2-tokenizer-v1 (vocab=50,257)
- wall_ms median: 264.74 ms (range 257.86–467.66 with step 0 = 467.66
  kernel-warmup outlier)
- train_loss step 0=11.02 → step 99=10.50 (Δ=−0.52, decreasing)
- val_loss=10.31 triggered GATE-TRAIN-005 ship-blocker abort at epoch
  boundary (correct behavior for fresh-init 370M before convergence)
- nvidia-smi PID 1658504 / 6636 MiB stable mid-run

## Spec progression

v2.64.0 → v2.65.0. Coverage tally update is **pending** the contract
bump for `gpu-training-backend-v1.yaml` GATE-GPUTRAIN-004
PARTIAL_ALGORITHM_LEVEL → ACTIVE_WITH_LIVE_EVIDENCE (separate
follow-up PR; §20 records the data, the contract amendment captures
the durable verdict).

## Stacks under

- #1068 (§19 — task #132 correction)
- #1067 (§18 — training status snapshot)
- Concrete progress on §19.4 Residual B (live evidence half)
- Pairs with PR #1069 (wall_ms code half — provided the JSONL field
  used for the GATE-GPUTRAIN-004 timing data)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77

§31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24).
Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG:

| Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff |
|--------|---------:|--------:|----------:|---------:|---------:|---------:|
| q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 |
| k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 |
| v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 |

APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained
qkv_bias values; both formats store/load them correctly.

So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)?
**TRACE-CAPTURE-POINT MISMATCH.**

GGUF (gguf/inference/forward/traced.rs:144):
  - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv
  - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226
  - = PRE-BIAS measurement → std=1.14

APR (apr_transformer/pmat-260.rs:331-334):
  - matmul writes `qkv` then `add_bias(qkv, bias)` in-place
  - Trace captured AFTER bias add
  - = POST-BIAS measurement → std=10.33

Both forward passes correctly apply qkv_bias. The 9× gap exists only in
the TRACE STATISTICS, not in the actual computation.

Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate
diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies
to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as
§28 originally said.

§30's investigation (which refuted §28) only tested LAYER 0 QKV matmul.
LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope:

  Run §31-style bisection AT LAYER 3 with the proper trace capture points,
  comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at
  matched points per PR #1066/#1067 forward_traced sub-FFN slots).

Methodology lesson (§32.5): when stat-bisection finds a "smoking gun,"
ALWAYS verify with byte-level comparison against the reference. Stats can
mislead when measurement points differ. Toyota Way: verify physical state
(byte equality), not just symptoms (statistical gaps).

Files:
- crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt
- §32 spec section (6 subsections)
- §31 marked SUPERSEDED in spec
- Header v2.76.0 → v2.77.0

Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific
bisection localizes the actual divergence point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77

§31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24).
Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG:

| Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff |
|--------|---------:|--------:|----------:|---------:|---------:|---------:|
| q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 |
| k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 |
| v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 |

APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained
qkv_bias values; both formats store/load them correctly.

So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)?
**TRACE-CAPTURE-POINT MISMATCH.**

GGUF (gguf/inference/forward/traced.rs:144):
  - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv
  - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226
  - = PRE-BIAS measurement → std=1.14

APR (apr_transformer/pmat-260.rs:331-334):
  - matmul writes `qkv` then `add_bias(qkv, bias)` in-place
  - Trace captured AFTER bias add
  - = POST-BIAS measurement → std=10.33

Both forward passes correctly apply qkv_bias. The 9× gap exists only in
the TRACE STATISTICS, not in the actual computation.

Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate
diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies
to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as
§28 originally said.

§30's investigation (which refuted §28) only tested LAYER 0 QKV matmul.
LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope:

  Run §31-style bisection AT LAYER 3 with the proper trace capture points,
  comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at
  matched points per PR #1066/#1067 forward_traced sub-FFN slots).

Methodology lesson (§32.5): when stat-bisection finds a "smoking gun,"
ALWAYS verify with byte-level comparison against the reference. Stats can
mislead when measurement points differ. Toyota Way: verify physical state
(byte equality), not just symptoms (statistical gaps).

Files:
- crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt
- §32 spec section (6 subsections)
- §31 marked SUPERSEDED in spec
- Header v2.76.0 → v2.77.0

Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific
bisection localizes the actual divergence point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77

§31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24).
Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG:

| Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff |
|--------|---------:|--------:|----------:|---------:|---------:|---------:|
| q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 |
| k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 |
| v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 |

APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained
qkv_bias values; both formats store/load them correctly.

So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)?
**TRACE-CAPTURE-POINT MISMATCH.**

GGUF (gguf/inference/forward/traced.rs:144):
  - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv
  - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226
  - = PRE-BIAS measurement → std=1.14

APR (apr_transformer/pmat-260.rs:331-334):
  - matmul writes `qkv` then `add_bias(qkv, bias)` in-place
  - Trace captured AFTER bias add
  - = POST-BIAS measurement → std=10.33

Both forward passes correctly apply qkv_bias. The 9× gap exists only in
the TRACE STATISTICS, not in the actual computation.

Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate
diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies
to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as
§28 originally said.

§30's investigation (which refuted §28) only tested LAYER 0 QKV matmul.
LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope:

  Run §31-style bisection AT LAYER 3 with the proper trace capture points,
  comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at
  matched points per PR #1066/#1067 forward_traced sub-FFN slots).

Methodology lesson (§32.5): when stat-bisection finds a "smoking gun,"
ALWAYS verify with byte-level comparison against the reference. Stats can
mislead when measurement points differ. Toyota Way: verify physical state
(byte equality), not just symptoms (statistical gaps).

Files:
- crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt
- §32 spec section (6 subsections)
- §31 marked SUPERSEDED in spec
- Header v2.76.0 → v2.77.0

Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific
bisection localizes the actual divergence point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…=GGUF (trace point mismatch) — v2.76 → v2.77

§31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24).
Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG:

| Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff |
|--------|---------:|--------:|----------:|---------:|---------:|---------:|
| q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 |
| k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 |
| v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 |

APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained
qkv_bias values; both formats store/load them correctly.

So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)?
**TRACE-CAPTURE-POINT MISMATCH.**

GGUF (gguf/inference/forward/traced.rs:144):
  - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv
  - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226
  - = PRE-BIAS measurement → std=1.14

APR (apr_transformer/pmat-260.rs:331-334):
  - matmul writes `qkv` then `add_bias(qkv, bias)` in-place
  - Trace captured AFTER bias add
  - = POST-BIAS measurement → std=10.33

Both forward passes correctly apply qkv_bias. The 9× gap exists only in
the TRACE STATISTICS, not in the actual computation.

Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate
diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies
to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as
§28 originally said.

§30's investigation (which refuted §28) only tested LAYER 0 QKV matmul.
LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope:

  Run §31-style bisection AT LAYER 3 with the proper trace capture points,
  comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at
  matched points per PR #1066/#1067 forward_traced sub-FFN slots).

Methodology lesson (§32.5): when stat-bisection finds a "smoking gun,"
ALWAYS verify with byte-level comparison against the reference. Stats can
mislead when measurement points differ. Toyota Way: verify physical state
(byte equality), not just symptoms (statistical gaps).

Files:
- crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt
- §32 spec section (6 subsections)
- §31 marked SUPERSEDED in spec
- Header v2.76.0 → v2.77.0

Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific
bisection localizes the actual divergence point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 27, 2026
…=10.24) — v2.75 → v2.76 (#1090)

* docs(ship-two-001): §31 — SHIP-007 root cause PINNED to qkv_bias (std=10.24) — spec v2.75.0 → v2.76.0

Live three-stage bisection on canonical 7B teacher pinpoints the divergence
point exactly. Per §30.4's falsifiable next-investigation step, captured layer-0
qkv at four stages with prompt "What is 2+2?":

| Stage | mean | std | Match GGUF (1.14)? |
|-------|------|-----|---------------------|
| Embedding | 1e-5 | 0.0174 | OK (input) |
| Post-RMSNorm | -8e-5 | 0.221 | OK (input) |
| Post-matmul, pre-bias | -0.0159 | 0.925 | YES — Q4K tolerance |
| qkv_bias (the bias itself) | +0.272 | 10.243 | ⚠ ~10× too large |
| Post-bias | +0.256 | 10.329 | matches APR trace blowup |

The 9× std blowup happens ENTIRELY at the qkv_bias addition step
(pmat-260.rs:332-334). Pre-bias matmul output matches GGUF; post-bias
matches APR's existing trace. K-part bias is most extreme (post-bias
std=29.49).

PR E v2 is now scoped to ONE specific investigation per §31.4:

  - dump APR's `blk.0.attn_q.bias` / `attn_k.bias` / `attn_v.bias` bytes
  - dump GGUF's same 3 tensors
  - byte-compare:
    - if APR != GGUF, the GGUF→APR converter is broken
    - if APR == GGUF, the loader (`load_qkv_bias`) is misinterpreting

§31 falsification chain (now closed at the root):

  §15.4 GPU eliminated → §16 APR CPU isolated → §17 (layer 3, FFN)
  → §23 (layer 3, ffn_swigl) → §27 ratio 18.23×
  → §28 "F32 vs Q4K matmul precision" (REFUTED in §30 by direct kernel
    comparison)
  → §31 qkv_bias std=10.24 introduces 9× layer-0 gap (PINNED)

The bug was 3 layers upstream of where §27/§28 looked. Bisection-by-stages
found it in one pass.

Drift-prevention test for next session (per §31.5): assert per-layer
|APR qkv_bias.std() - GGUF qkv_bias.std()| / max(eps, GGUF) < 0.10.

Files:
- crates/aprender-serve/examples/diag_qkv_bisection_layer0.rs (rerunnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_qkv_bisection_layer0.txt
- evidence/ship-007-qkv-bisection-2026-04-27/findings.md (full analysis)
- §31 spec section (8 subsections)
- Header: v2.75.0 → v2.76.0

Coverage scoreboard unchanged (15+33). Will flip to 20+28 when PR E v2 lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(ship-two-001): §32 — §31 REFUTED, qkv_bias is byte-identical APR=GGUF (trace point mismatch) — v2.76 → v2.77

§31 hypothesized that APR's qkv_bias was the divergence introducer (std=10.24).
Live byte-compare via diag_compare_qkv_bias.rs proves this WRONG:

| Tensor | APR mean | APR std | GGUF mean | GGUF std | max diff | RMS diff |
|--------|---------:|--------:|----------:|---------:|---------:|---------:|
| q_bias | 0.127345 | 3.258061 | 0.127345 | 3.258061 | 0.000000 | 0.000000 |
| k_bias | 1.549082 | 29.464621 | 1.549082 | 29.464621 | 0.000000 | 0.000000 |
| v_bias | 0.005926 | 0.186556 | 0.005926 | 0.186556 | 0.000000 | 0.000000 |

APR ≡ GGUF byte-for-byte. The std=10.24 IS Qwen2.5-7B's actual trained
qkv_bias values; both formats store/load them correctly.

So why the 9× layer-0 qkv std gap (APR=10.33 vs GGUF=1.14)?
**TRACE-CAPTURE-POINT MISMATCH.**

GGUF (gguf/inference/forward/traced.rs:144):
  - qkv_stats captured AFTER scratch_attention_block writes scratch.qkv
  - BUT BEFORE the per-Q/K/V bias add at results.rs:216-226
  - = PRE-BIAS measurement → std=1.14

APR (apr_transformer/pmat-260.rs:331-334):
  - matmul writes `qkv` then `add_bias(qkv, bias)` in-place
  - Trace captured AFTER bias add
  - = POST-BIAS measurement → std=10.33

Both forward passes correctly apply qkv_bias. The 9× gap exists only in
the TRACE STATISTICS, not in the actual computation.

Where IS the actual SHIP-007 bug then? Per existing trace, layer 3 ffn_gate
diverges 1.36× (APR=1.92 vs GGUF=1.41), which silu non-linearly amplifies
to 18× ffn_swigl ratio. That's STILL where the bug surface is — exactly as
§28 originally said.

§30's investigation (which refuted §28) only tested LAYER 0 QKV matmul.
LAYER-3 FFN ffn_gate matmul is a DIFFERENT code path. PR E v3 scope:

  Run §31-style bisection AT LAYER 3 with the proper trace capture points,
  comparing APR sub-FFN stats vs GGUF sub-FFN stats (both captured at
  matched points per PR #1066/#1067 forward_traced sub-FFN slots).

Methodology lesson (§32.5): when stat-bisection finds a "smoking gun,"
ALWAYS verify with byte-level comparison against the reference. Stats can
mislead when measurement points differ. Toyota Way: verify physical state
(byte equality), not just symptoms (statistical gaps).

Files:
- crates/aprender-serve/examples/diag_compare_qkv_bias.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_qkv_bias.txt
- §32 spec section (6 subsections)
- §31 marked SUPERSEDED in spec
- Header v2.76.0 → v2.77.0

Coverage scoreboard unchanged (15+33). Will flip when LAYER-3-specific
bisection localizes the actual divergence point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(ship-two-001): §32 follow-up — layer-3 ffn_gate/up/down Q4K bytes ARE byte-identical APR=GGUF

Per §32.4 next-step ("layer-3 sub-FFN bisection"), ran a byte-level
comparison of layer-3 Q4K weight bytes between the canonical 7B teacher's
.apr and .gguf files. Result:

  ffn_gate.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_up.weight   Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  ffn_down.weight Q4K (38,191,104 bytes)  →  ✓ APR ≡ GGUF byte-for-byte
  layer-0 ffn_gate Q4K (sanity)           →  ✓ APR ≡ GGUF byte-for-byte

So the layer-3 ffn_gate divergence (APR std=1.92 vs GGUF std=1.41 = 1.36×
ratio per existing trace) does NOT come from differing weight bytes.

This eliminates the GGUF→APR converter as the bug surface for layer 3.
Combined with §30 (layer-0 QKV matmul kernel agrees) and §32 (qkv_bias
byte-identical), the elimination chain is now:

  - QKV matmul kernel: ✓ correct (§30)
  - QKV bias bytes: ✓ correct (§32)
  - Layer-3 FFN weight bytes: ✓ correct (this commit)

The remaining hypothesis: cumulative layer-by-layer F32 precision drift
through residual connections. APR layer 0 output std=0.40 vs GGUF=0.36
(10% drift). Layer 1 input = layer 0 output. After 3 layers of accumulating
~1% Q4K rounding diffs, layer-3 ffn_gate input is sufficiently different
between the two formats to push silu into different saturation regions,
producing the 18× ffn_swigl ratio.

Note for the apr trace --payload sub-FFN telemetry: GGUF's forward_traced
in PR A (#1081, merged) currently leaves the 4 sub-FFN slots at default
(zero). PR B (#1082, still DIRTY) clones scratch_swiglu_ffn_traced to
populate them. The existing apr-trace.txt and gguf-trace.txt evidence
files (2026-04-27) were generated when PR B was applied locally to the
binary — those numbers are valid but require PR B to land on main for
reproducibility.

Files added:
- crates/aprender-serve/examples/diag_compare_layer3_ffn.rs (re-runnable)
- evidence/ship-007-qkv-bisection-2026-04-27/diag_compare_layer3_ffn.txt

Coverage scoreboard unchanged. Investigation continues; PR E v3 scope
narrows further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant