fix(shard-reader): wrap_around so training can loop the corpus — root-cause for placeholder-loss bug by noahgift · Pull Request #1073 · paiml/aprender

noahgift · 2026-04-26T14:25:18Z

Bug observed (2026-04-26)

Real CUDA training run on /mnt/nvme-raid0/data/csn-python-shards (18.1M tokens) with --num-steps 5000:

Epoch	train_loss	val_loss	wall_s
0	10.11	9.97	264
1	9.91	9.91	260
2	2.84	9.90	55
3	1.00	9.90	0.38
4	1.00	9.90	0.39

train_loss=1.0 + wall=0.4s is the (1.0, 1.0) placeholder from pretrain_real_cuda.rs:88-90 — every step ran the iterator-exhausted fallback for thousands of steps. Garbage gradients masquerading as training.

Root cause

ShardBatchIter::next() returns None when shards exhausted. With 18.1M tokens × 8.19M tokens-per-epoch = 2.2 epochs of real data, then iterator dies. Per feedback_fix_root_cause_never_route_around.md, the placeholder was a route-around — the real fix is corpus looping.

Fix

ShardBatchIter::with_wrap_around(true) opt-in (default false preserves historical contract)
read_one_sequence resets cursor_shard=0 on exhaustion when wrap_around is on
epochs_completed() accessor for downstream telemetry
apr pretrain real-corpus path enables it (matches PyTorch/HF behavior)

Tests

5 pass on aprender-train --release --lib shard_reader:

NEW wrap_around_continues_past_shard_exhaustion — 12 batches across 3 simulated epochs from a 1-shard fixture
NEW no_wrap_around_terminates_on_exhaustion — default behavior preserved
3 existing tests unchanged

Why this matters

MODEL-2 training cannot converge without this fix. The 18M-token CSN corpus exhausts in ~2 epochs; any run targeting >2 epochs silently produces garbage. The user's primary goal (per spec SHIP-TWO-001) is producing a converged 370M .apr; this is the prerequisite.

🤖 Generated with Claude Code

… root cause for premature stop ## Bug observed (2026-04-26) 50K-step from-scratch run on csn-python-shards (with PR #1073's wrap-around in place) early-stopped at epoch 5/24 even though train_loss was monotonically decreasing: | Epoch | train_loss | val_loss | |------:|-----------:|---------:| | 0 | 10.010 | 9.909 | | 1 | 9.798 | 9.791 | | 2 | 9.689 | 9.733 ← best | | 3 | 9.623 | 9.830 | | 4 | 9.563 | 9.845 | | 5 | 9.543 | 9.818 | train_loss dropped 0.47 monotonically; val_loss bounced in a 0.11 band. Early-stop fired after 3 epochs of no val-improvement. ## Root cause Two compounding issues: 1. **HELD_OUT_BATCHES = 2** (constant). With batch=16 seq=512, val set was just 16,384 tokens — single-batch fluctuation ~0.04, which is the same magnitude as legitimate epoch-over- epoch convergence signal during early training. 2. **patience_epochs = 2** (default). Combined with #1, this means 2 epochs of normal val-noise terminate a still-converging run. ## Fix at root - `HELD_OUT_BATCHES`: 2 → 16 (16,384 tokens → 131,072 tokens). 8× larger sample reduces val_loss noise floor from ~0.04 to ~0.01, restoring early-stop signal-to-noise. - `patience_epochs`: 2 → 5. Allows 5 epochs of plateau before honouring early-stop. - `min_epochs_before_early_stop`: 1 → 3. Guarantees warmup window (1000 steps) + 1-2 post-warmup epochs of real learning before any early-stop check is honoured. ## Why this matters Without these defaults, ANY from-scratch training run is at risk of spurious early-stop on noisy val_loss while train_loss is still clearly decreasing. The user's primary goal per SHIP-TWO-001 spec is producing a converged 370M `.apr`; this fix is the prerequisite. Per `feedback_fix_root_cause_never_route_around.md`: not routing around with --no-early-stop or longer --num-steps. Fixing the defaults so any future operator gets a sensible signal-to-noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…und + fixed + first checkpoint — spec v2.65.0 → v2.66.0 (#1074) User directive: "we should train a model unless the path is broken, then fix." This section documents the FIRST sustained from-scratch MODEL-2 training run since the project began. Three real stack bugs were discovered DURING training and fixed at root. ## Bug 1 — corpus exhaustion silently emits placeholder `ShardBatchIter::next()→None` after exhaust; `Cuda*StepFn` returned `(1.0, 1.0)` placeholder, masking exhaustion silently. 5K-step run showed train_loss=1.0 / wall_s=0.4 for epochs 3-4 (impossible for real training). Fix: PR #1073 first commit — `with_wrap_around(true)` opt-in. Validation: re-ran 5K → 5 epochs of monotonic train_loss 10.111→9.700. ## Bug 2 — early-stop on val noise not stagnation 50K run with #1073 still early-stopped at epoch 5/24 — train_loss 10.01→9.54 monotonic, val_loss bouncing 9.78-9.92 in 0.04 noise band. Root cause: HELD_OUT_BATCHES=2 (16k tokens, single-batch fluctuation matched epoch-over-epoch signal) + patience_epochs=2 (terminates on 2 noise epochs). Fix: PR #1073 second commit — HELD_OUT_BATCHES 2→16 (131k tokens, 8× larger sample); patience_epochs 2→5; min_epochs_before_early_stop 1→3. Validation: tuned 50K showed val_loss decreasing 9.95→9.84→9.78 across first 3 epochs. ## Bug 3 — corpus too small (18M vs Chinchilla 7.4B for 370M) After fixes 1+2, tuned run revealed overfitting at epoch 3+: train_loss continues 9.69→9.52, val_loss climbs 9.78→9.92, gap inverts. Classic small-corpus signature. Fix not in code — data engineering: pretokenize The Stack v2 Python (multi-billion tokens). Deferred to next session per feedback_compute_pre_authorized.md. ## What was produced Best checkpoint at /mnt/nvme-raid0/runs/model-2-from-scratch-006-50k-tuned/ ckpt/epoch-002.apr: - APR v2, 219 tensors, 1.39 GiB, checksum VALID - Architecture: LlamaForCausalLM, name=llama-370m-pretrain - val_loss=9.78, 49.2M tokens (corpus wrapped 2.7×) `apr inspect` validates. AC-SHIP2-005 (.apr checkpoint format) structurally discharged. ## Coverage impact (no tally change) | Gate | Prior | Post | |------|-------|------| | AC-SHIP2-005 | PARTIAL | structurally discharged | | GATE-TRAIN-005 | PARTIAL | confirmed correct (didn't fire on real run) | | GATE-TRAIN-001 | PARTIAL | confirmed correct (per-step metrics live) | Coverage tally formally unchanged; structural verification awaits contract-level promotion. ## Methodology Per feedback_fix_root_cause_never_route_around.md: zero route-arounds. Each bug had a `TrueCause`-style root cause analysis. The placeholder return on iterator exhaustion was a route-around itself; replaced with iterator looping. The early-stop policy was data-undersampled; fixed by widening the val set, not by raising the patience alone. Spec v2.65.0 → v2.66.0. Evidence persisted to evidence/model-2-first-training-2026-04-26/ (12 metadata.json files: 5 from broken+fixed 5K runs, 7 from tuned 50K run). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…-cause for placeholder-loss bug ## Bug observed (2026-04-26) 5000-step `apr pretrain --device cuda --num-steps 5000 --steps-per-epoch 1000` on /mnt/nvme-raid0/data/csn-python-shards (18.1M tokens, 10 shards): | Epoch | train_loss | val_loss | wall_s | |------:|-----------:|---------:|-------:| | 0 | 10.111 | 9.967 | 263.9 | | 1 | 9.909 | 9.909 | 259.6 | | 2 | **2.836** | 9.902 | 54.8 | ← partial corpus-exhaust | 3 | **1.000** | 9.902 | 0.378 | ← all placeholder | 4 | **1.000** | 9.903 | 0.387 | ← all placeholder train_loss=1.0 + wall=0.4s == every step returned the `(1.0, 1.0)` placeholder from `pretrain_real_cuda.rs:88-90`: ```rust let Some(batch) = self.batches.next() else { return (1.0, 1.0); // INV-TRAIN-007 / GATE-TRAIN-005 placeholder }; ``` ## Root cause `ShardBatchIter::next()` returns `None` when all `.bin` shards are exhausted. With 18.1M tokens and a 5K-step run consuming 8.19M tokens/epoch, the iterator exhausts in ~2 epochs. The `Cuda*StepFn::step` placeholder masks the exhaustion silently — no error, no warning, just garbage gradients. ## Fix (this PR) Add opt-in `wrap_around` to `ShardBatchIter`: - `pub fn with_wrap_around(self, bool) -> Self` (consuming builder) - `pub fn epochs_completed(&self) -> u64` (cycles through full shard set) - `read_one_sequence`: when shards exhausted AND `wrap_around==true`, reset cursor_shard=0, increment epochs_completed, continue - Default behaviour preserved (wrap_around=false → returns None) `apr pretrain` real-corpus dispatch path (apr-cli/src/commands/ pretrain.rs:273) now calls `.with_wrap_around(true)`. This matches PyTorch / HuggingFace standard behaviour where finite corpora loop across multi-epoch training runs. ## Tests 5 tests pass on `aprender-train --release --lib shard_reader`: - `wrap_around_continues_past_shard_exhaustion` (NEW) — 12 batches across 3 simulated epochs from a single-shard corpus that would otherwise yield 4 batches before None - `no_wrap_around_terminates_on_exhaustion` (NEW) — default behaviour preserved - 3 existing tests unchanged (single_shard, empty_dir, multi_shard) ## Why this is "fix at root cause" Per `feedback_fix_root_cause_never_route_around.md`: the placeholder return from `Cuda*StepFn` was a **route-around** for the exhaustion case (avoid NaN/Inf misfire, avoid divergence cap fire). The real problem is that production training MUST loop the corpus. With wrap- around in `ShardBatchIter`, the placeholder fallback now only fires on an actually-broken shard set (e.g., empty post-reset), at which point an abort is the right response. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… root cause for premature stop ## Bug observed (2026-04-26) 50K-step from-scratch run on csn-python-shards (with PR #1073's wrap-around in place) early-stopped at epoch 5/24 even though train_loss was monotonically decreasing: | Epoch | train_loss | val_loss | |------:|-----------:|---------:| | 0 | 10.010 | 9.909 | | 1 | 9.798 | 9.791 | | 2 | 9.689 | 9.733 ← best | | 3 | 9.623 | 9.830 | | 4 | 9.563 | 9.845 | | 5 | 9.543 | 9.818 | train_loss dropped 0.47 monotonically; val_loss bounced in a 0.11 band. Early-stop fired after 3 epochs of no val-improvement. ## Root cause Two compounding issues: 1. **HELD_OUT_BATCHES = 2** (constant). With batch=16 seq=512, val set was just 16,384 tokens — single-batch fluctuation ~0.04, which is the same magnitude as legitimate epoch-over- epoch convergence signal during early training. 2. **patience_epochs = 2** (default). Combined with #1, this means 2 epochs of normal val-noise terminate a still-converging run. ## Fix at root - `HELD_OUT_BATCHES`: 2 → 16 (16,384 tokens → 131,072 tokens). 8× larger sample reduces val_loss noise floor from ~0.04 to ~0.01, restoring early-stop signal-to-noise. - `patience_epochs`: 2 → 5. Allows 5 epochs of plateau before honouring early-stop. - `min_epochs_before_early_stop`: 1 → 3. Guarantees warmup window (1000 steps) + 1-2 post-warmup epochs of real learning before any early-stop check is honoured. ## Why this matters Without these defaults, ANY from-scratch training run is at risk of spurious early-stop on noisy val_loss while train_loss is still clearly decreasing. The user's primary goal per SHIP-TWO-001 spec is producing a converged 370M `.apr`; this fix is the prerequisite. Per `feedback_fix_root_cause_never_route_around.md`: not routing around with --no-early-stop or longer --num-steps. Fixing the defaults so any future operator gets a sensible signal-to-noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…orization signature in v2.65 best run — spec v2.67.0 → v2.68.0 (#1076) User mandate "train this model: now!" delivered second from-scratch run on a 4.10× larger CSN-Python corpus (74.3M tokens vs 18.1M). Same hyperparameters as v2.65.0 best 20K run on RTX 4090. Result: final val_loss=9.806, best val_loss=9.751 at epoch 4. Did NOT beat 1× run's "best" of 8.911 — but for an *important* reason that §24 documents. Critical comparison (epoch 9): - 1× run: train=9.467 / val=8.911 / gap=-0.556 ← memorization - 4× run: train=9.816 / val=9.806 / gap=-0.010 ← healthy The 1× run's val < train by 0.556 nats is the signature of the val sequences sharing memorized substrings with the 9.1×-wrapped training corpus. The 4× run never exhibits this inversion — at 2.21× wraps the model is generalization-bound, not memorization- bound. §24 establishes two falsifiable claims: 1. v2.65.0's 8.911 was memorization-driven (val<train confirms it). 2. Healthy MODEL-2 generalization on CSN-Python plateaus near val_loss≈9.8 at this hyperparameter budget. Together these mean the contract target_val_loss=3.0 remains unreachable on CodeSearchNet-Python at any size — Stack v2 Python (multi-billion tokens) is required, as already noted in project_2026_04_26_session_complete_handoff.md priority 1. Best 4× checkpoint: epoch-004.apr at val_loss=9.751 - APR v2 / 219 tensors / 1.39 GiB / checksum VALID Methodology: zero eprintln!, zero route-arounds, fix-at-root held throughout. The §22 wrap_around fix (PR #1073) was load-bearing — without it the 4× run would have exhausted in 2 epochs and silently emitted placeholder loss. Spec v2.67.0 → v2.68.0. No coverage tally change. Evidence: evidence/model-2-corpus-4x-2026-04-27/training-summary.json (all 10 epoch metadatas + corpus stats + hyperparameters). Run dir: /mnt/nvme-raid0/runs/model-2-from-scratch-009-4x-corpus Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…_LEVEL (#1682) Uplift GATE-ARCH-370M-002 (binds AC-SHIP2-005) from undated `verdict: pass` to PARTIAL_ALGORITHM_LEVEL with live evidence from the first real MODEL-2 pretraining checkpoint, closing the §22 "STRUCTURALLY DISCHARGED" note that lacked a contract-level evidence binding until now. Evidence on canonical artifact `epoch-002.apr` (§22 / PR #1073, 2026-04-26): - file: /mnt/nvme-raid0/runs/model-2-from-scratch-006-50k-tuned/ckpt/epoch-002.apr - size: 1.39 GiB (1494053060 bytes) - apr inspect: exit 0 - valid: true - format: "APR v2" - tensor_count: 219 - architecture: LlamaForCausalLM - checksum_valid: true Re-verified live 2026-05-15 against `apr 0.33.0` (crates.io install). The .apr-format invariants (magic, header, tensor manifest, checksum) are all satisfied on a real on-disk artifact produced by the actual training loop — not a synthetic fixture, not the stub. Full discharge PENDING on `apr qa --arch-contract <name>` subcommand binding (~50 LOC + 1 test follow-up; the underlying data fixture is already in place per this evidence dir). Contract llama-370m-sovereign v1.10.0 → v1.11.0: - GATE-ARCH-370M-002 gains: evidence_discharged_by, discharge_status: PARTIAL_ALGORITHM_LEVEL, partial_discharge_note, full_discharge_blocks_on - New changelog entry 1.11.0 documenting the uplift MODEL-2 ship %: 57% → 60% (1 of 10 PARTIALs gains a structured contract-level evidence binding; the 9 others either share this fixture or are blocked on AC-SHIP2-003's val_loss capacity ceiling per §34). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 26, 2026 14:25

noahgift mentioned this pull request Apr 26, 2026

docs(ship-two-001): §22 first real MODEL-2 training — 3 stack bugs found + fixed + first checkpoint #1074

Merged

noahgift and others added 2 commits April 26, 2026 19:59

noahgift force-pushed the fix/shard-iter-wrap-around branch from 345a9f8 to dcefc10 Compare April 26, 2026 17:59

noahgift merged commit fb632fc into main Apr 26, 2026
10 checks passed

noahgift deleted the fix/shard-iter-wrap-around branch April 26, 2026 18:27

noahgift mentioned this pull request Apr 27, 2026

docs(ship-two-001): §24 — MODEL-2 4×-corpus experiment exposes memorization signature — spec v2.67.0 → v2.68.0 #1076

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shard-reader): wrap_around so training can loop the corpus — root-cause for placeholder-loss bug#1073

fix(shard-reader): wrap_around so training can loop the corpus — root-cause for placeholder-loss bug#1073
noahgift merged 2 commits into
mainfrom
fix/shard-iter-wrap-around

noahgift commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 26, 2026

Bug observed (2026-04-26)

Root cause

Fix

Tests

Why this matters

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant