fix(shard-reader): wrap_around so training can loop the corpus — root-cause for placeholder-loss bug#1073
Merged
Merged
Conversation
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
… root cause for premature stop ## Bug observed (2026-04-26) 50K-step from-scratch run on csn-python-shards (with PR #1073's wrap-around in place) early-stopped at epoch 5/24 even though train_loss was monotonically decreasing: | Epoch | train_loss | val_loss | |------:|-----------:|---------:| | 0 | 10.010 | 9.909 | | 1 | 9.798 | 9.791 | | 2 | 9.689 | 9.733 ← best | | 3 | 9.623 | 9.830 | | 4 | 9.563 | 9.845 | | 5 | 9.543 | 9.818 | train_loss dropped 0.47 monotonically; val_loss bounced in a 0.11 band. Early-stop fired after 3 epochs of no val-improvement. ## Root cause Two compounding issues: 1. **HELD_OUT_BATCHES = 2** (constant). With batch=16 seq=512, val set was just 16,384 tokens — single-batch fluctuation ~0.04, which is the same magnitude as legitimate epoch-over- epoch convergence signal during early training. 2. **patience_epochs = 2** (default). Combined with #1, this means 2 epochs of normal val-noise terminate a still-converging run. ## Fix at root - `HELD_OUT_BATCHES`: 2 → 16 (16,384 tokens → 131,072 tokens). 8× larger sample reduces val_loss noise floor from ~0.04 to ~0.01, restoring early-stop signal-to-noise. - `patience_epochs`: 2 → 5. Allows 5 epochs of plateau before honouring early-stop. - `min_epochs_before_early_stop`: 1 → 3. Guarantees warmup window (1000 steps) + 1-2 post-warmup epochs of real learning before any early-stop check is honoured. ## Why this matters Without these defaults, ANY from-scratch training run is at risk of spurious early-stop on noisy val_loss while train_loss is still clearly decreasing. The user's primary goal per SHIP-TWO-001 spec is producing a converged 370M `.apr`; this fix is the prerequisite. Per `feedback_fix_root_cause_never_route_around.md`: not routing around with --no-early-stop or longer --num-steps. Fixing the defaults so any future operator gets a sensible signal-to-noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…und + fixed + first checkpoint — spec v2.65.0 → v2.66.0 (#1074) User directive: "we should train a model unless the path is broken, then fix." This section documents the FIRST sustained from-scratch MODEL-2 training run since the project began. Three real stack bugs were discovered DURING training and fixed at root. ## Bug 1 — corpus exhaustion silently emits placeholder `ShardBatchIter::next()→None` after exhaust; `Cuda*StepFn` returned `(1.0, 1.0)` placeholder, masking exhaustion silently. 5K-step run showed train_loss=1.0 / wall_s=0.4 for epochs 3-4 (impossible for real training). Fix: PR #1073 first commit — `with_wrap_around(true)` opt-in. Validation: re-ran 5K → 5 epochs of monotonic train_loss 10.111→9.700. ## Bug 2 — early-stop on val noise not stagnation 50K run with #1073 still early-stopped at epoch 5/24 — train_loss 10.01→9.54 monotonic, val_loss bouncing 9.78-9.92 in 0.04 noise band. Root cause: HELD_OUT_BATCHES=2 (16k tokens, single-batch fluctuation matched epoch-over-epoch signal) + patience_epochs=2 (terminates on 2 noise epochs). Fix: PR #1073 second commit — HELD_OUT_BATCHES 2→16 (131k tokens, 8× larger sample); patience_epochs 2→5; min_epochs_before_early_stop 1→3. Validation: tuned 50K showed val_loss decreasing 9.95→9.84→9.78 across first 3 epochs. ## Bug 3 — corpus too small (18M vs Chinchilla 7.4B for 370M) After fixes 1+2, tuned run revealed overfitting at epoch 3+: train_loss continues 9.69→9.52, val_loss climbs 9.78→9.92, gap inverts. Classic small-corpus signature. Fix not in code — data engineering: pretokenize The Stack v2 Python (multi-billion tokens). Deferred to next session per feedback_compute_pre_authorized.md. ## What was produced Best checkpoint at /mnt/nvme-raid0/runs/model-2-from-scratch-006-50k-tuned/ ckpt/epoch-002.apr: - APR v2, 219 tensors, 1.39 GiB, checksum VALID - Architecture: LlamaForCausalLM, name=llama-370m-pretrain - val_loss=9.78, 49.2M tokens (corpus wrapped 2.7×) `apr inspect` validates. AC-SHIP2-005 (.apr checkpoint format) structurally discharged. ## Coverage impact (no tally change) | Gate | Prior | Post | |------|-------|------| | AC-SHIP2-005 | PARTIAL | structurally discharged | | GATE-TRAIN-005 | PARTIAL | confirmed correct (didn't fire on real run) | | GATE-TRAIN-001 | PARTIAL | confirmed correct (per-step metrics live) | Coverage tally formally unchanged; structural verification awaits contract-level promotion. ## Methodology Per feedback_fix_root_cause_never_route_around.md: zero route-arounds. Each bug had a `TrueCause`-style root cause analysis. The placeholder return on iterator exhaustion was a route-around itself; replaced with iterator looping. The early-stop policy was data-undersampled; fixed by widening the val set, not by raising the patience alone. Spec v2.65.0 → v2.66.0. Evidence persisted to evidence/model-2-first-training-2026-04-26/ (12 metadata.json files: 5 from broken+fixed 5K runs, 7 from tuned 50K run). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…-cause for placeholder-loss bug
## Bug observed (2026-04-26)
5000-step `apr pretrain --device cuda --num-steps 5000 --steps-per-epoch
1000` on /mnt/nvme-raid0/data/csn-python-shards (18.1M tokens, 10 shards):
| Epoch | train_loss | val_loss | wall_s |
|------:|-----------:|---------:|-------:|
| 0 | 10.111 | 9.967 | 263.9 |
| 1 | 9.909 | 9.909 | 259.6 |
| 2 | **2.836** | 9.902 | 54.8 | ← partial corpus-exhaust
| 3 | **1.000** | 9.902 | 0.378 | ← all placeholder
| 4 | **1.000** | 9.903 | 0.387 | ← all placeholder
train_loss=1.0 + wall=0.4s == every step returned the
`(1.0, 1.0)` placeholder from `pretrain_real_cuda.rs:88-90`:
```rust
let Some(batch) = self.batches.next() else {
return (1.0, 1.0); // INV-TRAIN-007 / GATE-TRAIN-005 placeholder
};
```
## Root cause
`ShardBatchIter::next()` returns `None` when all `.bin` shards are
exhausted. With 18.1M tokens and a 5K-step run consuming 8.19M
tokens/epoch, the iterator exhausts in ~2 epochs. The
`Cuda*StepFn::step` placeholder masks the exhaustion silently —
no error, no warning, just garbage gradients.
## Fix (this PR)
Add opt-in `wrap_around` to `ShardBatchIter`:
- `pub fn with_wrap_around(self, bool) -> Self` (consuming builder)
- `pub fn epochs_completed(&self) -> u64` (cycles through full shard set)
- `read_one_sequence`: when shards exhausted AND `wrap_around==true`,
reset cursor_shard=0, increment epochs_completed, continue
- Default behaviour preserved (wrap_around=false → returns None)
`apr pretrain` real-corpus dispatch path (apr-cli/src/commands/
pretrain.rs:273) now calls `.with_wrap_around(true)`. This matches
PyTorch / HuggingFace standard behaviour where finite corpora loop
across multi-epoch training runs.
## Tests
5 tests pass on `aprender-train --release --lib shard_reader`:
- `wrap_around_continues_past_shard_exhaustion` (NEW) — 12 batches
across 3 simulated epochs from a single-shard corpus that would
otherwise yield 4 batches before None
- `no_wrap_around_terminates_on_exhaustion` (NEW) — default
behaviour preserved
- 3 existing tests unchanged (single_shard, empty_dir, multi_shard)
## Why this is "fix at root cause"
Per `feedback_fix_root_cause_never_route_around.md`: the placeholder
return from `Cuda*StepFn` was a **route-around** for the exhaustion
case (avoid NaN/Inf misfire, avoid divergence cap fire). The real
problem is that production training MUST loop the corpus. With wrap-
around in `ShardBatchIter`, the placeholder fallback now only fires
on an actually-broken shard set (e.g., empty post-reset), at which
point an abort is the right response.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… root cause for premature stop ## Bug observed (2026-04-26) 50K-step from-scratch run on csn-python-shards (with PR #1073's wrap-around in place) early-stopped at epoch 5/24 even though train_loss was monotonically decreasing: | Epoch | train_loss | val_loss | |------:|-----------:|---------:| | 0 | 10.010 | 9.909 | | 1 | 9.798 | 9.791 | | 2 | 9.689 | 9.733 ← best | | 3 | 9.623 | 9.830 | | 4 | 9.563 | 9.845 | | 5 | 9.543 | 9.818 | train_loss dropped 0.47 monotonically; val_loss bounced in a 0.11 band. Early-stop fired after 3 epochs of no val-improvement. ## Root cause Two compounding issues: 1. **HELD_OUT_BATCHES = 2** (constant). With batch=16 seq=512, val set was just 16,384 tokens — single-batch fluctuation ~0.04, which is the same magnitude as legitimate epoch-over- epoch convergence signal during early training. 2. **patience_epochs = 2** (default). Combined with #1, this means 2 epochs of normal val-noise terminate a still-converging run. ## Fix at root - `HELD_OUT_BATCHES`: 2 → 16 (16,384 tokens → 131,072 tokens). 8× larger sample reduces val_loss noise floor from ~0.04 to ~0.01, restoring early-stop signal-to-noise. - `patience_epochs`: 2 → 5. Allows 5 epochs of plateau before honouring early-stop. - `min_epochs_before_early_stop`: 1 → 3. Guarantees warmup window (1000 steps) + 1-2 post-warmup epochs of real learning before any early-stop check is honoured. ## Why this matters Without these defaults, ANY from-scratch training run is at risk of spurious early-stop on noisy val_loss while train_loss is still clearly decreasing. The user's primary goal per SHIP-TWO-001 spec is producing a converged 370M `.apr`; this fix is the prerequisite. Per `feedback_fix_root_cause_never_route_around.md`: not routing around with --no-early-stop or longer --num-steps. Fixing the defaults so any future operator gets a sensible signal-to-noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
345a9f8 to
dcefc10
Compare
4 tasks
noahgift
added a commit
that referenced
this pull request
Apr 27, 2026
…orization signature in v2.65 best run — spec v2.67.0 → v2.68.0 (#1076) User mandate "train this model: now!" delivered second from-scratch run on a 4.10× larger CSN-Python corpus (74.3M tokens vs 18.1M). Same hyperparameters as v2.65.0 best 20K run on RTX 4090. Result: final val_loss=9.806, best val_loss=9.751 at epoch 4. Did NOT beat 1× run's "best" of 8.911 — but for an *important* reason that §24 documents. Critical comparison (epoch 9): - 1× run: train=9.467 / val=8.911 / gap=-0.556 ← memorization - 4× run: train=9.816 / val=9.806 / gap=-0.010 ← healthy The 1× run's val < train by 0.556 nats is the signature of the val sequences sharing memorized substrings with the 9.1×-wrapped training corpus. The 4× run never exhibits this inversion — at 2.21× wraps the model is generalization-bound, not memorization- bound. §24 establishes two falsifiable claims: 1. v2.65.0's 8.911 was memorization-driven (val<train confirms it). 2. Healthy MODEL-2 generalization on CSN-Python plateaus near val_loss≈9.8 at this hyperparameter budget. Together these mean the contract target_val_loss=3.0 remains unreachable on CodeSearchNet-Python at any size — Stack v2 Python (multi-billion tokens) is required, as already noted in project_2026_04_26_session_complete_handoff.md priority 1. Best 4× checkpoint: epoch-004.apr at val_loss=9.751 - APR v2 / 219 tensors / 1.39 GiB / checksum VALID Methodology: zero eprintln!, zero route-arounds, fix-at-root held throughout. The §22 wrap_around fix (PR #1073) was load-bearing — without it the 4× run would have exhausted in 2 epochs and silently emitted placeholder loss. Spec v2.67.0 → v2.68.0. No coverage tally change. Evidence: evidence/model-2-corpus-4x-2026-04-27/training-summary.json (all 10 epoch metadatas + corpus stats + hyperparameters). Run dir: /mnt/nvme-raid0/runs/model-2-from-scratch-009-4x-corpus Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 15, 2026
…_LEVEL (#1682) Uplift GATE-ARCH-370M-002 (binds AC-SHIP2-005) from undated `verdict: pass` to PARTIAL_ALGORITHM_LEVEL with live evidence from the first real MODEL-2 pretraining checkpoint, closing the §22 "STRUCTURALLY DISCHARGED" note that lacked a contract-level evidence binding until now. Evidence on canonical artifact `epoch-002.apr` (§22 / PR #1073, 2026-04-26): - file: /mnt/nvme-raid0/runs/model-2-from-scratch-006-50k-tuned/ckpt/epoch-002.apr - size: 1.39 GiB (1494053060 bytes) - apr inspect: exit 0 - valid: true - format: "APR v2" - tensor_count: 219 - architecture: LlamaForCausalLM - checksum_valid: true Re-verified live 2026-05-15 against `apr 0.33.0` (crates.io install). The .apr-format invariants (magic, header, tensor manifest, checksum) are all satisfied on a real on-disk artifact produced by the actual training loop — not a synthetic fixture, not the stub. Full discharge PENDING on `apr qa --arch-contract <name>` subcommand binding (~50 LOC + 1 test follow-up; the underlying data fixture is already in place per this evidence dir). Contract llama-370m-sovereign v1.10.0 → v1.11.0: - GATE-ARCH-370M-002 gains: evidence_discharged_by, discharge_status: PARTIAL_ALGORITHM_LEVEL, partial_discharge_note, full_discharge_blocks_on - New changelog entry 1.11.0 documenting the uplift MODEL-2 ship %: 57% → 60% (1 of 10 PARTIALs gains a structured contract-level evidence binding; the 9 others either share this fixture or are blocked on AC-SHIP2-003's val_loss capacity ceiling per §34). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug observed (2026-04-26)
Real CUDA training run on
/mnt/nvme-raid0/data/csn-python-shards(18.1M tokens) with--num-steps 5000:train_loss=1.0 + wall=0.4sis the(1.0, 1.0)placeholder frompretrain_real_cuda.rs:88-90— every step ran the iterator-exhausted fallback for thousands of steps. Garbage gradients masquerading as training.Root cause
ShardBatchIter::next()returnsNonewhen shards exhausted. With 18.1M tokens × 8.19M tokens-per-epoch = 2.2 epochs of real data, then iterator dies. Perfeedback_fix_root_cause_never_route_around.md, the placeholder was a route-around — the real fix is corpus looping.Fix
ShardBatchIter::with_wrap_around(true)opt-in (default false preserves historical contract)read_one_sequenceresetscursor_shard=0on exhaustion when wrap_around is onepochs_completed()accessor for downstream telemetryapr pretrainreal-corpus path enables it (matches PyTorch/HF behavior)Tests
5 pass on
aprender-train --release --lib shard_reader:wrap_around_continues_past_shard_exhaustion— 12 batches across 3 simulated epochs from a 1-shard fixtureno_wrap_around_terminates_on_exhaustion— default behavior preservedWhy this matters
MODEL-2 training cannot converge without this fix. The 18M-token CSN corpus exhausts in ~2 epochs; any run targeting >2 epochs silently produces garbage. The user's primary goal (per spec SHIP-TWO-001) is producing a converged 370M
.apr; this is the prerequisite.🤖 Generated with Claude Code