Skip to content

fix(shard-reader): wrap_around so training can loop the corpus — root-cause for placeholder-loss bug#1073

Merged
noahgift merged 2 commits into
mainfrom
fix/shard-iter-wrap-around
Apr 26, 2026
Merged

fix(shard-reader): wrap_around so training can loop the corpus — root-cause for placeholder-loss bug#1073
noahgift merged 2 commits into
mainfrom
fix/shard-iter-wrap-around

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Bug observed (2026-04-26)

Real CUDA training run on /mnt/nvme-raid0/data/csn-python-shards (18.1M tokens) with --num-steps 5000:

Epoch train_loss val_loss wall_s
0 10.11 9.97 264
1 9.91 9.91 260
2 2.84 9.90 55
3 1.00 9.90 0.38
4 1.00 9.90 0.39

train_loss=1.0 + wall=0.4s is the (1.0, 1.0) placeholder from pretrain_real_cuda.rs:88-90 — every step ran the iterator-exhausted fallback for thousands of steps. Garbage gradients masquerading as training.

Root cause

ShardBatchIter::next() returns None when shards exhausted. With 18.1M tokens × 8.19M tokens-per-epoch = 2.2 epochs of real data, then iterator dies. Per feedback_fix_root_cause_never_route_around.md, the placeholder was a route-around — the real fix is corpus looping.

Fix

  • ShardBatchIter::with_wrap_around(true) opt-in (default false preserves historical contract)
  • read_one_sequence resets cursor_shard=0 on exhaustion when wrap_around is on
  • epochs_completed() accessor for downstream telemetry
  • apr pretrain real-corpus path enables it (matches PyTorch/HF behavior)

Tests

5 pass on aprender-train --release --lib shard_reader:

  • NEW wrap_around_continues_past_shard_exhaustion — 12 batches across 3 simulated epochs from a 1-shard fixture
  • NEW no_wrap_around_terminates_on_exhaustion — default behavior preserved
  • 3 existing tests unchanged

Why this matters

MODEL-2 training cannot converge without this fix. The 18M-token CSN corpus exhausts in ~2 epochs; any run targeting >2 epochs silently produces garbage. The user's primary goal (per spec SHIP-TWO-001) is producing a converged 370M .apr; this is the prerequisite.

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 26, 2026 14:25
noahgift added a commit that referenced this pull request Apr 26, 2026
… root cause for premature stop

## Bug observed (2026-04-26)

50K-step from-scratch run on csn-python-shards (with PR #1073's
wrap-around in place) early-stopped at epoch 5/24 even though
train_loss was monotonically decreasing:

| Epoch | train_loss | val_loss |
|------:|-----------:|---------:|
| 0     | 10.010     | 9.909    |
| 1     | 9.798      | 9.791    |
| 2     | 9.689      | 9.733 ← best |
| 3     | 9.623      | 9.830    |
| 4     | 9.563      | 9.845    |
| 5     | 9.543      | 9.818    |

train_loss dropped 0.47 monotonically; val_loss bounced in a 0.11
band. Early-stop fired after 3 epochs of no val-improvement.

## Root cause

Two compounding issues:

1. **HELD_OUT_BATCHES = 2** (constant). With batch=16 seq=512,
   val set was just 16,384 tokens — single-batch fluctuation
   ~0.04, which is the same magnitude as legitimate epoch-over-
   epoch convergence signal during early training.
2. **patience_epochs = 2** (default). Combined with #1, this means
   2 epochs of normal val-noise terminate a still-converging run.

## Fix at root

- `HELD_OUT_BATCHES`: 2 → 16 (16,384 tokens → 131,072 tokens).
  8× larger sample reduces val_loss noise floor from ~0.04 to
  ~0.01, restoring early-stop signal-to-noise.
- `patience_epochs`: 2 → 5. Allows 5 epochs of plateau before
  honouring early-stop.
- `min_epochs_before_early_stop`: 1 → 3. Guarantees warmup
  window (1000 steps) + 1-2 post-warmup epochs of real learning
  before any early-stop check is honoured.

## Why this matters

Without these defaults, ANY from-scratch training run is at risk of
spurious early-stop on noisy val_loss while train_loss is still
clearly decreasing. The user's primary goal per SHIP-TWO-001 spec
is producing a converged 370M `.apr`; this fix is the prerequisite.

Per `feedback_fix_root_cause_never_route_around.md`: not routing
around with --no-early-stop or longer --num-steps. Fixing the
defaults so any future operator gets a sensible signal-to-noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…und + fixed + first checkpoint — spec v2.65.0 → v2.66.0 (#1074)

User directive: "we should train a model unless the path is broken,
then fix." This section documents the FIRST sustained from-scratch
MODEL-2 training run since the project began. Three real stack bugs
were discovered DURING training and fixed at root.

## Bug 1 — corpus exhaustion silently emits placeholder

`ShardBatchIter::next()→None` after exhaust; `Cuda*StepFn` returned
`(1.0, 1.0)` placeholder, masking exhaustion silently. 5K-step run
showed train_loss=1.0 / wall_s=0.4 for epochs 3-4 (impossible for
real training).

Fix: PR #1073 first commit — `with_wrap_around(true)` opt-in.
Validation: re-ran 5K → 5 epochs of monotonic train_loss
10.111→9.700.

## Bug 2 — early-stop on val noise not stagnation

50K run with #1073 still early-stopped at epoch 5/24 — train_loss
10.01→9.54 monotonic, val_loss bouncing 9.78-9.92 in 0.04 noise band.

Root cause: HELD_OUT_BATCHES=2 (16k tokens, single-batch fluctuation
matched epoch-over-epoch signal) + patience_epochs=2 (terminates on
2 noise epochs).

Fix: PR #1073 second commit — HELD_OUT_BATCHES 2→16 (131k tokens,
8× larger sample); patience_epochs 2→5; min_epochs_before_early_stop
1→3. Validation: tuned 50K showed val_loss decreasing 9.95→9.84→9.78
across first 3 epochs.

## Bug 3 — corpus too small (18M vs Chinchilla 7.4B for 370M)

After fixes 1+2, tuned run revealed overfitting at epoch 3+:
train_loss continues 9.69→9.52, val_loss climbs 9.78→9.92, gap
inverts. Classic small-corpus signature.

Fix not in code — data engineering: pretokenize The Stack v2 Python
(multi-billion tokens). Deferred to next session per
feedback_compute_pre_authorized.md.

## What was produced

Best checkpoint at /mnt/nvme-raid0/runs/model-2-from-scratch-006-50k-tuned/
ckpt/epoch-002.apr:
- APR v2, 219 tensors, 1.39 GiB, checksum VALID
- Architecture: LlamaForCausalLM, name=llama-370m-pretrain
- val_loss=9.78, 49.2M tokens (corpus wrapped 2.7×)

`apr inspect` validates. AC-SHIP2-005 (.apr checkpoint format)
structurally discharged.

## Coverage impact (no tally change)

| Gate | Prior | Post |
|------|-------|------|
| AC-SHIP2-005 | PARTIAL | structurally discharged |
| GATE-TRAIN-005 | PARTIAL | confirmed correct (didn't fire on real run) |
| GATE-TRAIN-001 | PARTIAL | confirmed correct (per-step metrics live) |

Coverage tally formally unchanged; structural verification awaits
contract-level promotion.

## Methodology

Per feedback_fix_root_cause_never_route_around.md: zero route-arounds.
Each bug had a `TrueCause`-style root cause analysis. The placeholder
return on iterator exhaustion was a route-around itself; replaced
with iterator looping. The early-stop policy was data-undersampled;
fixed by widening the val set, not by raising the patience alone.

Spec v2.65.0 → v2.66.0.

Evidence persisted to evidence/model-2-first-training-2026-04-26/
(12 metadata.json files: 5 from broken+fixed 5K runs, 7 from tuned
50K run).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift and others added 2 commits April 26, 2026 19:59
…-cause for placeholder-loss bug

## Bug observed (2026-04-26)

5000-step `apr pretrain --device cuda --num-steps 5000 --steps-per-epoch
1000` on /mnt/nvme-raid0/data/csn-python-shards (18.1M tokens, 10 shards):

| Epoch | train_loss | val_loss | wall_s |
|------:|-----------:|---------:|-------:|
| 0     | 10.111     | 9.967    | 263.9  |
| 1     | 9.909      | 9.909    | 259.6  |
| 2     | **2.836**  | 9.902    | 54.8   |  ← partial corpus-exhaust
| 3     | **1.000**  | 9.902    | 0.378  |  ← all placeholder
| 4     | **1.000**  | 9.903    | 0.387  |  ← all placeholder

train_loss=1.0 + wall=0.4s == every step returned the
`(1.0, 1.0)` placeholder from `pretrain_real_cuda.rs:88-90`:

```rust
let Some(batch) = self.batches.next() else {
    return (1.0, 1.0);  // INV-TRAIN-007 / GATE-TRAIN-005 placeholder
};
```

## Root cause

`ShardBatchIter::next()` returns `None` when all `.bin` shards are
exhausted. With 18.1M tokens and a 5K-step run consuming 8.19M
tokens/epoch, the iterator exhausts in ~2 epochs. The
`Cuda*StepFn::step` placeholder masks the exhaustion silently —
no error, no warning, just garbage gradients.

## Fix (this PR)

Add opt-in `wrap_around` to `ShardBatchIter`:

- `pub fn with_wrap_around(self, bool) -> Self` (consuming builder)
- `pub fn epochs_completed(&self) -> u64` (cycles through full shard set)
- `read_one_sequence`: when shards exhausted AND `wrap_around==true`,
  reset cursor_shard=0, increment epochs_completed, continue
- Default behaviour preserved (wrap_around=false → returns None)

`apr pretrain` real-corpus dispatch path (apr-cli/src/commands/
pretrain.rs:273) now calls `.with_wrap_around(true)`. This matches
PyTorch / HuggingFace standard behaviour where finite corpora loop
across multi-epoch training runs.

## Tests

5 tests pass on `aprender-train --release --lib shard_reader`:
- `wrap_around_continues_past_shard_exhaustion` (NEW) — 12 batches
  across 3 simulated epochs from a single-shard corpus that would
  otherwise yield 4 batches before None
- `no_wrap_around_terminates_on_exhaustion` (NEW) — default
  behaviour preserved
- 3 existing tests unchanged (single_shard, empty_dir, multi_shard)

## Why this is "fix at root cause"

Per `feedback_fix_root_cause_never_route_around.md`: the placeholder
return from `Cuda*StepFn` was a **route-around** for the exhaustion
case (avoid NaN/Inf misfire, avoid divergence cap fire). The real
problem is that production training MUST loop the corpus. With wrap-
around in `ShardBatchIter`, the placeholder fallback now only fires
on an actually-broken shard set (e.g., empty post-reset), at which
point an abort is the right response.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… root cause for premature stop

## Bug observed (2026-04-26)

50K-step from-scratch run on csn-python-shards (with PR #1073's
wrap-around in place) early-stopped at epoch 5/24 even though
train_loss was monotonically decreasing:

| Epoch | train_loss | val_loss |
|------:|-----------:|---------:|
| 0     | 10.010     | 9.909    |
| 1     | 9.798      | 9.791    |
| 2     | 9.689      | 9.733 ← best |
| 3     | 9.623      | 9.830    |
| 4     | 9.563      | 9.845    |
| 5     | 9.543      | 9.818    |

train_loss dropped 0.47 monotonically; val_loss bounced in a 0.11
band. Early-stop fired after 3 epochs of no val-improvement.

## Root cause

Two compounding issues:

1. **HELD_OUT_BATCHES = 2** (constant). With batch=16 seq=512,
   val set was just 16,384 tokens — single-batch fluctuation
   ~0.04, which is the same magnitude as legitimate epoch-over-
   epoch convergence signal during early training.
2. **patience_epochs = 2** (default). Combined with #1, this means
   2 epochs of normal val-noise terminate a still-converging run.

## Fix at root

- `HELD_OUT_BATCHES`: 2 → 16 (16,384 tokens → 131,072 tokens).
  8× larger sample reduces val_loss noise floor from ~0.04 to
  ~0.01, restoring early-stop signal-to-noise.
- `patience_epochs`: 2 → 5. Allows 5 epochs of plateau before
  honouring early-stop.
- `min_epochs_before_early_stop`: 1 → 3. Guarantees warmup
  window (1000 steps) + 1-2 post-warmup epochs of real learning
  before any early-stop check is honoured.

## Why this matters

Without these defaults, ANY from-scratch training run is at risk of
spurious early-stop on noisy val_loss while train_loss is still
clearly decreasing. The user's primary goal per SHIP-TWO-001 spec
is producing a converged 370M `.apr`; this fix is the prerequisite.

Per `feedback_fix_root_cause_never_route_around.md`: not routing
around with --no-early-stop or longer --num-steps. Fixing the
defaults so any future operator gets a sensible signal-to-noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the fix/shard-iter-wrap-around branch from 345a9f8 to dcefc10 Compare April 26, 2026 17:59
@noahgift noahgift merged commit fb632fc into main Apr 26, 2026
10 checks passed
@noahgift noahgift deleted the fix/shard-iter-wrap-around branch April 26, 2026 18:27
noahgift added a commit that referenced this pull request Apr 27, 2026
…orization signature in v2.65 best run — spec v2.67.0 → v2.68.0 (#1076)

User mandate "train this model: now!" delivered second from-scratch
run on a 4.10× larger CSN-Python corpus (74.3M tokens vs 18.1M).
Same hyperparameters as v2.65.0 best 20K run on RTX 4090.

Result: final val_loss=9.806, best val_loss=9.751 at epoch 4. Did
NOT beat 1× run's "best" of 8.911 — but for an *important* reason
that §24 documents.

Critical comparison (epoch 9):
- 1× run: train=9.467 / val=8.911 / gap=-0.556 ← memorization
- 4× run: train=9.816 / val=9.806 / gap=-0.010 ← healthy

The 1× run's val < train by 0.556 nats is the signature of the
val sequences sharing memorized substrings with the 9.1×-wrapped
training corpus. The 4× run never exhibits this inversion — at
2.21× wraps the model is generalization-bound, not memorization-
bound.

§24 establishes two falsifiable claims:
1. v2.65.0's 8.911 was memorization-driven (val<train confirms it).
2. Healthy MODEL-2 generalization on CSN-Python plateaus near
   val_loss≈9.8 at this hyperparameter budget.

Together these mean the contract target_val_loss=3.0 remains
unreachable on CodeSearchNet-Python at any size — Stack v2 Python
(multi-billion tokens) is required, as already noted in
project_2026_04_26_session_complete_handoff.md priority 1.

Best 4× checkpoint: epoch-004.apr at val_loss=9.751
- APR v2 / 219 tensors / 1.39 GiB / checksum VALID

Methodology: zero eprintln!, zero route-arounds, fix-at-root held
throughout. The §22 wrap_around fix (PR #1073) was load-bearing —
without it the 4× run would have exhausted in 2 epochs and
silently emitted placeholder loss.

Spec v2.67.0 → v2.68.0. No coverage tally change.

Evidence: evidence/model-2-corpus-4x-2026-04-27/training-summary.json
(all 10 epoch metadatas + corpus stats + hyperparameters).

Run dir: /mnt/nvme-raid0/runs/model-2-from-scratch-009-4x-corpus

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 15, 2026
…_LEVEL (#1682)

Uplift GATE-ARCH-370M-002 (binds AC-SHIP2-005) from undated `verdict: pass`
to PARTIAL_ALGORITHM_LEVEL with live evidence from the first real MODEL-2
pretraining checkpoint, closing the §22 "STRUCTURALLY DISCHARGED" note
that lacked a contract-level evidence binding until now.

Evidence on canonical artifact `epoch-002.apr` (§22 / PR #1073, 2026-04-26):
  - file:           /mnt/nvme-raid0/runs/model-2-from-scratch-006-50k-tuned/ckpt/epoch-002.apr
  - size:           1.39 GiB (1494053060 bytes)
  - apr inspect:    exit 0
  - valid:          true
  - format:         "APR v2"
  - tensor_count:   219
  - architecture:   LlamaForCausalLM
  - checksum_valid: true

Re-verified live 2026-05-15 against `apr 0.33.0` (crates.io install).

The .apr-format invariants (magic, header, tensor manifest, checksum)
are all satisfied on a real on-disk artifact produced by the actual
training loop — not a synthetic fixture, not the stub. Full discharge
PENDING on `apr qa --arch-contract <name>` subcommand binding
(~50 LOC + 1 test follow-up; the underlying data fixture is already
in place per this evidence dir).

Contract llama-370m-sovereign v1.10.0 → v1.11.0:
  - GATE-ARCH-370M-002 gains:
      evidence_discharged_by, discharge_status: PARTIAL_ALGORITHM_LEVEL,
      partial_discharge_note, full_discharge_blocks_on
  - New changelog entry 1.11.0 documenting the uplift

MODEL-2 ship %: 57% → 60% (1 of 10 PARTIALs gains a structured
contract-level evidence binding; the 9 others either share this fixture
or are blocked on AC-SHIP2-003's val_loss capacity ceiling per §34).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant