docs(ship-two-001): §22 first real MODEL-2 training — 3 stack bugs found + fixed + first checkpoint#1074
Merged
Conversation
…und + fixed + first checkpoint — spec v2.65.0 → v2.66.0 User directive: "we should train a model unless the path is broken, then fix." This section documents the FIRST sustained from-scratch MODEL-2 training run since the project began. Three real stack bugs were discovered DURING training and fixed at root. ## Bug 1 — corpus exhaustion silently emits placeholder `ShardBatchIter::next()→None` after exhaust; `Cuda*StepFn` returned `(1.0, 1.0)` placeholder, masking exhaustion silently. 5K-step run showed train_loss=1.0 / wall_s=0.4 for epochs 3-4 (impossible for real training). Fix: PR #1073 first commit — `with_wrap_around(true)` opt-in. Validation: re-ran 5K → 5 epochs of monotonic train_loss 10.111→9.700. ## Bug 2 — early-stop on val noise not stagnation 50K run with #1073 still early-stopped at epoch 5/24 — train_loss 10.01→9.54 monotonic, val_loss bouncing 9.78-9.92 in 0.04 noise band. Root cause: HELD_OUT_BATCHES=2 (16k tokens, single-batch fluctuation matched epoch-over-epoch signal) + patience_epochs=2 (terminates on 2 noise epochs). Fix: PR #1073 second commit — HELD_OUT_BATCHES 2→16 (131k tokens, 8× larger sample); patience_epochs 2→5; min_epochs_before_early_stop 1→3. Validation: tuned 50K showed val_loss decreasing 9.95→9.84→9.78 across first 3 epochs. ## Bug 3 — corpus too small (18M vs Chinchilla 7.4B for 370M) After fixes 1+2, tuned run revealed overfitting at epoch 3+: train_loss continues 9.69→9.52, val_loss climbs 9.78→9.92, gap inverts. Classic small-corpus signature. Fix not in code — data engineering: pretokenize The Stack v2 Python (multi-billion tokens). Deferred to next session per feedback_compute_pre_authorized.md. ## What was produced Best checkpoint at /mnt/nvme-raid0/runs/model-2-from-scratch-006-50k-tuned/ ckpt/epoch-002.apr: - APR v2, 219 tensors, 1.39 GiB, checksum VALID - Architecture: LlamaForCausalLM, name=llama-370m-pretrain - val_loss=9.78, 49.2M tokens (corpus wrapped 2.7×) `apr inspect` validates. AC-SHIP2-005 (.apr checkpoint format) structurally discharged. ## Coverage impact (no tally change) | Gate | Prior | Post | |------|-------|------| | AC-SHIP2-005 | PARTIAL | structurally discharged | | GATE-TRAIN-005 | PARTIAL | confirmed correct (didn't fire on real run) | | GATE-TRAIN-001 | PARTIAL | confirmed correct (per-step metrics live) | Coverage tally formally unchanged; structural verification awaits contract-level promotion. ## Methodology Per feedback_fix_root_cause_never_route_around.md: zero route-arounds. Each bug had a `TrueCause`-style root cause analysis. The placeholder return on iterator exhaustion was a route-around itself; replaced with iterator looping. The early-stop policy was data-undersampled; fixed by widening the val set, not by raising the patience alone. Spec v2.65.0 → v2.66.0. Evidence persisted to evidence/model-2-first-training-2026-04-26/ (12 metadata.json files: 5 from broken+fixed 5K runs, 7 from tuned 50K run). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closed
4 tasks
noahgift
added a commit
that referenced
this pull request
Apr 26, 2026
…t 17× anomaly site — spec v2.66.0 → v2.67.0 (#1075) §17.4 specified sub-layer bisection of FFN as the falsifier next step. PR #1066 added the 4 sub-FFN ActivationStats fields. §23 records the first run on the canonical 7B teacher post-#1066-merge. (Originally authored as §21 in the closed PR #1072. Re-numbered as §23 because §22 (PR #1074) landed first with v2.66.0 banner; this PR brings v2.67.0.) ## Key finding Live `apr trace --payload` on `paiml/qwen2.5-coder-7b-apache-q4k-v1` teacher (CPU, prompt "What is 2+2?") layer-3 sub-FFN std: | Sub-FFN slot | L1-2 baseline | L3 | Ratio | |--------------|--------------:|----:|------:| | ffn_norm | 0.85 / 0.86 | 1.00 | 1.16× normal | | ffn_gate | 1.50 / 1.99 | 1.92 | 0.97× normal | | ffn_up | 1.10 / 0.94 | 1.34 | 1.42× small | | ffn_silu | 0.043 / 0.052 | 0.168 | 3.2× precursor | | **ffn_swigl** | **0.061 / 0.071** | **1.222** | **17.2× anomaly** | | ffn_out | 0.345 / 0.216 | 11.459 | 53× cascade | Gate/up individually normal at layer 3. Element-wise multiply at inference.rs:163 `ffn_hidden.push(silu_g * u)` is the named bug site (possibly off-by-one slice indexing). ## Bug surface narrowing chain - §15.4: GPU GQA kernel ELIMINATED - §16: GPU stack ELIMINATED (CPU APR vs GGUF) - §17: layer 3 FFN sub-block named (53× ffn_out) - **§23: layer 3 ffn_swigl named (17× first anomaly site)** ## Falsifiable next investigation step (§23.6) Extend `OwnedQuantizedModel::forward_traced` (the GGUF path; needs to be authored per `project_ship_007_gguf_forward_traced_plan.md`) with same 4 sub-FFN fields. Compare APR vs GGUF layer-3 ffn_swigl directly: - ≈0.07 → APR-side bug pinned to inference.rs:160-164 - ≈1.22 → spike is normal model behavior; bug elsewhere ## Evidence persisted - evidence/ship-007-layer-3-anomaly/sub-ffn-bisection-2026-04-26.txt (386 lines) - evidence/ship-007-layer-3-anomaly/sub-ffn-per-layer-stds.csv (28-layer × 6-field summary) Spec v2.66.0 → v2.67.0. No coverage tally change. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
User mandate: "we should train a model unless the path is broken, then fix."
Delivered this session:
feedback_fix_root_cause_never_route_around.md).aprcheckpoint (1.39 GiB, 219 tensors, checksum VALID, arch=LlamaForCausalLM)§22 contains 8 subsections
ShardBatchItercorpus exhaustion silently emits(1.0, 1.0)placeholder for 1000s of steps. Fix at root:with_wrap_around(true)opt-in (PR fix(shard-reader): wrap_around so training can loop the corpus — root-cause for placeholder-loss bug #1073)epoch-002.aprat val_loss=9.78, 49.2M tokens, format-validatedSpec progression
v2.65.0 → v2.66.0. Coverage tally formally unchanged (structural verification awaits contract-level promotion).
Evidence persisted
12 metadata.json files at
evidence/model-2-first-training-2026-04-26/:Stacks under
None (independent spec amendment; PR #1073 not yet merged but referenced)
Why this matters
This is the first time MODEL-2 has been REAL trained. Prior sessions had only synthetic-drive runs or broken-path attempts. The 3-bug-fix bundle in PR #1073 is the prerequisite for any future operator dispatching real MODEL-2 training; without it, runs silently produce garbage gradients (bug 1) or terminate prematurely (bug 2).
🤖 Generated with Claude Code