Skip to content

docs(ship-two-001): §33 — MODEL-2 codeparrot retrain val_loss=9.3837 confirms §25 corpus-diversity hypothesis#1094

Merged
noahgift merged 1 commit into
mainfrom
docs/spec-33-codeparrot-retrain-success
Apr 28, 2026
Merged

docs(ship-two-001): §33 — MODEL-2 codeparrot retrain val_loss=9.3837 confirms §25 corpus-diversity hypothesis#1094
noahgift merged 1 commit into
mainfrom
docs/spec-33-codeparrot-retrain-success

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

P1 corpus pipeline complete end-to-end. P2 MODEL-2 retrain on 565.6M-token codeparrot Python+permissive corpus (7.6× the 4× CSN-Python baseline) pushes val_loss from 9.7507 plateau to 9.3837 — a 0.367-nat (4.7%) improvement.

§25 had concluded:

"There is no LR/step configuration that beats val_loss=9.75 on CSN-Python — only Stack v2 will move the needle."

§33 confirms this empirically.

Pipeline (all stack-canonical, zero muda)

Phase Outcome
P1.0 contract authored #1080#1089
P1.1 apr pull dataset extension #1089 MERGED
P1.4 codeparrot pull 80 shards / 27 GB
P1.5a parquet → JSONL filter 405,904 rows / 3.17 GB
P1.5b BPE encode-corpus 57 shards / 565.6M tokens / 10h
P2 MODEL-2 retrain on RTX 4090 EARLY_STOP at 51 ep / 47 min

Total wall time from contract authoring to val_loss=9.3837: ~14 hours.

Training curve highlights

Epoch val_loss
0 10.0698
10 9.5657
20 9.4771
30 9.42x
44 9.3837 ← BEST
50 9.3889 (EARLY_STOP)

Full per-epoch metadata in evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json.

Methodology proven (§26.8 pays off)

The §26.8 stack-tool-extension rule paid off concretely:

  • 6h authoring cost (P1.0 contract + P1.1 impl) → permanent apr capability
  • Every future dataset pull benefits
  • §33's val_loss=9.3837 is downstream proof of the methodology

Coverage impact

§33 is binding evidence for SHIP-021 (corpus diversity binding) — promotion to DISCHARGED is deferred to a separate PR that updates the SHIP-021 contract atomically. Spec scoreboard unchanged (15+33) in this PR per "one coverage flip per PR" methodology.

Files

  • Spec: §33 added (~80 lines, 8 subsections)
  • Evidence:
    • evidence/model-2-codeparrot-retrain-2026-04-28/launch.log
    • evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json
  • Best checkpoint: /mnt/nvme-raid0/runs/model-2-from-scratch-010-codeparrot/ckpt/epoch-044.apr (live on RTX 4090 host)

Test plan

  • Spec self-consistent: header v2.78.0 references new §33
  • Evidence files commit cleanly (launch.log force-added past .gitignore)
  • Training data live, reproducible via the launch script in /mnt/nvme-raid0/data/codeparrot-python-permissive-shards/

Next session

Per §33.4: re-train with --num-steps 200000 and looser early-stop patience to push val_loss further. With 565.6M tokens available and only 83.5M (15%) seen at EARLY_STOP, there's significant headroom.

🤖 Generated with Claude Code

…confirms §25 corpus-diversity hypothesis — v2.77 → v2.78

P1 corpus pipeline complete end-to-end. P2 MODEL-2 retrain on 565.6M-token
codeparrot Python+permissive corpus (7.6× the 4× CSN-Python baseline)
pushes val_loss from the 9.7507 plateau to 9.3837 — a 0.367-nat (4.7%)
improvement with the SAME training configuration.

§25 had concluded (after 80K-step LR-budget falsification on 4× CSN-Python):
  "There is no LR/step configuration that beats val_loss=9.75 on
   CSN-Python — only Stack v2 will move the needle."

§33 confirms this empirically. The corpus-diversity binding criterion of
§26.9 is satisfied.

## Pipeline (all stack-canonical, no muda)

| Phase | Outcome |
|-------|---------|
| P1.0 contract authored (PROPOSED → ACTIVE) | #1080#1089 |
| P1.1 apr pull dataset extension | #1089 MERGED |
| P1.4 codeparrot pull | 80 shards / 27 GB |
| P1.5a parquet → JSONL filter | 405,904 rows / 3.17 GB |
| P1.5b BPE encode-corpus | 57 shards / 565.6M tokens / 10h |
| P2 MODEL-2 retrain on RTX 4090 | EARLY_STOP at 51 ep / 47 min |

Total wall time from contract authoring to val_loss=9.3837: ~14 hours.

## Training curve highlights

- epoch 0: train=9.7567, val=10.0698 (init)
- epoch 10: train=9.4610, val=9.5657 (post-warmup)
- epoch 30: train=9.2x, val=9.42x
- epoch 44: val=9.3837 (BEST)
- epoch 50: train=9.2093, val=9.3889 (EARLY_STOP next)

Full per-epoch metadata in evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json.

## Coverage impact

§33 is binding evidence for SHIP-021 (corpus diversity binding) — promotion
to DISCHARGED is deferred to a separate PR that updates the SHIP-021
contract atomically. Spec scoreboard unchanged (15+33) in this PR.

## Files

- evidence/model-2-codeparrot-retrain-2026-04-28/launch.log
- evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json
- §33 spec section (8 subsections, ~80 lines)
- Header: v2.77.0 → v2.78.0

## Methodology landed

The §26.8 stack-tool-extension rule paid off concretely:
- 6h authoring cost (P1.0 contract + P1.1 impl) → permanent apr capability
- Every future dataset pull benefits
- §33's val_loss=9.3837 is downstream proof of the methodology

This commit represents the first cycle in §22→§33 where the spec amendment
has the same priority as the empirical result.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 28, 2026 02:28
@noahgift noahgift merged commit 52da8e5 into main Apr 28, 2026
11 checks passed
@noahgift noahgift deleted the docs/spec-33-codeparrot-retrain-success branch April 28, 2026 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant