test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002)#1841
Merged
Merged
Conversation
…(F-DISTILL-SHARD-BATCH-001/002) Closes the cross-component contract gap that the Blackwell cascade post-mortem (lesson #2) identified: cache machinery is silent on divergences between producer and consumer, until live dispatch surfaces the failure. Same risk class for ShardBatchSource: its wrap-around / cursor / chunk semantics need fixture-driven verification. Adds two tests gated on `shard-batch-source` feature: F-DISTILL-SHARD-BATCH-001 — happy path Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via ShardBatchSource::from_dir, asserts: - batch shape (4 rows × 16 tokens) - all returned tokens lie in [0, 4096) (fixture range) - labels in same range Catches: any cursor-off-by-one or layout swap that produces garbage outside the fixture range. F-DISTILL-SHARD-BATCH-002 — wrap-around Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16), consumes 5 batches in a row. Asserts no error — wrap_around=true is the default for ShardBatchSource. Catches: regression where the iterator returns None on exhaustion despite the constructor setting wrap_around. Test plan: - [x] 63 distill lib tests pass (was 61; 2 new) - [x] `cargo test --features shard-batch-source` clean These two tests would have caught most ShardBatchSource bugs at PR-time instead of at gx10-dispatch-time, where each failure costs 5-15min. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
…ASSES (Phase 4 ladder) (#1845) 2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus (.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B student on Blackwell GB10 (sm_121), 100-step trial. initial_loss = 15.6094 final_loss = 6.0095 ← Δ = -9.60 (-62% reduction) 124 steps, 232.4s, 1.87 sec/step This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3 victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke (#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it with strictly better convergence on real data (codeparrot Python tokenized to Qwen vocab, 10 shards / 383 MB). What this validates: - ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards correctly and produces non-degenerate batches - Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from synthetic → real source via with_batch_source() cleanly - Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10 - Full Phase 4 readiness for the 50K-step Stage D run (compute-gated, requires user check-in per autonomous-mode rule) Cascade math: Stage A: Δloss = -6.80 over 62 steps (synthetic, seq=256) Stage C: Δloss = -9.60 over 124 steps (real corpus, seq=256) Per-step loss decrease: Stage A: -0.110/step Stage C: -0.077/step Stage A's per-step rate is higher because synthetic data has zero variance — every batch is the same identity-mapping task. Real-corpus Stage C has higher variance but covers more concepts, so absolute delta is larger. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) ✅ MERGED Stage C-prep (#1840) ✅ MERGED Stage B-1.5 tests (#1841) 🟡 in CI Stage C trial (THIS evidence) ✅ PASSED 2026-05-20 Stage D 50K dispatch ⏳ awaiting user check-in (28h GB10 compute) Stage E HumanEval pass@1 ⏳ Phase 5 (turnkey post-Stage-D) Stage F publish v2 ⏳ Phase 6 (turnkey post-Stage-E) Evidence: - evidence/distill-stage-c-trial/dispatch.json — dispatch manifest - evidence/distill-stage-c-trial/launch-victory.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/ Trained checkpoint: student-trained.apr/model.safetensors Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes a class of pre-Phase-4 contract bugs
Per the Blackwell cascade post-mortem lesson #2 — "Property test: pre_warm_keys() ⊇ runtime_keys() would have caught the regression". Same risk class for ShardBatchSource: cursor/wrap/chunk semantics are silent on divergences until live dispatch, where each failure costs 5-15 min.
Two new tests
F-DISTILL-SHARD-BATCH-001 — happy path
Writes a tiny
.binshard with[0, 1, ..., 4095]tokens, opens viaShardBatchSource::from_dir, asserts:[0, 4096)(fixture range)Catches: any cursor-off-by-one or layout swap that produces garbage outside the fixture range.
F-DISTILL-SHARD-BATCH-002 — wrap-around
Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16), consumes 5 batches in a row. Asserts no error — wrap_around=true is the default.
Catches: regression where the iterator returns None on exhaustion despite the constructor setting wrap_around.
Test plan
cargo test --features shard-batch-sourcecleanPhase 4 ladder
🤖 Generated with Claude Code