test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002) by noahgift · Pull Request #1841 · paiml/aprender

noahgift · 2026-05-20T08:59:11Z

Closes a class of pre-Phase-4 contract bugs

Per the Blackwell cascade post-mortem lesson #2 — "Property test: pre_warm_keys() ⊇ runtime_keys() would have caught the regression". Same risk class for ShardBatchSource: cursor/wrap/chunk semantics are silent on divergences until live dispatch, where each failure costs 5-15 min.

Two new tests

F-DISTILL-SHARD-BATCH-001 — happy path

Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via ShardBatchSource::from_dir, asserts:

batch shape (4 rows × 16 tokens)
all returned tokens lie in [0, 4096) (fixture range)
labels in same range

Catches: any cursor-off-by-one or layout swap that produces garbage outside the fixture range.

F-DISTILL-SHARD-BATCH-002 — wrap-around

Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16), consumes 5 batches in a row. Asserts no error — wrap_around=true is the default.

Catches: regression where the iterator returns None on exhaustion despite the constructor setting wrap_around.

Test plan

63 distill lib tests pass (was 61; 2 new)
cargo test --features shard-batch-source clean

Phase 4 ladder

Stage	PR	Status
A	#1833	✅ MERGED + verified
B-1	#1836	✅ MERGED
B-2	#1839	🟡 in CI
C-prep	#1840	🟡 in CI
B-1.5 hardening	THIS	fixture tests for ShardBatchSource
C	(next)	live trial w/ --dataset
D	(compute-gated)	50K-step Phase 4
E	(Phase 5)	HumanEval pass@1
F	(Phase 6)	publish v2

🤖 Generated with Claude Code

…(F-DISTILL-SHARD-BATCH-001/002) Closes the cross-component contract gap that the Blackwell cascade post-mortem (lesson #2) identified: cache machinery is silent on divergences between producer and consumer, until live dispatch surfaces the failure. Same risk class for ShardBatchSource: its wrap-around / cursor / chunk semantics need fixture-driven verification. Adds two tests gated on `shard-batch-source` feature: F-DISTILL-SHARD-BATCH-001 — happy path Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via ShardBatchSource::from_dir, asserts: - batch shape (4 rows × 16 tokens) - all returned tokens lie in [0, 4096) (fixture range) - labels in same range Catches: any cursor-off-by-one or layout swap that produces garbage outside the fixture range. F-DISTILL-SHARD-BATCH-002 — wrap-around Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16), consumes 5 batches in a row. Asserts no error — wrap_around=true is the default for ShardBatchSource. Catches: regression where the iterator returns None on exhaustion despite the constructor setting wrap_around. Test plan: - [x] 63 distill lib tests pass (was 61; 2 new) - [x] `cargo test --features shard-batch-source` clean These two tests would have caught most ShardBatchSource bugs at PR-time instead of at gx10-dispatch-time, where each failure costs 5-15min. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ASSES (Phase 4 ladder) (#1845) 2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus (.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B student on Blackwell GB10 (sm_121), 100-step trial. initial_loss = 15.6094 final_loss = 6.0095 ← Δ = -9.60 (-62% reduction) 124 steps, 232.4s, 1.87 sec/step This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3 victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke (#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it with strictly better convergence on real data (codeparrot Python tokenized to Qwen vocab, 10 shards / 383 MB). What this validates: - ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards correctly and produces non-degenerate batches - Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from synthetic → real source via with_batch_source() cleanly - Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10 - Full Phase 4 readiness for the 50K-step Stage D run (compute-gated, requires user check-in per autonomous-mode rule) Cascade math: Stage A: Δloss = -6.80 over 62 steps (synthetic, seq=256) Stage C: Δloss = -9.60 over 124 steps (real corpus, seq=256) Per-step loss decrease: Stage A: -0.110/step Stage C: -0.077/step Stage A's per-step rate is higher because synthetic data has zero variance — every batch is the same identity-mapping task. Real-corpus Stage C has higher variance but covers more concepts, so absolute delta is larger. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) ✅ MERGED Stage C-prep (#1840) ✅ MERGED Stage B-1.5 tests (#1841) 🟡 in CI Stage C trial (THIS evidence) ✅ PASSED 2026-05-20 Stage D 50K dispatch ⏳ awaiting user check-in (28h GB10 compute) Stage E HumanEval pass@1 ⏳ Phase 5 (turnkey post-Stage-D) Stage F publish v2 ⏳ Phase 6 (turnkey post-Stage-E) Evidence: - evidence/distill-stage-c-trial/dispatch.json — dispatch manifest - evidence/distill-stage-c-trial/launch-victory.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/ Trained checkpoint: student-trained.apr/model.safetensors Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 20, 2026 08:59

noahgift added 2 commits May 20, 2026 11:18

Merge branch 'main' into test/distill-shard-batch-source-fixture

56e57ba

Merge branch 'main' into test/distill-shard-batch-source-fixture

73c8f32

noahgift mentioned this pull request May 20, 2026

evidence(distill): Stage C — first real-corpus distillation on GB10 PASSES #1845

Merged

noahgift added 3 commits May 20, 2026 12:41

Merge branch 'main' into test/distill-shard-batch-source-fixture

4ba814b

Merge branch 'main' into test/distill-shard-batch-source-fixture

3f9a17f

Merge branch 'main' into test/distill-shard-batch-source-fixture

a1fe7e2

Merge branch 'main' into test/distill-shard-batch-source-fixture

be30744

noahgift merged commit 4f5c12a into main May 20, 2026
10 checks passed

noahgift deleted the test/distill-shard-batch-source-fixture branch May 20, 2026 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002)#1841

test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002)#1841
noahgift merged 7 commits into
mainfrom
test/distill-shard-batch-source-fixture

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Closes a class of pre-Phase-4 contract bugs

Two new tests

F-DISTILL-SHARD-BATCH-001 — happy path

F-DISTILL-SHARD-BATCH-002 — wrap-around

Test plan

Phase 4 ladder

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant