Skip to content

test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002)#1841

Merged
noahgift merged 7 commits into
mainfrom
test/distill-shard-batch-source-fixture
May 20, 2026
Merged

test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002)#1841
noahgift merged 7 commits into
mainfrom
test/distill-shard-batch-source-fixture

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Closes a class of pre-Phase-4 contract bugs

Per the Blackwell cascade post-mortem lesson #2 — "Property test: pre_warm_keys() ⊇ runtime_keys() would have caught the regression". Same risk class for ShardBatchSource: cursor/wrap/chunk semantics are silent on divergences until live dispatch, where each failure costs 5-15 min.

Two new tests

F-DISTILL-SHARD-BATCH-001 — happy path

Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via ShardBatchSource::from_dir, asserts:

  • batch shape (4 rows × 16 tokens)
  • all returned tokens lie in [0, 4096) (fixture range)
  • labels in same range

Catches: any cursor-off-by-one or layout swap that produces garbage outside the fixture range.

F-DISTILL-SHARD-BATCH-002 — wrap-around

Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16), consumes 5 batches in a row. Asserts no error — wrap_around=true is the default.

Catches: regression where the iterator returns None on exhaustion despite the constructor setting wrap_around.

Test plan

  • 63 distill lib tests pass (was 61; 2 new)
  • cargo test --features shard-batch-source clean

Phase 4 ladder

Stage PR Status
A #1833 ✅ MERGED + verified
B-1 #1836 ✅ MERGED
B-2 #1839 🟡 in CI
C-prep #1840 🟡 in CI
B-1.5 hardening THIS fixture tests for ShardBatchSource
C (next) live trial w/ --dataset
D (compute-gated) 50K-step Phase 4
E (Phase 5) HumanEval pass@1
F (Phase 6) publish v2

🤖 Generated with Claude Code

…(F-DISTILL-SHARD-BATCH-001/002)

Closes the cross-component contract gap that the Blackwell cascade
post-mortem (lesson #2) identified: cache machinery is silent on
divergences between producer and consumer, until live dispatch
surfaces the failure. Same risk class for ShardBatchSource: its
wrap-around / cursor / chunk semantics need fixture-driven verification.

Adds two tests gated on `shard-batch-source` feature:

  F-DISTILL-SHARD-BATCH-001 — happy path
    Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via
    ShardBatchSource::from_dir, asserts:
      - batch shape (4 rows × 16 tokens)
      - all returned tokens lie in [0, 4096) (fixture range)
      - labels in same range
    Catches: any cursor-off-by-one or layout swap that produces
    garbage outside the fixture range.

  F-DISTILL-SHARD-BATCH-002 — wrap-around
    Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16),
    consumes 5 batches in a row. Asserts no error — wrap_around=true is
    the default for ShardBatchSource. Catches: regression where the
    iterator returns None on exhaustion despite the constructor
    setting wrap_around.

Test plan:
- [x] 63 distill lib tests pass (was 61; 2 new)
- [x] `cargo test --features shard-batch-source` clean

These two tests would have caught most ShardBatchSource bugs at PR-time
instead of at gx10-dispatch-time, where each failure costs 5-15min.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 08:59
noahgift added a commit that referenced this pull request May 20, 2026
…ASSES (Phase 4 ladder) (#1845)

2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus
(.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B
student on Blackwell GB10 (sm_121), 100-step trial.

  initial_loss = 15.6094
  final_loss   =  6.0095   ← Δ = -9.60 (-62% reduction)
  124 steps, 232.4s, 1.87 sec/step

This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3
victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke
(#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it
with strictly better convergence on real data (codeparrot Python
tokenized to Qwen vocab, 10 shards / 383 MB).

What this validates:
- ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards
  correctly and produces non-degenerate batches
- Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from
  synthetic → real source via with_batch_source() cleanly
- Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10
- Full Phase 4 readiness for the 50K-step Stage D run (compute-gated,
  requires user check-in per autonomous-mode rule)

Cascade math:
  Stage A:  Δloss = -6.80 over 62 steps  (synthetic, seq=256)
  Stage C:  Δloss = -9.60 over 124 steps (real corpus, seq=256)
  Per-step loss decrease:
    Stage A: -0.110/step
    Stage C: -0.077/step
  Stage A's per-step rate is higher because synthetic data has zero
  variance — every batch is the same identity-mapping task. Real-corpus
  Stage C has higher variance but covers more concepts, so absolute
  delta is larger.

Phase 4 ladder progress:
  Stage A (#1833)              ✅ MERGED + verified
  Stage B-1 (#1836)            ✅ MERGED
  Stage B-2 (#1839)            ✅ MERGED
  Stage C-prep (#1840)         ✅ MERGED
  Stage B-1.5 tests (#1841)    🟡 in CI
  Stage C trial (THIS evidence) ✅ PASSED 2026-05-20
  Stage D 50K dispatch          ⏳ awaiting user check-in (28h GB10 compute)
  Stage E HumanEval pass@1      ⏳ Phase 5 (turnkey post-Stage-D)
  Stage F publish v2            ⏳ Phase 6 (turnkey post-Stage-E)

Evidence:
- evidence/distill-stage-c-trial/dispatch.json — dispatch manifest
- evidence/distill-stage-c-trial/launch-victory.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/
Trained checkpoint: student-trained.apr/model.safetensors

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 4f5c12a into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the test/distill-shard-batch-source-fixture branch May 20, 2026 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant