fix(distill): save trained GPU weights + periodic checkpoints (PMAT-699 P0) by noahgift · Pull Request #1856 · paiml/aprender

noahgift · 2026-05-21T16:12:26Z

P0 — Stage D 2026-05-20 produced 200-byte empty checkpoint despite 25h compute

Two P0 defects surfaced by Stage D Phase 4 dispatch: training ran for 25h on Blackwell GB10, final_loss=3.58 (real convergence), but student-trained.apr/model.safetensors was a 200-byte empty placeholder. The trained weights lived only in CudaStudentProvider's GPU memory and are gone now that the process exited.

Root cause — defect pair

Defect 1 — empty export:
pipeline.export() serializes the student_weights HashMap. PMAT-698f's APR short-circuit (correct on the read side) returns empty maps. The write side then ships nothing.

Defect 2 — zero intermediate checkpoints:
A 25h run with no periodic saves. A crash at step 49999 would have had identical outcome to what we observed.

Fix

New trait method StudentLogitsProvider::save_checkpoint(&mut self, path: &Path) -> Result<()> with no-op default (preserves FixtureStudent behavior).
CudaStudentProvider::save_checkpoint(path) delegates to trainer.save_apr(path, "albor-distilled-v2", "Qwen2ForCausalLM"). GPU weights → APR v2 file with full metadata.
pipeline.export() calls self.student.save_checkpoint(<output_dir>/model.apr) after the metadata sidecar write. CUDA path produces a real APR; fixture path still writes only the metadata-only safetensors.
Periodic checkpointing in pipeline.train():
- Every APR_DISTILL_CHECKPOINT_EVERY steps (default 5000)
- Files: <output_dir>/ckpt-step-{N:06}.apr
- Write failures logged but don't fail training
- Set to 0 to disable (smoke tests)

Test plan

cargo check (both with/without cuda feature) clean
61 distill lib tests pass (fixture path semantics preserved)
Live re-dispatch (Phase 4 short re-train per user direction): verify model.apr non-empty AND ckpt-step-NNNNNN.apr every 5000 steps

What this unblocks

Phase 5 (HumanEval) + Phase 6 (publish) — both require a model with real weights. Per user direction (2026-05-21): land this P0 fix, then dispatch a 5-10K-step re-train (~5h) to produce a usable Phase 5/6 checkpoint.

🤖 Generated with Claude Code

…99 P0) Two P0 defects surfaced by Stage D 2026-05-20: 25h of GB10 training produced final_loss=3.58 (real convergence) but a 200-byte empty model.safetensors. Root cause is a defect-pair: Defect 1: pipeline.export() serializes the `student_weights` HashMap, which PMAT-698f's APR short-circuit returns empty (the read side correctly handles APR; the write side then ships nothing). The trained GPU weights in CudaStudentProvider were never pulled back to disk. Defect 2: zero intermediate checkpoints across the 25h run. A crash at step 49999 would have produced identical loss to what we observed. Fix: 1. New trait method `StudentLogitsProvider::save_checkpoint(&mut self, path: &Path) -> Result<()>` with a no-op default (preserves FixtureStudent behavior). 2. `CudaStudentProvider::save_checkpoint(path)` delegates to `trainer.save_apr(path, "albor-distilled-v2", "Qwen2ForCausalLM")`. GPU weights → APR v2 file with full metadata. 3. `pipeline.export()` calls `self.student.save_checkpoint(<output_dir>/model.apr)` after the metadata sidecar write. CUDA path produces a real APR; fixture path still writes only the metadata-only safetensors. 4. Periodic checkpointing in `pipeline.train()`: - Every APR_DISTILL_CHECKPOINT_EVERY steps (default 5000) - Files: `<output_dir>/ckpt-step-{N:06}.apr` - Write failures are logged but don't fail training (loss progress is more valuable than pristine intermediate snapshots) - Set to 0 to disable (smoke tests) Test plan: - [x] cargo check (both with/without cuda feature) clean - [x] 61 distill lib tests pass (FALSIFY-APR-DISTILL-TRAIN-001/002 unchanged; fixture path semantics preserved via no-op default) - [ ] Live gx10 re-dispatch (Phase 4 short re-train, per user direction): verify model.apr is written with non-trivial size AND ckpt-step-NNNNNN.apr appears every 5000 steps Unblocks Phase 5 (HumanEval) and Phase 6 (publish), which both require a model with real weights. Stage D 2026-05-20 produced loss-verdict evidence (-66% over 25h) but no usable model. Per user direction (2026-05-21): land this P0 fix, then dispatch a 5-10K-step re-train (~5h) to produce a usable Phase 5/6 checkpoint. Future runs benefit from periodic checkpoints automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 21, 2026 16:12

noahgift merged commit 2deb0da into main May 21, 2026
11 checks passed

noahgift deleted the fix/distill-checkpoint-export-pmat-699 branch May 21, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): save trained GPU weights + periodic checkpoints (PMAT-699 P0)#1856

fix(distill): save trained GPU weights + periodic checkpoints (PMAT-699 P0)#1856
noahgift merged 1 commit into
mainfrom
fix/distill-checkpoint-export-pmat-699

noahgift commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 21, 2026

P0 — Stage D 2026-05-20 produced 200-byte empty checkpoint despite 25h compute

Root cause — defect pair

Fix

Test plan

What this unblocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant