feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881
Closed
noahgift wants to merge 3 commits into
Closed
feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881noahgift wants to merge 3 commits into
noahgift wants to merge 3 commits into
Conversation
…onitoring gap (PMAT-705)
The PMAT-704 cascade post-mortem revealed a hard observability gap: long
distill dispatches were silent between "blocks uploaded" and
"Distillation complete." There was no way to distinguish "training
progressing silently" from "training hung." A 1.5 h gx10 hang at step 0
was invisible until terminal state.
Root cause: `aprender-train` ships a full callback infrastructure
(`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`,
`EarlyStoppingCallback`, etc.) but `aprender-train-distill::Pipeline`
had zero callback references. Per-step loss was computed in `kd_step`
and discarded.
## Fix
`crates/aprender-train-distill/src/pipeline.rs`:
* `Pipeline` gains `callbacks: Vec<Box<dyn entrenar::train::TrainerCallback>>`
field and `with_callback()` builder method.
* Training loop fires the full lifecycle:
- `on_train_begin` once before the first step.
- `on_epoch_begin` at each epoch boundary.
- `on_step_end` after grad application with current loss + step
counter (the load-bearing observability point).
- `on_epoch_end` at each epoch boundary.
- `on_train_end` once after the loop.
* `CallbackAction::Stop` from any callback terminates the training
loop within one step (early-stopping support).
`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:
* Attaches a default `ProgressCallback` honoring the env var
`APR_DISTILL_LOG_EVERY` (default 10). `APR_DISTILL_LOG_EVERY=0`
disables per-step logs but keeps epoch boundary lines.
## Contract
`contracts/distill-pipeline-observability-v1.yaml`:
* 3 equations: callback_lifecycle, progress_log_format, default_attachment.
* 4 falsifiers (FT-OBSERV-001..004): per-step callback fires; log_every
honored; APR_DISTILL_LOG_EVERY=0 silences per-step; callback overhead
bounded.
* 2 Kani harnesses (callback_ordering, Stop_terminates_loop).
* qa_gate F-OBSERV-001.
Validates clean: `pv validate` reports 0 errors, 0 warnings.
## Unit tests
`crates/aprender-train-distill/src/pipeline.rs` tests module:
* `pmat_705_observability::falsify_observ_001_per_step_callback_fires`
* `pmat_705_observability::falsify_observ_001_step_count_matches`
* `pmat_705_observability::callback_ordering_preserved`
All 3 PASS. Use a recording callback that captures every lifecycle
event and verify event counts match the expected (epochs ×
steps_per_epoch) shape.
## gx10 dispatch verification
Dispatch on gx10 emits:
[PMAT-705] ProgressCallback attached (log every 1 steps)
…confirming the wiring fires. Full per-step output on the 7B teacher
path requires PMAT-704 (#1879) to also land — current main routes Q4K
teachers through Bug B's CPU-bound `RealizarQ4KTeacher` which hangs at
step 0 (no progress to log). The PMAT-705 wiring is correct
independent of teacher backend; the unit tests prove it.
## Cascade context
Sixth fix in the PMAT-701 family:
- #1863 PMAT-701 Bug A: allocator autodetect Grace Blackwell
- #1869 PMAT-701 Bug B: RealizarQ4KTeacher (demoted by #1879)
- #1871 SPEC-DISTILL §86 + dispatch script default
- #1874 PMAT-702: apr eval no-fake-pass on broken models
- #1877 PMAT-703: teacher vocab alignment (superseded by #1879's TruncatingTeacher)
- #1879 PMAT-704: cuBLAS default + opt-in Realizar fallback
- #1880 SPEC-DISTILL §87: PMAT-704 post-mortem
- **This PR**: PMAT-705 — ProgressCallback wired into Pipeline
Together: the distill → eval pipeline is honest end-to-end AND
observable end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
Contributor
Author
auto-merge was automatically disabled
May 23, 2026 04:37
Pull request was closed
noahgift
added a commit
that referenced
this pull request
May 23, 2026
When the operator sets `APR_DISTILL_MAX_STEPS=N` (default unset), the distill training loop runs at most N steps, prints a per-run summary, and exits without writing a final output model. Lets operators validate the cascade end-to-end in ~60 s before committing to a 30-50 h Stage D production run. The PMAT-704 cascade post-mortem found that the 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with no per-step output. PMAT-705 (#1881) added ProgressCallback to surface per-step loss during normal runs. PMAT-706 adds the complementary EARLY-BREAK so operators don't have to wait through the full epoch budget to see if something's wrong. `crates/aprender-train-distill/src/pipeline.rs`: * Reads `APR_DISTILL_MAX_STEPS` env var. Empty/unset = old behavior (no regression). N > 0 = run at most N steps then break. N = 0 or non-integer = early Err with clear message. * Optional `APR_DISTILL_PROJECT_TO_STEPS` env var (default 50000) controls the projected-wall-time target in the summary. * `train()` early-breaks the inner loop when step >= max_steps, prints two `[SMOKE]` summary lines (loss trajectory + projected wall time at the observed throughput), and returns empty weights / shapes via the normal Result path. * `execute()` detects smoke mode (env var set) and short-circuits the export step — no `model.safetensors` / output.apr is written, so downstream tools (`apr eval`, `apr run`) can't accidentally consume a smoke result. [PMAT-706] smoke mode: APR_DISTILL_MAX_STEPS=N (early-break after N steps; no final output.apr written) ... [SMOKE] N steps in T.TTs: initial_loss=X.XXXX, final_loss=Y.YYYY, throughput=Z.ZZ step/s [SMOKE] projected full-run wall time (50000 steps): H.HHh / W.W min / S.Ss [PMAT-706] smoke mode: skipping export — no model.safetensors / output.apr written `contracts/apr-distill-smoke-validation-v1.yaml`: * 3 equations: early_break_condition (off-by-one tight), smoke_summary_format, no_side_effects. * 4 falsifiers (FT-SMOKE-001..004) covering exact step count, no-regression when unset, summary line format, no output.apr written. * 2 Kani harnesses (count is tight; 0 steps is degenerate, not panic). * qa_gate F-SMOKE-001. * Validates clean: `pv validate` 0 errors, 0 warnings. `pipeline::tests::pmat_706_smoke_validation`: * `falsify_smoke_001_exact_step_count` — N=10 returns metrics.steps_completed == 10 * `falsify_smoke_002_no_regression_when_unset` — unset → full epochs run * `falsify_smoke_004_no_output_in_smoke` — output_path empty + no model.* files * `smoke_zero_steps_returns_err` — N=0 returns Err Tests share global env state; serialized via a Mutex (ENV_LOCK) so they don't race in parallel threads. All 4 PASS. This closes the diagnostic loop on the PMAT-704 cascade post-mortem lesson. Per memory `feedback_a_priori_theoretical_falsification.md`: 30 min of math saves 8 h of GPU. PMAT-706 is the runtime analog: 60 s of smoke saves 8 h of staring at a silent process. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 23, 2026
…#1888) When the operator sets `APR_DISTILL_MAX_STEPS=N` (default unset), the distill training loop runs at most N steps, prints a per-run summary, and exits without writing a final output model. Lets operators validate the cascade end-to-end in ~60 s before committing to a 30-50 h Stage D production run. The PMAT-704 cascade post-mortem found that the 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with no per-step output. PMAT-705 (#1881) added ProgressCallback to surface per-step loss during normal runs. PMAT-706 adds the complementary EARLY-BREAK so operators don't have to wait through the full epoch budget to see if something's wrong. `crates/aprender-train-distill/src/pipeline.rs`: * Reads `APR_DISTILL_MAX_STEPS` env var. Empty/unset = old behavior (no regression). N > 0 = run at most N steps then break. N = 0 or non-integer = early Err with clear message. * Optional `APR_DISTILL_PROJECT_TO_STEPS` env var (default 50000) controls the projected-wall-time target in the summary. * `train()` early-breaks the inner loop when step >= max_steps, prints two `[SMOKE]` summary lines (loss trajectory + projected wall time at the observed throughput), and returns empty weights / shapes via the normal Result path. * `execute()` detects smoke mode (env var set) and short-circuits the export step — no `model.safetensors` / output.apr is written, so downstream tools (`apr eval`, `apr run`) can't accidentally consume a smoke result. [PMAT-706] smoke mode: APR_DISTILL_MAX_STEPS=N (early-break after N steps; no final output.apr written) ... [SMOKE] N steps in T.TTs: initial_loss=X.XXXX, final_loss=Y.YYYY, throughput=Z.ZZ step/s [SMOKE] projected full-run wall time (50000 steps): H.HHh / W.W min / S.Ss [PMAT-706] smoke mode: skipping export — no model.safetensors / output.apr written `contracts/apr-distill-smoke-validation-v1.yaml`: * 3 equations: early_break_condition (off-by-one tight), smoke_summary_format, no_side_effects. * 4 falsifiers (FT-SMOKE-001..004) covering exact step count, no-regression when unset, summary line format, no output.apr written. * 2 Kani harnesses (count is tight; 0 steps is degenerate, not panic). * qa_gate F-SMOKE-001. * Validates clean: `pv validate` 0 errors, 0 warnings. `pipeline::tests::pmat_706_smoke_validation`: * `falsify_smoke_001_exact_step_count` — N=10 returns metrics.steps_completed == 10 * `falsify_smoke_002_no_regression_when_unset` — unset → full epochs run * `falsify_smoke_004_no_output_in_smoke` — output_path empty + no model.* files * `smoke_zero_steps_returns_err` — N=0 returns Err Tests share global env state; serialized via a Mutex (ENV_LOCK) so they don't race in parallel threads. All 4 PASS. This closes the diagnostic loop on the PMAT-704 cascade post-mortem lesson. Per memory `feedback_a_priori_theoretical_falsification.md`: 30 min of math saves 8 h of GPU. PMAT-706 is the runtime analog: 60 s of smoke saves 8 h of staring at a silent process. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the training-monitoring gap surfaced by the PMAT-704 cascade. A 1.5 h gx10 hang at step 0 went unnoticed because the distill pipeline emits zero output between "blocks uploaded" and "Distillation complete."
Root cause: `aprender-train` ships a full callback infrastructure (`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`, etc.) but `aprender-train-distill::Pipeline` had zero callback references. Per-step loss was computed in `kd_step` and discarded.
Fix
`crates/aprender-train-distill/src/pipeline.rs`:
`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:
Contract
`contracts/distill-pipeline-observability-v1.yaml` (validates clean):
Unit tests
3 new tests in `pipeline::tests::pmat_705_observability`, all PASS:
```
falsify_observ_001_per_step_callback_fires .... ok
falsify_observ_001_step_count_matches ......... ok
callback_ordering_preserved ................... ok
```
Use a recording callback that captures every lifecycle event and verify counts match (epochs × steps_per_epoch).
gx10 verification
Dispatch emits:
```
[PMAT-705] ProgressCallback attached (log every 1 steps)
```
Full per-step output on the 7B teacher path needs #1879 (PMAT-704) to land too — current main routes Q4K through Bug B's CPU-bound path which hangs at step 0 (no steps → no logs). The PMAT-705 wiring is correct independent of teacher backend; unit tests prove it.
Cascade
Sixth fix in the PMAT-701 family:
Together the distill → eval pipeline is honest AND observable end-to-end.
Test plan
🤖 Generated with Claude Code