feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705) by noahgift · Pull Request #1881 · paiml/aprender

noahgift · 2026-05-22T11:09:53Z

Summary

Closes the training-monitoring gap surfaced by the PMAT-704 cascade. A 1.5 h gx10 hang at step 0 went unnoticed because the distill pipeline emits zero output between "blocks uploaded" and "Distillation complete."

Root cause: `aprender-train` ships a full callback infrastructure (`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`, etc.) but `aprender-train-distill::Pipeline` had zero callback references. Per-step loss was computed in `kd_step` and discarded.

Fix

`crates/aprender-train-distill/src/pipeline.rs`:

`Pipeline` gains `callbacks: Vec<Box>` field + `with_callback()` builder.
Training loop wires full lifecycle: `on_train_begin` → `on_epoch_begin` → `on_step_end` × N (with current loss + step counter) → `on_epoch_end` → `on_train_end`.
`CallbackAction::Stop` terminates the loop within one step (early-stopping support).

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

Attaches default `ProgressCallback` honoring `APR_DISTILL_LOG_EVERY` env (default 10). Set to 0 to disable per-step logs (epoch boundaries still log).

Contract

`contracts/distill-pipeline-observability-v1.yaml` (validates clean):

3 equations: callback lifecycle, progress log format, default attachment
4 falsifiers: per-step callback fires; log_every honored; APR_DISTILL_LOG_EVERY=0 silences; overhead bounded
2 Kani harnesses (callback ordering, Stop terminates loop)
qa_gate F-OBSERV-001

Unit tests

3 new tests in `pipeline::tests::pmat_705_observability`, all PASS:

```
falsify_observ_001_per_step_callback_fires .... ok
falsify_observ_001_step_count_matches ......... ok
callback_ordering_preserved ................... ok
```

Use a recording callback that captures every lifecycle event and verify counts match (epochs × steps_per_epoch).

gx10 verification

Dispatch emits:

```
[PMAT-705] ProgressCallback attached (log every 1 steps)
```

Full per-step output on the 7B teacher path needs #1879 (PMAT-704) to land too — current main routes Q4K through Bug B's CPU-bound path which hangs at step 0 (no steps → no logs). The PMAT-705 wiring is correct independent of teacher backend; unit tests prove it.

Cascade

Sixth fix in the PMAT-701 family:

feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701) #1863 Bug A: allocator autodetect
fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869 Bug B: RealizarQ4KTeacher (demoted by fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) #1879)
chore(distill): default to MODEL-1 7B teacher + SPEC-DISTILL-001 §86 (PMAT-701 follow-up) #1871 SPEC-DISTILL §86
fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702) #1874 PMAT-702: apr eval no-fake-pass
fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703) #1877 PMAT-703: vocab alignment (superseded by fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) #1879)
fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) #1879 PMAT-704: cuBLAS default
docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn #1880 SPEC-DISTILL §87 post-mortem
This PR: PMAT-705 ProgressCallback

Together the distill → eval pipeline is honest AND observable end-to-end.

Test plan

`cargo build --release --features cuda -p apr-cli` — clean
`cargo test -p aprender-train-distill --lib pipeline::tests::pmat_705` — 3/3 PASS
`cargo fmt -p aprender-train-distill -p apr-cli --check` — clean
`pv validate contracts/distill-pipeline-observability-v1.yaml` — clean
gx10 dispatch confirms `[PMAT-705] ProgressCallback attached` marker fires
CI: `ci / gate` + `workspace-test` green
Follow-up: capture per-step loss trajectory from a 7B run once fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) #1879 lands

🤖 Generated with Claude Code

…onitoring gap (PMAT-705) The PMAT-704 cascade post-mortem revealed a hard observability gap: long distill dispatches were silent between "blocks uploaded" and "Distillation complete." There was no way to distinguish "training progressing silently" from "training hung." A 1.5 h gx10 hang at step 0 was invisible until terminal state. Root cause: `aprender-train` ships a full callback infrastructure (`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`, `EarlyStoppingCallback`, etc.) but `aprender-train-distill::Pipeline` had zero callback references. Per-step loss was computed in `kd_step` and discarded. ## Fix `crates/aprender-train-distill/src/pipeline.rs`: * `Pipeline` gains `callbacks: Vec<Box<dyn entrenar::train::TrainerCallback>>` field and `with_callback()` builder method. * Training loop fires the full lifecycle: - `on_train_begin` once before the first step. - `on_epoch_begin` at each epoch boundary. - `on_step_end` after grad application with current loss + step counter (the load-bearing observability point). - `on_epoch_end` at each epoch boundary. - `on_train_end` once after the loop. * `CallbackAction::Stop` from any callback terminates the training loop within one step (early-stopping support). `crates/apr-cli/src/commands/distill.rs::run_cuda_backend`: * Attaches a default `ProgressCallback` honoring the env var `APR_DISTILL_LOG_EVERY` (default 10). `APR_DISTILL_LOG_EVERY=0` disables per-step logs but keeps epoch boundary lines. ## Contract `contracts/distill-pipeline-observability-v1.yaml`: * 3 equations: callback_lifecycle, progress_log_format, default_attachment. * 4 falsifiers (FT-OBSERV-001..004): per-step callback fires; log_every honored; APR_DISTILL_LOG_EVERY=0 silences per-step; callback overhead bounded. * 2 Kani harnesses (callback_ordering, Stop_terminates_loop). * qa_gate F-OBSERV-001. Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Unit tests `crates/aprender-train-distill/src/pipeline.rs` tests module: * `pmat_705_observability::falsify_observ_001_per_step_callback_fires` * `pmat_705_observability::falsify_observ_001_step_count_matches` * `pmat_705_observability::callback_ordering_preserved` All 3 PASS. Use a recording callback that captures every lifecycle event and verify event counts match the expected (epochs × steps_per_epoch) shape. ## gx10 dispatch verification Dispatch on gx10 emits: [PMAT-705] ProgressCallback attached (log every 1 steps) …confirming the wiring fires. Full per-step output on the 7B teacher path requires PMAT-704 (#1879) to also land — current main routes Q4K teachers through Bug B's CPU-bound `RealizarQ4KTeacher` which hangs at step 0 (no progress to log). The PMAT-705 wiring is correct independent of teacher backend; the unit tests prove it. ## Cascade context Sixth fix in the PMAT-701 family: - #1863 PMAT-701 Bug A: allocator autodetect Grace Blackwell - #1869 PMAT-701 Bug B: RealizarQ4KTeacher (demoted by #1879) - #1871 SPEC-DISTILL §86 + dispatch script default - #1874 PMAT-702: apr eval no-fake-pass on broken models - #1877 PMAT-703: teacher vocab alignment (superseded by #1879's TruncatingTeacher) - #1879 PMAT-704: cuBLAS default + opt-in Realizar fallback - #1880 SPEC-DISTILL §87: PMAT-704 post-mortem - **This PR**: PMAT-705 — ProgressCallback wired into Pipeline Together: the distill → eval pipeline is honest end-to-end AND observable end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-705

noahgift · 2026-05-23T04:37:25Z

Subsumed by #1897 (PMAT-702..705 distill cascade bundle for v0.35.x hiatus-prep). Squash-merge preserves the per-PR commit message — see commit log on #1897.

When the operator sets `APR_DISTILL_MAX_STEPS=N` (default unset), the distill training loop runs at most N steps, prints a per-run summary, and exits without writing a final output model. Lets operators validate the cascade end-to-end in ~60 s before committing to a 30-50 h Stage D production run. The PMAT-704 cascade post-mortem found that the 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with no per-step output. PMAT-705 (#1881) added ProgressCallback to surface per-step loss during normal runs. PMAT-706 adds the complementary EARLY-BREAK so operators don't have to wait through the full epoch budget to see if something's wrong. `crates/aprender-train-distill/src/pipeline.rs`: * Reads `APR_DISTILL_MAX_STEPS` env var. Empty/unset = old behavior (no regression). N > 0 = run at most N steps then break. N = 0 or non-integer = early Err with clear message. * Optional `APR_DISTILL_PROJECT_TO_STEPS` env var (default 50000) controls the projected-wall-time target in the summary. * `train()` early-breaks the inner loop when step >= max_steps, prints two `[SMOKE]` summary lines (loss trajectory + projected wall time at the observed throughput), and returns empty weights / shapes via the normal Result path. * `execute()` detects smoke mode (env var set) and short-circuits the export step — no `model.safetensors` / output.apr is written, so downstream tools (`apr eval`, `apr run`) can't accidentally consume a smoke result. [PMAT-706] smoke mode: APR_DISTILL_MAX_STEPS=N (early-break after N steps; no final output.apr written) ... [SMOKE] N steps in T.TTs: initial_loss=X.XXXX, final_loss=Y.YYYY, throughput=Z.ZZ step/s [SMOKE] projected full-run wall time (50000 steps): H.HHh / W.W min / S.Ss [PMAT-706] smoke mode: skipping export — no model.safetensors / output.apr written `contracts/apr-distill-smoke-validation-v1.yaml`: * 3 equations: early_break_condition (off-by-one tight), smoke_summary_format, no_side_effects. * 4 falsifiers (FT-SMOKE-001..004) covering exact step count, no-regression when unset, summary line format, no output.apr written. * 2 Kani harnesses (count is tight; 0 steps is degenerate, not panic). * qa_gate F-SMOKE-001. * Validates clean: `pv validate` 0 errors, 0 warnings. `pipeline::tests::pmat_706_smoke_validation`: * `falsify_smoke_001_exact_step_count` — N=10 returns metrics.steps_completed == 10 * `falsify_smoke_002_no_regression_when_unset` — unset → full epochs run * `falsify_smoke_004_no_output_in_smoke` — output_path empty + no model.* files * `smoke_zero_steps_returns_err` — N=0 returns Err Tests share global env state; serialized via a Mutex (ENV_LOCK) so they don't race in parallel threads. All 4 PASS. This closes the diagnostic loop on the PMAT-704 cascade post-mortem lesson. Per memory `feedback_a_priori_theoretical_falsification.md`: 30 min of math saves 8 h of GPU. PMAT-706 is the runtime analog: 60 s of smoke saves 8 h of staring at a silent process. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…#1888) When the operator sets `APR_DISTILL_MAX_STEPS=N` (default unset), the distill training loop runs at most N steps, prints a per-run summary, and exits without writing a final output model. Lets operators validate the cascade end-to-end in ~60 s before committing to a 30-50 h Stage D production run. The PMAT-704 cascade post-mortem found that the 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with no per-step output. PMAT-705 (#1881) added ProgressCallback to surface per-step loss during normal runs. PMAT-706 adds the complementary EARLY-BREAK so operators don't have to wait through the full epoch budget to see if something's wrong. `crates/aprender-train-distill/src/pipeline.rs`: * Reads `APR_DISTILL_MAX_STEPS` env var. Empty/unset = old behavior (no regression). N > 0 = run at most N steps then break. N = 0 or non-integer = early Err with clear message. * Optional `APR_DISTILL_PROJECT_TO_STEPS` env var (default 50000) controls the projected-wall-time target in the summary. * `train()` early-breaks the inner loop when step >= max_steps, prints two `[SMOKE]` summary lines (loss trajectory + projected wall time at the observed throughput), and returns empty weights / shapes via the normal Result path. * `execute()` detects smoke mode (env var set) and short-circuits the export step — no `model.safetensors` / output.apr is written, so downstream tools (`apr eval`, `apr run`) can't accidentally consume a smoke result. [PMAT-706] smoke mode: APR_DISTILL_MAX_STEPS=N (early-break after N steps; no final output.apr written) ... [SMOKE] N steps in T.TTs: initial_loss=X.XXXX, final_loss=Y.YYYY, throughput=Z.ZZ step/s [SMOKE] projected full-run wall time (50000 steps): H.HHh / W.W min / S.Ss [PMAT-706] smoke mode: skipping export — no model.safetensors / output.apr written `contracts/apr-distill-smoke-validation-v1.yaml`: * 3 equations: early_break_condition (off-by-one tight), smoke_summary_format, no_side_effects. * 4 falsifiers (FT-SMOKE-001..004) covering exact step count, no-regression when unset, summary line format, no output.apr written. * 2 Kani harnesses (count is tight; 0 steps is degenerate, not panic). * qa_gate F-SMOKE-001. * Validates clean: `pv validate` 0 errors, 0 warnings. `pipeline::tests::pmat_706_smoke_validation`: * `falsify_smoke_001_exact_step_count` — N=10 returns metrics.steps_completed == 10 * `falsify_smoke_002_no_regression_when_unset` — unset → full epochs run * `falsify_smoke_004_no_output_in_smoke` — output_path empty + no model.* files * `smoke_zero_steps_returns_err` — N=0 returns Err Tests share global env state; serialized via a Mutex (ENV_LOCK) so they don't race in parallel threads. All 4 PASS. This closes the diagnostic loop on the PMAT-704 cascade post-mortem lesson. Per memory `feedback_a_priori_theoretical_falsification.md`: 30 min of math saves 8 h of GPU. PMAT-706 is the runtime analog: 60 s of smoke saves 8 h of staring at a silent process. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 11:09

Merge branch 'main' into feat/distill-pipeline-progress-callback-pmat…

18ae74f

…-705

This was referenced May 22, 2026

chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in #1883

Closed

feat(distill): APR_DISTILL_MAX_STEPS smoke-validation mode (PMAT-706) #1888

Merged

Merge branch 'main' into feat/distill-pipeline-progress-callback-pmat…

2d17380

…-705

noahgift mentioned this pull request May 23, 2026

chore: bundle PMAT-702..705 distill cascade (subsumes #1874, #1877, #1879, #1881) #1897

Closed

noahgift closed this May 23, 2026

auto-merge was automatically disabled May 23, 2026 04:37
Pull request was closed

noahgift mentioned this pull request May 23, 2026

chore: mega-bundle hiatus close-out (subsumes #1880, #1883, #1886, #1891, #1896, #1897) #1898

Merged

noahgift mentioned this pull request May 23, 2026

release: v0.35.2 — hiatus close-out drain #1899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881
noahgift wants to merge 3 commits into
mainfrom
feat/distill-pipeline-progress-callback-pmat-705

noahgift commented May 22, 2026

Uh oh!

noahgift commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

Fix

Contract

Unit tests

gx10 verification

Cascade

Test plan

Uh oh!

noahgift commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant