Skip to content

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881

Closed
noahgift wants to merge 3 commits into
mainfrom
feat/distill-pipeline-progress-callback-pmat-705
Closed

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881
noahgift wants to merge 3 commits into
mainfrom
feat/distill-pipeline-progress-callback-pmat-705

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Closes the training-monitoring gap surfaced by the PMAT-704 cascade. A 1.5 h gx10 hang at step 0 went unnoticed because the distill pipeline emits zero output between "blocks uploaded" and "Distillation complete."

Root cause: `aprender-train` ships a full callback infrastructure (`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`, etc.) but `aprender-train-distill::Pipeline` had zero callback references. Per-step loss was computed in `kd_step` and discarded.

Fix

`crates/aprender-train-distill/src/pipeline.rs`:

  • `Pipeline` gains `callbacks: Vec<Box>` field + `with_callback()` builder.
  • Training loop wires full lifecycle: `on_train_begin` → `on_epoch_begin` → `on_step_end` × N (with current loss + step counter) → `on_epoch_end` → `on_train_end`.
  • `CallbackAction::Stop` terminates the loop within one step (early-stopping support).

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

  • Attaches default `ProgressCallback` honoring `APR_DISTILL_LOG_EVERY` env (default 10). Set to 0 to disable per-step logs (epoch boundaries still log).

Contract

`contracts/distill-pipeline-observability-v1.yaml` (validates clean):

  • 3 equations: callback lifecycle, progress log format, default attachment
  • 4 falsifiers: per-step callback fires; log_every honored; APR_DISTILL_LOG_EVERY=0 silences; overhead bounded
  • 2 Kani harnesses (callback ordering, Stop terminates loop)
  • qa_gate F-OBSERV-001

Unit tests

3 new tests in `pipeline::tests::pmat_705_observability`, all PASS:

```
falsify_observ_001_per_step_callback_fires .... ok
falsify_observ_001_step_count_matches ......... ok
callback_ordering_preserved ................... ok
```

Use a recording callback that captures every lifecycle event and verify counts match (epochs × steps_per_epoch).

gx10 verification

Dispatch emits:

```
[PMAT-705] ProgressCallback attached (log every 1 steps)
```

Full per-step output on the 7B teacher path needs #1879 (PMAT-704) to land too — current main routes Q4K through Bug B's CPU-bound path which hangs at step 0 (no steps → no logs). The PMAT-705 wiring is correct independent of teacher backend; unit tests prove it.

Cascade

Sixth fix in the PMAT-701 family:

Together the distill → eval pipeline is honest AND observable end-to-end.

Test plan

  • `cargo build --release --features cuda -p apr-cli` — clean
  • `cargo test -p aprender-train-distill --lib pipeline::tests::pmat_705` — 3/3 PASS
  • `cargo fmt -p aprender-train-distill -p apr-cli --check` — clean
  • `pv validate contracts/distill-pipeline-observability-v1.yaml` — clean
  • gx10 dispatch confirms `[PMAT-705] ProgressCallback attached` marker fires
  • CI: `ci / gate` + `workspace-test` green
  • Follow-up: capture per-step loss trajectory from a 7B run once fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) #1879 lands

🤖 Generated with Claude Code

…onitoring gap (PMAT-705)

The PMAT-704 cascade post-mortem revealed a hard observability gap: long
distill dispatches were silent between "blocks uploaded" and
"Distillation complete." There was no way to distinguish "training
progressing silently" from "training hung." A 1.5 h gx10 hang at step 0
was invisible until terminal state.

Root cause: `aprender-train` ships a full callback infrastructure
(`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`,
`EarlyStoppingCallback`, etc.) but `aprender-train-distill::Pipeline`
had zero callback references. Per-step loss was computed in `kd_step`
and discarded.

## Fix

`crates/aprender-train-distill/src/pipeline.rs`:

* `Pipeline` gains `callbacks: Vec<Box<dyn entrenar::train::TrainerCallback>>`
  field and `with_callback()` builder method.
* Training loop fires the full lifecycle:
  - `on_train_begin` once before the first step.
  - `on_epoch_begin` at each epoch boundary.
  - `on_step_end` after grad application with current loss + step
    counter (the load-bearing observability point).
  - `on_epoch_end` at each epoch boundary.
  - `on_train_end` once after the loop.
* `CallbackAction::Stop` from any callback terminates the training
  loop within one step (early-stopping support).

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

* Attaches a default `ProgressCallback` honoring the env var
  `APR_DISTILL_LOG_EVERY` (default 10). `APR_DISTILL_LOG_EVERY=0`
  disables per-step logs but keeps epoch boundary lines.

## Contract

`contracts/distill-pipeline-observability-v1.yaml`:

* 3 equations: callback_lifecycle, progress_log_format, default_attachment.
* 4 falsifiers (FT-OBSERV-001..004): per-step callback fires; log_every
  honored; APR_DISTILL_LOG_EVERY=0 silences per-step; callback overhead
  bounded.
* 2 Kani harnesses (callback_ordering, Stop_terminates_loop).
* qa_gate F-OBSERV-001.

Validates clean: `pv validate` reports 0 errors, 0 warnings.

## Unit tests

`crates/aprender-train-distill/src/pipeline.rs` tests module:

* `pmat_705_observability::falsify_observ_001_per_step_callback_fires`
* `pmat_705_observability::falsify_observ_001_step_count_matches`
* `pmat_705_observability::callback_ordering_preserved`

All 3 PASS. Use a recording callback that captures every lifecycle
event and verify event counts match the expected (epochs ×
steps_per_epoch) shape.

## gx10 dispatch verification

Dispatch on gx10 emits:

  [PMAT-705] ProgressCallback attached (log every 1 steps)

…confirming the wiring fires. Full per-step output on the 7B teacher
path requires PMAT-704 (#1879) to also land — current main routes Q4K
teachers through Bug B's CPU-bound `RealizarQ4KTeacher` which hangs at
step 0 (no progress to log). The PMAT-705 wiring is correct
independent of teacher backend; the unit tests prove it.

## Cascade context

Sixth fix in the PMAT-701 family:

  - #1863 PMAT-701 Bug A: allocator autodetect Grace Blackwell
  - #1869 PMAT-701 Bug B: RealizarQ4KTeacher (demoted by #1879)
  - #1871 SPEC-DISTILL §86 + dispatch script default
  - #1874 PMAT-702: apr eval no-fake-pass on broken models
  - #1877 PMAT-703: teacher vocab alignment (superseded by #1879's TruncatingTeacher)
  - #1879 PMAT-704: cuBLAS default + opt-in Realizar fallback
  - #1880 SPEC-DISTILL §87: PMAT-704 post-mortem
  - **This PR**: PMAT-705 — ProgressCallback wired into Pipeline

Together: the distill → eval pipeline is honest end-to-end AND
observable end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 22, 2026 11:09
@noahgift

Copy link
Copy Markdown
Contributor Author

Subsumed by #1897 (PMAT-702..705 distill cascade bundle for v0.35.x hiatus-prep). Squash-merge preserves the per-PR commit message — see commit log on #1897.

@noahgift noahgift closed this May 23, 2026
auto-merge was automatically disabled May 23, 2026 04:37

Pull request was closed

noahgift added a commit that referenced this pull request May 23, 2026
When the operator sets `APR_DISTILL_MAX_STEPS=N` (default unset), the
distill training loop runs at most N steps, prints a per-run summary,
and exits without writing a final output model. Lets operators
validate the cascade end-to-end in ~60 s before committing to a 30-50 h
Stage D production run.

The PMAT-704 cascade post-mortem found that the 7B vocab-aligned
500-step validation hung at step 0 for 1.5 h with no per-step output.
PMAT-705 (#1881) added ProgressCallback to surface per-step loss
during normal runs. PMAT-706 adds the complementary EARLY-BREAK so
operators don't have to wait through the full epoch budget to see if
something's wrong.

`crates/aprender-train-distill/src/pipeline.rs`:

* Reads `APR_DISTILL_MAX_STEPS` env var. Empty/unset = old behavior
  (no regression). N > 0 = run at most N steps then break. N = 0 or
  non-integer = early Err with clear message.
* Optional `APR_DISTILL_PROJECT_TO_STEPS` env var (default 50000)
  controls the projected-wall-time target in the summary.
* `train()` early-breaks the inner loop when step >= max_steps,
  prints two `[SMOKE]` summary lines (loss trajectory + projected
  wall time at the observed throughput), and returns empty weights /
  shapes via the normal Result path.
* `execute()` detects smoke mode (env var set) and short-circuits
  the export step — no `model.safetensors` / output.apr is written,
  so downstream tools (`apr eval`, `apr run`) can't accidentally
  consume a smoke result.

  [PMAT-706] smoke mode: APR_DISTILL_MAX_STEPS=N (early-break after N steps; no final output.apr written)
  ...
  [SMOKE] N steps in T.TTs: initial_loss=X.XXXX, final_loss=Y.YYYY, throughput=Z.ZZ step/s
  [SMOKE] projected full-run wall time (50000 steps): H.HHh / W.W min / S.Ss
  [PMAT-706] smoke mode: skipping export — no model.safetensors / output.apr written

`contracts/apr-distill-smoke-validation-v1.yaml`:

* 3 equations: early_break_condition (off-by-one tight), smoke_summary_format,
  no_side_effects.
* 4 falsifiers (FT-SMOKE-001..004) covering exact step count, no-regression
  when unset, summary line format, no output.apr written.
* 2 Kani harnesses (count is tight; 0 steps is degenerate, not panic).
* qa_gate F-SMOKE-001.
* Validates clean: `pv validate` 0 errors, 0 warnings.

`pipeline::tests::pmat_706_smoke_validation`:

  * `falsify_smoke_001_exact_step_count` — N=10 returns metrics.steps_completed == 10
  * `falsify_smoke_002_no_regression_when_unset` — unset → full epochs run
  * `falsify_smoke_004_no_output_in_smoke` — output_path empty + no model.* files
  * `smoke_zero_steps_returns_err` — N=0 returns Err

Tests share global env state; serialized via a Mutex (ENV_LOCK) so
they don't race in parallel threads. All 4 PASS.

This closes the diagnostic loop on the PMAT-704 cascade post-mortem
lesson. Per memory `feedback_a_priori_theoretical_falsification.md`:
30 min of math saves 8 h of GPU. PMAT-706 is the runtime analog:
60 s of smoke saves 8 h of staring at a silent process.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 23, 2026
…#1888)

When the operator sets `APR_DISTILL_MAX_STEPS=N` (default unset), the
distill training loop runs at most N steps, prints a per-run summary,
and exits without writing a final output model. Lets operators
validate the cascade end-to-end in ~60 s before committing to a 30-50 h
Stage D production run.

The PMAT-704 cascade post-mortem found that the 7B vocab-aligned
500-step validation hung at step 0 for 1.5 h with no per-step output.
PMAT-705 (#1881) added ProgressCallback to surface per-step loss
during normal runs. PMAT-706 adds the complementary EARLY-BREAK so
operators don't have to wait through the full epoch budget to see if
something's wrong.

`crates/aprender-train-distill/src/pipeline.rs`:

* Reads `APR_DISTILL_MAX_STEPS` env var. Empty/unset = old behavior
  (no regression). N > 0 = run at most N steps then break. N = 0 or
  non-integer = early Err with clear message.
* Optional `APR_DISTILL_PROJECT_TO_STEPS` env var (default 50000)
  controls the projected-wall-time target in the summary.
* `train()` early-breaks the inner loop when step >= max_steps,
  prints two `[SMOKE]` summary lines (loss trajectory + projected
  wall time at the observed throughput), and returns empty weights /
  shapes via the normal Result path.
* `execute()` detects smoke mode (env var set) and short-circuits
  the export step — no `model.safetensors` / output.apr is written,
  so downstream tools (`apr eval`, `apr run`) can't accidentally
  consume a smoke result.

  [PMAT-706] smoke mode: APR_DISTILL_MAX_STEPS=N (early-break after N steps; no final output.apr written)
  ...
  [SMOKE] N steps in T.TTs: initial_loss=X.XXXX, final_loss=Y.YYYY, throughput=Z.ZZ step/s
  [SMOKE] projected full-run wall time (50000 steps): H.HHh / W.W min / S.Ss
  [PMAT-706] smoke mode: skipping export — no model.safetensors / output.apr written

`contracts/apr-distill-smoke-validation-v1.yaml`:

* 3 equations: early_break_condition (off-by-one tight), smoke_summary_format,
  no_side_effects.
* 4 falsifiers (FT-SMOKE-001..004) covering exact step count, no-regression
  when unset, summary line format, no output.apr written.
* 2 Kani harnesses (count is tight; 0 steps is degenerate, not panic).
* qa_gate F-SMOKE-001.
* Validates clean: `pv validate` 0 errors, 0 warnings.

`pipeline::tests::pmat_706_smoke_validation`:

  * `falsify_smoke_001_exact_step_count` — N=10 returns metrics.steps_completed == 10
  * `falsify_smoke_002_no_regression_when_unset` — unset → full epochs run
  * `falsify_smoke_004_no_output_in_smoke` — output_path empty + no model.* files
  * `smoke_zero_steps_returns_err` — N=0 returns Err

Tests share global env state; serialized via a Mutex (ENV_LOCK) so
they don't race in parallel threads. All 4 PASS.

This closes the diagnostic loop on the PMAT-704 cascade post-mortem
lesson. Per memory `feedback_a_priori_theoretical_falsification.md`:
30 min of math saves 8 h of GPU. PMAT-706 is the runtime analog:
60 s of smoke saves 8 h of staring at a silent process.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant