Skip to content

chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in#1883

Closed
noahgift wants to merge 4 commits into
mainfrom
chore/stage-d-dispatch-wrapper
Closed

chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in#1883
noahgift wants to merge 4 commits into
mainfrom
chore/stage-d-dispatch-wrapper

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

`scripts/dispatch-distill-stage-d.sh` is the operator entrypoint for Phase 4 Stage D production training. Captures the PMAT-701 cascade post-mortem lessons in a single dispatchable wrapper.

What it bakes in

Lesson Default behavior
PMAT-704 cuBLAS default (#1879) `APR_DISTILL_TEACHER_BACKEND=auto` — operator can opt-in to `realizar-q4k` for memory-constrained dGPUs
PMAT-705 per-step monitoring (#1881) `APR_DISTILL_LOG_EVERY=50` — visible loss progress without spam
PMAT-699 P0 checkpointing every 5000 steps (survives kill / crash)
PMAT-703 vocab alignment auto-applies inside cuda backend (no operator config)
Disk preflight requires ≥ 15 GB free; gx10 was 98% full when PMAT-704 incident surfaced
Teacher metadata validation requires stamped APR (apr-leaderboard cache); fails fast on the GGUF-import path that broke during PMAT-704
10 s alive check catches early validation errors before operator walks away

Override env vars

`STEPS`, `BATCH_SIZE`, `LR`, `T`, `ALPHA`, `DATASET_DIR`, `APR_DISTILL_LOG_EVERY`, `APR_DISTILL_CHECKPOINT_EVERY`, `APR_DISTILL_TEACHER_BACKEND`, `DISK_FREE_REQUIRED_GB`, `DRY_RUN`.

Intentionally separate from `dispatch-distill-phase-3-gx10.sh` (smoke). SPEC-DISTILL-001 §86 + `feedback_smoke_defaults_leak_into_production.md` codified why these should NOT share defaults.

QA

Cascade context

Companion to the PMAT-701 family of fixes. Ready to dispatch once #1879 (PMAT-704 cuBLAS default) and #1881 (PMAT-705 ProgressCallback) land — without those, this wrapper would default to the slow / silent path.

🤖 Generated with Claude Code

…s baked in

`scripts/dispatch-distill-stage-d.sh` is the operator entrypoint for
Phase 4 Stage D production training. Captures the PMAT-701 cascade
post-mortem lessons in a single dispatchable wrapper:

* **cuBLAS default** (PMAT-704 / #1879). `APR_DISTILL_TEACHER_BACKEND=auto`
  by default; operators can opt into the slower memory-constrained
  Realizar path via `APR_DISTILL_TEACHER_BACKEND=realizar-q4k`.
* **Per-step monitoring** (PMAT-705 / #1881). `APR_DISTILL_LOG_EVERY=50`
  default — visible loss progress without log spam. Operators can set
  =1 for verbose mode or =0 to silence.
* **PMAT-699 P0 checkpointing** every 5000 steps (durability — survives
  kill / crash).
* **PMAT-703 vocab alignment** auto-applies inside the cuda backend
  when teacher.vocab > student.vocab (no operator config needed).
* **Disk preflight**: requires ≥ 15 GB free on /home/noah (Stage D 50K
  writes ~12 GB of checkpoints; PMAT-704 cascade post-mortem caught
  gx10 at 98 % full). Fails fast with cleanup candidates listed.
* **Teacher / student validation**: requires stamped APR metadata
  (apr-leaderboard checkpoint by default — the dispatch-script's
  `apr import --preserve-q4k` path fails the cuda backend's
  metadata-required check, surfaced by PMAT-704 incident).
* **Process-alive check**: 10 s post-dispatch verification catches
  early validation errors so the operator doesn't walk away from a
  failed dispatch.

The wrapper is intentionally separate from `dispatch-distill-phase-3-gx10.sh`
which remains the Phase 3 smoke entrypoint. Stage D is production scope
and shouldn't inherit smoke defaults (see SPEC-DISTILL-001 §86 +
memory `feedback_smoke_defaults_leak_into_production.md`).

## Override env vars

* `STEPS` (default 50000)
* `BATCH_SIZE` (default 32)
* `LR` (default 1.5e-5)
* `T` (default 4.0)
* `ALPHA` (default 0.3)
* `DATASET_DIR` (unset → synthetic; set to a `.bin` shard dir for real corpus)
* `APR_DISTILL_LOG_EVERY` (default 50)
* `APR_DISTILL_CHECKPOINT_EVERY` (default 5000)
* `APR_DISTILL_TEACHER_BACKEND` (default `auto`)
* `DISK_FREE_REQUIRED_GB` (default 15)
* `DRY_RUN=1` to plan only

## QA

* `bash -n scripts/dispatch-distill-stage-d.sh` — syntax-ok
* `bashrs lint scripts/dispatch-distill-stage-d.sh` — 0 errors
  (warnings are df-non-determinism + path-traversal-ln, both expected
  for an operator-supplied path dispatcher)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 22, 2026 13:44
@noahgift

Copy link
Copy Markdown
Contributor Author

Subsumed by #1898 (mega-bundle hiatus close-out). Squash-merge preserves the per-PR commit message — see #1898 commit log.

@noahgift noahgift closed this May 23, 2026
auto-merge was automatically disabled May 23, 2026 07:09

Pull request was closed

noahgift added a commit that referenced this pull request May 23, 2026
, #1896, #1897) (#1898)

* docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn (#1880)

* chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in (#1883)

* chore(distill): Phase 5 HumanEval dispatch wrapper (#1886)

* chore: bundle PMAT-702..705 distill cascade + clippy fix (#1897)

* fix(cli): point 7B qwen models to single-file GGUF artifacts and align caches (#1891)

* fix(chat): preserve original path in FileNotFound for filesystem paths

PR #1891 wrapped all path_arg through HF alias resolution. For inputs
that look like filesystem paths (absolute or starts with ./, ../) and
don't exist, the alias resolver was rewriting them as hf:// URIs and
returning a mangled path in the FileNotFound error.

Fix: short-circuit with the original path_arg in the error BEFORE alias
resolution kicks in. Preserves the contract that test_run_file_not_found
and test_run_nonexistent_path_without_trace assert.

Closes the workspace-test failure on bundle PR #1898.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant