Skip to content

chore(distill): default to MODEL-1 7B teacher + SPEC-DISTILL-001 §86 (PMAT-701 follow-up)#1871

Merged
noahgift merged 3 commits into
mainfrom
chore/distill-phase4-7b-teacher-default
May 22, 2026
Merged

chore(distill): default to MODEL-1 7B teacher + SPEC-DISTILL-001 §86 (PMAT-701 follow-up)#1871
noahgift merged 3 commits into
mainfrom
chore/distill-phase4-7b-teacher-default

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Now that PMAT-701 Bug A (#1863) and Bug B (#1869) have landed, the MODEL-1 7B teacher (`paiml/qwen2.5-coder-7b-apache-q4k-v1`) is feasible on Grace Blackwell GB10. This PR is the small `chore` change that flips the dispatch script default + records the why in SPEC-DISTILL-001 §86.

Changes

  • `scripts/dispatch-distill-phase-3-gx10.sh`: `TEACHER_REPO` default changes from `Qwen/Qwen2.5-Coder-0.5B-Instruct` (smoke fallback) → `paiml/qwen2.5-coder-7b-apache-q4k-v1` (the spec's intended teacher). Smoke-only callers override with `TEACHER_REPO=...`. Old comment about the 1.5B Block-0 OOM is replaced with the PMAT-701 fix references.
  • `docs/specifications/aprender-train/distillation-epic-spec.md`: new §86 amendment documenting the 5-whys, the two fixed bugs, the new falsifier `F-DISTILL-V2-001-TEACHER-DIVERGENCE` (preflight reject when STEPS>=5000 and teacher==student without an explicit override), and the discharge of the prior Stage D 50K + 10K runs as no-KD. Spec version bumped 1.1.0 → 1.2.0.

Why this matters

The Phase 4 Stage D 50K (25 h) and 10K (5 h) runs in 2026-05-20/21 silently inherited the Phase 3 smoke workaround of TEACHER_REPO == STUDENT_INIT == 0.5B. KD signal was ~zero (KL between identical distributions); 30 hours of compute fine-tuned the base model toward gibberish on a synthetic-ish corpus. The §86 amendment makes that mistake hard to repeat.

Test plan

  • `bash -n scripts/dispatch-distill-phase-3-gx10.sh` — syntax-ok
  • Spec markdown renders cleanly
  • CI: `ci / gate` + `workspace-test` green
  • Operator: when ready, re-dispatch Stage D with the new defaults — `STEPS=50000 ./scripts/dispatch-distill-phase-3-gx10.sh`. Compute estimate ~50 h on GB10 (slower than the previous 0.5B-teacher run because realizar's 7B forward is heavier; this is acceptable given the falsifier-quality gain).

🤖 Generated with Claude Code

…(PMAT-701 follow-up)

The Phase 4 Stage D 50K + 10K runs (2026-05-20/21) silently inherited the
Phase 3 smoke workaround of TEACHER_REPO == STUDENT_INIT == 0.5B. Result: no
KD signal, 30 h of compute that fine-tuned the base model toward gibberish on
a small corpus. Documented in `evidence/distill-7b-teacher-loadtest-gx10/findings.json`
+ this spec amendment.

Now that PMAT-701 Bug A (PR #1863) and Bug B (PR #1869) have landed, the 7B
Q4K teacher is feasible on Grace Blackwell GB10:

* PR #1863: trueno-gpu allocator autodetects unified-memory devices (Grace,
  Tegra) and routes to cuMemAllocManaged so the full 128 GB pool is reachable.
* PR #1869: new RealizarQ4KTeacher keeps Q4K teacher weights quantized on
  the GPU (no F32 dequant at upload), eliminating the OOM-kill that was
  killing the first training step.

This PR flips the dispatch script's default and codifies the why in spec §86:

* `scripts/dispatch-distill-phase-3-gx10.sh` — TEACHER_REPO default changes
  from `Qwen/Qwen2.5-Coder-0.5B-Instruct` (smoke fallback) to
  `paiml/qwen2.5-coder-7b-apache-q4k-v1` (the MODEL-1 teacher the spec was
  designed around). Smoke-only callers override with the env var.
* `docs/specifications/aprender-train/distillation-epic-spec.md` — adds §86
  documenting the 5-whys, the fix references, and a new falsifier
  F-DISTILL-V2-001-TEACHER-DIVERGENCE that rejects future Phase-4-class
  dispatches where teacher == student unless an explicit override is set.
* Spec version bumped to 1.2.0 with changelog entry.

The §86 amendment also notes that the existing 50K + 10K Stage D runs do NOT
count toward AC-DISTILL-003 — they're discharged as no-KD baselines, and a
re-dispatched 50K run with the 7B teacher is required for a real Phase 4
verdict.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 97d9b80 into main May 22, 2026
10 checks passed
@noahgift noahgift deleted the chore/distill-phase4-7b-teacher-default branch May 22, 2026 14:06
noahgift added a commit that referenced this pull request May 22, 2026
…g turn

Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of
the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a
wrong turn — the realizar `_cuda` forward path is CPU-bound and
unusable as a distillation teacher on Grace Blackwell GB10. The 7B
vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU
at 0% utilization — empirical proof of the defect.

The amendment includes:

* Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer
  SIGKILL on the explicit-managed path), with file/line citations
  pointing to the CPU-heavy ops in
  crates/aprender-serve/src/gguf/cuda/cuda.rs:18
* Root cause: conflated two failures, missed the cheap dispatch-flip
  experiment that would have rejected Bug B's hypothesis in 5 minutes.
* Fix references: PR #1879 (PMAT-704) — cuBLAS default,
  RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k
  opt-in fallback.
* Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`,
  `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted).
* Methodology lesson: cheap-experiment-before-design discipline.
* Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877,
  #1879.

Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86
(via PR #1871, also pending merge) and §87 (this PR). The amendment
notes the §86 cross-reference and explains the order-of-operations
in case readers see this on a build of main that predates #1871.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant