feat(distill): StudentLogitsProvider trait + FixtureStudent (SPEC-DISTILL-001 Phase 2b) by noahgift · Pull Request #1791 · paiml/aprender

noahgift · 2026-05-18T12:20:33Z

Summary

Mirrors Phase 1's TeacherLogitsProvider for the student side. The student has two methods: logits_for_batch (forward) and apply_kd_gradient (backward + optimizer step seeded by Phase 2a's kd_logit_gradient). FixtureStudent implements both for CPU-only unit testing — Phase 2c will add a CudaStudentProvider that wraps CudaTransformerTrainer.

Stacks on top of #1788 (Phase 2a kd_step).

What this PR adds

pub trait StudentLogitsProvider {
    fn vocab_size(&self) -> usize;
    fn logits_for_batch(&mut self, input_ids: &[Vec<u32>])
        -> Result<Vec<Vec<f32>>>;
    fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>])
        -> Result<()>;
}

pub struct FixtureStudent { /* in-memory logits + LR */ }

FixtureStudent::apply_kd_gradient averages the gradient across batch elements (canonical SGD batch averaging) and subtracts the scaled gradient from its internal logits buffer. Not a real model — it's a logit-space optimization fixture that validates the KD pipeline's gradient direction without needing CUDA.

Falsifiers pinned

ID	Statement	Status
F-DISTILL-STUDENT-001	One KD step moves student logits toward teacher's preferred token	✓
F-DISTILL-STUDENT-002	10 sequential KD steps decrease loss to < 90% of initial	✓

Plus 5 sanity tests: vocab_size reporting, batch broadcast, shape validation, in-place update, batch averaging math.

All 57 aprender-train-distill lib tests pass (was 50 — 7 new).

Architecture

┌─────────────────┐  logits_for_batch  ┌─────────────────┐
│ TeacherProvider │ ─────────────────► │   kd_step       │
│  (Phase 1/1b)   │                    │  (Phase 2a)     │
└─────────────────┘                    │                 │
                                       │ kd_logit_grad   │
┌─────────────────┐  logits_for_batch  │                 │
│ StudentProvider │ ◄─────────────────►│                 │
│  (THIS PR)      │  apply_kd_gradient │                 │
└─────────────────┘                    └─────────────────┘

After all three abstractions are in main, Phase 2c implements CudaStudentProvider (wraps CudaTransformerTrainer) and the pipeline integrates all three. With Phase 2c landed, end-to-end GPU distillation is unblocked (Phase 3 — 500-step E2E smoke).

Test plan

7 new student_provider tests pass (2 falsifiers + 5 sanity)
All 57 aprender-train-distill lib tests pass
cargo check -p aprender-train-distill clean
Phase 2c (PMAT-696): CudaStudentProvider + Pipeline integration

🤖 Generated with Claude Code

…L-001 Phase 1b, PMAT-693) Adds the real teacher backend the SPEC-DISTILL-001 Phase 1b ticket scopes: CudaTrainerTeacher wraps entrenar's CudaTransformerTrainer in inference-only mode, delegates logits_for_batch to forward_logits() per batch element, returns shape [batch, vocab_size]. Gated behind a new `cuda` feature on aprender-train-distill that propagates to entrenar/cuda. Without the feature, only FixtureTeacher (Phase 1) is available — sufficient for unit tests but not for real training. Real distillation runs (Phase 4) require --features cuda. Surface ======= #[cfg(feature = "cuda")] pub struct CudaTrainerTeacher { /* wraps CudaTransformerTrainer */ } impl CudaTrainerTeacher { pub fn for_inference( checkpoint_dir: impl AsRef<Path>, model_config: TransformerConfig, ) -> Result<Self> { ... } } impl TeacherLogitsProvider for CudaTrainerTeacher { fn vocab_size(&self) -> usize { ... } fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>> { ... } } Defensive checks ================ - forward_logits returning None → EntrenarError::Internal with a clear "likely missing weights or CUDA init failure" message - logits.len() != vocab_size → EntrenarError::Internal flagging TransformerConfig vs checkpoint vocab drift (the common silent failure mode for loaded-from-disk distillation runs) Tests ===== All 6 teacher_provider tests pass under both --features (none) and --features cuda. Compile gates verified: cargo check -p aprender-train-distill # clean cargo check -p aprender-train-distill --features cuda # clean What's next =========== Phase 2 (PMAT-694, follow-up): wire CudaTransformerTrainer's KD-loss backward into the student path — replaces the remaining build_synthetic_logits call site for the student in pipeline.rs::train(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ent (SPEC-DISTILL-001 Phase 2) Wires Phase 1's teacher provider into a per-batch KD orchestration step that produces both the combined α·CE + (1-α)·T²·KL scalar loss (for telemetry) and the KD-aware logit-space gradient (Phase 2b plug point). What this PR adds ================= New module `aprender-train-distill::kd_step`: pub fn kd_loss( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> f32; pub fn kd_logit_gradient( student_logits: &[f32], teacher_logits: &[f32], label: usize, temperature: f32, alpha: f32, ) -> Vec<f32>; pub fn kd_step<F: FnMut(&[u32]) -> Vec<f32>>( teacher: &mut dyn TeacherLogitsProvider, input_ids: &[Vec<u32>], labels: &[usize], temperature: f32, alpha: f32, compute_student_logits: F, ) -> Result<(f32, Vec<Vec<f32>>)>; The gradient is the Hinton et al. 2015 §2 derivation: ∂L/∂s = α · (softmax(s) - one_hot(label)) + (1-α) · T · (softmax(s/T) - softmax(t/T)) (T factor, not T² — one T factor is absorbed by the softmax derivative chain rule.) Scope: Phase 2 vs Phase 2b ========================== Phase 2 (this PR) ships the orchestration math, all in pure Rust on the CPU. The output `Vec<Vec<f32>>` of per-batch gradients is what Phase 2b will plumb into `CudaTransformerTrainer.forward_backward_kd_batch` as the backward-pass seed (replacing the CE-only gradient currently used by forward_backward_batch). Splitting Phase 2 into 2a/2b lets us land the orchestration layer + its tests now, separate from extending the complex GPU trainer code path. Falsifiers pinned ================= 3 KD-step falsifiers + 6 sanity tests, all passing: - F-DISTILL-KDSTEP-001 (alpha=1 → pure CE) - F-DISTILL-KDSTEP-002 (student==teacher → zero KL gradient under alpha=0) - F-DISTILL-KDSTEP-003 (loss monotone in student-teacher divergence) - softmax unit-sum + non-negative - CE gradient correct sign at label vs non-label positions - kd_step orchestration end-to-end - kd_step empty-batch sanity - kd_step vocab-mismatch error path - kd_loss alpha=1 collapses to pure CE All 50 aprender-train-distill lib tests pass (was 41 — 9 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…TILL-001 Phase 2b) Mirrors Phase 1's TeacherLogitsProvider for the student side. The student has two methods: logits_for_batch (forward) and apply_kd_gradient (backward + optimizer step). FixtureStudent implements both for CPU-only unit testing — Phase 2c will add a CudaStudentProvider that wraps CudaTransformerTrainer. What this PR adds ================= pub trait StudentLogitsProvider { fn vocab_size(&self) -> usize; fn logits_for_batch(&mut self, input_ids: &[Vec<u32>]) -> Result<Vec<Vec<f32>>>; fn apply_kd_gradient(&mut self, gradient: &[Vec<f32>]) -> Result<()>; } pub struct FixtureStudent { vocab_size: usize, logits: Vec<f32>, // current student parameters learning_rate: f32, } FixtureStudent's apply_kd_gradient averages the gradient across batch elements (canonical SGD batch averaging) and subtracts the scaled gradient from its internal logits buffer. This isn't a real model — it's a logit-space optimization fixture that lets us validate the KD pipeline's gradient direction is correct without needing CUDA. Falsifiers pinned ================= 7 student_provider tests + 2 falsifiers, all passing: - F-DISTILL-STUDENT-001 — one KD step moves student logits toward teacher's preferred token. Setup: uniform student, teacher prefers token 5, alpha=0 (pure KL signal). After one step, student logit at index 5 must be strictly greater than before. - F-DISTILL-STUDENT-002 — 10 sequential KD steps strictly decrease per-step KD loss. With LR=0.5, loss after 10 steps < 90% of initial. Validates the gradient direction is correct (descent, not ascent). Plus 5 sanity tests covering vocab_size reporting, batch broadcast, shape validation, in-place logit update, and batch averaging math. Architecture ============ Stacks on top of #1788 (kd_step). Pipeline integration that uses both TeacherLogitsProvider + StudentLogitsProvider + kd_step is Phase 2c. Phase 2c (PMAT-696, follow-up): CudaStudentProvider that wraps CudaTransformerTrainer for production runs. Once it lands, end-to-end GPU distillation is unblocked. Tests ===== All 57 aprender-train-distill lib tests pass (was 50 — 7 new). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-18T16:40:14Z

Subsumed by #1797 squash-merge (chain-PR leapfrog pattern per memory rule). All content landed on main at aee8716.

#1798) Adds a chown step BEFORE the cargo step that runs `docker run --rm` as root and chowns the per-RUN target dir + cargo registry to noah:1000. ## Why Docker's bind-mount creates missing host directories with the daemon's uid (root). Since #1693 switched to per-RUN target dirs (`/mnt/nvme-raid0/targets/aprender-ci/<PR>/run-<RUN_ID>`), every fresh run gets a root-owned target dir. Cargo (running as uid 1000 inside the container) cannot write to it and fails with: error: failed to create directory `/workspace/target/debug`: No such file or directory (os error 2) The existing post-job chown (line 245) was meant to fix this for the NEXT run's git-clean — but per-RUN paths invalidate that since each run gets a brand-new root-owned dir. First-runs always fail. This was observed across 6+ in-flight PRs (#1784, #1791-#1797) on 2026-05-18 — every "infrastructure flake" turned out to be the same ownership bug at different cargo entry points. ## Fix Pre-cargo chown step. Idempotent (`|| true`). Runs the existing sovereign-ci image as root for the chown, then exits — adds maybe 2s to runs. Matches the pattern of the post-job chown step that already exists; just moves it to BEFORE cargo as well. ## Manual one-shot The 6 currently-stuck PRs were unblocked by manually chowning their per-RUN dirs on the runner host: ssh intel sudo chown -R 1000:1000 \ /mnt/nvme-raid0/targets/aprender-ci/{1792,1793,1794,1796,1797,main}/run-* After this PR lands, future runs will fix themselves. Co-authored-by: Noah Gift <claude@noahgift.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 12:20

noahgift mentioned this pull request May 18, 2026

feat(distill): pipeline integration teacher + student + kd_step end-to-end (SPEC-DISTILL-001 Phase 2c) #1792

Closed

5 tasks

noahgift and others added 3 commits May 18, 2026 14:41

noahgift force-pushed the feat/distill-phase-2b-kd-backward-pmat-695 branch from 8dcc09c to 3c28fcb Compare May 18, 2026 12:41

noahgift mentioned this pull request May 18, 2026

chore(distill): Phase 3 smoke-run dispatch + watch scripts for gx10 (SPEC-DISTILL-001) #1795

Closed

3 tasks

noahgift added 4 commits May 18, 2026 15:42

Merge branch 'main' into feat/distill-phase-2b-kd-backward-pmat-695

e8804a2

Merge branch 'main' into feat/distill-phase-2b-kd-backward-pmat-695

529c52d

Merge branch 'main' into feat/distill-phase-2b-kd-backward-pmat-695

e7d175e

Merge branch 'main' into feat/distill-phase-2b-kd-backward-pmat-695

0a36883

noahgift closed this May 18, 2026

auto-merge was automatically disabled May 18, 2026 16:40
Pull request was closed

noahgift deleted the feat/distill-phase-2b-kd-backward-pmat-695 branch May 18, 2026 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distill): StudentLogitsProvider trait + FixtureStudent (SPEC-DISTILL-001 Phase 2b)#1791

feat(distill): StudentLogitsProvider trait + FixtureStudent (SPEC-DISTILL-001 Phase 2b)#1791
noahgift wants to merge 7 commits into
mainfrom
feat/distill-phase-2b-kd-backward-pmat-695

noahgift commented May 18, 2026

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

What this PR adds

Falsifiers pinned

Architecture

Test plan

Uh oh!

noahgift commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant