docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3#1097
Merged
Merged
Conversation
…till train implementation per §35.3 Triggering observation 2026-04-28 §35: live execution of `apr distill` on the canonical 7B teacher + §33 best student finished in ~45 seconds. Source at distill.rs:1464 confirmed Standard strategy is just `tensor_clone()` — no gradient training. The 45-second wall + 192-byte output delta vs input was the falsification signal. §34.5 had recommended distillation as the path past the val_loss=9.38 capacity ceiling. §35 found the in-tree implementation isn't ready. This contract is the §26.8 stack-tool-extension that unblocks §34.5. ## Contract structure 3 equations: - kl_divergence_logit_loss with T*T scaling (Hinton et al. 2015) - alpha_weighted_loss combining KL + CE - precompute_train_stages 3-stage pipeline (per existing --stage skeleton) - drift_prevention_output_must_be_trained (gates against the §35 stub behavior — output must differ from input by >Q4K tolerance) 9 falsification tests: - FALSIFY-001: real training (not stub) — student tensors differ post-train - FALSIFY-002: KL loss decreases over epochs - FALSIFY-003: temperature scaling preserves softmax ranking - FALSIFY-004: alpha=1 reduces to pure KD - FALSIFY-005: precompute is byte-deterministic - FALSIFY-006: stage train resumes from precompute cache - FALSIFY-007: pv validates this contract - FALSIFY-008: 3-surface drift (cli + registry + test) - FALSIFY-009: end-to-end smoke beats from-scratch baseline `pv validate` exits 0 (verified live). ## Implementation cost ~600-1200 LOC + 9 tests: 1. distill.rs:1464 — replace tensor_clone() with real KD pipeline 2. Stage `precompute`: forward teacher over corpus, save logits per-token 3. Stage `train`: load student + cached teacher logits, KL+CE loss, backprop via aprender-train autograd, optimizer step, checkpoint 4. Per-epoch metadata: train_loss, val_loss, kl_loss, ce_loss, total_loss 5. CI fixture: small teacher (qwen2.5-0.5b) + 50M student on 1M tokens Once shipped, runs §34.5's 7B→370M distillation in ~2-4h GPU time and is expected to push MODEL-2 below the §34 9.38 capacity ceiling toward the spec target of val_loss=3.0. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
… — v2.80 → v2.81 Landmark section in plain prose for readers who don't want to chase the §15→§35 hypothesis chain. Each model is blocked by a single concrete problem. MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference. Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32, layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32 precision drift through residuals. Fix path: with PR #1082 merged + PR #1083 in flight, run apr trace --payload on canonical 7B teacher in both formats and bisect layer-by-layer. MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M from-scratch has converged — 4x more steps yielded same outcome (§34). Capacity is the binding, not corpus or compute. Path forward: distillation from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35); contract authored as #1097, impl is multi-day Rust task. Both blockers are fixable with code, not training time: - MODEL-1: bisect with new sub-FFN telemetry, then fix at root - MODEL-2: implement apr distill --stage train, then run 2-4h distillation Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl + 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed end-to-end with zero muda. Header v2.80.0 → v2.81.0. No coverage flip — landmark only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
… — v2.80 → v2.81 (#1098) Landmark section in plain prose for readers who don't want to chase the §15→§35 hypothesis chain. Each model is blocked by a single concrete problem. MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference. Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32, layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32 precision drift through residuals. Fix path: with PR #1082 merged + PR #1083 in flight, run apr trace --payload on canonical 7B teacher in both formats and bisect layer-by-layer. MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M from-scratch has converged — 4x more steps yielded same outcome (§34). Capacity is the binding, not corpus or compute. Path forward: distillation from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35); contract authored as #1097, impl is multi-day Rust task. Both blockers are fixable with code, not training time: - MODEL-1: bisect with new sub-FFN telemetry, then fix at root - MODEL-2: implement apr distill --stage train, then run 2-4h distillation Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl + 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed end-to-end with zero muda. Header v2.80.0 → v2.81.0. No coverage flip — landmark only. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per §35.3 + §26.8 stack-tool-extension methodology, this contract defines the missing real-training behavior in
apr distill.§35 discovered that
apr distillStandard strategy is currently a stub (distill.rs:1464just doestensor_clone(), no gradient training). §34.5 recommended distillation as the path past the val_loss=9.38 capacity ceiling. This contract unblocks §34.5.Contract structure
3 equations + 1 drift-prevention gate:
kl_divergence_logit_losswith T² scaling (Hinton et al. 2015)alpha_weighted_losscombining KL + CEprecompute_train_stages3-stage pipelinedrift_prevention_output_must_be_trained— gate against §35 stub behavior9 falsification tests:
pv validate contracts/apr-cli-distill-train-v1.yamlexits 0 (verified live).Implementation cost
~600-1200 LOC + 9 tests, multi-day Rust task. Once shipped, enables §34.5's 7B→370M distillation in ~2-4h GPU time.
Test plan
pv validate contracts/apr-cli-distill-train-v1.yamlexits 0🤖 Generated with Claude Code