docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3 by noahgift · Pull Request #1097 · paiml/aprender

noahgift · 2026-04-28T05:11:42Z

Summary

Per §35.3 + §26.8 stack-tool-extension methodology, this contract defines the missing real-training behavior in apr distill.

§35 discovered that apr distill Standard strategy is currently a stub (distill.rs:1464 just does tensor_clone(), no gradient training). §34.5 recommended distillation as the path past the val_loss=9.38 capacity ceiling. This contract unblocks §34.5.

Contract structure

3 equations + 1 drift-prevention gate:

kl_divergence_logit_loss with T² scaling (Hinton et al. 2015)
alpha_weighted_loss combining KL + CE
precompute_train_stages 3-stage pipeline
drift_prevention_output_must_be_trained — gate against §35 stub behavior

9 falsification tests:

FALSIFY-001: real training (not stub) — tensors differ post-train
FALSIFY-002: KL loss decreases over epochs
FALSIFY-003: temperature scaling preserves softmax ranking
FALSIFY-004: alpha=1 reduces to pure KD
FALSIFY-005: precompute is byte-deterministic
FALSIFY-006: stage train resumes from precompute cache
FALSIFY-007: pv validates this contract
FALSIFY-008: 3-surface drift gate
FALSIFY-009: end-to-end smoke beats from-scratch baseline

pv validate contracts/apr-cli-distill-train-v1.yaml exits 0 (verified live).

Implementation cost

~600-1200 LOC + 9 tests, multi-day Rust task. Once shipped, enables §34.5's 7B→370M distillation in ~2-4h GPU time.

Test plan

pv validate contracts/apr-cli-distill-train-v1.yaml exits 0

🤖 Generated with Claude Code

…till train implementation per §35.3 Triggering observation 2026-04-28 §35: live execution of `apr distill` on the canonical 7B teacher + §33 best student finished in ~45 seconds. Source at distill.rs:1464 confirmed Standard strategy is just `tensor_clone()` — no gradient training. The 45-second wall + 192-byte output delta vs input was the falsification signal. §34.5 had recommended distillation as the path past the val_loss=9.38 capacity ceiling. §35 found the in-tree implementation isn't ready. This contract is the §26.8 stack-tool-extension that unblocks §34.5. ## Contract structure 3 equations: - kl_divergence_logit_loss with T*T scaling (Hinton et al. 2015) - alpha_weighted_loss combining KL + CE - precompute_train_stages 3-stage pipeline (per existing --stage skeleton) - drift_prevention_output_must_be_trained (gates against the §35 stub behavior — output must differ from input by >Q4K tolerance) 9 falsification tests: - FALSIFY-001: real training (not stub) — student tensors differ post-train - FALSIFY-002: KL loss decreases over epochs - FALSIFY-003: temperature scaling preserves softmax ranking - FALSIFY-004: alpha=1 reduces to pure KD - FALSIFY-005: precompute is byte-deterministic - FALSIFY-006: stage train resumes from precompute cache - FALSIFY-007: pv validates this contract - FALSIFY-008: 3-surface drift (cli + registry + test) - FALSIFY-009: end-to-end smoke beats from-scratch baseline `pv validate` exits 0 (verified live). ## Implementation cost ~600-1200 LOC + 9 tests: 1. distill.rs:1464 — replace tensor_clone() with real KD pipeline 2. Stage `precompute`: forward teacher over corpus, save logits per-token 3. Stage `train`: load student + cached teacher logits, KL+CE loss, backprop via aprender-train autograd, optimizer step, checkpoint 4. Per-epoch metadata: train_loss, val_loss, kl_loss, ce_loss, total_loss 5. CI fixture: small teacher (qwen2.5-0.5b) + 50M student on 1M tokens Once shipped, runs §34.5's 7B→370M distillation in ~2-4h GPU time and is expected to push MODEL-2 below the §34 9.38 capacity ceiling toward the spec target of val_loss=3.0. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… — v2.80 → v2.81 Landmark section in plain prose for readers who don't want to chase the §15→§35 hypothesis chain. Each model is blocked by a single concrete problem. MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference. Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32, layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32 precision drift through residuals. Fix path: with PR #1082 merged + PR #1083 in flight, run apr trace --payload on canonical 7B teacher in both formats and bisect layer-by-layer. MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M from-scratch has converged — 4x more steps yielded same outcome (§34). Capacity is the binding, not corpus or compute. Path forward: distillation from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35); contract authored as #1097, impl is multi-day Rust task. Both blockers are fixable with code, not training time: - MODEL-1: bisect with new sub-FFN telemetry, then fix at root - MODEL-2: implement apr distill --stage train, then run 2-4h distillation Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl + 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed end-to-end with zero muda. Header v2.80.0 → v2.81.0. No coverage flip — landmark only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… — v2.80 → v2.81 (#1098) Landmark section in plain prose for readers who don't want to chase the §15→§35 hypothesis chain. Each model is blocked by a single concrete problem. MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference. Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32, layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32 precision drift through residuals. Fix path: with PR #1082 merged + PR #1083 in flight, run apr trace --payload on canonical 7B teacher in both formats and bisect layer-by-layer. MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M from-scratch has converged — 4x more steps yielded same outcome (§34). Capacity is the binding, not corpus or compute. Path forward: distillation from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35); contract authored as #1097, impl is multi-day Rust task. Both blockers are fixable with code, not training time: - MODEL-1: bisect with new sub-FFN telemetry, then fix at root - MODEL-2: implement apr distill --stage train, then run 2-4h distillation Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl + 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed end-to-end with zero muda. Header v2.80.0 → v2.81.0. No coverage flip — landmark only. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 28, 2026 05:11

noahgift merged commit 3732fb5 into main Apr 28, 2026
11 checks passed

noahgift deleted the docs/apr-cli-distill-train-contract branch April 28, 2026 05:35

noahgift mentioned this pull request Apr 28, 2026

docs(ship-two-001): §36 — plain-language status of what's left to ship the two models #1098

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3#1097

docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3#1097
noahgift merged 1 commit into
mainfrom
docs/apr-cli-distill-train-contract

noahgift commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 28, 2026

Summary

Contract structure

Implementation cost

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant