Skip to content

docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3#1097

Merged
noahgift merged 1 commit into
mainfrom
docs/apr-cli-distill-train-contract
Apr 28, 2026
Merged

docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3#1097
noahgift merged 1 commit into
mainfrom
docs/apr-cli-distill-train-contract

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Per §35.3 + §26.8 stack-tool-extension methodology, this contract defines the missing real-training behavior in apr distill.

§35 discovered that apr distill Standard strategy is currently a stub (distill.rs:1464 just does tensor_clone(), no gradient training). §34.5 recommended distillation as the path past the val_loss=9.38 capacity ceiling. This contract unblocks §34.5.

Contract structure

3 equations + 1 drift-prevention gate:

  • kl_divergence_logit_loss with T² scaling (Hinton et al. 2015)
  • alpha_weighted_loss combining KL + CE
  • precompute_train_stages 3-stage pipeline
  • drift_prevention_output_must_be_trained — gate against §35 stub behavior

9 falsification tests:

  • FALSIFY-001: real training (not stub) — tensors differ post-train
  • FALSIFY-002: KL loss decreases over epochs
  • FALSIFY-003: temperature scaling preserves softmax ranking
  • FALSIFY-004: alpha=1 reduces to pure KD
  • FALSIFY-005: precompute is byte-deterministic
  • FALSIFY-006: stage train resumes from precompute cache
  • FALSIFY-007: pv validates this contract
  • FALSIFY-008: 3-surface drift gate
  • FALSIFY-009: end-to-end smoke beats from-scratch baseline

pv validate contracts/apr-cli-distill-train-v1.yaml exits 0 (verified live).

Implementation cost

~600-1200 LOC + 9 tests, multi-day Rust task. Once shipped, enables §34.5's 7B→370M distillation in ~2-4h GPU time.

Test plan

  • pv validate contracts/apr-cli-distill-train-v1.yaml exits 0

🤖 Generated with Claude Code

…till train implementation per §35.3

Triggering observation 2026-04-28 §35: live execution of `apr distill`
on the canonical 7B teacher + §33 best student finished in ~45 seconds.
Source at distill.rs:1464 confirmed Standard strategy is just
`tensor_clone()` — no gradient training. The 45-second wall + 192-byte
output delta vs input was the falsification signal.

§34.5 had recommended distillation as the path past the val_loss=9.38
capacity ceiling. §35 found the in-tree implementation isn't ready.
This contract is the §26.8 stack-tool-extension that unblocks §34.5.

## Contract structure

3 equations:
- kl_divergence_logit_loss with T*T scaling (Hinton et al. 2015)
- alpha_weighted_loss combining KL + CE
- precompute_train_stages 3-stage pipeline (per existing --stage skeleton)
- drift_prevention_output_must_be_trained (gates against the §35 stub
  behavior — output must differ from input by >Q4K tolerance)

9 falsification tests:
- FALSIFY-001: real training (not stub) — student tensors differ post-train
- FALSIFY-002: KL loss decreases over epochs
- FALSIFY-003: temperature scaling preserves softmax ranking
- FALSIFY-004: alpha=1 reduces to pure KD
- FALSIFY-005: precompute is byte-deterministic
- FALSIFY-006: stage train resumes from precompute cache
- FALSIFY-007: pv validates this contract
- FALSIFY-008: 3-surface drift (cli + registry + test)
- FALSIFY-009: end-to-end smoke beats from-scratch baseline

`pv validate` exits 0 (verified live).

## Implementation cost

~600-1200 LOC + 9 tests:
1. distill.rs:1464 — replace tensor_clone() with real KD pipeline
2. Stage `precompute`: forward teacher over corpus, save logits per-token
3. Stage `train`: load student + cached teacher logits, KL+CE loss,
   backprop via aprender-train autograd, optimizer step, checkpoint
4. Per-epoch metadata: train_loss, val_loss, kl_loss, ce_loss, total_loss
5. CI fixture: small teacher (qwen2.5-0.5b) + 50M student on 1M tokens

Once shipped, runs §34.5's 7B→370M distillation in ~2-4h GPU time and
is expected to push MODEL-2 below the §34 9.38 capacity ceiling toward
the spec target of val_loss=3.0.

Status: PROPOSED. Implementation deferred to multi-day Rust task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 28, 2026 05:11
@noahgift noahgift merged commit 3732fb5 into main Apr 28, 2026
11 checks passed
@noahgift noahgift deleted the docs/apr-cli-distill-train-contract branch April 28, 2026 05:35
noahgift added a commit that referenced this pull request Apr 28, 2026
… — v2.80 → v2.81

Landmark section in plain prose for readers who don't want to chase the
§15→§35 hypothesis chain. Each model is blocked by a single concrete
problem.

MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference.
Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32,
layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32
precision drift through residuals. Fix path: with PR #1082 merged + PR #1083
in flight, run apr trace --payload on canonical 7B teacher in both formats
and bisect layer-by-layer.

MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M
from-scratch has converged — 4x more steps yielded same outcome (§34).
Capacity is the binding, not corpus or compute. Path forward: distillation
from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35);
contract authored as #1097, impl is multi-day Rust task.

Both blockers are fixable with code, not training time:
- MODEL-1: bisect with new sub-FFN telemetry, then fix at root
- MODEL-2: implement apr distill --stage train, then run 2-4h distillation

Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl
+ 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed
end-to-end with zero muda.

Header v2.80.0 → v2.81.0. No coverage flip — landmark only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
… — v2.80 → v2.81 (#1098)

Landmark section in plain prose for readers who don't want to chase the
§15→§35 hypothesis chain. Each model is blocked by a single concrete
problem.

MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference.
Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32,
layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32
precision drift through residuals. Fix path: with PR #1082 merged + PR #1083
in flight, run apr trace --payload on canonical 7B teacher in both formats
and bisect layer-by-layer.

MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M
from-scratch has converged — 4x more steps yielded same outcome (§34).
Capacity is the binding, not corpus or compute. Path forward: distillation
from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35);
contract authored as #1097, impl is multi-day Rust task.

Both blockers are fixable with code, not training time:
- MODEL-1: bisect with new sub-FFN telemetry, then fix at root
- MODEL-2: implement apr distill --stage train, then run 2-4h distillation

Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl
+ 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed
end-to-end with zero muda.

Header v2.80.0 → v2.81.0. No coverage flip — landmark only.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant