docs(ship-two-001): §36 — plain-language status of what's left to ship the two models#1098
Merged
Conversation
… — v2.80 → v2.81 Landmark section in plain prose for readers who don't want to chase the §15→§35 hypothesis chain. Each model is blocked by a single concrete problem. MODEL-1: numerical bug at layer 3 of FFN. 18× std anomaly vs GGUF reference. Three theories tested+refuted today (matmul kernel via §30, qkv_bias via §32, layer-3 weight bytes via #1082 byte-compare). Actual bug is cumulative F32 precision drift through residuals. Fix path: with PR #1082 merged + PR #1083 in flight, run apr trace --payload on canonical 7B teacher in both formats and bisect layer-by-layer. MODEL-2: trained end-to-end today. val_loss=9.38 (spec target 3.0). 370M from-scratch has converged — 4x more steps yielded same outcome (§34). Capacity is the binding, not corpus or compute. Path forward: distillation from shipped MODEL-1 7B teacher. apr distill is currently a stub (§35); contract authored as #1097, impl is multi-day Rust task. Both blockers are fixable with code, not training time: - MODEL-1: bisect with new sub-FFN telemetry, then fix at root - MODEL-2: implement apr distill --stage train, then run 2-4h distillation Today's session: 11 PRs landed (6 spec amendments + 4 contracts + 1 impl + 2 SHIP-007 sub-FFN telemetry PRs) plus full P1.0→P2 pipeline executed end-to-end with zero muda. Header v2.80.0 → v2.81.0. No coverage flip — landmark only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
b364e07 to
ca2f352
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Landmark section in plain prose for readers who don't want to chase the §15→§35 hypothesis chain. Each of the two models is blocked by a single concrete problem.
Plain-language status
MODEL-1 (Qwen2.5-Coder-7B Apache-Q4K, already published): has a numerical bug at layer 3 of FFN — outputs 18× too spread compared to GGUF on the same prompt. Three theories tested and refuted today:
Actual bug: cumulative F32 precision drift through residual connections. Fix path: with PR #1082 (sub-FFN telemetry) merged and PR #1083 (CLI wiring) in flight, run
apr trace --payloadon canonical 7B teacher in both formats and bisect layer-by-layer.MODEL-2 (paiml/albor-llama-370m-python-v1, trained today): val_loss=9.38, spec target 3.0. The 370M-from-scratch architecture has converged — 4× more steps yielded same outcome (§34). Capacity is the binding constraint. Path: distillation from shipped MODEL-1 7B teacher.
apr distillis currently a stub (§35); contract authored as #1097, impl is multi-day Rust task.What's left
apr distill --stage trainBoth blockers are fixable with code, not training time or compute.
Today's session output
11 PRs landed in 24 hours:
Plus full P1.0 → P2 corpus pipeline (565.6M tokens, val_loss=9.38) executed end-to-end with zero muda.
Test plan
🤖 Generated with Claude Code