spec(ship-two-models): v2.94.0 — §49 MODEL-2 strategy pivot: from-scratch → pretrained-init#1461
Merged
Merged
Conversation
…tch → pretrained-init After operator asked "why aren't we training models?" mid-session — re-diagnosed MODEL-2 from spec §34's "capacity-limited at val_loss=9.38" framing to the empirically-correct "data-limited" diagnosis. Pivots strategy from "MODEL-2 = 370M from-scratch" to "MODEL-2 = pretrained 0.5B-class fine-tuned" (industry standard for production small code-LMs). ## Live evidence (this session, 2026-05-04) 500-step `apr pretrain --mode from-scratch --device cuda` smoke on RTX 4090: ``` Run Result: OK CONVERGED final val_loss=9.7255 after 5 epoch(s) ``` Vs §24 memory's 80K-step run on the same 4× corpus: val_loss=9.7507. **Within 0.026 across 160× difference in step count.** Ceiling is corpus-bound, not step-bound. ## Why "from-scratch" is the methodology defect SmolLM-360M (similar 360M param count) hits val_loss ~2.9 — but trained on **1T tokens**. MODEL-2 saw 565M, ~1800× less. The 370M-from-scratch math doesn't reach val_loss=3.0 at this corpus scale. Industry: StableCode← StableLM, Qwen2.5-Coder ← Qwen2.5, DeepSeek-Coder ← DeepSeek-LLM. **Nobody trains 0.5B from scratch for production code-LMs.** ## Pre-conditions verified - ✅ Qwen2.5-Coder-0.5B-Instruct in HF cache (950 MB, model.safetensors + config + tokenizer) - ✅ `apr convert <safetensors> --quantize fp16` → APR file: 290 tensors, 942 MiB, 5.1 sec - ✅ APR file at `/mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr` - ✅ RTX 4090 + cuBLAS + custom PTX backward (sm_89, no Blackwell JIT bug) - ✅ codeparrot+CSN-Python tokenized corpus (565M tokens) ## Implementation gap (next PR) `apr pretrain` lacks `--init <model.apr>` flag. §49.6 step 4 calls out the ~50 LOC fix. Not in this commit. ## Net effects - Spec v2.93.0 → **v2.94.0**. - §34's "capacity-limited" framing RETIRED in favor of §49.1's data-limited diagnosis. - §36.2's "distillation is the only path" REFINED — pretrained-init is load-bearing; distillation is multiplicative on top. - **MODEL-2 ship %**: stays **57%** until first fine-tune produces val_loss < 9.38 evidence. - **MODEL-1 ship %**: unchanged at 91%. ## Five whys (paraphrased from §49.5) 1. **Why now?** Operator pivot. Cascade was real but didn't move ship %. 2. **Why is "from-scratch" wrong?** Math: 370M × 565M tokens ≠ SmolLM territory. 3. **Why Qwen2.5-Coder-0.5B-Instruct?** Already cached, code-pretrained, similar params, permissive license. 4. **Why retain target val_loss=3.0?** Right product target (~exp(3) = ~20 perplexity = good code completion). 5. **Why isn't this just a rename?** Different init = qualitatively different artifact (good code completion vs near-random tokens). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…MODEL-2 pivot (#1470) Authors `contracts/apr-pretrain-from-init-v1.yaml` v1.0.0 PROPOSED, the contract layer driving §49 step 4 (the wire-up PR for `apr pretrain --init <model.apr>`). §49 (spec v2.94.0, 2026-05-04) retired the from-scratch MODEL-2 strategy after 80K-step LR-budget falsification (§24.8) confirmed the corpus-bottleneck floor at val_loss=9.75. The pivot is to fine-tune a Qwen2.5-class pretrained checkpoint that has already paid the 1T-token data tax. This contract pins the new flag's semantics: - flag absent → existing pretrain behavior (no regression on §24/§25) - flag present → load weights from APR file, train with that init - flag composes with --mode {finetune, from-scratch} - missing/corrupted/shape-mismatched APR → fail-fast non-zero exit - load-bearing claim: init_loss(step=0) ≤ 6.0 < from_scratch_loss(step=0) - drift-prevention: clap field + unit test + integration test in same PR 10 falsifiers (FALSIFY-APR-PRETRAIN-INIT-001..010), 6 proof obligations, 2 kani harnesses. pv validate exits 0 with 0 errors / 0 warnings. The contract is INDEPENDENT of the SHIP-007 / Qwen2-0.5B `apr run` gibberish defect (memory entry 2026-05-04). Training forward path uses a different loader and dispatch surface than `apr run` inference; the init flag's correctness is provable independently. If the gibberish defect also infects the training forward path, FALSIFY-APR-PRETRAIN- INIT-006 (init_loss <= 6.0) will catch it as a side effect. Five Whys: 1. Why a contract before the impl? `pv validate` rejects schema drift; the falsifier list pins what the wire-up PR must satisfy. Contract-first prevents the wire-up shipping with silent gaps. 2. Why now? Spec §49 was authored 2026-05-04 (PR #1461) and roadmap step 3 is "author this contract". Step 4 (wire-up) blocks on it. 3. Why scope to PROPOSED? No impl exists yet. Promotion to ACTIVE/FUNCTIONAL/DISCHARGED gated on the wire-up PR landing and the 10 falsifiers compile-binding (PARTIAL_ALGORITHM_LEVEL). 4. Why 10 falsifiers? Each error class is a distinct silent-failure mode. The §24 retrospective showed `--synthetic` defaulted silently to `true` for months — a contract pinning "no silent fallback" must enumerate ALL silent paths so CI catches each. 5. Why not block on the SHIP-007 / 0.5B gibberish? The two are orthogonal. SHIP-007 is `apr run` inference; this contract is `apr pretrain` training-forward. Separating keeps the wire-up PR small and lets §49 progress while SHIP-007 stays under operator-gated investigation. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §49 step 5 (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461) - contracts/training-loop-pretrain-v1.yaml v1.5.0 ACTIVE (parent) - feedback_cli_subcommand_three_surface_drift.md - feedback_no_guessing.md - memory:project_qwen2_0_5b_is_ship_007_manifestation.md (orthogonal) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
Implements `apr pretrain --init <PATH>` per the contract authored in PR #1470 (`apr-pretrain-from-init-v1` v1.0.0 PROPOSED). This is §49 step 4 of the MODEL-2 pretrained-init pivot. Spec §49 (v2.94.0, 2026-05-04, PR #1461) retired the from-scratch MODEL-2 strategy after §24.8's 80K-step LR-budget falsification confirmed val_loss=9.75 as a corpus-bottleneck floor on the 565M-token corpus. Fine-tuning a Qwen2.5-class pretrained checkpoint (which has already paid the 1T-token data tax) is the load-bearing path. This PR adds the flag that loads weights from an APR file as the initial weights for the pretrain optimizer. What this PR adds: 1. Clap field `init: Option<PathBuf>` on the `Pretrain` variant in `extended_commands.rs:635`. Optional — absence preserves existing pretrain behavior (no §24/§25 regression). 2. Plumbing through `dispatch_analysis.rs:346` to `commands::pretrain::run` (new `init: Option<&Path>` param). 3. New helper `validate_init_apr_path()` in `pretrain.rs`: a. Open file → FALSIFY-003 (missing → exit non-zero) b. Read 4 magic bytes → FALSIFY-004 (read fails → exit non-zero) c. Compare magic bytes vs APR\\0 / APRN → FALSIFY-004 d. If valid magic → return "not yet wired" error pointing at §49 step 5 (no silent random-init fallback) 4. 7 new unit tests in `pretrain::tests`: - pretrain_init_flag_absent_parses_to_none (FALSIFY-001/002) - pretrain_init_flag_parses_path (FALSIFY-001) - pretrain_init_missing_file_errors (FALSIFY-003) - pretrain_init_bad_magic_errors (FALSIFY-004) - pretrain_init_empty_file_errors (FALSIFY-004 edge) - pretrain_init_valid_apr_rejected_until_step5 (partial-state guard) - pretrain_init_v1_magic_aprn_recognised (v1 magic acceptance) 5. Contract status bump: PROPOSED → PARTIAL_ALGORITHM_LEVEL via v1.0.0 → v1.1.0 metadata update + changelog entry. Test results (cargo test -p apr-cli --lib commands::pretrain::): 21 passed; 0 failed; 0 ignored Step 5 follow-up scope (~150 LOC): - Architecture matching: read APR header, compare vocab/hidden/ layers/heads against pretrain target → discharges FALSIFY-005 - Actual weight load: read tensor shards, materialize into optimizer's initial state → discharges FALSIFY-006/009/010 - LIVE 500-step fine-tune on Qwen2.5-Coder-0.5B-Instruct.apr → DISCHARGED (val_loss < 9.38) Five Whys: 1. Why a small partial-state PR instead of full step 4+5? §49 step 4 was scoped at ~50 LOC for "wire the flag"; step 5 does the full weight load. Splitting keeps each PR small, reviewable, and lets CI catch silent-fallback regressions between the two steps. 2. Why have validate_init_apr_path() reject EVERY valid APR right now? Honors the contract's no-silent-fallback invariant. If we accepted valid APRs and silently used random init while step 5 is open, an operator could ship a "fine-tune" run that's actually a from-scratch run — exactly the §24 silent-default defect class this PR is built to prevent. 3. Why a custom error message naming "§49 step 5" instead of just "not implemented"? Operators tracing a failure to the source PR can find the next-step contract obligations by grep'ing the spec; "not implemented" gives them no thread to pull. The error message IS the breadcrumb to the next-cycle work. 4. Why bump the contract from PROPOSED to PARTIAL_ALGORITHM_LEVEL in the same PR? Atomicity: the contract describes the flag's algorithm at the LEVEL we have impl evidence for. PROPOSED means "no impl"; PARTIAL means "compile-bound + algorithm-bound at sub-falsifier granularity". This PR delivers exactly that. Leaving the contract at PROPOSED while the impl is on main creates a drift between status and reality. 5. Why not implement step 5 in the same PR? The MappedAprModel architecture extraction is a deeper plumbing question (header reading, GGUF qtype decoding, optimizer state initialization) that warrants its own commit + review. Going small + atomic is the Toyota Way single-piece flow. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §49 step 5 weight-load impl + LIVE 500-step fine-tune (FALSIFY-006) Refs: - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461) - contracts/apr-pretrain-from-init-v1.yaml v1.0.0 → v1.1.0 - PR #1470 — contract authoring (merged) - feedback_cli_subcommand_three_surface_drift.md (3-surface rule) - feedback_no_guessing.md - memory:project_qwen2_0_5b_is_ship_007_manifestation.md (orthogonal) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…oupling finding (#1472) Adds §50 documenting the architecture-mismatch finding caught after §49.6 steps 3+4 landed (PR #1470 contract + PR #1471 wire-up). The remaining §49.6 step 5 was scoped at "0 LOC, just run apr pretrain --init" — that assumption is empirically wrong. Empirical finding (§50.1): pretrain_real.rs:38-46 HARDCODES Llama370MConfig::* for every architectural constant. Qwen2.5-Coder-0.5B-Instruct has different shape across the board: Param | Llama370M | Qwen2.5-Coder-0.5B -----------------|-----------|-------------------- hidden_size | 1024 | 896 num_attention_heads | 16 | 14 num_kv_heads | 4 (GQA-4:1) | 2 (GQA-7:1) intermediate_size | 2816 | 4864 vocab_size | 50_257 | 151_936 rope_theta | 10_000 | 1_000_000 Every tensor mismatches. Loading Qwen2.5 weights into a Llama370M- shaped optimizer is a category error. Three options surfaced (§50.3): A: Find/build a Llama-shaped 0.5B pretrained checkpoint (~5K LOC + multi-week training; recreates §24/§25 corpus problem) B: Make trainer architecture-polymorphic (~200-400 LOC; preserves §24/§25 falsification; recommended) C: Replace Llama370MConfig with Qwen2_5_Coder_0_5B_Config outright (~300 LOC; deletes a working falsification path) Recommendation (§50.5): Option B — preserves §24/§25 falsification evidence, exercises TransformerConfig's designed polymorphism, binds each new component (qwen2_0_5b constructor, GQA-7:1 attention, Qwen tokenizer surface) to its own falsifier. Re-scoped roadmap (§50.4) — 8 sub-steps replacing original step 5: 5a. Author apr-pretrain-arch-polymorphic-v1.yaml contract (~80 LOC) 5b. TransformerConfig::qwen2_0_5b() constructor (~40 LOC) 5c. Extract arch from init APR file metadata (~80 LOC) 5d. Qwen tokenizer-vocab compatibility check (~30 LOC) 5e. GQA-7:1 attention forward-pass verification (~50 LOC) 5f. Wire actual weight load (~120 LOC) 5g. LIVE 500-step smoke fine-tune (operator dispatch) 0 LOC 5h. Stamp + publish as MODEL-2 v2 (~10 LOC) Total: ~410 LOC + 1 LIVE training run. Five Whys (§50.6): 1. Why didn't §49 catch this? §49 was authored from strategy/ data-budget reasoning; the 0-LOC step-5 cost implicitly assumed polymorphism. Live source inspection (this section's empirical move) revealed pretrain_real.rs:38-46 predates the assumption. 2. Why catch this NOW and not in step 5 implementation? Per feedback_no_guessing.md: read live source before forming implementation plan. Surfacing the mismatch BEFORE writing 200 LOC of weight-load code that fails at runtime is the cheapest place to pay cost-of-defect. The §50-prior wrong- premise PRs (#1466/#1467/#1468 closed) on the SHIP-007 / 0.5B gibberish track were the same defect class. 3. Why option B over A or C? Preserves §24/§25 falsification evidence (we KEEP knowing from-scratch fails at 9.75; we just don't ship it as MODEL-2). Exercises the polymorphism TransformerConfig was designed for. Each new component becomes its own falsifier rather than a hidden coupling. 4. Why is FALSIFY-005 the right place to fail-fast? PR #1470 already pinned "Architecture mismatch is FAIL-FAST, not silent- truncate". Step 4 (PR #1471) doesn't enforce arch matching yet — returns "not yet wired" before getting there. So FALSIFY-005 is currently UNBOUND but its discharge gate is well-defined: read APR header, compare against pretrain target, error with names of mismatched fields. 5. Why isn't this a "punt"? A punt would say "blocked, await operator". This amendment names three options with LOC estimates, recommends one with reasoning, gives a concrete 8- step roadmap with falsifier discharge mapped to each sub-step. The work IS shippable; it's just bigger than 0 LOC. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38). Sub-steps 5a-5f can each individually move 1% with falsifier discharge (architecture-polymorphic infrastructure shipped == evidence that the §49 path is REACHABLE, not just theoretical). Refs: - §49 — MODEL-2 strategy pivot (PR #1461) - PR #1470 — apr-pretrain-from-init-v1 v1.0.0 PROPOSED contract - PR #1471 — apr pretrain --init clap field + magic-byte validate - feedback_no_guessing.md — read source before forming hypothesis - feedback_fix_root_cause_never_route_around.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 4, 2026
… 5b + DEFECT FIX (#1474) §50.4 step 5b authored a contract assuming `qwen2_0_5b()` did not exist. Live source inspection during impl revealed the constructor ALREADY EXISTS at `transformer/config.rs:156`. Reading the HF config byte-for-byte (per `feedback_no_guessing.md`) revealed a real defect: HF config (Qwen2.5-Coder-0.5B-Instruct): tie_word_embeddings: true Existing code (qwen2_0_5b): tie_word_embeddings: false Fix: 1 LOC change `false → true`. Per Qwen scaling-law convention verified against the HF cache: - Qwen2.5-Coder-0.5B: tie=true (HF cache 2026-05-04 ✓) - Qwen2.5-Coder-1.5B: tie=true (HF cache 2026-05-04 ✓; inherits via `..Self::qwen2_0_5b()` spread) - Qwen2.5-Coder-7B: tie=false (HF cache 2026-05-04 ✓; explicit in qwen2_7b()) Why the defect matters: tied vs untied embeddings is a load-bearing architectural property. With tie=false (current bug), if an operator fine-tunes from a Qwen2.5-0.5B init checkpoint, the lm_head will be allocated as a separate tensor that doesn't get loaded (because the APR file only contains the embed_tokens tensor — they share weights). The result: lm_head random-initialized and untrained, producing silent gibberish at val time. This is exactly the §49 / §50 failure class the contract was authored to prevent. What this PR adds: 1. Fix `tie_word_embeddings: false → true` in `qwen2_0_5b()` at `transformer/config.rs:156-174` 2. Add docstring noting the empirical verification + HF cache path + Qwen scaling-law quirk 3. Add 3 new unit tests in `transformer::config::tests`: - `qwen2_0_5b_matches_hf_config_2026_05_04` (FALSIFY-001 byte- identity verification — 11 fields) - `qwen2_1_5b_inherits_tie_word_embeddings_from_0_5b` (drift- prevention; catches future spread-split refactors) - `qwen2_7b_does_not_tie_embeddings` (drift-prevention; pins the 7B Qwen scaling-law quirk against silent flips) Test results (cargo test -p aprender-train --lib transformer::config::tests::qwen2): 3 passed; 0 failed; 0 ignored Discharges FALSIFY-APR-PRETRAIN-ARCH-001 in PR #1473's contract. Five Whys: 1. Why was the constructor already there but with the wrong tie setting? Likely authored before the spec-§49 use case became the load-bearing target. The constants for `qwen2_0_5b` were correct for inference, but tie_word_embeddings is mostly a training- pipeline concern — it determines whether lm_head is a separate trainable parameter or shares with embed_tokens. 2. Why didn't pmat query / cargo test catch this earlier? Existing tests pinned shape (hidden, layers, heads, etc.) but no test verified `tie_word_embeddings`. This PR adds the missing drift-prevention test that catches the defect class. 3. Why fix this in the same PR as the test (not a separate fix)? Toyota Way: the test IS the discharge mechanism for FALSIFY-001. A test that passed against the (defective) status quo would be a liar. Fixing first + testing second guarantees the test pins correct behavior, not whatever happened to be in the code. 4. Why also pin qwen2_1_5b (inheritance) and qwen2_7b (anti-spread)? Those are drift-prevention. The spread-inheritance pattern `..Self::qwen2_0_5b()` is fragile — a future refactor could split the inheritance chain and silently flip tie_word_embeddings back to false on 1.5B. Test catches that. Similarly, an over- enthusiastic refactor could homogenize 7B with 0.5B (incorrectly setting 7B's tie=true). Test catches that too. 5. Why §50.4 step 5b was overscoped at 40 LOC: §50 was authored under the assumption that the constructor didn't exist. Live source inspection (per `feedback_no_guessing.md`) revealed the foundation was already there, just with one defect. This is the same lesson as §50 itself — read source before authoring scope. The contract from PR #1473 is still valid; only the LOC estimate in §50.4's table was wrong. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461) - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (#1472, in flight) - PR #1470 — apr-pretrain-from-init-v1 contract (merged) - PR #1471 — apr pretrain --init wire-up (merged) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - HF config: ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-{0.5B,1.5B,7B}-Instruct/.../config.json - feedback_no_guessing.md — read source before forming hypothesis Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
apr convertworks (290 tensors → 942 MiB APR file).Why now
Operator asked "why aren't we training models?" after 11 SHIP-007 cascade PRs without ship-% movement. Re-diagnosed MODEL-2: §34's "capacity-limited" framing is wrong — it's data-limited. SmolLM-360M (similar params) needed 1T tokens to hit val_loss ~2.9; MODEL-2 saw 565M (~1800× less). From-scratch math doesn't reach val_loss=3.0 at this corpus scale.
Industry precedent
Nobody trains 0.5B from scratch for production code-LMs at <1T tokens because the math doesn't work.
Implementation gap (next PR)
apr pretrainlacks--init <model.apr>flag. §49.6 step 4 calls out ~50 LOC fix. Not in this commit.Test plan
evidence/model-2-strategy-pivot-2026-05-04/findings.mdPlain ship %
🤖 Generated with Claude Code