spec(ship-two-models): v2.94.0 — §49 MODEL-2 strategy pivot: from-scratch → pretrained-init by noahgift · Pull Request #1461 · paiml/aprender

noahgift · 2026-05-04T07:58:46Z

Summary

Spec v2.93.0 → v2.94.0.
§49 amendment pivots MODEL-2 strategy from "370M from-scratch on 565M tokens" (math doesn't work) to "pretrained 0.5B-class fine-tuned on 565M tokens" (industry standard).
Live evidence: 500-step run hits val_loss=9.7255, ~identical to §24's 80K-step val_loss=9.7507. Ceiling is corpus-bound, not step-bound.
Pre-conditions verified: Qwen2.5-Coder-0.5B-Instruct in HF cache, apr convert works (290 tensors → 942 MiB APR file).

Why now

Operator asked "why aren't we training models?" after 11 SHIP-007 cascade PRs without ship-% movement. Re-diagnosed MODEL-2: §34's "capacity-limited" framing is wrong — it's data-limited. SmolLM-360M (similar params) needed 1T tokens to hit val_loss ~2.9; MODEL-2 saw 565M (~1800× less). From-scratch math doesn't reach val_loss=3.0 at this corpus scale.

Industry precedent

Model	Init
Qwen2.5-Coder-0.5B	← Qwen2.5 base
StableCode-3B	← StableLM
DeepSeek-Coder-1.3B	← DeepSeek-LLM
StarCoder2-3B	from-scratch on 3.3T tokens
SmolLM-360M	from-scratch on 1T tokens

Nobody trains 0.5B from scratch for production code-LMs at <1T tokens because the math doesn't work.

Implementation gap (next PR)

apr pretrain lacks --init <model.apr> flag. §49.6 step 4 calls out ~50 LOC fix. Not in this commit.

Test plan

Live evidence captured: 500-step from-scratch ≈ 80K-step from-scratch
Qwen2.5-Coder-0.5B-Instruct converted to APR (5.1 sec, 290 tensors)
Pre-conditions documented in evidence/model-2-strategy-pivot-2026-05-04/findings.md
CI green
Auto-merge

Plain ship %

MODEL-2: 57% (unchanged) — stays here until first fine-tune produces val_loss < 9.38 evidence
MODEL-1: 91% (unchanged)

🤖 Generated with Claude Code

…tch → pretrained-init After operator asked "why aren't we training models?" mid-session — re-diagnosed MODEL-2 from spec §34's "capacity-limited at val_loss=9.38" framing to the empirically-correct "data-limited" diagnosis. Pivots strategy from "MODEL-2 = 370M from-scratch" to "MODEL-2 = pretrained 0.5B-class fine-tuned" (industry standard for production small code-LMs). ## Live evidence (this session, 2026-05-04) 500-step `apr pretrain --mode from-scratch --device cuda` smoke on RTX 4090: ``` Run Result: OK CONVERGED final val_loss=9.7255 after 5 epoch(s) ``` Vs §24 memory's 80K-step run on the same 4× corpus: val_loss=9.7507. **Within 0.026 across 160× difference in step count.** Ceiling is corpus-bound, not step-bound. ## Why "from-scratch" is the methodology defect SmolLM-360M (similar 360M param count) hits val_loss ~2.9 — but trained on **1T tokens**. MODEL-2 saw 565M, ~1800× less. The 370M-from-scratch math doesn't reach val_loss=3.0 at this corpus scale. Industry: StableCode← StableLM, Qwen2.5-Coder ← Qwen2.5, DeepSeek-Coder ← DeepSeek-LLM. **Nobody trains 0.5B from scratch for production code-LMs.** ## Pre-conditions verified - ✅ Qwen2.5-Coder-0.5B-Instruct in HF cache (950 MB, model.safetensors + config + tokenizer) - ✅ `apr convert <safetensors> --quantize fp16` → APR file: 290 tensors, 942 MiB, 5.1 sec - ✅ APR file at `/mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr` - ✅ RTX 4090 + cuBLAS + custom PTX backward (sm_89, no Blackwell JIT bug) - ✅ codeparrot+CSN-Python tokenized corpus (565M tokens) ## Implementation gap (next PR) `apr pretrain` lacks `--init <model.apr>` flag. §49.6 step 4 calls out the ~50 LOC fix. Not in this commit. ## Net effects - Spec v2.93.0 → **v2.94.0**. - §34's "capacity-limited" framing RETIRED in favor of §49.1's data-limited diagnosis. - §36.2's "distillation is the only path" REFINED — pretrained-init is load-bearing; distillation is multiplicative on top. - **MODEL-2 ship %**: stays **57%** until first fine-tune produces val_loss < 9.38 evidence. - **MODEL-1 ship %**: unchanged at 91%. ## Five whys (paraphrased from §49.5) 1. **Why now?** Operator pivot. Cascade was real but didn't move ship %. 2. **Why is "from-scratch" wrong?** Math: 370M × 565M tokens ≠ SmolLM territory. 3. **Why Qwen2.5-Coder-0.5B-Instruct?** Already cached, code-pretrained, similar params, permissive license. 4. **Why retain target val_loss=3.0?** Right product target (~exp(3) = ~20 perplexity = good code completion). 5. **Why isn't this just a rename?** Different init = qualitatively different artifact (good code completion vs near-random tokens). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…MODEL-2 pivot (#1470) Authors `contracts/apr-pretrain-from-init-v1.yaml` v1.0.0 PROPOSED, the contract layer driving §49 step 4 (the wire-up PR for `apr pretrain --init <model.apr>`). §49 (spec v2.94.0, 2026-05-04) retired the from-scratch MODEL-2 strategy after 80K-step LR-budget falsification (§24.8) confirmed the corpus-bottleneck floor at val_loss=9.75. The pivot is to fine-tune a Qwen2.5-class pretrained checkpoint that has already paid the 1T-token data tax. This contract pins the new flag's semantics: - flag absent → existing pretrain behavior (no regression on §24/§25) - flag present → load weights from APR file, train with that init - flag composes with --mode {finetune, from-scratch} - missing/corrupted/shape-mismatched APR → fail-fast non-zero exit - load-bearing claim: init_loss(step=0) ≤ 6.0 < from_scratch_loss(step=0) - drift-prevention: clap field + unit test + integration test in same PR 10 falsifiers (FALSIFY-APR-PRETRAIN-INIT-001..010), 6 proof obligations, 2 kani harnesses. pv validate exits 0 with 0 errors / 0 warnings. The contract is INDEPENDENT of the SHIP-007 / Qwen2-0.5B `apr run` gibberish defect (memory entry 2026-05-04). Training forward path uses a different loader and dispatch surface than `apr run` inference; the init flag's correctness is provable independently. If the gibberish defect also infects the training forward path, FALSIFY-APR-PRETRAIN- INIT-006 (init_loss <= 6.0) will catch it as a side effect. Five Whys: 1. Why a contract before the impl? `pv validate` rejects schema drift; the falsifier list pins what the wire-up PR must satisfy. Contract-first prevents the wire-up shipping with silent gaps. 2. Why now? Spec §49 was authored 2026-05-04 (PR #1461) and roadmap step 3 is "author this contract". Step 4 (wire-up) blocks on it. 3. Why scope to PROPOSED? No impl exists yet. Promotion to ACTIVE/FUNCTIONAL/DISCHARGED gated on the wire-up PR landing and the 10 falsifiers compile-binding (PARTIAL_ALGORITHM_LEVEL). 4. Why 10 falsifiers? Each error class is a distinct silent-failure mode. The §24 retrospective showed `--synthetic` defaulted silently to `true` for months — a contract pinning "no silent fallback" must enumerate ALL silent paths so CI catches each. 5. Why not block on the SHIP-007 / 0.5B gibberish? The two are orthogonal. SHIP-007 is `apr run` inference; this contract is `apr pretrain` training-forward. Separating keeps the wire-up PR small and lets §49 progress while SHIP-007 stays under operator-gated investigation. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §49 step 5 (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461) - contracts/training-loop-pretrain-v1.yaml v1.5.0 ACTIVE (parent) - feedback_cli_subcommand_three_surface_drift.md - feedback_no_guessing.md - memory:project_qwen2_0_5b_is_ship_007_manifestation.md (orthogonal) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Implements `apr pretrain --init <PATH>` per the contract authored in PR #1470 (`apr-pretrain-from-init-v1` v1.0.0 PROPOSED). This is §49 step 4 of the MODEL-2 pretrained-init pivot. Spec §49 (v2.94.0, 2026-05-04, PR #1461) retired the from-scratch MODEL-2 strategy after §24.8's 80K-step LR-budget falsification confirmed val_loss=9.75 as a corpus-bottleneck floor on the 565M-token corpus. Fine-tuning a Qwen2.5-class pretrained checkpoint (which has already paid the 1T-token data tax) is the load-bearing path. This PR adds the flag that loads weights from an APR file as the initial weights for the pretrain optimizer. What this PR adds: 1. Clap field `init: Option<PathBuf>` on the `Pretrain` variant in `extended_commands.rs:635`. Optional — absence preserves existing pretrain behavior (no §24/§25 regression). 2. Plumbing through `dispatch_analysis.rs:346` to `commands::pretrain::run` (new `init: Option<&Path>` param). 3. New helper `validate_init_apr_path()` in `pretrain.rs`: a. Open file → FALSIFY-003 (missing → exit non-zero) b. Read 4 magic bytes → FALSIFY-004 (read fails → exit non-zero) c. Compare magic bytes vs APR\\0 / APRN → FALSIFY-004 d. If valid magic → return "not yet wired" error pointing at §49 step 5 (no silent random-init fallback) 4. 7 new unit tests in `pretrain::tests`: - pretrain_init_flag_absent_parses_to_none (FALSIFY-001/002) - pretrain_init_flag_parses_path (FALSIFY-001) - pretrain_init_missing_file_errors (FALSIFY-003) - pretrain_init_bad_magic_errors (FALSIFY-004) - pretrain_init_empty_file_errors (FALSIFY-004 edge) - pretrain_init_valid_apr_rejected_until_step5 (partial-state guard) - pretrain_init_v1_magic_aprn_recognised (v1 magic acceptance) 5. Contract status bump: PROPOSED → PARTIAL_ALGORITHM_LEVEL via v1.0.0 → v1.1.0 metadata update + changelog entry. Test results (cargo test -p apr-cli --lib commands::pretrain::): 21 passed; 0 failed; 0 ignored Step 5 follow-up scope (~150 LOC): - Architecture matching: read APR header, compare vocab/hidden/ layers/heads against pretrain target → discharges FALSIFY-005 - Actual weight load: read tensor shards, materialize into optimizer's initial state → discharges FALSIFY-006/009/010 - LIVE 500-step fine-tune on Qwen2.5-Coder-0.5B-Instruct.apr → DISCHARGED (val_loss < 9.38) Five Whys: 1. Why a small partial-state PR instead of full step 4+5? §49 step 4 was scoped at ~50 LOC for "wire the flag"; step 5 does the full weight load. Splitting keeps each PR small, reviewable, and lets CI catch silent-fallback regressions between the two steps. 2. Why have validate_init_apr_path() reject EVERY valid APR right now? Honors the contract's no-silent-fallback invariant. If we accepted valid APRs and silently used random init while step 5 is open, an operator could ship a "fine-tune" run that's actually a from-scratch run — exactly the §24 silent-default defect class this PR is built to prevent. 3. Why a custom error message naming "§49 step 5" instead of just "not implemented"? Operators tracing a failure to the source PR can find the next-step contract obligations by grep'ing the spec; "not implemented" gives them no thread to pull. The error message IS the breadcrumb to the next-cycle work. 4. Why bump the contract from PROPOSED to PARTIAL_ALGORITHM_LEVEL in the same PR? Atomicity: the contract describes the flag's algorithm at the LEVEL we have impl evidence for. PROPOSED means "no impl"; PARTIAL means "compile-bound + algorithm-bound at sub-falsifier granularity". This PR delivers exactly that. Leaving the contract at PROPOSED while the impl is on main creates a drift between status and reality. 5. Why not implement step 5 in the same PR? The MappedAprModel architecture extraction is a deeper plumbing question (header reading, GGUF qtype decoding, optimizer state initialization) that warrants its own commit + review. Going small + atomic is the Toyota Way single-piece flow. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §49 step 5 weight-load impl + LIVE 500-step fine-tune (FALSIFY-006) Refs: - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461) - contracts/apr-pretrain-from-init-v1.yaml v1.0.0 → v1.1.0 - PR #1470 — contract authoring (merged) - feedback_cli_subcommand_three_surface_drift.md (3-surface rule) - feedback_no_guessing.md - memory:project_qwen2_0_5b_is_ship_007_manifestation.md (orthogonal) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…oupling finding (#1472) Adds §50 documenting the architecture-mismatch finding caught after §49.6 steps 3+4 landed (PR #1470 contract + PR #1471 wire-up). The remaining §49.6 step 5 was scoped at "0 LOC, just run apr pretrain --init" — that assumption is empirically wrong. Empirical finding (§50.1): pretrain_real.rs:38-46 HARDCODES Llama370MConfig::* for every architectural constant. Qwen2.5-Coder-0.5B-Instruct has different shape across the board: Param | Llama370M | Qwen2.5-Coder-0.5B -----------------|-----------|-------------------- hidden_size | 1024 | 896 num_attention_heads | 16 | 14 num_kv_heads | 4 (GQA-4:1) | 2 (GQA-7:1) intermediate_size | 2816 | 4864 vocab_size | 50_257 | 151_936 rope_theta | 10_000 | 1_000_000 Every tensor mismatches. Loading Qwen2.5 weights into a Llama370M- shaped optimizer is a category error. Three options surfaced (§50.3): A: Find/build a Llama-shaped 0.5B pretrained checkpoint (~5K LOC + multi-week training; recreates §24/§25 corpus problem) B: Make trainer architecture-polymorphic (~200-400 LOC; preserves §24/§25 falsification; recommended) C: Replace Llama370MConfig with Qwen2_5_Coder_0_5B_Config outright (~300 LOC; deletes a working falsification path) Recommendation (§50.5): Option B — preserves §24/§25 falsification evidence, exercises TransformerConfig's designed polymorphism, binds each new component (qwen2_0_5b constructor, GQA-7:1 attention, Qwen tokenizer surface) to its own falsifier. Re-scoped roadmap (§50.4) — 8 sub-steps replacing original step 5: 5a. Author apr-pretrain-arch-polymorphic-v1.yaml contract (~80 LOC) 5b. TransformerConfig::qwen2_0_5b() constructor (~40 LOC) 5c. Extract arch from init APR file metadata (~80 LOC) 5d. Qwen tokenizer-vocab compatibility check (~30 LOC) 5e. GQA-7:1 attention forward-pass verification (~50 LOC) 5f. Wire actual weight load (~120 LOC) 5g. LIVE 500-step smoke fine-tune (operator dispatch) 0 LOC 5h. Stamp + publish as MODEL-2 v2 (~10 LOC) Total: ~410 LOC + 1 LIVE training run. Five Whys (§50.6): 1. Why didn't §49 catch this? §49 was authored from strategy/ data-budget reasoning; the 0-LOC step-5 cost implicitly assumed polymorphism. Live source inspection (this section's empirical move) revealed pretrain_real.rs:38-46 predates the assumption. 2. Why catch this NOW and not in step 5 implementation? Per feedback_no_guessing.md: read live source before forming implementation plan. Surfacing the mismatch BEFORE writing 200 LOC of weight-load code that fails at runtime is the cheapest place to pay cost-of-defect. The §50-prior wrong- premise PRs (#1466/#1467/#1468 closed) on the SHIP-007 / 0.5B gibberish track were the same defect class. 3. Why option B over A or C? Preserves §24/§25 falsification evidence (we KEEP knowing from-scratch fails at 9.75; we just don't ship it as MODEL-2). Exercises the polymorphism TransformerConfig was designed for. Each new component becomes its own falsifier rather than a hidden coupling. 4. Why is FALSIFY-005 the right place to fail-fast? PR #1470 already pinned "Architecture mismatch is FAIL-FAST, not silent- truncate". Step 4 (PR #1471) doesn't enforce arch matching yet — returns "not yet wired" before getting there. So FALSIFY-005 is currently UNBOUND but its discharge gate is well-defined: read APR header, compare against pretrain target, error with names of mismatched fields. 5. Why isn't this a "punt"? A punt would say "blocked, await operator". This amendment names three options with LOC estimates, recommends one with reasoning, gives a concrete 8- step roadmap with falsifier discharge mapped to each sub-step. The work IS shippable; it's just bigger than 0 LOC. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38). Sub-steps 5a-5f can each individually move 1% with falsifier discharge (architecture-polymorphic infrastructure shipped == evidence that the §49 path is REACHABLE, not just theoretical). Refs: - §49 — MODEL-2 strategy pivot (PR #1461) - PR #1470 — apr-pretrain-from-init-v1 v1.0.0 PROPOSED contract - PR #1471 — apr pretrain --init clap field + magic-byte validate - feedback_no_guessing.md — read source before forming hypothesis - feedback_fix_root_cause_never_route_around.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… 5b + DEFECT FIX (#1474) §50.4 step 5b authored a contract assuming `qwen2_0_5b()` did not exist. Live source inspection during impl revealed the constructor ALREADY EXISTS at `transformer/config.rs:156`. Reading the HF config byte-for-byte (per `feedback_no_guessing.md`) revealed a real defect: HF config (Qwen2.5-Coder-0.5B-Instruct): tie_word_embeddings: true Existing code (qwen2_0_5b): tie_word_embeddings: false Fix: 1 LOC change `false → true`. Per Qwen scaling-law convention verified against the HF cache: - Qwen2.5-Coder-0.5B: tie=true (HF cache 2026-05-04 ✓) - Qwen2.5-Coder-1.5B: tie=true (HF cache 2026-05-04 ✓; inherits via `..Self::qwen2_0_5b()` spread) - Qwen2.5-Coder-7B: tie=false (HF cache 2026-05-04 ✓; explicit in qwen2_7b()) Why the defect matters: tied vs untied embeddings is a load-bearing architectural property. With tie=false (current bug), if an operator fine-tunes from a Qwen2.5-0.5B init checkpoint, the lm_head will be allocated as a separate tensor that doesn't get loaded (because the APR file only contains the embed_tokens tensor — they share weights). The result: lm_head random-initialized and untrained, producing silent gibberish at val time. This is exactly the §49 / §50 failure class the contract was authored to prevent. What this PR adds: 1. Fix `tie_word_embeddings: false → true` in `qwen2_0_5b()` at `transformer/config.rs:156-174` 2. Add docstring noting the empirical verification + HF cache path + Qwen scaling-law quirk 3. Add 3 new unit tests in `transformer::config::tests`: - `qwen2_0_5b_matches_hf_config_2026_05_04` (FALSIFY-001 byte- identity verification — 11 fields) - `qwen2_1_5b_inherits_tie_word_embeddings_from_0_5b` (drift- prevention; catches future spread-split refactors) - `qwen2_7b_does_not_tie_embeddings` (drift-prevention; pins the 7B Qwen scaling-law quirk against silent flips) Test results (cargo test -p aprender-train --lib transformer::config::tests::qwen2): 3 passed; 0 failed; 0 ignored Discharges FALSIFY-APR-PRETRAIN-ARCH-001 in PR #1473's contract. Five Whys: 1. Why was the constructor already there but with the wrong tie setting? Likely authored before the spec-§49 use case became the load-bearing target. The constants for `qwen2_0_5b` were correct for inference, but tie_word_embeddings is mostly a training- pipeline concern — it determines whether lm_head is a separate trainable parameter or shares with embed_tokens. 2. Why didn't pmat query / cargo test catch this earlier? Existing tests pinned shape (hidden, layers, heads, etc.) but no test verified `tie_word_embeddings`. This PR adds the missing drift-prevention test that catches the defect class. 3. Why fix this in the same PR as the test (not a separate fix)? Toyota Way: the test IS the discharge mechanism for FALSIFY-001. A test that passed against the (defective) status quo would be a liar. Fixing first + testing second guarantees the test pins correct behavior, not whatever happened to be in the code. 4. Why also pin qwen2_1_5b (inheritance) and qwen2_7b (anti-spread)? Those are drift-prevention. The spread-inheritance pattern `..Self::qwen2_0_5b()` is fragile — a future refactor could split the inheritance chain and silently flip tie_word_embeddings back to false on 1.5B. Test catches that. Similarly, an over- enthusiastic refactor could homogenize 7B with 0.5B (incorrectly setting 7B's tie=true). Test catches that too. 5. Why §50.4 step 5b was overscoped at 40 LOC: §50 was authored under the assumption that the constructor didn't exist. Live source inspection (per `feedback_no_guessing.md`) revealed the foundation was already there, just with one defect. This is the same lesson as §50 itself — read source before authoring scope. The contract from PR #1473 is still valid; only the LOC estimate in §50.4's table was wrong. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461) - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (#1472, in flight) - PR #1470 — apr-pretrain-from-init-v1 contract (merged) - PR #1471 — apr pretrain --init wire-up (merged) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - HF config: ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-{0.5B,1.5B,7B}-Instruct/.../config.json - feedback_no_guessing.md — read source before forming hypothesis Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 07:58

noahgift merged commit ff689bd into main May 4, 2026
11 checks passed

noahgift deleted the spec/v2-94-model-2-pretrained-init-pivot branch May 4, 2026 08:22

noahgift mentioned this pull request May 4, 2026

contract(apr-pretrain-from-init-v1): v1.0.0 PROPOSED — §49 step 3 of MODEL-2 pivot #1470

Merged

5 tasks

noahgift mentioned this pull request May 4, 2026

feat(apr-cli): wire apr pretrain --init <model.apr> — §49 step 4 #1471

Merged

6 tasks

noahgift mentioned this pull request May 4, 2026

spec(ship-two-models): v2.94.0 → v2.95.0 — §50 MODEL-2 architecture-coupling finding #1472

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(ship-two-models): v2.94.0 — §49 MODEL-2 strategy pivot: from-scratch → pretrained-init#1461

spec(ship-two-models): v2.94.0 — §49 MODEL-2 strategy pivot: from-scratch → pretrained-init#1461
noahgift merged 1 commit into
mainfrom
spec/v2-94-model-2-pretrained-init-pivot

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

Why now

Industry precedent

Implementation gap (next PR)

Test plan

Plain ship %

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant