spec(ship-two-models): v2.94.0 → v2.95.0 — §50 MODEL-2 architecture-coupling finding by noahgift · Pull Request #1472 · paiml/aprender

noahgift · 2026-05-04T14:17:46Z

Summary

Adds §50 documenting the architecture-mismatch finding caught after §49.6 steps 3+4 landed (PR #1470 contract + PR #1471 wire-up). The remaining §49.6 step 5 was scoped at "0 LOC, just run apr pretrain --init" — that assumption is empirically wrong.

The empirical finding

pretrain_real.rs:38-46 HARDCODES Llama370MConfig::* for every architectural constant. Qwen2.5-Coder-0.5B-Instruct has different shape across the board:

Param	Llama370M	Qwen2.5-Coder-0.5B
hidden_size	1024	896
num_attention_heads	16	14
num_kv_heads	4 (GQA-4:1)	2 (GQA-7:1)
intermediate_size	2816	4864
vocab_size	50_257	151_936
rope_theta	10_000	1_000_000

Every tensor mismatches. Loading Qwen2.5 weights into a Llama370M-shaped optimizer is a category error — --init <Qwen2.5-Coder-0.5B-Instruct.apr> will fail at FALSIFY-005 (architecture mismatch) the moment step 5's arch-check runs.

Three options + recommendation

Option	Description	LOC	Risk
A	Find/build Llama-shaped 0.5B checkpoint	~5K + multi-week training	High — recreates §24/§25 corpus problem
B ⭐	Make trainer arch-polymorphic	~200-400	Medium — preserves §24/§25 falsification
C	Replace Llama370MConfig with Qwen-only	~300	Medium — deletes a working falsification

Recommend Option B — preserves §24/§25 falsification evidence (we KEEP knowing from-scratch fails at 9.75; we just don't ship it as MODEL-2). Exercises TransformerConfig's designed polymorphism. Each new component (Qwen tokenizer, GQA-7:1, extracted-arch loader) becomes its own falsifier.

Re-scoped roadmap (replaces single step 5)

#	Step	LOC	Falsifier
5a	Author `apr-pretrain-arch-polymorphic-v1.yaml` contract	~80	New
5b	`TransformerConfig::qwen2_0_5b()` constructor	~40	architecture-requirements-v1
5c	Refactor `pretrain_real::build_transformer_config()` to read from init APR metadata	~80	apr-pretrain-from-init-v1 FALSIFY-005
5d	Qwen tokenizer-vocab compatibility check	~30	gate-arch-370M-011 update
5e	GQA-7:1 attention forward-pass verification	~50	gqa-kernel-v1
5f	Wire actual weight load	~120	FALSIFY-006/009/010
5g	LIVE 500-step smoke fine-tune (operator dispatch)	0	FALSIFY-006 DISCHARGED
5h	Stamp + publish	~10	(existing)

Total: ~410 LOC + 1 LIVE training run — not 0 LOC.

Plain ship-% update

MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
MODEL-2: unchanged at 57% until 5g produces val_loss < 9.38. Sub-steps 5a-5f can each individually move 1% with falsifier discharge.
Coverage tally unchanged (architecture finding, not falsifier flip).

Five Whys

Why didn't §49 catch this? §49 was authored from strategy/data-budget reasoning. The 0-LOC step-5 cost implicitly assumed polymorphism. Live source inspection (this section's empirical move) revealed pretrain_real.rs:38-46 predates the assumption.
Why catch this NOW and not in step 5 implementation? Per feedback_no_guessing.md: read the live source before forming the implementation plan. Surfacing the mismatch BEFORE writing 200 LOC that fails at runtime is the cheapest place to pay cost-of-defect. The §50-prior wrong-premise PRs (docs(evidence): qwen2-0.5b bisection — root cause via apr diff (LAYOUT-001/002 violation) #1466/spec(ship-two-models): v2.95.0 — §50 LAYOUT-001/002 in safetensors→APR FFN import #1467/contract(tensor-layout-v1): v2.0.0 → v2.1.0 — FALSIFY-013 safetensors FFN round-trip drift gate #1468 closed) on the SHIP-007 / 0.5B gibberish track were the same defect class.
Why option B over A or C? Preserves §24/§25 falsification evidence. Exercises TransformerConfig's designed polymorphism. Each new component becomes its own falsifier rather than a hidden coupling.
Why is FALSIFY-005 the right place to fail-fast? PR contract(apr-pretrain-from-init-v1): v1.0.0 PROPOSED — §49 step 3 of MODEL-2 pivot #1470 already pinned "Architecture mismatch is FAIL-FAST, not silent-truncate". Step 4 (PR feat(apr-cli): wire apr pretrain --init <model.apr> — §49 step 4 #1471) doesn't enforce arch matching yet. So FALSIFY-005 is currently UNBOUND but its discharge gate is well-defined: read APR header, compare against pretrain target, error with names of mismatched fields.
Why isn't this a "punt"? A punt would say "blocked, await operator". This amendment names three options with LOC estimates, recommends one with reasoning, gives a concrete 8-step roadmap with falsifier discharge mapped to each sub-step.

Test plan

Spec text validates — markdown lint OK
Pre-commit quality gates pass
CI checks (gate, test, lint, coverage, security)
No spec-to-code drift introduced (spec-only change)

Refs

§49 — MODEL-2 strategy pivot (spec(ship-two-models): v2.94.0 — §49 MODEL-2 strategy pivot: from-scratch → pretrained-init #1461)
PR contract(apr-pretrain-from-init-v1): v1.0.0 PROPOSED — §49 step 3 of MODEL-2 pivot #1470 — apr-pretrain-from-init-v1 v1.0.0 PROPOSED contract (merged)
PR feat(apr-cli): wire apr pretrain --init <model.apr> — §49 step 4 #1471 — apr pretrain --init clap field + magic-byte validate (merged)
feedback_no_guessing.md — read source before forming hypothesis
feedback_fix_root_cause_never_route_around.md

🤖 Generated with Claude Code

…oupling finding Adds §50 documenting the architecture-mismatch finding caught after §49.6 steps 3+4 landed (PR #1470 contract + PR #1471 wire-up). The remaining §49.6 step 5 was scoped at "0 LOC, just run apr pretrain --init" — that assumption is empirically wrong. Empirical finding (§50.1): pretrain_real.rs:38-46 HARDCODES Llama370MConfig::* for every architectural constant. Qwen2.5-Coder-0.5B-Instruct has different shape across the board: Param | Llama370M | Qwen2.5-Coder-0.5B -----------------|-----------|-------------------- hidden_size | 1024 | 896 num_attention_heads | 16 | 14 num_kv_heads | 4 (GQA-4:1) | 2 (GQA-7:1) intermediate_size | 2816 | 4864 vocab_size | 50_257 | 151_936 rope_theta | 10_000 | 1_000_000 Every tensor mismatches. Loading Qwen2.5 weights into a Llama370M- shaped optimizer is a category error. Three options surfaced (§50.3): A: Find/build a Llama-shaped 0.5B pretrained checkpoint (~5K LOC + multi-week training; recreates §24/§25 corpus problem) B: Make trainer architecture-polymorphic (~200-400 LOC; preserves §24/§25 falsification; recommended) C: Replace Llama370MConfig with Qwen2_5_Coder_0_5B_Config outright (~300 LOC; deletes a working falsification path) Recommendation (§50.5): Option B — preserves §24/§25 falsification evidence, exercises TransformerConfig's designed polymorphism, binds each new component (qwen2_0_5b constructor, GQA-7:1 attention, Qwen tokenizer surface) to its own falsifier. Re-scoped roadmap (§50.4) — 8 sub-steps replacing original step 5: 5a. Author apr-pretrain-arch-polymorphic-v1.yaml contract (~80 LOC) 5b. TransformerConfig::qwen2_0_5b() constructor (~40 LOC) 5c. Extract arch from init APR file metadata (~80 LOC) 5d. Qwen tokenizer-vocab compatibility check (~30 LOC) 5e. GQA-7:1 attention forward-pass verification (~50 LOC) 5f. Wire actual weight load (~120 LOC) 5g. LIVE 500-step smoke fine-tune (operator dispatch) 0 LOC 5h. Stamp + publish as MODEL-2 v2 (~10 LOC) Total: ~410 LOC + 1 LIVE training run. Five Whys (§50.6): 1. Why didn't §49 catch this? §49 was authored from strategy/ data-budget reasoning; the 0-LOC step-5 cost implicitly assumed polymorphism. Live source inspection (this section's empirical move) revealed pretrain_real.rs:38-46 predates the assumption. 2. Why catch this NOW and not in step 5 implementation? Per feedback_no_guessing.md: read live source before forming implementation plan. Surfacing the mismatch BEFORE writing 200 LOC of weight-load code that fails at runtime is the cheapest place to pay cost-of-defect. The §50-prior wrong- premise PRs (#1466/#1467/#1468 closed) on the SHIP-007 / 0.5B gibberish track were the same defect class. 3. Why option B over A or C? Preserves §24/§25 falsification evidence (we KEEP knowing from-scratch fails at 9.75; we just don't ship it as MODEL-2). Exercises the polymorphism TransformerConfig was designed for. Each new component becomes its own falsifier rather than a hidden coupling. 4. Why is FALSIFY-005 the right place to fail-fast? PR #1470 already pinned "Architecture mismatch is FAIL-FAST, not silent- truncate". Step 4 (PR #1471) doesn't enforce arch matching yet — returns "not yet wired" before getting there. So FALSIFY-005 is currently UNBOUND but its discharge gate is well-defined: read APR header, compare against pretrain target, error with names of mismatched fields. 5. Why isn't this a "punt"? A punt would say "blocked, await operator". This amendment names three options with LOC estimates, recommends one with reasoning, gives a concrete 8- step roadmap with falsifier discharge mapped to each sub-step. The work IS shippable; it's just bigger than 0 LOC. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38). Sub-steps 5a-5f can each individually move 1% with falsifier discharge (architecture-polymorphic infrastructure shipped == evidence that the §49 path is REACHABLE, not just theoretical). Refs: - §49 — MODEL-2 strategy pivot (PR #1461) - PR #1470 — apr-pretrain-from-init-v1 v1.0.0 PROPOSED contract - PR #1471 — apr pretrain --init clap field + magic-byte validate - feedback_no_guessing.md — read source before forming hypothesis - feedback_fix_root_cause_never_route_around.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…#1478) Adds `falsify_apr_pretrain_arch_004_gqa_7_1_forward_pass_smoke` that constructs a tiny GQA-7:1 transformer (kv_heads=2, query_heads=14, hidden=112=14*8, head_dim=8 — mimicking Qwen2.5-Coder-0.5B's GQA ratio) and verifies the forward pass: - runs without panic - returns the correct shape (seq_len * vocab_size) - produces all-finite logits (no NaN, no Inf) Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473): - FALSIFY-APR-PRETRAIN-ARCH-004 at SMOKE level: kernel handles GQA-7:1 without per-ratio specialization. Full numerical-parity vs GQA-1:1 reference (cosine ≥ 0.9999) is a FUNCTIONAL-level discharge, not PARTIAL_ALGORITHM_LEVEL. Why this matters: the existing aprender-train Llama370M codepath only empirically exercised GQA-4:1 (kv_heads=4, query_heads=16). Qwen2.5-0.5B (the §49 fine-tune init source) uses GQA-7:1. Without this test, a future refactor of the attention kernel could silently break the 7:1 case while keeping 4:1 working — exactly the §24 silent-failure class. The test runs in <1ms (tiny shape: hidden=112, vocab=256, layers=1). Drift-prevention: also asserts the GQA ratio at construction time, so a typo in num_attention_heads or num_kv_heads is caught before the forward pass even runs. Test results (cargo test -p aprender-train --lib transformer::model::tests::falsify_apr_pretrain_arch_004): 1 passed; 0 failed; 0 ignored Five Whys: 1. Why a smoke test, not a numerical-parity test? PARTIAL_ALGORITHM_LEVEL requires only "compile + run + finite". FUNCTIONAL would require cosine vs reference. Smoke is the right scope for §50.4 step 5e — full parity is a follow-up if FALSIFY-006 (init_loss < 6.0) ever fails on the LIVE 500-step run. 2. Why num_attention_heads=14 (Qwen2.5-0.5B exact) and not e.g. 7 (smaller test model)? The Qwen2.5-0.5B-canonical 14/2=7 ratio is the load-bearing GQA shape. A 7/1 ratio would also be 7:1 but wouldn't exercise the multi-query-head-per-kv-head dispatch on more than one query group. 14/2 forces 2 query groups, each with 7 heads — the actual production shape. 3. Why use_bias=true and tie_word_embeddings=true? Mirror the Qwen2 scaling-law defaults verified by PR #1474 (the `qwen2_0_5b()` HF config check). If the test used the Llama defaults (use_bias=false, tie=false), it wouldn't catch a regression in the bias-add or embedding-tie code paths under the Qwen variant. 4. Why include the all-finite check, not just shape? §24's retrospective showed silent NaN propagation through GQA can produce loss=NaN that the divergence guard catches LATE (multiple steps in). The smoke test catches it at the first forward pass, before any optimizer state corrupts. 5. Why is this a SEPARATE test, not an extension of `test_transformer_tiny_forward`? The existing tiny() config uses defaults that may include GQA=1:1 (no GQA at all). A separate test makes the GQA-7:1 assertion auditable — `cargo test gqa_7_1` finds it directly, and contract drift between this test and FALSIFY-004 is detectable via grep. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight) - PR #1475 — build_transformer_config polymorphic dispatch (in flight) - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight) - feedback_no_guessing.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… §50.4 step 5d (#1476) Refactors `preflight_tokenizer_vocab_matches_model(tokenizer_dir)` → `preflight_tokenizer_vocab_matches_target(tokenizer_dir, target_vocab_size)` so the GATE-ARCH-370M-011 preflight no longer hardcodes the Llama370M baseline. When --init wires up (§50.4 step 5f), the caller passes the EXTRACTED arch's vocab_size (e.g., 151_936 for Qwen2.5-0.5B); for now (init=None), the only existing caller passes Llama370MConfig::VOCAB_SIZE explicitly, preserving regression-free behavior on the §24/§25 from-scratch path. Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473): - FALSIFY-APR-PRETRAIN-ARCH-005 — Qwen tokenizer (151_936) PASSES preflight when target is Qwen-shaped - FALSIFY-APR-PRETRAIN-ARCH-006 — Qwen tokenizer (151_936) FAILS preflight when target is Llama-shaped (the silent-pass class) What this PR adds: 1. Renamed function `preflight_tokenizer_vocab_matches_model` → `preflight_tokenizer_vocab_matches_target` with new `target_vocab_size: usize` parameter 2. Updated 4 callers (1 production at line 361 + 3 existing tests) to pass `Llama370MConfig::VOCAB_SIZE` explicitly — same behavior, now visible at the call site 3. 2 new unit tests in `commands::pretrain::tests`: - preflight_qwen_vocab_passes_with_qwen_target (FALSIFY-005) - preflight_qwen_vocab_fails_with_llama_target (FALSIFY-006 with error-message assertion that names BOTH vocab sizes) 4. Updated docstring noting the §50 polymorphism + cross-references to contract `apr-pretrain-arch-polymorphic-v1` Test results (cargo test -p apr-cli --lib commands::pretrain::tests::preflight): 5 passed; 0 failed; 0 ignored - preflight_accepts_matching_vocab (regression-free, unchanged behavior) - preflight_rejects_tokenizer_vocab_mismatch (regression-free) - preflight_rejects_missing_vocab_json (regression-free) - preflight_qwen_vocab_passes_with_qwen_target (NEW — FALSIFY-005) - preflight_qwen_vocab_fails_with_llama_target (NEW — FALSIFY-006) Five Whys: 1. Why polymorphic preflight NOW rather than at step 5f? Each step gets its own falsifier discharge. Step 5d's invariant — "preflight gates by EXTRACTED vocab when --init is set, by Llama370M vocab when --init is absent" — is independently testable WITHOUT actually reading an APR file. Authoring the test now pins the algorithm before the I/O integration arrives. 2. Why rename `_matches_model` → `_matches_target`? Old name implied "matches the model" with a fixed/canonical model. New name reflects the polymorphic dispatch where the target depends on call-site context. The rename is a one-time cost; staying with the old name would ossify the misleading abstraction. 3. Why pass target as parameter rather than extract from PretrainConfig? Decoupling: PretrainConfig already exists and is wired through many callers. Adding a new field to PretrainConfig would create a parallel drift surface (every constructor of PretrainConfig must remember to set it correctly). A function parameter forces every call site to decide explicitly, which is exactly the contract's "no silent defaults" invariant. 4. Why 2 new tests not 1? The two falsifiers (-005 and -006) are mutually exclusive proofs of the polymorphism: - 005 (positive): Qwen+Qwen target = pass - 006 (negative): Qwen+Llama target = fail Without the negative case, a regression that always returns Ok would silently pass the positive case. The pair pins the dispatch. 5. Why does FALSIFY-006 assert BOTH vocab sizes appear in the error message? Operator-experience: when a fine-tune fails with "tokenizer vocab mismatch", the operator needs to see WHICH tokenizer (151_936) and WHICH target (50_257) — not just an abstract "they don't match" error. The dual-name requirement prevents lossy error messages during the §49 strategy pivot. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (#1472, in flight) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight) - PR #1475 — build_transformer_config polymorphic dispatch (in flight) - feedback_no_guessing.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… (7/8 falsifiers bound) (#1480) Same-day continuation cycle landed 8 PRs across the §50.4 architecture- polymorphic infrastructure track. §51 records the cascade-complete state and pinpoints the remaining MODEL-2 ship-% gate (step 5g LIVE). Falsifier-discharge scoreboard for `apr-pretrain-arch-polymorphic-v1`: | ID | What it pins | PR | Status | |----|---------------------------------------|-------|--------| | 001 | qwen2_0_5b matches HF + tie fix | #1474 | PARTIAL | | 002 | init=None preserves Llama370M | #1475 | PARTIAL | | 003 | init=Some pass-through | #1475 | PARTIAL | | 004 | GQA-7:1 forward smoke | #1478 | MERGED | | 005 | Qwen tokenizer + Qwen target = pass | #1476 | MERGED | | 006 | Qwen tokenizer + Llama target = fail | #1476 | MERGED | | 007 | encoder/decoder family mismatch | #1479 | PARTIAL | | 008 | pv validate | #1473 | PARTIAL | 7 of 8 falsifiers PARTIAL_ALGORITHM_LEVEL or MERGED. Remaining work: - 5f.2 — wire APR file open + tensor materialization (~80 LOC) DELIBERATELY DEFERRED this cycle; doing 5f.2 now means rebasing onto 4 in-flight PRs as they land - 5g — LIVE 500-step smoke fine-tune (operator dispatch) THE LOAD-BEARING TEST that moves MODEL-2 ship-% - 5h — stamp + publish Per §47-§48 lesson: "infrastructure shipped ≠ ship-% movement." Cascade-complete state means the polymorphic foundation is in place; ship-% movement still requires the LIVE empirical check. Five Whys: 1. Why a snapshot now? Multiple PRs in cascade auto-merge create cognitive load. A spec snapshot captures both the achievement (7 falsifiers bound) and the remaining gate (step 5g LIVE). Without it, future operators waste cycles re-deriving the state. 2. Why focus on falsifier scoreboard rather than total LOC? Falsifier discharge is the actual contract obligation. 7/8 invariants pinned means CI now catches regressions in the polymorphic-init path. 3. Why mention 5f.2 explicitly as deliberately deferred? Naming the deferral makes it not a punt. Step 5f.2 has a clear "when": after the 4 in-flight PRs cascade-merge, then 5f.2 lands clean. 4. Why call out infrastructure ≠ ship-%? The §47-§48 cascade taught the same lesson — "11 SHIP-007 cascade PRs landed but no ship-% movement." Operator-facing ship-% is the LIVE check. 5. Why is FALSIFY-006 LIVE the load-bearing claim? init_loss(step=0) ≤ 6.0 vs from_scratch_loss(step=0) ≥ 9.5 proves end-to-end correctness in one number. No other falsifier can substitute. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Spec amendment cadence: §41 → §42 → §43 → §44 → §45 → §46 → §47 → §48 → §49 → §50 → §51. Eleven amendments since 2026-05-03. Same-day spec hygiene rather than letting the cascade-complete state remain implicit. Refs: - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (PR #1472, MERGED) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight) - PR #1475 — build_transformer_config polymorphic dispatch (in flight) - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED) - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED) - PR #1479 — validate_pretrain_init_arch_compatible (in flight) - feedback_no_guessing.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… 5b + DEFECT FIX (#1474) §50.4 step 5b authored a contract assuming `qwen2_0_5b()` did not exist. Live source inspection during impl revealed the constructor ALREADY EXISTS at `transformer/config.rs:156`. Reading the HF config byte-for-byte (per `feedback_no_guessing.md`) revealed a real defect: HF config (Qwen2.5-Coder-0.5B-Instruct): tie_word_embeddings: true Existing code (qwen2_0_5b): tie_word_embeddings: false Fix: 1 LOC change `false → true`. Per Qwen scaling-law convention verified against the HF cache: - Qwen2.5-Coder-0.5B: tie=true (HF cache 2026-05-04 ✓) - Qwen2.5-Coder-1.5B: tie=true (HF cache 2026-05-04 ✓; inherits via `..Self::qwen2_0_5b()` spread) - Qwen2.5-Coder-7B: tie=false (HF cache 2026-05-04 ✓; explicit in qwen2_7b()) Why the defect matters: tied vs untied embeddings is a load-bearing architectural property. With tie=false (current bug), if an operator fine-tunes from a Qwen2.5-0.5B init checkpoint, the lm_head will be allocated as a separate tensor that doesn't get loaded (because the APR file only contains the embed_tokens tensor — they share weights). The result: lm_head random-initialized and untrained, producing silent gibberish at val time. This is exactly the §49 / §50 failure class the contract was authored to prevent. What this PR adds: 1. Fix `tie_word_embeddings: false → true` in `qwen2_0_5b()` at `transformer/config.rs:156-174` 2. Add docstring noting the empirical verification + HF cache path + Qwen scaling-law quirk 3. Add 3 new unit tests in `transformer::config::tests`: - `qwen2_0_5b_matches_hf_config_2026_05_04` (FALSIFY-001 byte- identity verification — 11 fields) - `qwen2_1_5b_inherits_tie_word_embeddings_from_0_5b` (drift- prevention; catches future spread-split refactors) - `qwen2_7b_does_not_tie_embeddings` (drift-prevention; pins the 7B Qwen scaling-law quirk against silent flips) Test results (cargo test -p aprender-train --lib transformer::config::tests::qwen2): 3 passed; 0 failed; 0 ignored Discharges FALSIFY-APR-PRETRAIN-ARCH-001 in PR #1473's contract. Five Whys: 1. Why was the constructor already there but with the wrong tie setting? Likely authored before the spec-§49 use case became the load-bearing target. The constants for `qwen2_0_5b` were correct for inference, but tie_word_embeddings is mostly a training- pipeline concern — it determines whether lm_head is a separate trainable parameter or shares with embed_tokens. 2. Why didn't pmat query / cargo test catch this earlier? Existing tests pinned shape (hidden, layers, heads, etc.) but no test verified `tie_word_embeddings`. This PR adds the missing drift-prevention test that catches the defect class. 3. Why fix this in the same PR as the test (not a separate fix)? Toyota Way: the test IS the discharge mechanism for FALSIFY-001. A test that passed against the (defective) status quo would be a liar. Fixing first + testing second guarantees the test pins correct behavior, not whatever happened to be in the code. 4. Why also pin qwen2_1_5b (inheritance) and qwen2_7b (anti-spread)? Those are drift-prevention. The spread-inheritance pattern `..Self::qwen2_0_5b()` is fragile — a future refactor could split the inheritance chain and silently flip tie_word_embeddings back to false on 1.5B. Test catches that. Similarly, an over- enthusiastic refactor could homogenize 7B with 0.5B (incorrectly setting 7B's tie=true). Test catches that too. 5. Why §50.4 step 5b was overscoped at 40 LOC: §50 was authored under the assumption that the constructor didn't exist. Live source inspection (per `feedback_no_guessing.md`) revealed the foundation was already there, just with one defect. This is the same lesson as §50 itself — read source before authoring scope. The contract from PR #1473 is still valid; only the LOC estimate in §50.4's table was wrong. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §49 — MODEL-2 strategy pivot (#1461) - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (#1472, in flight) - PR #1470 — apr-pretrain-from-init-v1 contract (merged) - PR #1471 — apr pretrain --init wire-up (merged) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - HF config: ~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-{0.5B,1.5B,7B}-Instruct/.../config.json - feedback_no_guessing.md — read source before forming hypothesis Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

… §50.4 step 5c (#1475) Adds `pretrain_real::build_transformer_config(init: Option<&TransformerConfig>)` which dispatches between the §24/§25 from-scratch baseline (Llama370M) and a caller-extracted init config (e.g., Qwen2.5-Coder-0.5B). Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473): - FALSIFY-APR-PRETRAIN-ARCH-002 — init=None preserves Llama370M baseline byte-for-byte (no §24/§25 regression) - FALSIFY-APR-PRETRAIN-ARCH-003 — init=Some passes through caller-extracted config (no silent defaults) Design decision: the polymorphic builder takes a `TransformerConfig`, NOT a path to an APR file. The caller (apr-cli, in step 5d/5f) is responsible for reading the APR file and producing the config — typically via `TransformerConfig::from_apr_metadata()` (already at config.rs:264). This decoupling keeps `aprender-train` free of `aprender-serve` (the APR loader) as a build dep, preserving compile-graph hygiene. What this PR adds: 1. `pub fn build_transformer_config(init: Option<&TransformerConfig>) -> TransformerConfig` at pretrain_real.rs (~25 LOC including doc comment) 2. 3 unit tests in pretrain_real::tests: - build_transformer_config_no_init_matches_llama370m (FALSIFY-002) - build_transformer_config_qwen_init_matches_input (FALSIFY-003) - build_transformer_config_dispatch_mutually_exclusive (drift-prevention) Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::build_transformer_config): 3 passed; 0 failed; 0 ignored Five Whys: 1. Why a polymorphic builder NOW, not when pretrain_real was first written? §24/§25 only exercised the from-scratch path (Llama370M architecture). §49 added the pretrained-init use case where the init checkpoint can have ARBITRARY shape. Without the dispatch, §49.6 step 5g would fail at the first forward pass with shape errors when an operator points --init at a non-Llama-shaped APR. 2. Why take TransformerConfig (not Path) as input? Decoupling: the APR-reading is `aprender-serve`'s job (or apr-cli's, since it's the one that reads --init <PATH>). `aprender-train` only needs to know "given a config, build the trainer". Mixing concerns would force aprender-train to depend on aprender-serve, which is a multi-thousand-line cross-crate dep just for header reading. 3. Why are the 3 tests sufficient at PARTIAL_ALGORITHM_LEVEL? - FALSIFY-002 pins regression-free behavior on the from-scratch path (the load-bearing claim that the polymorphic dispatch doesn't break what works today) - FALSIFY-003 pins pass-through behavior (no silent defaults that would corrupt init weights) - The dispatch-mutually-exclusive test catches a future refactor that accidentally always returns Llama370M (a silent-fallback regression class, the same defect §24's --synthetic default introduced) 4. Why not also wire this into drive_real / drive_real_cuda? That wires the actual init path. It's step 5f (weight load) work; this PR is just the pure-function dispatch. Keeping each PR small prevents the `feedback_no_guessing.md` failure mode of writing 200 LOC at once and finding mid-PR that the foundation is wrong. 5. Why is FALSIFY-APR-PRETRAIN-INIT-005 (arch mismatch) NOT yet discharged? The current dispatch is "init=Some passes through". There's no validation that the caller-provided config is COMPATIBLE with the pretrain target. That validation lives in the apr-cli wire (step 5d preflight + step 5f weight load) and exits non-zero with a clear error if architectures mismatch. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling finding (#1472, in flight) - SPEC-SHIP-TWO-001 §50.4 — re-scoped roadmap (steps 5a-5h) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight) - feedback_no_guessing.md — read source before forming hypothesis Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…step 5f.1 Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that fail-fast rejects an init `TransformerConfig` whose architecture family is incompatible with the decoder-only pretrain trainer. Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473): - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/ RoBERTa encoder model) is FAIL-FAST not silent-truncate Why this matters: §49 wires `--init <PATH>` to load weights from any APR file. Without this gate, an operator who points --init at e.g. microsoft/codebert-base.apr would silently load encoder weights into a decoder-shaped trainer, producing nonsense gradients that the divergence guard catches LATE (multiple epochs in). This gate catches the family mismatch BEFORE any trainer allocation. Step 5f decomposition: this is step 5f.1 — the arch-family gate. Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization into optimizer state. Splitting keeps each PR small + reviewable. What this PR adds: 1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig) -> Result<(), String>` (~30 LOC including doc comment) at pretrain_real.rs:35 2. 3 unit tests in `pretrain_real::tests`: - validate_pretrain_init_arch_accepts_decoder (FALSIFY-007 negative) - validate_pretrain_init_arch_rejects_encoder (FALSIFY-007 positive, load-bearing) - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention, catches over-rejection regression) The encoder-rejection test asserts FOUR string contents in the error: - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability) - "Encoder" — names the architecture family - "decoder-only" — explains why this is wrong - "RobertaModel" — names the offending hf_architecture Operator-experience parity: when the gate fires, the error tells the operator exactly what they did wrong + how the trainer differs. Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch): 3 passed; 0 failed; 0 ignored Five Whys: 1. Why a separate function rather than baking the check into build_transformer_config? Decoupling: build_transformer_config is a pure pass-through dispatch; adding arch validation would conflate "which config?" with "is this config valid?". Two functions, two concerns, two test surfaces. 2. Why focus this PR on JUST the arch-family check (step 5f.1) and not the full weight materialization (step 5f)? Single-piece flow. Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load). Each PR has its own falsifier discharge; CI catches regressions between them. 3. Why FOUR string assertions in the encoder-rejection error? Each piece of the error text serves a distinct operator need: - falsifier id → audit (which contract did this fail?) - architecture family → what (encoder vs decoder) - "decoder-only" → why (the trainer is decoder-only) - hf_architecture → which model (RobertaModel/CodeBERT/...) Lossy error messages erode operator trust; the contract pins all four to prevent message rot. 4. Why include the Llama370M baseline drift-prevention test? §24's retrospective showed silent over-rejection (every input rejected, even valid ones) is the symmetric defect to silent under-rejection (every input accepted, even invalid ones). The 3 tests cover both halves of the dispatch. 5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That requires the actual weight materialization (step 5f.2) PLUS a LIVE training run (step 5g). Step 5f.1 is just the gate; the load-bearing init_loss measurement is downstream. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight) - PR #1475 — build_transformer_config polymorphic dispatch (in flight) - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight) - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED) - feedback_no_guessing.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tep 5a (#1473) Authors `contracts/apr-pretrain-arch-polymorphic-v1.yaml` v1.0.0 PROPOSED — the contract layer driving §50.4 steps 5b-5f (the architecture-polymorphic pretrain trainer that unblocks fine-tuning from a Qwen2.5-class init checkpoint). Per §50 (PR #1472), the existing pretrain trainer `pretrain_real.rs:38-46` HARDCODES every architectural constant from `Llama370MConfig`. Loading Qwen2.5-Coder-0.5B-Instruct weights into this fixed shape is a category error (vocab 50K vs 152K, hidden 1024 vs 896, GQA 4:1 vs 7:1, etc.). This contract pins the polymorphic builder + 4 invariants: 1. arch_extraction_signature — init=None preserves §24/§25 baseline; init=Some extracts all 10 fields from APR header, no silent defaults 2. qwen2_0_5b_constructor — TransformerConfig::qwen2_0_5b() returns a config matching HF config.json byte-for-byte (vocab=151_936, hidden=896, GQA-7:1, rope_theta=1e6, use_bias=true, tie_word_embeddings=true) 3. gqa_7_to_1_invariants — attention kernel handles GQA-7:1 without per-ratio specialization; cosine ≥ 0.9999 vs GQA-1:1 reference 4. qwen_tokenizer_vocab_compatibility — preflight gates by EXTRACTED vocab (151_936 for Qwen) when --init present, falls back to Llama370MConfig::VOCAB_SIZE (50_257) when absent 8 falsifiers (FALSIFY-APR-PRETRAIN-ARCH-001..008), 6 proof obligations, 2 kani harnesses. `pv validate` exits 0 with 0 errors / 0 warnings. This contract DOES NOT replace apr-pretrain-from-init-v1; the two compose. apr-pretrain-from-init-v1 pins the --init flag's CLI surface + magic-byte validation; this contract pins the architecture extraction algorithm that --init's weight load depends on. FALSIFY-APR-PRETRAIN-INIT-005 (arch mismatch) becomes DISCHARGED when this contract's FALSIFY-007 lands. Five Whys: 1. Why a contract before the impl? §50.4 step 5a is THE first step of the re-scoped roadmap. The contract pins what 5b-5f must satisfy — without it, the impl PRs would each pick their own arbitrary semantics for "extract arch from APR". Contract-first prevents 5-PR scope drift. 2. Why 8 falsifiers, not 4? Each of the 4 equations decomposes into 2 falsifiable claims: (existence + correctness) for the constructor, (init=None + init=Some) for the builder, (forward- pass + reference-comparison) for GQA-7:1, (positive + negative case) for the tokenizer surface. 8 covers every silent-failure mode the §24 retrospective showed is possible. 3. Why also pin GQA-7:1 here, not just in gqa-kernel-v1? The existing gqa-kernel-v1 covers GQA generally; what's NEW is that the Llama370M codepath empirically only exercised 4:1 (kv=4, q=16). Qwen2.5 exercises 7:1 (kv=2, q=14). FALSIFY-004 makes this transition contract-bound rather than tribal knowledge. 4. Why not just delete Llama370MConfig outright? Per §50.3 Option C analysis: that deletes the §24/§25 falsification evidence (we KEEP knowing from-scratch fails at val_loss=9.75 on the existing corpus). The polymorphic builder preserves both paths — Llama370M for the from-scratch baseline, Qwen2.5 (or any future init) for the fine-tune path. 5. Why is FALSIFY-007 (encoder/decoder mismatch) load-bearing? Without it, an operator who points --init at e.g. CodeBERT (an encoder) would silently load weights into a decoder-shaped trainer, producing nonsense gradients. The error message must name the architecture-family mismatch, not crash later with cryptic shape errors during the first forward pass. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling finding (#1472) - SPEC-SHIP-TWO-001 §50.4 — re-scoped roadmap (steps 5a-5h) - contracts/apr-pretrain-from-init-v1.yaml v1.1.0 PARTIAL (#1471, sibling) - contracts/training-loop-pretrain-v1.yaml v1.5.0 ACTIVE (parent) - contracts/architecture-requirements-v1.yaml (sibling) - contracts/gqa-kernel-v1.yaml (sibling — GQA ratio invariants) - feedback_no_guessing.md - feedback_fix_root_cause_never_route_around.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…1481) Adds the read-half of `apr pretrain --init` weight load: a thin wrapper over `aprender::format::converter::load_model_tensors` that returns a `BTreeMap<String, (Vec<f32>, Vec<usize>)>` of tensor blobs keyed by HF naming convention. Per `apr-pretrain-arch-polymorphic-v1` §init_load_semantics (PR #1473): "Loader is REUSED, not reimplemented." This function does not duplicate APR parsing — it forwards to the same machinery `apr export` and `apr inspect` use. Discharges from `apr-pretrain-arch-polymorphic-v1`: - §init_load_semantics invariant (loader reuse): satisfied - FALSIFY-006 (init_loss < 6.0) at READ-COMPILE-BIND level Step 5f decomposition: - 5f.1 (PR #1479): encoder/decoder family validator (~30 LOC) - 5f.2 (this PR): APR file open + tensor read (~30 LOC + 2 tests) - 5f.3 (next): populate trainer parameters from BTreeMap (~50 LOC) - 5g (operator): LIVE 500-step fine-tune → DISCHARGES MODEL-2 ship-% Step 5f.2 is intentionally narrow — it only does the READ. Population into trainer parameter slots (5f.3) reconciles HF naming convention (e.g., `model.embed_tokens.weight`) against the trainer's internal parameter naming. That's a separate concern with its own falsifier. What this PR adds: 1. `pub fn load_init_tensors_from_apr(path) -> Result<BTreeMap<...>>` at pretrain_real.rs:35 (~25 LOC including doc comment) 2. 2 unit tests in `pretrain_real::tests`: - load_init_tensors_missing_file_errors_with_falsifier_id (FALSIFY-006 fail-fast path; asserts error message contains falsifier id + offending path for operator-experience) - load_init_tensors_signature_compile_bind (drift-prevention: catches a future signature change that would break step 5f.3's BTreeMap consumer) Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::load_init_tensors): 2 passed; 0 failed; 0 ignored Five Whys: 1. Why decompose step 5f.2 to JUST the read? Single-piece flow. Read → Validate → Populate are three distinct concerns. Step 5f.1 did validation (#1479); 5f.2 does read; 5f.3 will do populate. Each PR has one falsifier discharge story. 2. Why use load_model_tensors and not write a new parser? The contract pins "Loader is reused, not reimplemented." Writing a new parser would create a parallel format-decoder that drifts from the canonical one. The same lesson as the LAYOUT-001/002 hits — parallel format code paths produce silent format-drift bugs. 3. Why return BTreeMap<String, (Vec<f32>, Vec<usize>)> rather than a trainer-parameter-shaped struct? Decoupling: the read shouldn't know about TransformerTrainer's internal parameter names. Step 5f.3's job is to map HF names → trainer slots; if 5f.2 baked that mapping in, every change to TransformerTrainer would break the read. 4. Why include the signature-compile-bind test? It's a compile-time check that drives step 5f.3's expectations. If a future refactor changes the return type (e.g., from BTreeMap to HashMap, or from Vec<usize> to Box<[usize]>), step 5f.3's consumer code stops compiling — caught here, not at the integration point. 5. Why is FALSIFY-006 NOT yet at PARTIAL_ALGORITHM_LEVEL after this PR? Because step 5f.2 only does the read; FALSIFY-006 requires the LIVE init_loss < 6.0 check, which needs steps 5f.3 + 5g. This PR moves FALSIFY-006 from UNBOUND → READ-COMPILE-BIND, a sub-level of PARTIAL_ALGORITHM_LEVEL. Full PARTIAL discharge happens at 5f.3 when the populate step exists. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50, §51 — MODEL-2 architecture-coupling + cascade snapshot (PR #1472, #1480 MERGED) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (MERGED) - PR #1475 — build_transformer_config polymorphic dispatch (in flight) - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED) - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED) - PR #1479 — validate_pretrain_init_arch_compatible (in flight) - feedback_no_guessing.md - feedback_falsifier_first_cascade_pattern.md (this turn's pattern) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…step 5f.1 Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that fail-fast rejects an init `TransformerConfig` whose architecture family is incompatible with the decoder-only pretrain trainer. Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473): - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/ RoBERTa encoder model) is FAIL-FAST not silent-truncate Why this matters: §49 wires `--init <PATH>` to load weights from any APR file. Without this gate, an operator who points --init at e.g. microsoft/codebert-base.apr would silently load encoder weights into a decoder-shaped trainer, producing nonsense gradients that the divergence guard catches LATE (multiple epochs in). This gate catches the family mismatch BEFORE any trainer allocation. Step 5f decomposition: this is step 5f.1 — the arch-family gate. Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization into optimizer state. Splitting keeps each PR small + reviewable. What this PR adds: 1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig) -> Result<(), String>` (~30 LOC including doc comment) at pretrain_real.rs:35 2. 3 unit tests in `pretrain_real::tests`: - validate_pretrain_init_arch_accepts_decoder (FALSIFY-007 negative) - validate_pretrain_init_arch_rejects_encoder (FALSIFY-007 positive, load-bearing) - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention, catches over-rejection regression) The encoder-rejection test asserts FOUR string contents in the error: - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability) - "Encoder" — names the architecture family - "decoder-only" — explains why this is wrong - "RobertaModel" — names the offending hf_architecture Operator-experience parity: when the gate fires, the error tells the operator exactly what they did wrong + how the trainer differs. Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch): 3 passed; 0 failed; 0 ignored Five Whys: 1. Why a separate function rather than baking the check into build_transformer_config? Decoupling: build_transformer_config is a pure pass-through dispatch; adding arch validation would conflate "which config?" with "is this config valid?". Two functions, two concerns, two test surfaces. 2. Why focus this PR on JUST the arch-family check (step 5f.1) and not the full weight materialization (step 5f)? Single-piece flow. Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load). Each PR has its own falsifier discharge; CI catches regressions between them. 3. Why FOUR string assertions in the encoder-rejection error? Each piece of the error text serves a distinct operator need: - falsifier id → audit (which contract did this fail?) - architecture family → what (encoder vs decoder) - "decoder-only" → why (the trainer is decoder-only) - hf_architecture → which model (RobertaModel/CodeBERT/...) Lossy error messages erode operator trust; the contract pins all four to prevent message rot. 4. Why include the Llama370M baseline drift-prevention test? §24's retrospective showed silent over-rejection (every input rejected, even valid ones) is the symmetric defect to silent under-rejection (every input accepted, even invalid ones). The 3 tests cover both halves of the dispatch. 5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That requires the actual weight materialization (step 5f.2) PLUS a LIVE training run (step 5g). Step 5f.1 is just the gate; the load-bearing init_loss measurement is downstream. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight) - PR #1475 — build_transformer_config polymorphic dispatch (in flight) - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight) - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED) - feedback_no_guessing.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…step 5f.1 (#1479) Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that fail-fast rejects an init `TransformerConfig` whose architecture family is incompatible with the decoder-only pretrain trainer. Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473): - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/ RoBERTa encoder model) is FAIL-FAST not silent-truncate Why this matters: §49 wires `--init <PATH>` to load weights from any APR file. Without this gate, an operator who points --init at e.g. microsoft/codebert-base.apr would silently load encoder weights into a decoder-shaped trainer, producing nonsense gradients that the divergence guard catches LATE (multiple epochs in). This gate catches the family mismatch BEFORE any trainer allocation. Step 5f decomposition: this is step 5f.1 — the arch-family gate. Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization into optimizer state. Splitting keeps each PR small + reviewable. What this PR adds: 1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig) -> Result<(), String>` (~30 LOC including doc comment) at pretrain_real.rs:35 2. 3 unit tests in `pretrain_real::tests`: - validate_pretrain_init_arch_accepts_decoder (FALSIFY-007 negative) - validate_pretrain_init_arch_rejects_encoder (FALSIFY-007 positive, load-bearing) - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention, catches over-rejection regression) The encoder-rejection test asserts FOUR string contents in the error: - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability) - "Encoder" — names the architecture family - "decoder-only" — explains why this is wrong - "RobertaModel" — names the offending hf_architecture Operator-experience parity: when the gate fires, the error tells the operator exactly what they did wrong + how the trainer differs. Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch): 3 passed; 0 failed; 0 ignored Five Whys: 1. Why a separate function rather than baking the check into build_transformer_config? Decoupling: build_transformer_config is a pure pass-through dispatch; adding arch validation would conflate "which config?" with "is this config valid?". Two functions, two concerns, two test surfaces. 2. Why focus this PR on JUST the arch-family check (step 5f.1) and not the full weight materialization (step 5f)? Single-piece flow. Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load). Each PR has its own falsifier discharge; CI catches regressions between them. 3. Why FOUR string assertions in the encoder-rejection error? Each piece of the error text serves a distinct operator need: - falsifier id → audit (which contract did this fail?) - architecture family → what (encoder vs decoder) - "decoder-only" → why (the trainer is decoder-only) - hf_architecture → which model (RobertaModel/CodeBERT/...) Lossy error messages erode operator trust; the contract pins all four to prevent message rot. 4. Why include the Llama370M baseline drift-prevention test? §24's retrospective showed silent over-rejection (every input rejected, even valid ones) is the symmetric defect to silent under-rejection (every input accepted, even invalid ones). The 3 tests cover both halves of the dispatch. 5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That requires the actual weight materialization (step 5f.2) PLUS a LIVE training run (step 5g). Step 5f.1 is just the gate; the load-bearing init_loss measurement is downstream. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38) Refs: - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED) - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight) - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight) - PR #1475 — build_transformer_config polymorphic dispatch (in flight) - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight) - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED) - feedback_no_guessing.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ION-COMPLETE; contract v1.1.0 → v1.2.0 FUNCTIONAL (#1495) §50.4 cascade INTEGRATION-COMPLETE on main with PR #1494 merging at 2026-05-05T01:48:14Z. The `apr pretrain --init <PATH>` flow is now end-to-end functional on CPU; the legacy "not yet wired" Err is RETIRED; step 5g LIVE is the only remaining gate before MODEL-2 ship-% can move from 57% → ≥58%. Spec amendment §53: - Updated falsifier scoreboard: 6/8 INTEGRATION (001/002/003/005/006/007 via live CLI dispatch); 2/8 PARTIAL_ALGORITHM_LEVEL (004 forward-pass smoke + 008 contract validation are inherently algorithm-level). - Step roadmap: 5a-5f.4 ✅ MERGED; 5f.5 (CUDA wireup) NOT YET STARTED; 5g (LIVE 500-step fine-tune) operator-dispatchable on RTX 4090. - Cascade ships statistics: 11 PRs over 2 days (#1471/#1472/#1473/#1474/#1475/#1476/#1478/#1479/#1481/#1482/#1483/#1486/#1494). - MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57% (gated on 5g empirical val_loss < 9.38 evidence). - 3 CI andon classes documented as feedback memories during cascade (workspace-test missing-binary, trueno SIGSEGV-on-cleanup, auto-merge behind-state). Contract apr-pretrain-arch-polymorphic-v1 v1.1.0 → v1.2.0 FUNCTIONAL: - All 8 falsifiers PASS on main; 6/8 reach INTEGRATION via the user-facing `apr pretrain --init` flow. - verification_summary updated: tested 7 → 8; status partial → functional. - Added §52 + §53 references. - Promotion to DISCHARGED still requires §50.4 step 5g LIVE empirical 500-step fine-tune on canonical Qwen2.5-Coder-0.5B-Instruct.apr producing val_loss < 9.38. `pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml` exits 0. Refs: SPEC-SHIP-TWO-001 §50.4 cascade, PR #1494 merge commit 9afca16 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 4, 2026 14:17

This was referenced May 4, 2026

contract(apr-pretrain-arch-polymorphic-v1): v1.0.0 PROPOSED — §50.4 step 5a #1473

Merged

fix(aprender-train): qwen2_0_5b tie_word_embeddings true — §50.4 step 5b + DEFECT FIX #1474

Merged

Merge branch 'main' into spec/ship-two-models-v2-95-arch-coupling

db53632

noahgift mentioned this pull request May 4, 2026

feat(aprender-train): build_transformer_config polymorphic dispatch — §50.4 step 5c #1475

Merged

4 tasks

noahgift merged commit 7f9e256 into main May 4, 2026
10 checks passed

noahgift deleted the spec/ship-two-models-v2-95-arch-coupling branch May 4, 2026 15:37

This was referenced May 4, 2026

feat(apr-cli): polymorphic preflight_tokenizer_vocab_matches_target — §50.4 step 5d #1476

Merged

test(aprender-train): GQA-7:1 forward-pass smoke test — §50.4 step 5e #1478

Merged

noahgift mentioned this pull request May 4, 2026

feat(aprender-train): validate_pretrain_init_arch_compatible — §50.4 step 5f.1 #1479

Merged

noahgift mentioned this pull request May 4, 2026

spec(ship-two-models): v2.95.0 → v2.96.0 — §51 §50.4 cascade snapshot (7/8 falsifiers bound) #1480

Merged

noahgift mentioned this pull request May 4, 2026

feat(aprender-train): load_init_tensors_from_apr — §50.4 step 5f.2 #1481

Merged

noahgift mentioned this pull request May 5, 2026

spec(ship-two-models): v2.97 → v2.98 — §53 §50.4 cascade INTEGRATION-COMPLETE; contract v1.1 → v1.2 FUNCTIONAL #1495

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(ship-two-models): v2.94.0 → v2.95.0 — §50 MODEL-2 architecture-coupling finding#1472

spec(ship-two-models): v2.94.0 → v2.95.0 — §50 MODEL-2 architecture-coupling finding#1472
noahgift merged 2 commits into
mainfrom
spec/ship-two-models-v2-95-arch-coupling

noahgift commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 4, 2026

Summary

The empirical finding

Three options + recommendation

Re-scoped roadmap (replaces single step 5)

Plain ship-% update

Five Whys

Test plan

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant