spec(ship-two-models): v2.95.0 — §50 LAYOUT-001/002 in safetensors→APR FFN import#1467
Closed
noahgift wants to merge 1 commit into
Closed
spec(ship-two-models): v2.95.0 — §50 LAYOUT-001/002 in safetensors→APR FFN import#1467noahgift wants to merge 1 commit into
noahgift wants to merge 1 commit into
Conversation
…tensors→APR FFN import Documents the empirical root cause for Qwen2-0.5B-Instruct gibberish, found via `apr diff` in one command. Blocks §49 MODEL-2 pretrained-init pivot until the fix lands. §50 sections: - 50.1: Empirical root cause via `apr diff` (literal "DIAGNOSIS" line) - 50.2: 7B vs 0.5B variable isolation (GGUF-import vs SafeTensors-import) - 50.3: Five Whys with code-line citations - 50.4: Falsified hypotheses table (4 wrong theories, evidence) - 50.5: Methodology lesson (apr-tools-first BEFORE code reading) - 50.6: PR #1463 MERGED + #1466 OPEN summary - 50.7: Bounded next-session fix scope (~50 LOC) with regression guard plan - 50.8: Ship-% impact (M1=91% unchanged, M2=57% blocked on fix) - 50.9: Related contracts (tensor-layout-v1, tied-embeddings-v1) Cross-refs CLAUDE.md "LAYOUT-001/002 Tensor Layout Safety" section which flags this bug class as "occurred 100+ times". No code changes — spec amendment only. Companion to evidence PR #1466 which ships the findings.md + gguf-trace JSON files. Coverage tally unchanged this cycle (PR #1463 added unit tests for QA gate, not contract falsifiers). M1=91%, M2=57%. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
Contributor
Author
|
Closing — premise was wrong. The contract tensor-layout-v1.yaml metadata explicitly states 'safetensors: layout=row-major (HuggingFace native format - same layout as APR)' and the FFN tensor entries document the [TRANSPOSED] shape labels between GGUF and APR as the contract-specified behavior, NOT a defect. The |
auto-merge was automatically disabled
May 4, 2026 12:14
Pull request was closed
4 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…oupling finding (#1472) Adds §50 documenting the architecture-mismatch finding caught after §49.6 steps 3+4 landed (PR #1470 contract + PR #1471 wire-up). The remaining §49.6 step 5 was scoped at "0 LOC, just run apr pretrain --init" — that assumption is empirically wrong. Empirical finding (§50.1): pretrain_real.rs:38-46 HARDCODES Llama370MConfig::* for every architectural constant. Qwen2.5-Coder-0.5B-Instruct has different shape across the board: Param | Llama370M | Qwen2.5-Coder-0.5B -----------------|-----------|-------------------- hidden_size | 1024 | 896 num_attention_heads | 16 | 14 num_kv_heads | 4 (GQA-4:1) | 2 (GQA-7:1) intermediate_size | 2816 | 4864 vocab_size | 50_257 | 151_936 rope_theta | 10_000 | 1_000_000 Every tensor mismatches. Loading Qwen2.5 weights into a Llama370M- shaped optimizer is a category error. Three options surfaced (§50.3): A: Find/build a Llama-shaped 0.5B pretrained checkpoint (~5K LOC + multi-week training; recreates §24/§25 corpus problem) B: Make trainer architecture-polymorphic (~200-400 LOC; preserves §24/§25 falsification; recommended) C: Replace Llama370MConfig with Qwen2_5_Coder_0_5B_Config outright (~300 LOC; deletes a working falsification path) Recommendation (§50.5): Option B — preserves §24/§25 falsification evidence, exercises TransformerConfig's designed polymorphism, binds each new component (qwen2_0_5b constructor, GQA-7:1 attention, Qwen tokenizer surface) to its own falsifier. Re-scoped roadmap (§50.4) — 8 sub-steps replacing original step 5: 5a. Author apr-pretrain-arch-polymorphic-v1.yaml contract (~80 LOC) 5b. TransformerConfig::qwen2_0_5b() constructor (~40 LOC) 5c. Extract arch from init APR file metadata (~80 LOC) 5d. Qwen tokenizer-vocab compatibility check (~30 LOC) 5e. GQA-7:1 attention forward-pass verification (~50 LOC) 5f. Wire actual weight load (~120 LOC) 5g. LIVE 500-step smoke fine-tune (operator dispatch) 0 LOC 5h. Stamp + publish as MODEL-2 v2 (~10 LOC) Total: ~410 LOC + 1 LIVE training run. Five Whys (§50.6): 1. Why didn't §49 catch this? §49 was authored from strategy/ data-budget reasoning; the 0-LOC step-5 cost implicitly assumed polymorphism. Live source inspection (this section's empirical move) revealed pretrain_real.rs:38-46 predates the assumption. 2. Why catch this NOW and not in step 5 implementation? Per feedback_no_guessing.md: read live source before forming implementation plan. Surfacing the mismatch BEFORE writing 200 LOC of weight-load code that fails at runtime is the cheapest place to pay cost-of-defect. The §50-prior wrong- premise PRs (#1466/#1467/#1468 closed) on the SHIP-007 / 0.5B gibberish track were the same defect class. 3. Why option B over A or C? Preserves §24/§25 falsification evidence (we KEEP knowing from-scratch fails at 9.75; we just don't ship it as MODEL-2). Exercises the polymorphism TransformerConfig was designed for. Each new component becomes its own falsifier rather than a hidden coupling. 4. Why is FALSIFY-005 the right place to fail-fast? PR #1470 already pinned "Architecture mismatch is FAIL-FAST, not silent- truncate". Step 4 (PR #1471) doesn't enforce arch matching yet — returns "not yet wired" before getting there. So FALSIFY-005 is currently UNBOUND but its discharge gate is well-defined: read APR header, compare against pretrain target, error with names of mismatched fields. 5. Why isn't this a "punt"? A punt would say "blocked, await operator". This amendment names three options with LOC estimates, recommends one with reasoning, gives a concrete 8- step roadmap with falsifier discharge mapped to each sub-step. The work IS shippable; it's just bigger than 0 LOC. Plain ship-% update: - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track) - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4 step 5g (LIVE 500-step fine-tune producing val_loss < 9.38). Sub-steps 5a-5f can each individually move 1% with falsifier discharge (architecture-polymorphic infrastructure shipped == evidence that the §49 path is REACHABLE, not just theoretical). Refs: - §49 — MODEL-2 strategy pivot (PR #1461) - PR #1470 — apr-pretrain-from-init-v1 v1.0.0 PROPOSED contract - PR #1471 — apr pretrain --init clap field + magic-byte validate - feedback_no_guessing.md — read source before forming hypothesis - feedback_fix_root_cause_never_route_around.md Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Spec amendment v2.94.0 → v2.95.0 documenting §50: empirical root cause for Qwen2-0.5B-Instruct gibberish, found via `apr diff` in one command. Blocks §49 MODEL-2 pretrained-init pivot until the fix lands (separate next-session PR).
What changed
No code changes
Spec amendment only. Companion to evidence PR #1466 (findings.md + gguf-trace-coherent-logits.json). The fix itself is bounded and clear but multi-path; needs careful next-session implementation to avoid breaking the working 7B teacher.
Test plan
🤖 Generated with Claude Code