Skip to content

contract(tensor-layout-v1): v2.0.0 → v2.1.0 — FALSIFY-013 safetensors FFN round-trip drift gate#1468

Closed
noahgift wants to merge 1 commit into
mainfrom
contract/tensor-layout-v1-ffn-falsifier
Closed

contract(tensor-layout-v1): v2.0.0 → v2.1.0 — FALSIFY-013 safetensors FFN round-trip drift gate#1468
noahgift wants to merge 1 commit into
mainfrom
contract/tensor-layout-v1-ffn-falsifier

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Codifies spec §50 finding as a falsifiable contract gate. `apr diff ` MUST report IDENTICAL for FFN tensors (`mlp.{down,gate,up}_proj.weight`), never [TRANSPOSED]. Status: PARTIAL_ALGORITHM_LEVEL — algorithm-bound to live evidence; flips DISCHARGED when the safetensors→APR FFN transpose fix lands.

Five Whys (recap from §50)

  1. Why does 0.5B `apr run` produce gibberish? FFN matmul reads weights in wrong orientation.
  2. Why? APR FFN tensors stored as `[out, in]` (HF SafeTensors convention), not `[in, out]` (kernel expectation).
  3. Why? Safetensors→APR import preserved HF shape labels without transposing.
  4. Why? `needs_transpose` scaffolding at `f16_convert.rs:100-127` is `#[allow(dead_code)]`.
  5. Why undetected? No round-trip falsification gate. This PR adds it.

Validation

```
$ pv validate contracts/tensor-layout-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.
```

Algorithm evidence captured

  • diagnosis_line: 'Values identical, shapes transposed (format layout diff)'
  • 3 affected FFN tensors (down/gate/up _proj)
  • 6 unaffected tensors as control (q/k/v/o _proj + 2 norms)
  • 7B status: works (GGUF-imported)
  • 0.5B status: gibberish (safetensors-imported)

Discharge criterion

When the fix lands and `apr diff` reports IDENTICAL on round-trip, flip status PARTIAL → DISCHARGED. Coverage tally will increment +1.

Cross-refs

Test plan

  • `pv validate` clean
  • Pre-commit quality gates pass
  • CI `ci / gate` and `workspace-test`

🤖 Generated with Claude Code

… FFN round-trip drift gate

Codifies the §50 finding (spec v2.95.0) as a falsifiable contract gate:
post-import, `apr diff <apr> <gguf>` MUST report IDENTICAL for FFN tensors
(`mlp.{down,gate,up}_proj.weight`), never [TRANSPOSED].

Status: PARTIAL_ALGORITHM_LEVEL — algorithm-bound to live evidence in
`evidence/qwen2-0.5b-bisection-2026-05-04/findings.md` but not yet
DISCHARGED (requires the safetensors→APR FFN transpose fix to land,
~50 LOC bounded scope per spec §50.7).

algorithm_evidence block captures:
- diagnosis_line: 'Values identical, shapes transposed (format layout diff)'
- affected_tensors: 3 FFN proj weights with shape labels swapped
- unaffected_tensors: 4 attn projections + 2 norms (IDENTICAL)
- seven_b_status: works (GGUF-imported, inherits transposed FFN layout)
- half_b_status: gibberish (safetensors-imported, preserves HF [out, in])

Discharge criterion: when the fix lands and `apr diff` reports IDENTICAL
on the round-trip, flip status PARTIAL → DISCHARGED. This is the gate
that prevents §49 MODEL-2 strategy A from regressing in the future.

Validation: `pv validate contracts/tensor-layout-v1.yaml` → 0 errors.

Cross-refs:
- spec §50 (docs/specifications/aprender-train/ship-two-models-spec.md)
- PR #1466 evidence (open auto-merge)
- PR #1467 spec amendment v2.95.0 (open auto-merge)
- CLAUDE.md "## LAYOUT-001/002 Tensor Layout Safety" (this contract is SOURCE OF TRUTH)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 12:11
@noahgift

noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Closing — premise was wrong. Re-reading the existing tensor-layout-v1 contract metadata reveals: 'safetensors: layout=row-major (HuggingFace native format - same layout as APR)'. The [TRANSPOSED] tag from apr diff reports contract-compliant shape-label differences between formats, NOT a defect. The 0.5B APR file's [intermediate, hidden] layout for ffn_gate IS what the contract specifies. Six hypotheses now falsified for the Qwen2-0.5B gibberish; root cause is genuinely elsewhere. Spec §50 in PR #1467 also needs correction. Authoring a fix from this premise would break MODEL-1.

@noahgift noahgift closed this May 4, 2026
auto-merge was automatically disabled May 4, 2026 12:14

Pull request was closed

noahgift added a commit that referenced this pull request May 4, 2026
…oupling finding (#1472)

Adds §50 documenting the architecture-mismatch finding caught after §49.6
steps 3+4 landed (PR #1470 contract + PR #1471 wire-up). The remaining
§49.6 step 5 was scoped at "0 LOC, just run apr pretrain --init" — that
assumption is empirically wrong.

Empirical finding (§50.1):
  pretrain_real.rs:38-46 HARDCODES Llama370MConfig::* for every
  architectural constant. Qwen2.5-Coder-0.5B-Instruct has different
  shape across the board:

    Param            | Llama370M | Qwen2.5-Coder-0.5B
    -----------------|-----------|--------------------
    hidden_size      | 1024      | 896
    num_attention_heads | 16     | 14
    num_kv_heads     | 4 (GQA-4:1) | 2 (GQA-7:1)
    intermediate_size | 2816    | 4864
    vocab_size       | 50_257    | 151_936
    rope_theta       | 10_000    | 1_000_000

  Every tensor mismatches. Loading Qwen2.5 weights into a Llama370M-
  shaped optimizer is a category error.

Three options surfaced (§50.3):
  A: Find/build a Llama-shaped 0.5B pretrained checkpoint
     (~5K LOC + multi-week training; recreates §24/§25 corpus problem)
  B: Make trainer architecture-polymorphic
     (~200-400 LOC; preserves §24/§25 falsification; recommended)
  C: Replace Llama370MConfig with Qwen2_5_Coder_0_5B_Config outright
     (~300 LOC; deletes a working falsification path)

Recommendation (§50.5): Option B — preserves §24/§25 falsification
evidence, exercises TransformerConfig's designed polymorphism, binds
each new component (qwen2_0_5b constructor, GQA-7:1 attention, Qwen
tokenizer surface) to its own falsifier.

Re-scoped roadmap (§50.4) — 8 sub-steps replacing original step 5:
  5a. Author apr-pretrain-arch-polymorphic-v1.yaml contract  (~80 LOC)
  5b. TransformerConfig::qwen2_0_5b() constructor           (~40 LOC)
  5c. Extract arch from init APR file metadata              (~80 LOC)
  5d. Qwen tokenizer-vocab compatibility check              (~30 LOC)
  5e. GQA-7:1 attention forward-pass verification           (~50 LOC)
  5f. Wire actual weight load                              (~120 LOC)
  5g. LIVE 500-step smoke fine-tune (operator dispatch)        0 LOC
  5h. Stamp + publish as MODEL-2 v2                         (~10 LOC)

  Total: ~410 LOC + 1 LIVE training run.

Five Whys (§50.6):
  1. Why didn't §49 catch this? §49 was authored from strategy/
     data-budget reasoning; the 0-LOC step-5 cost implicitly
     assumed polymorphism. Live source inspection (this section's
     empirical move) revealed pretrain_real.rs:38-46 predates the
     assumption.
  2. Why catch this NOW and not in step 5 implementation? Per
     feedback_no_guessing.md: read live source before forming
     implementation plan. Surfacing the mismatch BEFORE writing
     200 LOC of weight-load code that fails at runtime is the
     cheapest place to pay cost-of-defect. The §50-prior wrong-
     premise PRs (#1466/#1467/#1468 closed) on the SHIP-007 / 0.5B
     gibberish track were the same defect class.
  3. Why option B over A or C? Preserves §24/§25 falsification
     evidence (we KEEP knowing from-scratch fails at 9.75; we just
     don't ship it as MODEL-2). Exercises the polymorphism
     TransformerConfig was designed for. Each new component becomes
     its own falsifier rather than a hidden coupling.
  4. Why is FALSIFY-005 the right place to fail-fast? PR #1470
     already pinned "Architecture mismatch is FAIL-FAST, not silent-
     truncate". Step 4 (PR #1471) doesn't enforce arch matching yet
     — returns "not yet wired" before getting there. So FALSIFY-005
     is currently UNBOUND but its discharge gate is well-defined:
     read APR header, compare against pretrain target, error with
     names of mismatched fields.
  5. Why isn't this a "punt"? A punt would say "blocked, await
     operator". This amendment names three options with LOC
     estimates, recommends one with reasoning, gives a concrete 8-
     step roadmap with falsifier discharge mapped to each sub-step.
     The work IS shippable; it's just bigger than 0 LOC.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38).
    Sub-steps 5a-5f can each individually move 1% with falsifier
    discharge (architecture-polymorphic infrastructure shipped ==
    evidence that the §49 path is REACHABLE, not just theoretical).

Refs:
  - §49 — MODEL-2 strategy pivot (PR #1461)
  - PR #1470 — apr-pretrain-from-init-v1 v1.0.0 PROPOSED contract
  - PR #1471 — apr pretrain --init clap field + magic-byte validate
  - feedback_no_guessing.md — read source before forming hypothesis
  - feedback_fix_root_cause_never_route_around.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant