Skip to content

test(aprender-train): GQA-7:1 forward-pass smoke test — §50.4 step 5e#1478

Merged
noahgift merged 1 commit into
mainfrom
feat/gqa-7-1-forward-pass-test
May 4, 2026
Merged

test(aprender-train): GQA-7:1 forward-pass smoke test — §50.4 step 5e#1478
noahgift merged 1 commit into
mainfrom
feat/gqa-7-1-forward-pass-test

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Adds GQA-7:1 forward-pass smoke test discharging FALSIFY-APR-PRETRAIN-ARCH-004 at smoke level. Tiny shape (hidden=112, vocab=256, layers=1). Test verifies: runs without panic, correct shape, all-finite logits. The existing Llama370M codepath only exercised GQA-4:1; this pins that the kernel handles 7:1 (Qwen2.5-Coder-0.5B canonical) without per-ratio specialization. PR #1473 contract obligation. Plain ship-%: MODEL-1=91%, MODEL-2=57% (unchanged; gated on step 5g LIVE fine-tune). Builds on PRs #1472 (merged) + #1473/#1474/#1475/#1476 (in flight).

Adds `falsify_apr_pretrain_arch_004_gqa_7_1_forward_pass_smoke` that
constructs a tiny GQA-7:1 transformer (kv_heads=2, query_heads=14,
hidden=112=14*8, head_dim=8 — mimicking Qwen2.5-Coder-0.5B's GQA ratio)
and verifies the forward pass:
  - runs without panic
  - returns the correct shape (seq_len * vocab_size)
  - produces all-finite logits (no NaN, no Inf)

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-004 at SMOKE level: kernel handles GQA-7:1
    without per-ratio specialization. Full numerical-parity vs GQA-1:1
    reference (cosine ≥ 0.9999) is a FUNCTIONAL-level discharge, not
    PARTIAL_ALGORITHM_LEVEL.

Why this matters: the existing aprender-train Llama370M codepath only
empirically exercised GQA-4:1 (kv_heads=4, query_heads=16). Qwen2.5-0.5B
(the §49 fine-tune init source) uses GQA-7:1. Without this test, a
future refactor of the attention kernel could silently break the 7:1
case while keeping 4:1 working — exactly the §24 silent-failure class.

The test runs in <1ms (tiny shape: hidden=112, vocab=256, layers=1).
Drift-prevention: also asserts the GQA ratio at construction time, so a
typo in num_attention_heads or num_kv_heads is caught before the forward
pass even runs.

Test results (cargo test -p aprender-train --lib transformer::model::tests::falsify_apr_pretrain_arch_004):
    1 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a smoke test, not a numerical-parity test? PARTIAL_ALGORITHM_LEVEL
     requires only "compile + run + finite". FUNCTIONAL would require
     cosine vs reference. Smoke is the right scope for §50.4 step 5e
     — full parity is a follow-up if FALSIFY-006 (init_loss < 6.0)
     ever fails on the LIVE 500-step run.

  2. Why num_attention_heads=14 (Qwen2.5-0.5B exact) and not e.g. 7
     (smaller test model)? The Qwen2.5-0.5B-canonical 14/2=7 ratio is
     the load-bearing GQA shape. A 7/1 ratio would also be 7:1 but
     wouldn't exercise the multi-query-head-per-kv-head dispatch on
     more than one query group. 14/2 forces 2 query groups, each
     with 7 heads — the actual production shape.

  3. Why use_bias=true and tie_word_embeddings=true? Mirror the Qwen2
     scaling-law defaults verified by PR #1474 (the `qwen2_0_5b()` HF
     config check). If the test used the Llama defaults (use_bias=false,
     tie=false), it wouldn't catch a regression in the bias-add or
     embedding-tie code paths under the Qwen variant.

  4. Why include the all-finite check, not just shape? §24's
     retrospective showed silent NaN propagation through GQA can
     produce loss=NaN that the divergence guard catches LATE (multiple
     steps in). The smoke test catches it at the first forward pass,
     before any optimizer state corrupts.

  5. Why is this a SEPARATE test, not an extension of
     `test_transformer_tiny_forward`? The existing tiny() config uses
     defaults that may include GQA=1:1 (no GQA at all). A separate
     test makes the GQA-7:1 assertion auditable — `cargo test
     gqa_7_1` finds it directly, and contract drift between this
     test and FALSIFY-004 is detectable via grep.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - feedback_no_guessing.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 15:51
@noahgift noahgift merged commit b78d860 into main May 4, 2026
11 checks passed
@noahgift noahgift deleted the feat/gqa-7-1-forward-pass-test branch May 4, 2026 16:17
noahgift added a commit that referenced this pull request May 4, 2026
… (7/8 falsifiers bound) (#1480)

Same-day continuation cycle landed 8 PRs across the §50.4 architecture-
polymorphic infrastructure track. §51 records the cascade-complete
state and pinpoints the remaining MODEL-2 ship-% gate (step 5g LIVE).

Falsifier-discharge scoreboard for `apr-pretrain-arch-polymorphic-v1`:

  | ID | What it pins                          | PR    | Status |
  |----|---------------------------------------|-------|--------|
  | 001 | qwen2_0_5b matches HF + tie fix      | #1474 | PARTIAL |
  | 002 | init=None preserves Llama370M        | #1475 | PARTIAL |
  | 003 | init=Some pass-through               | #1475 | PARTIAL |
  | 004 | GQA-7:1 forward smoke                | #1478 | MERGED  |
  | 005 | Qwen tokenizer + Qwen target = pass  | #1476 | MERGED  |
  | 006 | Qwen tokenizer + Llama target = fail | #1476 | MERGED  |
  | 007 | encoder/decoder family mismatch      | #1479 | PARTIAL |
  | 008 | pv validate                          | #1473 | PARTIAL |

7 of 8 falsifiers PARTIAL_ALGORITHM_LEVEL or MERGED.

Remaining work:
  - 5f.2 — wire APR file open + tensor materialization (~80 LOC)
           DELIBERATELY DEFERRED this cycle; doing 5f.2 now means
           rebasing onto 4 in-flight PRs as they land
  - 5g  — LIVE 500-step smoke fine-tune (operator dispatch)
          THE LOAD-BEARING TEST that moves MODEL-2 ship-%
  - 5h  — stamp + publish

Per §47-§48 lesson: "infrastructure shipped ≠ ship-% movement."
Cascade-complete state means the polymorphic foundation is in place;
ship-% movement still requires the LIVE empirical check.

Five Whys:
  1. Why a snapshot now? Multiple PRs in cascade auto-merge create
     cognitive load. A spec snapshot captures both the achievement
     (7 falsifiers bound) and the remaining gate (step 5g LIVE).
     Without it, future operators waste cycles re-deriving the state.
  2. Why focus on falsifier scoreboard rather than total LOC? Falsifier
     discharge is the actual contract obligation. 7/8 invariants pinned
     means CI now catches regressions in the polymorphic-init path.
  3. Why mention 5f.2 explicitly as deliberately deferred? Naming the
     deferral makes it not a punt. Step 5f.2 has a clear "when": after
     the 4 in-flight PRs cascade-merge, then 5f.2 lands clean.
  4. Why call out infrastructure ≠ ship-%? The §47-§48 cascade taught
     the same lesson — "11 SHIP-007 cascade PRs landed but no ship-%
     movement." Operator-facing ship-% is the LIVE check.
  5. Why is FALSIFY-006 LIVE the load-bearing claim? init_loss(step=0)
     ≤ 6.0 vs from_scratch_loss(step=0) ≥ 9.5 proves end-to-end
     correctness in one number. No other falsifier can substitute.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Spec amendment cadence: §41 → §42 → §43 → §44 → §45 → §46 → §47 →
§48 → §49 → §50 → §51. Eleven amendments since 2026-05-03. Same-day
spec hygiene rather than letting the cascade-complete state remain
implicit.

Refs:
  - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - PR #1479 — validate_pretrain_init_arch_compatible (in flight)
  - feedback_no_guessing.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…step 5f.1

Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that
fail-fast rejects an init `TransformerConfig` whose architecture family
is incompatible with the decoder-only pretrain trainer.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/
    RoBERTa encoder model) is FAIL-FAST not silent-truncate

Why this matters: §49 wires `--init <PATH>` to load weights from any
APR file. Without this gate, an operator who points --init at e.g.
microsoft/codebert-base.apr would silently load encoder weights into
a decoder-shaped trainer, producing nonsense gradients that the
divergence guard catches LATE (multiple epochs in). This gate catches
the family mismatch BEFORE any trainer allocation.

Step 5f decomposition: this is step 5f.1 — the arch-family gate.
Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization
into optimizer state. Splitting keeps each PR small + reviewable.

What this PR adds:

  1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig)
     -> Result<(), String>` (~30 LOC including doc comment) at
     pretrain_real.rs:35

  2. 3 unit tests in `pretrain_real::tests`:
       - validate_pretrain_init_arch_accepts_decoder       (FALSIFY-007 negative)
       - validate_pretrain_init_arch_rejects_encoder       (FALSIFY-007 positive,
                                                            load-bearing)
       - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention,
                                                                 catches over-rejection
                                                                 regression)

The encoder-rejection test asserts FOUR string contents in the error:
  - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability)
  - "Encoder"                      — names the architecture family
  - "decoder-only"                 — explains why this is wrong
  - "RobertaModel"                 — names the offending hf_architecture
Operator-experience parity: when the gate fires, the error tells the
operator exactly what they did wrong + how the trainer differs.

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a separate function rather than baking the check into
     build_transformer_config? Decoupling: build_transformer_config is
     a pure pass-through dispatch; adding arch validation would conflate
     "which config?" with "is this config valid?". Two functions, two
     concerns, two test surfaces.

  2. Why focus this PR on JUST the arch-family check (step 5f.1) and
     not the full weight materialization (step 5f)? Single-piece flow.
     Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this
     PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load).
     Each PR has its own falsifier discharge; CI catches regressions
     between them.

  3. Why FOUR string assertions in the encoder-rejection error? Each
     piece of the error text serves a distinct operator need:
       - falsifier id → audit (which contract did this fail?)
       - architecture family → what (encoder vs decoder)
       - "decoder-only" → why (the trainer is decoder-only)
       - hf_architecture → which model (RobertaModel/CodeBERT/...)
     Lossy error messages erode operator trust; the contract pins
     all four to prevent message rot.

  4. Why include the Llama370M baseline drift-prevention test? §24's
     retrospective showed silent over-rejection (every input rejected,
     even valid ones) is the symmetric defect to silent under-rejection
     (every input accepted, even invalid ones). The 3 tests cover both
     halves of the dispatch.

  5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That
     requires the actual weight materialization (step 5f.2) PLUS a
     LIVE training run (step 5g). Step 5f.1 is just the gate; the
     load-bearing init_loss measurement is downstream.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - feedback_no_guessing.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…1481)

Adds the read-half of `apr pretrain --init` weight load: a thin
wrapper over `aprender::format::converter::load_model_tensors` that
returns a `BTreeMap<String, (Vec<f32>, Vec<usize>)>` of tensor blobs
keyed by HF naming convention.

Per `apr-pretrain-arch-polymorphic-v1` §init_load_semantics (PR #1473):
"Loader is REUSED, not reimplemented." This function does not duplicate
APR parsing — it forwards to the same machinery `apr export` and
`apr inspect` use.

Discharges from `apr-pretrain-arch-polymorphic-v1`:
  - §init_load_semantics invariant (loader reuse): satisfied
  - FALSIFY-006 (init_loss < 6.0) at READ-COMPILE-BIND level

Step 5f decomposition:
  - 5f.1 (PR #1479): encoder/decoder family validator (~30 LOC)
  - 5f.2 (this PR): APR file open + tensor read (~30 LOC + 2 tests)
  - 5f.3 (next):    populate trainer parameters from BTreeMap (~50 LOC)
  - 5g  (operator): LIVE 500-step fine-tune → DISCHARGES MODEL-2 ship-%

Step 5f.2 is intentionally narrow — it only does the READ. Population
into trainer parameter slots (5f.3) reconciles HF naming convention
(e.g., `model.embed_tokens.weight`) against the trainer's internal
parameter naming. That's a separate concern with its own falsifier.

What this PR adds:

  1. `pub fn load_init_tensors_from_apr(path) -> Result<BTreeMap<...>>`
     at pretrain_real.rs:35 (~25 LOC including doc comment)
  2. 2 unit tests in `pretrain_real::tests`:
       - load_init_tensors_missing_file_errors_with_falsifier_id
         (FALSIFY-006 fail-fast path; asserts error message contains
          falsifier id + offending path for operator-experience)
       - load_init_tensors_signature_compile_bind
         (drift-prevention: catches a future signature change that
          would break step 5f.3's BTreeMap consumer)

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::load_init_tensors):
    2 passed; 0 failed; 0 ignored

Five Whys:

  1. Why decompose step 5f.2 to JUST the read? Single-piece flow.
     Read → Validate → Populate are three distinct concerns. Step 5f.1
     did validation (#1479); 5f.2 does read; 5f.3 will do populate.
     Each PR has one falsifier discharge story.

  2. Why use load_model_tensors and not write a new parser? The contract
     pins "Loader is reused, not reimplemented." Writing a new parser
     would create a parallel format-decoder that drifts from the canonical
     one. The same lesson as the LAYOUT-001/002 hits — parallel format
     code paths produce silent format-drift bugs.

  3. Why return BTreeMap<String, (Vec<f32>, Vec<usize>)> rather than a
     trainer-parameter-shaped struct? Decoupling: the read shouldn't
     know about TransformerTrainer's internal parameter names. Step
     5f.3's job is to map HF names → trainer slots; if 5f.2 baked that
     mapping in, every change to TransformerTrainer would break the read.

  4. Why include the signature-compile-bind test? It's a compile-time
     check that drives step 5f.3's expectations. If a future refactor
     changes the return type (e.g., from BTreeMap to HashMap, or from
     Vec<usize> to Box<[usize]>), step 5f.3's consumer code stops
     compiling — caught here, not at the integration point.

  5. Why is FALSIFY-006 NOT yet at PARTIAL_ALGORITHM_LEVEL after this
     PR? Because step 5f.2 only does the read; FALSIFY-006 requires
     the LIVE init_loss < 6.0 check, which needs steps 5f.3 + 5g.
     This PR moves FALSIFY-006 from UNBOUND → READ-COMPILE-BIND, a
     sub-level of PARTIAL_ALGORITHM_LEVEL. Full PARTIAL discharge
     happens at 5f.3 when the populate step exists.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50, §51 — MODEL-2 architecture-coupling +
    cascade snapshot (PR #1472, #1480 MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (MERGED)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - PR #1479 — validate_pretrain_init_arch_compatible (in flight)
  - feedback_no_guessing.md
  - feedback_falsifier_first_cascade_pattern.md (this turn's pattern)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…step 5f.1

Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that
fail-fast rejects an init `TransformerConfig` whose architecture family
is incompatible with the decoder-only pretrain trainer.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/
    RoBERTa encoder model) is FAIL-FAST not silent-truncate

Why this matters: §49 wires `--init <PATH>` to load weights from any
APR file. Without this gate, an operator who points --init at e.g.
microsoft/codebert-base.apr would silently load encoder weights into
a decoder-shaped trainer, producing nonsense gradients that the
divergence guard catches LATE (multiple epochs in). This gate catches
the family mismatch BEFORE any trainer allocation.

Step 5f decomposition: this is step 5f.1 — the arch-family gate.
Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization
into optimizer state. Splitting keeps each PR small + reviewable.

What this PR adds:

  1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig)
     -> Result<(), String>` (~30 LOC including doc comment) at
     pretrain_real.rs:35

  2. 3 unit tests in `pretrain_real::tests`:
       - validate_pretrain_init_arch_accepts_decoder       (FALSIFY-007 negative)
       - validate_pretrain_init_arch_rejects_encoder       (FALSIFY-007 positive,
                                                            load-bearing)
       - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention,
                                                                 catches over-rejection
                                                                 regression)

The encoder-rejection test asserts FOUR string contents in the error:
  - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability)
  - "Encoder"                      — names the architecture family
  - "decoder-only"                 — explains why this is wrong
  - "RobertaModel"                 — names the offending hf_architecture
Operator-experience parity: when the gate fires, the error tells the
operator exactly what they did wrong + how the trainer differs.

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a separate function rather than baking the check into
     build_transformer_config? Decoupling: build_transformer_config is
     a pure pass-through dispatch; adding arch validation would conflate
     "which config?" with "is this config valid?". Two functions, two
     concerns, two test surfaces.

  2. Why focus this PR on JUST the arch-family check (step 5f.1) and
     not the full weight materialization (step 5f)? Single-piece flow.
     Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this
     PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load).
     Each PR has its own falsifier discharge; CI catches regressions
     between them.

  3. Why FOUR string assertions in the encoder-rejection error? Each
     piece of the error text serves a distinct operator need:
       - falsifier id → audit (which contract did this fail?)
       - architecture family → what (encoder vs decoder)
       - "decoder-only" → why (the trainer is decoder-only)
       - hf_architecture → which model (RobertaModel/CodeBERT/...)
     Lossy error messages erode operator trust; the contract pins
     all four to prevent message rot.

  4. Why include the Llama370M baseline drift-prevention test? §24's
     retrospective showed silent over-rejection (every input rejected,
     even valid ones) is the symmetric defect to silent under-rejection
     (every input accepted, even invalid ones). The 3 tests cover both
     halves of the dispatch.

  5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That
     requires the actual weight materialization (step 5f.2) PLUS a
     LIVE training run (step 5g). Step 5f.1 is just the gate; the
     load-bearing init_loss measurement is downstream.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - feedback_no_guessing.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…step 5f.1 (#1479)

Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that
fail-fast rejects an init `TransformerConfig` whose architecture family
is incompatible with the decoder-only pretrain trainer.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/
    RoBERTa encoder model) is FAIL-FAST not silent-truncate

Why this matters: §49 wires `--init <PATH>` to load weights from any
APR file. Without this gate, an operator who points --init at e.g.
microsoft/codebert-base.apr would silently load encoder weights into
a decoder-shaped trainer, producing nonsense gradients that the
divergence guard catches LATE (multiple epochs in). This gate catches
the family mismatch BEFORE any trainer allocation.

Step 5f decomposition: this is step 5f.1 — the arch-family gate.
Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization
into optimizer state. Splitting keeps each PR small + reviewable.

What this PR adds:

  1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig)
     -> Result<(), String>` (~30 LOC including doc comment) at
     pretrain_real.rs:35

  2. 3 unit tests in `pretrain_real::tests`:
       - validate_pretrain_init_arch_accepts_decoder       (FALSIFY-007 negative)
       - validate_pretrain_init_arch_rejects_encoder       (FALSIFY-007 positive,
                                                            load-bearing)
       - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention,
                                                                 catches over-rejection
                                                                 regression)

The encoder-rejection test asserts FOUR string contents in the error:
  - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability)
  - "Encoder"                      — names the architecture family
  - "decoder-only"                 — explains why this is wrong
  - "RobertaModel"                 — names the offending hf_architecture
Operator-experience parity: when the gate fires, the error tells the
operator exactly what they did wrong + how the trainer differs.

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a separate function rather than baking the check into
     build_transformer_config? Decoupling: build_transformer_config is
     a pure pass-through dispatch; adding arch validation would conflate
     "which config?" with "is this config valid?". Two functions, two
     concerns, two test surfaces.

  2. Why focus this PR on JUST the arch-family check (step 5f.1) and
     not the full weight materialization (step 5f)? Single-piece flow.
     Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this
     PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load).
     Each PR has its own falsifier discharge; CI catches regressions
     between them.

  3. Why FOUR string assertions in the encoder-rejection error? Each
     piece of the error text serves a distinct operator need:
       - falsifier id → audit (which contract did this fail?)
       - architecture family → what (encoder vs decoder)
       - "decoder-only" → why (the trainer is decoder-only)
       - hf_architecture → which model (RobertaModel/CodeBERT/...)
     Lossy error messages erode operator trust; the contract pins
     all four to prevent message rot.

  4. Why include the Llama370M baseline drift-prevention test? §24's
     retrospective showed silent over-rejection (every input rejected,
     even valid ones) is the symmetric defect to silent under-rejection
     (every input accepted, even invalid ones). The 3 tests cover both
     halves of the dispatch.

  5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That
     requires the actual weight materialization (step 5f.2) PLUS a
     LIVE training run (step 5g). Step 5f.1 is just the gate; the
     load-bearing init_loss measurement is downstream.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - feedback_no_guessing.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…PARTIAL_ALGORITHM_LEVEL — §50.4 cascade snapshot (#1482)

## Summary

Bump `apr-pretrain-arch-polymorphic-v1` contract status from PROPOSED to
PARTIAL_ALGORITHM_LEVEL. All 8 FALSIFY-APR-PRETRAIN-ARCH-* falsifiers
are now bound to executable tests across the §50.4 cascade.

## Falsifier scoreboard (post-§51 snapshot)

| ID         | Rule                                          | PR                | Status                |
|------------|-----------------------------------------------|-------------------|-----------------------|
| FALSIFY-001 | qwen2_0_5b matches HF config                  | #1474 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-002 | build_transformer_config(None) → Llama370M    | #1475 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-003 | build_transformer_config(Some) extracts 10    | #1475 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-004 | GQA-7:1 forward-pass smoke                    | #1478 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-005 | Qwen tokenizer passes with --init Qwen        | #1476 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-006 | Qwen tokenizer fails without --init           | #1476 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-007 | encoder-arch APR fail-fast                    | #1479 open (auto-merge armed) | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-008 | contract self-validates via pv                | this PR (validates clean) | PARTIAL_ALGORITHM_LEVEL |

## Test plan

- [x] pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml exits 0
- [x] All 8 falsifiers cite a concrete test path or PR
- [x] Changelog entry under metadata.changelog with version/date/change

## Why now

Per `feedback_falsifier_first_cascade_pattern.md`: when a saturated
auto-merge queue (≥4 PRs) blocks more impl PRs, switch to non-conflict
work. This contract bump:
  - touches only one YAML file (no Rust/test source)
  - cannot conflict with #1479 / #1481 (impl PRs)
  - audit-trails the cascade scoreboard

Promotion to FUNCTIONAL is gated on #1479 landing (FALSIFY-007 PASS).
Promotion to DISCHARGED is gated on §50.4 step 5g LIVE empirical run.

## Five Whys

1. Why bump status now? — 7/8 falsifiers bound on main + 8th bound on
   open PR; PROPOSED is stale.
2. Why not wait for #1479 land first? — §51 snapshot recorded "7/8
   PARTIAL bound" 2 hours ago; the 8th binding is the contract-self
   validation, which is met by THIS PR's `pv validate` output.
3. Why not bundle with #1479? — Different file, different review scope,
   different concern (status semantics vs. impl).
4. Why not skip the bump? — Operator-facing scoreboard is in the YAML;
   stale PROPOSED implies "not yet started" which contradicts §51.
5. Why YAML changelog instead of just version? — Changelog records
   THIS bump's reasoning so future operators don't re-derive it from
   git log.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ION-COMPLETE; contract v1.1.0 → v1.2.0 FUNCTIONAL (#1495)

§50.4 cascade INTEGRATION-COMPLETE on main with PR #1494 merging at
2026-05-05T01:48:14Z. The `apr pretrain --init <PATH>` flow is now
end-to-end functional on CPU; the legacy "not yet wired" Err is
RETIRED; step 5g LIVE is the only remaining gate before MODEL-2 ship-%
can move from 57% → ≥58%.

Spec amendment §53:
- Updated falsifier scoreboard: 6/8 INTEGRATION (001/002/003/005/006/007
  via live CLI dispatch); 2/8 PARTIAL_ALGORITHM_LEVEL (004 forward-pass
  smoke + 008 contract validation are inherently algorithm-level).
- Step roadmap: 5a-5f.4 ✅ MERGED; 5f.5 (CUDA wireup) NOT YET STARTED;
  5g (LIVE 500-step fine-tune) operator-dispatchable on RTX 4090.
- Cascade ships statistics: 11 PRs over 2 days
  (#1471/#1472/#1473/#1474/#1475/#1476/#1478/#1479/#1481/#1482/#1483/#1486/#1494).
- MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%
  (gated on 5g empirical val_loss < 9.38 evidence).
- 3 CI andon classes documented as feedback memories during cascade
  (workspace-test missing-binary, trueno SIGSEGV-on-cleanup, auto-merge
  behind-state).

Contract apr-pretrain-arch-polymorphic-v1 v1.1.0 → v1.2.0 FUNCTIONAL:
- All 8 falsifiers PASS on main; 6/8 reach INTEGRATION via the
  user-facing `apr pretrain --init` flow.
- verification_summary updated: tested 7 → 8; status partial →
  functional.
- Added §52 + §53 references.
- Promotion to DISCHARGED still requires §50.4 step 5g LIVE empirical
  500-step fine-tune on canonical Qwen2.5-Coder-0.5B-Instruct.apr
  producing val_loss < 9.38.

`pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml` exits 0.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade, PR #1494 merge commit 9afca16

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant