Skip to content

feat(aprender-train): validate_pretrain_init_arch_compatible — §50.4 step 5f.1#1479

Merged
noahgift merged 1 commit into
mainfrom
feat/validate-pretrain-init-arch-compatible
May 4, 2026
Merged

feat(aprender-train): validate_pretrain_init_arch_compatible — §50.4 step 5f.1#1479
noahgift merged 1 commit into
mainfrom
feat/validate-pretrain-init-arch-compatible

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Adds pretrain_real::validate_pretrain_init_arch_compatible(cfg) that fail-fast rejects an init TransformerConfig whose architecture family is incompatible with the decoder-only pretrain trainer. Discharges FALSIFY-APR-PRETRAIN-ARCH-007 from PR #1473 contract. Encoder configs (CodeBERT/RoBERTa/BERT) are rejected with error message naming: falsifier-id, architecture family, decoder-only requirement, hf_architecture (e.g. RobertaModel). 3 unit tests verify decoder accept + encoder reject + Llama370M baseline accept (drift-prevention). Step 5f decomposition: this is 5f.1 (~30 LOC arch gate); 5f.2 (~80 LOC actual weight load) is follow-up. Plain ship-%: MODEL-1=91%, MODEL-2=57% (unchanged; gated on step 5g LIVE fine-tune). Builds on PRs #1472+#1478 MERGED + #1473/#1474/#1475/#1476 in flight.

@noahgift noahgift enabled auto-merge (squash) May 4, 2026 16:21
noahgift added a commit that referenced this pull request May 4, 2026
… (7/8 falsifiers bound) (#1480)

Same-day continuation cycle landed 8 PRs across the §50.4 architecture-
polymorphic infrastructure track. §51 records the cascade-complete
state and pinpoints the remaining MODEL-2 ship-% gate (step 5g LIVE).

Falsifier-discharge scoreboard for `apr-pretrain-arch-polymorphic-v1`:

  | ID | What it pins                          | PR    | Status |
  |----|---------------------------------------|-------|--------|
  | 001 | qwen2_0_5b matches HF + tie fix      | #1474 | PARTIAL |
  | 002 | init=None preserves Llama370M        | #1475 | PARTIAL |
  | 003 | init=Some pass-through               | #1475 | PARTIAL |
  | 004 | GQA-7:1 forward smoke                | #1478 | MERGED  |
  | 005 | Qwen tokenizer + Qwen target = pass  | #1476 | MERGED  |
  | 006 | Qwen tokenizer + Llama target = fail | #1476 | MERGED  |
  | 007 | encoder/decoder family mismatch      | #1479 | PARTIAL |
  | 008 | pv validate                          | #1473 | PARTIAL |

7 of 8 falsifiers PARTIAL_ALGORITHM_LEVEL or MERGED.

Remaining work:
  - 5f.2 — wire APR file open + tensor materialization (~80 LOC)
           DELIBERATELY DEFERRED this cycle; doing 5f.2 now means
           rebasing onto 4 in-flight PRs as they land
  - 5g  — LIVE 500-step smoke fine-tune (operator dispatch)
          THE LOAD-BEARING TEST that moves MODEL-2 ship-%
  - 5h  — stamp + publish

Per §47-§48 lesson: "infrastructure shipped ≠ ship-% movement."
Cascade-complete state means the polymorphic foundation is in place;
ship-% movement still requires the LIVE empirical check.

Five Whys:
  1. Why a snapshot now? Multiple PRs in cascade auto-merge create
     cognitive load. A spec snapshot captures both the achievement
     (7 falsifiers bound) and the remaining gate (step 5g LIVE).
     Without it, future operators waste cycles re-deriving the state.
  2. Why focus on falsifier scoreboard rather than total LOC? Falsifier
     discharge is the actual contract obligation. 7/8 invariants pinned
     means CI now catches regressions in the polymorphic-init path.
  3. Why mention 5f.2 explicitly as deliberately deferred? Naming the
     deferral makes it not a punt. Step 5f.2 has a clear "when": after
     the 4 in-flight PRs cascade-merge, then 5f.2 lands clean.
  4. Why call out infrastructure ≠ ship-%? The §47-§48 cascade taught
     the same lesson — "11 SHIP-007 cascade PRs landed but no ship-%
     movement." Operator-facing ship-% is the LIVE check.
  5. Why is FALSIFY-006 LIVE the load-bearing claim? init_loss(step=0)
     ≤ 6.0 vs from_scratch_loss(step=0) ≥ 9.5 proves end-to-end
     correctness in one number. No other falsifier can substitute.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Spec amendment cadence: §41 → §42 → §43 → §44 → §45 → §46 → §47 →
§48 → §49 → §50 → §51. Eleven amendments since 2026-05-03. Same-day
spec hygiene rather than letting the cascade-complete state remain
implicit.

Refs:
  - SPEC-SHIP-TWO-001 §50 — architecture-coupling finding (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - PR #1479 — validate_pretrain_init_arch_compatible (in flight)
  - feedback_no_guessing.md

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/validate-pretrain-init-arch-compatible branch from 4ed5894 to d9b8e20 Compare May 4, 2026 18:29
noahgift added a commit that referenced this pull request May 4, 2026
…1481)

Adds the read-half of `apr pretrain --init` weight load: a thin
wrapper over `aprender::format::converter::load_model_tensors` that
returns a `BTreeMap<String, (Vec<f32>, Vec<usize>)>` of tensor blobs
keyed by HF naming convention.

Per `apr-pretrain-arch-polymorphic-v1` §init_load_semantics (PR #1473):
"Loader is REUSED, not reimplemented." This function does not duplicate
APR parsing — it forwards to the same machinery `apr export` and
`apr inspect` use.

Discharges from `apr-pretrain-arch-polymorphic-v1`:
  - §init_load_semantics invariant (loader reuse): satisfied
  - FALSIFY-006 (init_loss < 6.0) at READ-COMPILE-BIND level

Step 5f decomposition:
  - 5f.1 (PR #1479): encoder/decoder family validator (~30 LOC)
  - 5f.2 (this PR): APR file open + tensor read (~30 LOC + 2 tests)
  - 5f.3 (next):    populate trainer parameters from BTreeMap (~50 LOC)
  - 5g  (operator): LIVE 500-step fine-tune → DISCHARGES MODEL-2 ship-%

Step 5f.2 is intentionally narrow — it only does the READ. Population
into trainer parameter slots (5f.3) reconciles HF naming convention
(e.g., `model.embed_tokens.weight`) against the trainer's internal
parameter naming. That's a separate concern with its own falsifier.

What this PR adds:

  1. `pub fn load_init_tensors_from_apr(path) -> Result<BTreeMap<...>>`
     at pretrain_real.rs:35 (~25 LOC including doc comment)
  2. 2 unit tests in `pretrain_real::tests`:
       - load_init_tensors_missing_file_errors_with_falsifier_id
         (FALSIFY-006 fail-fast path; asserts error message contains
          falsifier id + offending path for operator-experience)
       - load_init_tensors_signature_compile_bind
         (drift-prevention: catches a future signature change that
          would break step 5f.3's BTreeMap consumer)

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::load_init_tensors):
    2 passed; 0 failed; 0 ignored

Five Whys:

  1. Why decompose step 5f.2 to JUST the read? Single-piece flow.
     Read → Validate → Populate are three distinct concerns. Step 5f.1
     did validation (#1479); 5f.2 does read; 5f.3 will do populate.
     Each PR has one falsifier discharge story.

  2. Why use load_model_tensors and not write a new parser? The contract
     pins "Loader is reused, not reimplemented." Writing a new parser
     would create a parallel format-decoder that drifts from the canonical
     one. The same lesson as the LAYOUT-001/002 hits — parallel format
     code paths produce silent format-drift bugs.

  3. Why return BTreeMap<String, (Vec<f32>, Vec<usize>)> rather than a
     trainer-parameter-shaped struct? Decoupling: the read shouldn't
     know about TransformerTrainer's internal parameter names. Step
     5f.3's job is to map HF names → trainer slots; if 5f.2 baked that
     mapping in, every change to TransformerTrainer would break the read.

  4. Why include the signature-compile-bind test? It's a compile-time
     check that drives step 5f.3's expectations. If a future refactor
     changes the return type (e.g., from BTreeMap to HashMap, or from
     Vec<usize> to Box<[usize]>), step 5f.3's consumer code stops
     compiling — caught here, not at the integration point.

  5. Why is FALSIFY-006 NOT yet at PARTIAL_ALGORITHM_LEVEL after this
     PR? Because step 5f.2 only does the read; FALSIFY-006 requires
     the LIVE init_loss < 6.0 check, which needs steps 5f.3 + 5g.
     This PR moves FALSIFY-006 from UNBOUND → READ-COMPILE-BIND, a
     sub-level of PARTIAL_ALGORITHM_LEVEL. Full PARTIAL discharge
     happens at 5f.3 when the populate step exists.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50, §51 — MODEL-2 architecture-coupling +
    cascade snapshot (PR #1472, #1480 MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (MERGED)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (MERGED)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - PR #1479 — validate_pretrain_init_arch_compatible (in flight)
  - feedback_no_guessing.md
  - feedback_falsifier_first_cascade_pattern.md (this turn's pattern)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…step 5f.1

Adds `pretrain_real::validate_pretrain_init_arch_compatible(cfg)` that
fail-fast rejects an init `TransformerConfig` whose architecture family
is incompatible with the decoder-only pretrain trainer.

Discharges from `apr-pretrain-arch-polymorphic-v1` (PR #1473):
  - FALSIFY-APR-PRETRAIN-ARCH-007 — wrong-arch APR (e.g., CodeBERT/
    RoBERTa encoder model) is FAIL-FAST not silent-truncate

Why this matters: §49 wires `--init <PATH>` to load weights from any
APR file. Without this gate, an operator who points --init at e.g.
microsoft/codebert-base.apr would silently load encoder weights into
a decoder-shaped trainer, producing nonsense gradients that the
divergence guard catches LATE (multiple epochs in). This gate catches
the family mismatch BEFORE any trainer allocation.

Step 5f decomposition: this is step 5f.1 — the arch-family gate.
Step 5f.2 (~80 LOC, follow-up) does the actual weight materialization
into optimizer state. Splitting keeps each PR small + reviewable.

What this PR adds:

  1. `pub fn validate_pretrain_init_arch_compatible(cfg: &TransformerConfig)
     -> Result<(), String>` (~30 LOC including doc comment) at
     pretrain_real.rs:35

  2. 3 unit tests in `pretrain_real::tests`:
       - validate_pretrain_init_arch_accepts_decoder       (FALSIFY-007 negative)
       - validate_pretrain_init_arch_rejects_encoder       (FALSIFY-007 positive,
                                                            load-bearing)
       - validate_pretrain_init_arch_accepts_llama370m_baseline (drift-prevention,
                                                                 catches over-rejection
                                                                 regression)

The encoder-rejection test asserts FOUR string contents in the error:
  - "FALSIFY-APR-PRETRAIN-ARCH-007" — falsifier id (auditability)
  - "Encoder"                      — names the architecture family
  - "decoder-only"                 — explains why this is wrong
  - "RobertaModel"                 — names the offending hf_architecture
Operator-experience parity: when the gate fires, the error tells the
operator exactly what they did wrong + how the trainer differs.

Test results (cargo test -p aprender-train --lib train::pretrain_real::tests::validate_pretrain_init_arch):
    3 passed; 0 failed; 0 ignored

Five Whys:

  1. Why a separate function rather than baking the check into
     build_transformer_config? Decoupling: build_transformer_config is
     a pure pass-through dispatch; adding arch validation would conflate
     "which config?" with "is this config valid?". Two functions, two
     concerns, two test surfaces.

  2. Why focus this PR on JUST the arch-family check (step 5f.1) and
     not the full weight materialization (step 5f)? Single-piece flow.
     Step 5f's full scope (~120 LOC) splits naturally into 5f.1 (this
     PR, ~30 LOC + 3 tests) + 5f.2 (~80 LOC, the actual weight load).
     Each PR has its own falsifier discharge; CI catches regressions
     between them.

  3. Why FOUR string assertions in the encoder-rejection error? Each
     piece of the error text serves a distinct operator need:
       - falsifier id → audit (which contract did this fail?)
       - architecture family → what (encoder vs decoder)
       - "decoder-only" → why (the trainer is decoder-only)
       - hf_architecture → which model (RobertaModel/CodeBERT/...)
     Lossy error messages erode operator trust; the contract pins
     all four to prevent message rot.

  4. Why include the Llama370M baseline drift-prevention test? §24's
     retrospective showed silent over-rejection (every input rejected,
     even valid ones) is the symmetric defect to silent under-rejection
     (every input accepted, even invalid ones). The 3 tests cover both
     halves of the dispatch.

  5. Why is FALSIFY-006 (init_loss < 6.0) NOT yet discharged? That
     requires the actual weight materialization (step 5f.2) PLUS a
     LIVE training run (step 5g). Step 5f.1 is just the gate; the
     load-bearing init_loss measurement is downstream.

Plain ship-% update:
  - MODEL-1: unchanged at 91% (SHIP-007 cascade infrastructure track)
  - MODEL-2: unchanged at 57% — first ship-% movement gated on §50.4
    step 5g (LIVE 500-step fine-tune producing val_loss < 9.38)

Refs:
  - SPEC-SHIP-TWO-001 §50 — MODEL-2 architecture-coupling (PR #1472, MERGED)
  - PR #1473 — apr-pretrain-arch-polymorphic-v1 contract (in flight)
  - PR #1474 — qwen2_0_5b tie_word_embeddings fix (in flight)
  - PR #1475 — build_transformer_config polymorphic dispatch (in flight)
  - PR #1476 — preflight_tokenizer_vocab_matches_target (in flight)
  - PR #1478 — GQA-7:1 forward-pass smoke test (MERGED)
  - feedback_no_guessing.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/validate-pretrain-init-arch-compatible branch from 7069a83 to 55040a5 Compare May 4, 2026 19:34
@noahgift noahgift merged commit 96653ff into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the feat/validate-pretrain-init-arch-compatible branch May 4, 2026 20:01
noahgift added a commit that referenced this pull request May 4, 2026
… 5f.3

Add `populate_trainer_from_init_tensors(transformer, init_tensors)` —
the population half of `apr pretrain --init`. Iterates the model's
`named_parameters()` set, looks up each name in the init BTreeMap (HF
naming preserved by §50.4 step 5f.2's loader), validates length, and
calls `Transformer::set_named_parameter()`.

`apr-pretrain-arch-polymorphic-v1` §init_load_semantics:
  - Population invariant: "Init tensors populate trainer parameters
    byte-equivalent to source"
  - FALSIFY-APR-PRETRAIN-INIT-007 (population step) at PARTIAL_ALGORITHM_LEVEL

1. **Why strict on missing-required?** Architecture mismatch (e.g., init
   from a different model family) would silently leave random init for
   absent parameters, which §28's SHIP-007 lesson teaches us is the
   exact class of "silent gibberish" defect that hides for many epochs.
2. **Why strict on length-mismatch?** A length mismatch indicates the
   from_apr_metadata extractor misread a shape — populating regardless
   would silently truncate or pad, masking the bug.
3. **Why permissive on extra-init-entries?** Tied embeddings: a Qwen2.5
   APR may publish a separate `lm_head.weight` that the trainer's tied
   model omits. Failing on extra entries would force operators to
   pre-strip APRs, which is muda.
4. **Why FALSIFIER ID in error message?** §28 lesson — falsifier IDs in
   error messages turn opaque CI failures into self-explaining defects.
5. **Why one function not two (load+populate fused)?** Decoupling keeps
   `aprender-train` free of `aprender-serve` (the APR loader): the
   loader is a free function in §50.4 step 5f.2; this is the consumer.
   Two-step composition is testable independently (and is, in this PR).

- `populate_trainer_from_init_tensors_happy_path`: every param matched
  → returns Ok(N) where N = named_parameters().len()
- `populate_trainer_from_init_tensors_extra_entries_silently_ignored`:
  fictitious extra entry must NOT cause Err (tied-embeddings safety)
- `populate_trainer_from_init_tensors_rejects_length_mismatch`: wrong
  flat length → Err naming the param + falsifier ID
- `populate_trainer_from_init_tensors_rejects_missing_required_param`:
  missing required → Err with "not present in init APR" + falsifier ID

All 12 tests pass; cargo clippy --lib clean.

- [x] `cargo test -p aprender-train --lib train::pretrain_real::tests` (12/12 pass)
- [x] `cargo clippy -p aprender-train --lib -- -D warnings` (clean)
- [x] No new dependencies; pure aprender-train + autograd Tensor + std::collections::BTreeMap

Step 5f.3 caps the §50.4 step-5f sub-cascade:
  5f.1 — encoder-family validator (PR #1479, awaiting CI)
  5f.2 — load_init_tensors_from_apr (PR #1481 MERGED)
  5f.3 — THIS PR (populate_trainer_from_init_tensors)
Remaining roadmap: 5g (LIVE 500-step fine-tune, operator dispatch),
5h (stamp + publish as MODEL-2 v2).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…PARTIAL_ALGORITHM_LEVEL — §50.4 cascade snapshot (#1482)

## Summary

Bump `apr-pretrain-arch-polymorphic-v1` contract status from PROPOSED to
PARTIAL_ALGORITHM_LEVEL. All 8 FALSIFY-APR-PRETRAIN-ARCH-* falsifiers
are now bound to executable tests across the §50.4 cascade.

## Falsifier scoreboard (post-§51 snapshot)

| ID         | Rule                                          | PR                | Status                |
|------------|-----------------------------------------------|-------------------|-----------------------|
| FALSIFY-001 | qwen2_0_5b matches HF config                  | #1474 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-002 | build_transformer_config(None) → Llama370M    | #1475 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-003 | build_transformer_config(Some) extracts 10    | #1475 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-004 | GQA-7:1 forward-pass smoke                    | #1478 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-005 | Qwen tokenizer passes with --init Qwen        | #1476 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-006 | Qwen tokenizer fails without --init           | #1476 merged      | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-007 | encoder-arch APR fail-fast                    | #1479 open (auto-merge armed) | PARTIAL_ALGORITHM_LEVEL |
| FALSIFY-008 | contract self-validates via pv                | this PR (validates clean) | PARTIAL_ALGORITHM_LEVEL |

## Test plan

- [x] pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml exits 0
- [x] All 8 falsifiers cite a concrete test path or PR
- [x] Changelog entry under metadata.changelog with version/date/change

## Why now

Per `feedback_falsifier_first_cascade_pattern.md`: when a saturated
auto-merge queue (≥4 PRs) blocks more impl PRs, switch to non-conflict
work. This contract bump:
  - touches only one YAML file (no Rust/test source)
  - cannot conflict with #1479 / #1481 (impl PRs)
  - audit-trails the cascade scoreboard

Promotion to FUNCTIONAL is gated on #1479 landing (FALSIFY-007 PASS).
Promotion to DISCHARGED is gated on §50.4 step 5g LIVE empirical run.

## Five Whys

1. Why bump status now? — 7/8 falsifiers bound on main + 8th bound on
   open PR; PROPOSED is stale.
2. Why not wait for #1479 land first? — §51 snapshot recorded "7/8
   PARTIAL bound" 2 hours ago; the 8th binding is the contract-self
   validation, which is met by THIS PR's `pv validate` output.
3. Why not bundle with #1479? — Different file, different review scope,
   different concern (status semantics vs. impl).
4. Why not skip the bump? — Operator-facing scoreboard is in the YAML;
   stale PROPOSED implies "not yet started" which contradicts §51.
5. Why YAML changelog instead of just version? — Changelog records
   THIS bump's reasoning so future operators don't re-derive it from
   git log.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…LETE + 5f.4 wireup gap identified (#1486)

## Summary

Same-day continuation of §51 cascade landed PR #1479 (FALSIFY-007
encoder/decoder validator) and PR #1481 (load_init_tensors_from_apr).
PR #1483 (5f.3 populate) and PR #1482 (contract status bump) are
MERGEABLE in queue. All 8 falsifiers in `apr-pretrain-arch-polymorphic-v1`
are now PARTIAL_ALGORITHM_LEVEL bound on main or about to land.

§52 records:
1. Updated falsifier scoreboard (8/8 vs §51's 7/8)
2. NEW step 5f.4 (CLI wireup, ~150 LOC) identified via live source
   inspection of `apr-cli/src/commands/pretrain.rs:259-297`
3. Step 5g LIVE 500-step fine-tune is now gated on 5f.4 landing first

## Why now

Per `feedback_falsifier_first_cascade_pattern.md`: when a saturated
auto-merge queue blocks more impl PRs (#1483 + #1482 both in queue
touching pretrain_real.rs), switch to non-conflicting work. This spec
amendment touches one markdown file with no PR conflicts.

## Five Whys (§52.8 in body)

1. Why didn't §50 catch 5f.4? — top-down arch-coupling lens missed the
   CLI-dispatch seam.
2. Why is 5f.4 separate from 5f.3? — different crate (apr-cli vs
   aprender-train).
3. Why must 5f.4 be atomic? — removing "not yet wired" Err without the
   wireup produces silent random-init (§28 SHIP-007 defect class).
4. Why ~150 LOC? — 4 levels of plumbing + new builder + tests + CUDA.
5. Why call 5f.4 out in spec? — without §52, readers would assume 5g
   is dispatchable; spec is the source of truth.

## Test plan

- [x] Single markdown file, no Rust changes
- [x] Falsifier scoreboard table updated to 8/8 PARTIAL_ALGORITHM_LEVEL
- [x] Step roadmap table adds 5f.4 between 5f.3 and 5g
- [x] Cadence preserved: §41 → ... → §51 → §52 (12 amendments since 2026-05-03)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
… 5f.3 (#1483)

Add `populate_trainer_from_init_tensors(transformer, init_tensors)` —
the population half of `apr pretrain --init`. Iterates the model's
`named_parameters()` set, looks up each name in the init BTreeMap (HF
naming preserved by §50.4 step 5f.2's loader), validates length, and
calls `Transformer::set_named_parameter()`.

`apr-pretrain-arch-polymorphic-v1` §init_load_semantics:
  - Population invariant: "Init tensors populate trainer parameters
    byte-equivalent to source"
  - FALSIFY-APR-PRETRAIN-INIT-007 (population step) at PARTIAL_ALGORITHM_LEVEL

1. **Why strict on missing-required?** Architecture mismatch (e.g., init
   from a different model family) would silently leave random init for
   absent parameters, which §28's SHIP-007 lesson teaches us is the
   exact class of "silent gibberish" defect that hides for many epochs.
2. **Why strict on length-mismatch?** A length mismatch indicates the
   from_apr_metadata extractor misread a shape — populating regardless
   would silently truncate or pad, masking the bug.
3. **Why permissive on extra-init-entries?** Tied embeddings: a Qwen2.5
   APR may publish a separate `lm_head.weight` that the trainer's tied
   model omits. Failing on extra entries would force operators to
   pre-strip APRs, which is muda.
4. **Why FALSIFIER ID in error message?** §28 lesson — falsifier IDs in
   error messages turn opaque CI failures into self-explaining defects.
5. **Why one function not two (load+populate fused)?** Decoupling keeps
   `aprender-train` free of `aprender-serve` (the APR loader): the
   loader is a free function in §50.4 step 5f.2; this is the consumer.
   Two-step composition is testable independently (and is, in this PR).

- `populate_trainer_from_init_tensors_happy_path`: every param matched
  → returns Ok(N) where N = named_parameters().len()
- `populate_trainer_from_init_tensors_extra_entries_silently_ignored`:
  fictitious extra entry must NOT cause Err (tied-embeddings safety)
- `populate_trainer_from_init_tensors_rejects_length_mismatch`: wrong
  flat length → Err naming the param + falsifier ID
- `populate_trainer_from_init_tensors_rejects_missing_required_param`:
  missing required → Err with "not present in init APR" + falsifier ID

All 12 tests pass; cargo clippy --lib clean.

- [x] `cargo test -p aprender-train --lib train::pretrain_real::tests` (12/12 pass)
- [x] `cargo clippy -p aprender-train --lib -- -D warnings` (clean)
- [x] No new dependencies; pure aprender-train + autograd Tensor + std::collections::BTreeMap

Step 5f.3 caps the §50.4 step-5f sub-cascade:
  5f.1 — encoder-family validator (PR #1479, awaiting CI)
  5f.2 — load_init_tensors_from_apr (PR #1481 MERGED)
  5f.3 — THIS PR (populate_trainer_from_init_tensors)
Remaining roadmap: 5g (LIVE 500-step fine-tune, operator dispatch),
5h (stamp + publish as MODEL-2 v2).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ep 5f.4 (#1494)

## Summary

Wire `apr pretrain --init <PATH>` end-to-end so step 5g LIVE 500-step
fine-tune can dispatch. Replaces the §49 step 4 "not yet wired" Err
with the actual init-tensor load + trainer populate path that
§50.4 steps 5f.1/5f.2/5f.3 made possible.

## Architecture

Two functions added/changed:

1. `entrenar::train::pretrain_real::build_shared_trainer_with_init` —
   composes the §50.4 step-5f machinery (5c polymorphic dispatch +
   5f.1 encoder rejection + 5f.2 load + 5f.3 populate) into a single
   trainer-builder entry. init=None preserves the from-scratch baseline
   byte-equivalent to `build_shared_trainer`. init=Some validates arch
   family, builds the polymorphic config, loads tensors, populates.

2. `apr-cli/src/commands/pretrain.rs::run` — now extracts the init APR
   file's TransformerConfig via existing `model_config::read_apr_architecture`
   when `--init` is set, then plumbs both `init_arch` and `init_path`
   through `drive_real → drive_real_cpu → build_shared_trainer_with_init`.
   The polymorphic preflight (§50.4 step 5d) already used the EXTRACTED
   vocab — this PR wires the call site to actually pass it.

## What this PR DOES NOT do

- **CUDA path** (~80 LOC follow-up as 5f.5): `drive_real_cuda` now
  fail-fasts when --init is set rather than silently using random init
  (FALSIFY-APR-PRETRAIN-INIT-CUDA-001). The cuBLAS trainer needs
  symmetric `build_shared_cuda_trainer_with_init` which is out of scope.
- **Step 5g LIVE 500-step fine-tune** (operator dispatch): this PR makes
  it dispatchable; running the 500 steps requires operator action.

## Discharges (per apr-pretrain-arch-polymorphic-v1)

- §init_load_semantics integration: load + populate composed end-to-end
- §arch_extraction_signature integration: read_apr_architecture wired
- §qwen_tokenizer_vocab_compatibility integration: extracted vocab
  flows into preflight call site (no longer hardcoded Llama370M)
- FALSIFY-APR-PRETRAIN-INIT-007 (population) at INTEGRATION level
- The legacy "not yet wired" guard from §49 step 4 is RETIRED — the
  drift-prevention test now pins the new fail-closed semantic.

## Tests (8 new across 2 crates, all pass)

- `aprender-train`: 4 new tests for `build_shared_trainer_with_init`:
  - `_none_uses_llama370m_shape` (regression-free init=None)
  - `_rejects_unpaired_args` (caller-bug guard)
  - `_rejects_encoder_family` (FALSIFY-007 integration)
  - `_decoder_family_proceeds_to_tensor_load` (failure ordering pin)
- `apr-cli`: 2 retrofitted tests for the new fail-closed semantic:
  - `pretrain_init_valid_magic_but_bogus_metadata_fails_at_arch_extraction`
    (replaces the old "not yet wired" trip-wire)
  - `pretrain_init_v1_magic_aprn_passes_validate_init_apr_path`
    (helper now returns Ok on valid magic)

19/19 pretrain_real tests pass. 23/23 apr-cli pretrain tests pass.
cargo clippy --lib -- -D warnings clean across both crates.

## Five Whys

1. **Why was 5f.4 needed at all?** §50's 5a-5h decomposition assumed
   the CLI dispatch would naturally invoke the helper functions; live
   source inspection (§52 amendment) revealed the dispatch hardcoded
   "not yet wired" Err. 5f.4 is the explicit wireup.
2. **Why is removing the safety Err so load-bearing?** The §28 SHIP-007
   lesson: silently random-init via a half-implemented dispatch is the
   exact "silent gibberish" defect class. Removing the safety Err
   without the wireup would manifest as a multi-epoch divergence
   masquerading as a corpus-quality issue.
3. **Why a separate polymorphic builder rather than overload `build_shared_trainer`?**
   `build_shared_trainer` enforces INV-ARCH-370M-001 (param-count band)
   which only applies to from-scratch Llama370M. The polymorphic builder
   sidesteps it by design — Qwen2.5-0.5B is 0.5B params, outside the
   band by intent.
4. **Why fail-fast on `--init` + `--device cuda` rather than silently
   ignore?** Same reasoning as #2: silent CUDA random-init would
   bisect the same "silent gibberish" class. 5f.5 follow-up wires
   symmetric CUDA path; until then, fail-closed.
5. **Why couldn't this be inside #1483 (the populate PR)?** Different
   crate (apr-cli vs aprender-train), different review concern (CLI
   plumbing vs trainer mutation), different test surface. One atomic
   PR per file/crate boundary.

## Test plan

- [x] `cargo test -p aprender-train --lib train::pretrain_real::tests` (19/19 pass)
- [x] `cargo test -p apr-cli --lib commands::pretrain` (23/23 pass)
- [x] `cargo clippy -p aprender-train -p apr-cli --lib -- -D warnings` (clean)
- [x] `cargo check -p apr-cli --lib` (clean)
- [ ] Operator-dispatched: `apr pretrain --init <Qwen2.5-Coder-0.5B>.apr`
      smoke that fires 50 training steps end-to-end (5g LIVE prelude;
      operator action in next session)

## Cascade context

This is the §52-identified gap closing the §50.4 step 5f sub-cascade:
- 5f.1 encoder validator: PR #1479 ✅ MERGED
- 5f.2 load_init_tensors_from_apr: PR #1481 ✅ MERGED
- 5f.3 populate_trainer_from_init_tensors: PR #1483 (mergeable, in queue)
- **5f.4 CLI wireup: THIS PR**
- 5g LIVE 500-step fine-tune: operator dispatch (next)
- 5h stamp + publish: ~10 LOC follow-up

Once 5f.4 lands AND 5g produces val_loss < 9.38 evidence, MODEL-2 ship % moves 57% → ≥58%.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ION-COMPLETE; contract v1.1.0 → v1.2.0 FUNCTIONAL (#1495)

§50.4 cascade INTEGRATION-COMPLETE on main with PR #1494 merging at
2026-05-05T01:48:14Z. The `apr pretrain --init <PATH>` flow is now
end-to-end functional on CPU; the legacy "not yet wired" Err is
RETIRED; step 5g LIVE is the only remaining gate before MODEL-2 ship-%
can move from 57% → ≥58%.

Spec amendment §53:
- Updated falsifier scoreboard: 6/8 INTEGRATION (001/002/003/005/006/007
  via live CLI dispatch); 2/8 PARTIAL_ALGORITHM_LEVEL (004 forward-pass
  smoke + 008 contract validation are inherently algorithm-level).
- Step roadmap: 5a-5f.4 ✅ MERGED; 5f.5 (CUDA wireup) NOT YET STARTED;
  5g (LIVE 500-step fine-tune) operator-dispatchable on RTX 4090.
- Cascade ships statistics: 11 PRs over 2 days
  (#1471/#1472/#1473/#1474/#1475/#1476/#1478/#1479/#1481/#1482/#1483/#1486/#1494).
- MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57%
  (gated on 5g empirical val_loss < 9.38 evidence).
- 3 CI andon classes documented as feedback memories during cascade
  (workspace-test missing-binary, trueno SIGSEGV-on-cleanup, auto-merge
  behind-state).

Contract apr-pretrain-arch-polymorphic-v1 v1.1.0 → v1.2.0 FUNCTIONAL:
- All 8 falsifiers PASS on main; 6/8 reach INTEGRATION via the
  user-facing `apr pretrain --init` flow.
- verification_summary updated: tested 7 → 8; status partial →
  functional.
- Added §52 + §53 references.
- Promotion to DISCHARGED still requires §50.4 step 5g LIVE empirical
  500-step fine-tune on canonical Qwen2.5-Coder-0.5B-Instruct.apr
  producing val_loss < 9.38.

`pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml` exits 0.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade, PR #1494 merge commit 9afca16

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…-003/004/007 drift (round 2) (#1509)

* contract(apr-pretrain-arch-polymorphic-v1): v1.5 → v1.6 — fix FALSIFY-003/004/007 drift (round 2)

Second-round test-reference drift correction. §57's drift sweep
(this contract's v1.4 → v1.5 bump in PR #1505) caught FALSIFY-005/006
but a more thorough audit (cross-referencing every `test:` field
against the source-code function-name registry) surfaced three
additional dangling references.

## Drift inventory (round 2)

  | Falsifier | v1.5.0 cited test                                           | Exists? | Actual test                                                  |
  | ---       | ---                                                         | ---     | ---                                                          |
  | 003       | build_transformer_config_qwen_init_matches_constructor      | ❌       | build_transformer_config_qwen_init_matches_input             |
  | 004       | transformer::attention::tests::gqa_7_to_1_matches_full_mha  | ❌       | transformer::model::tests::falsify_apr_pretrain_arch_004_*   |
  | 007       | build_transformer_config_encoder_init_errors                | ❌       | validate_pretrain_init_arch_rejects_encoder                  |

## Why §57 (PR #1505) didn't catch these

§57's grep audited test-name SUFFIXES and FRAGMENTS, which produced
false-negatives on:
  - `_init_matches_constructor` vs `_init_matches_input` — both end
    in `_matches_<word>` so a fragment grep counted the contract's
    name as "not dangling"
  - `transformer::attention::tests::` vs `transformer::model::tests::` —
    module-path drift not just function-name drift; only fully-
    qualified path comparison catches this
  - `_encoder_init_errors` vs `validate_pretrain_init_arch_rejects_encoder` —
    the contract's name was a guess at the impl name; impl PR #1479
    chose a completely different convention

## How this round was found

Used a stricter audit: for every `cargo test ... ::tests::<name>`
in contracts, grep `fn <name>` in the actual source tree. If the
fn doesn't exist, drift. This catches drift that PR #1505's
fragment-based audit missed.

## Resolution

Update FALSIFY-003/004/007 `test:` fields to the actual function
names. No falsifier semantics change. 11 falsifiers all PASS;
contract status remains FUNCTIONAL.

## Verification

  $ cargo test -p aprender-train --lib -- build_transformer_config_qwen_init_matches_input
    test result: ok. 1 passed
  $ cargo test -p aprender-train --lib -- falsify_apr_pretrain_arch_004_gqa_7_1_forward_pass_smoke
    test result: ok. 1 passed
  $ cargo test -p aprender-train --lib -- validate_pretrain_init_arch_rejects_encoder
    test result: ok. 1 passed
  $ pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml
    0 error(s), 0 warning(s)

## Five Whys

1. Why did §57's sweep miss these? Used name-fragment grep
   (`::tests::[a-z_]+`) which counted false-negatives on suffix-
   close names like `_constructor` ↔ `_input`.
2. Why is module-path drift a separate class? Because grep against
   the `[a-z_]+` regex captures the FUNCTION name, not the
   `::module::tests::` path. A function with the right name in the
   wrong module passes that audit but fails actual test invocation.
3. Why fix in a separate PR rather than amending PR #1505? PR #1505
   already merged. Per `feedback_falsifier_first_cascade_pattern.md`
   the cleanest cadence is one-bump-per-PR.
4. Why bump to v1.6.0? Same pattern as PR #1505's v1.4 → v1.5: the
   test-binding INVARIANT was broken in v1.5.0 (residual drift) and
   v1.6.0 restores it.
5. Why now (during 5g.1 wait)? Productive use of the 5g.1 (~10hr
   remaining) compute-bound idle time. Each drift fix is small
   (~30 LOC), reduces drift risk for future agents, and restores
   the falsifier-binding invariant. The alternative (manufacture
   bigger work) would risk introducing defects the contract base
   doesn't catch yet.

## Net effects

- Contract v1.5.0 → v1.6.0 FUNCTIONAL.
- 11 falsifiers, all PASS — same count, but FALSIFY-003/004/007
  now reference tests that actually exist.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

This is the SECOND round of drift sweep on this contract. Together
with PRs #1502/#1504/#1505/#1506 (round 1), all known
test-reference drift is closed across the §50.4 cascade contracts.
A future spec amendment could codify a `pv lint --strict-test-binding`
enforcement that prevents drift at contract-merge time.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade,
      contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.6.0,
      PR #1505 (round 1 partial fix), PR #1502/#1504/#1506 (sibling fixes)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* contract(apr-pretrain-arch-polymorphic-v1): also fix FALSIFY-001 (round 2.5 — surfaced by PR #1511)

Round 2 (initial commit on this branch) fixed FALSIFY-003/004/007.
Sub-agent PR #1511 (`pv lint --strict-test-binding`) surfaced a 4th
drift in this same contract:

  FALSIFY-001 cited `qwen2_0_5b_matches_hf_config`
    → does NOT exist on main.
  Actual: `qwen2_0_5b_matches_hf_config_2026_05_04`
    (date-suffix added by impl PR #1474 / commit 9af6e71 — May 4).

The earlier round-2 audit (which focused on suffix + module-path
drift) didn't catch this because the test name has a DATE-SUFFIX
drift class (function name + `_<date>` is a real Rust test, but
the contract truncated to the prefix).

Updates:
- FALSIFY-001 test ref: append `_2026_05_04` suffix.
- v1.6.0 changelog updated to record 4 fixes (was 3).
- Verified: cargo test qwen2_0_5b_matches_hf_config_2026_05_04 PASS.
- pv lint --strict-test-binding contracts/apr-pretrain-arch-polymorphic-v1.yaml: 0 PV-VER-002 (down from 4 pre-fix).

This consolidates round 2 into a single commit on the same branch
+ PR (#1509) rather than spawning a round-3 PR for one extra fix.
The lint hardening in #1511 is what made finding the 4th drift
trivial; future drift will be caught at contract-merge time once
#1511 lands.

Refs: SPEC-SHIP-TWO-001 §50.4 cascade,
      PR #1511 (sub-agent's pv lint --strict-test-binding),
      Issue #1510

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant