Skip to content

fix(pretrain): default to drive_real; add GATE-TRAIN-010 falsifier#949

Merged
noahgift merged 4 commits into
mainfrom
fix/pretrain-synthetic-default-routing
Apr 21, 2026
Merged

fix(pretrain): default to drive_real; add GATE-TRAIN-010 falsifier#949
noahgift merged 4 commits into
mainfrom
fix/pretrain-synthetic-default-routing

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

  • Root cause: apr pretrain shipped with #[arg(long, default_value = "true")] synthetic: bool and no --synthetic=false companion. Every invocation silently routed to drive_synthetic (scripted-loss), so tasks APR-FORMAT-002: APR v2 format spec for web-scale models #119 / Integrate trueno-viz for training visualization #124 / Integrate trueno-rag for enhanced text/document ML #125 captured synthetic output and mis-labeled it real compute. val_loss=18 at target=10 and val_loss=4.2 at target=3 match the synthetic formula target × 1.8 linear-decayed — proving no real compute ran.
  • Fix: action = ArgAction::SetTrue → absent = false = drive_real, present = true = drive_synthetic.
  • Contract-first: bump training-loop-pretrain-v1 v1.3.0 → v1.4.0 with INV-TRAIN-010 (CLI default routing) and GATE-TRAIN-010 (ship_blocking, binds AC-SHIP2-003). Two parse-level falsifier tests assert the routing discriminator byte-for-byte.

Why parse-level tests?

Existing mode_* tests call run(...) directly with explicit synthetic: bool — they never exercise the clap surface, which is exactly why the defect slipped through task #105. The new tests drive Cli::try_parse_from so any future regression in the clap attribute fails a unit test before it reaches CI.

Test plan

  • cargo test -p apr-cli --features training --lib pretrain → 8/8 pass (debug + release)
  • pv validate contracts/training-loop-pretrain-v1.yaml → 0 errors, 0 warnings
  • cargo build -p apr-cli --features training → clean
  • CI ci / gate + workspace-test green
  • After merge: rebuild canonical binary on lambda-labs, re-dispatch task fix(lint): Resolve bashrs false positives #126

Unblocks

Task #126 — MODEL-2 from-scratch pretrain on lambda-labs RTX 4090.

🤖 Generated with Claude Code

)

Contract training-loop-pretrain-v1 v1.3.0 → v1.4.0.

Root cause (tasks #119 / #124 / #125): `extended_commands.rs`
shipped `#[arg(long, default_value = "true")] synthetic: bool`
with no `--synthetic=false` companion. Every `apr pretrain`
invocation silently routed to drive_synthetic (scripted loss),
so every "real-compute" capture was actually synthetic —
val_loss values (18 at target=10, 4.2 at target=3) match the
synthetic formula target × 1.8 linear-decayed.

Fix:
  * extended_commands.rs: flip to `action = ArgAction::SetTrue`
    so absent = false (real), present = true (synthetic).
  * contracts/training-loop-pretrain-v1.yaml:
      - bump 1.3.0 → 1.4.0
      - add INV-TRAIN-010 (CLI default routing must reach drive_real)
      - add GATE-TRAIN-010 (ship_blocking, binds AC-SHIP2-003)
  * crates/apr-cli/src/commands/pretrain.rs:
      - cli_pretrain_defaults_to_real_compute
      - cli_pretrain_synthetic_flag_routes_to_synthetic
    Parse on a 16 MiB worker thread because the `Commands` enum
    overflows the default 2 MiB test-thread stack in debug.

Verification:
  * cargo test -p apr-cli --features training --lib pretrain
    → 8/8 pass (debug + release)
  * pv validate contracts/training-loop-pretrain-v1.yaml
    → 0 errors, 0 warnings

Unblocks task #126 (MODEL-2 real-compute dispatch).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 20, 2026 19:37
@noahgift noahgift merged commit 29607ed into main Apr 21, 2026
10 checks passed
@noahgift noahgift deleted the fix/pretrain-synthetic-default-routing branch April 21, 2026 04:14
noahgift added a commit that referenced this pull request Apr 21, 2026
#961)

Task #128. First real MODEL-2 from-scratch compute dispatch (post PR #949
canonical rebuild at commit 29607ed, 2026-04-21) reached drive_real and
immediately fired `Embedding::forward token_id 50241 >= vocab_size 50000.
N-09 OOB escape.` — provably garbage gradients on 257 out-of-range tokens.

Root cause decomposes 3 ways:
  D1: tokenizer artifact at vocab=50_257 (GPT-2 standard: 50_000 merges +
      256 bytes + 1 <|endoftext|>) violates BOTH tokenizer-bpe-v1 and
      llama-370m-sovereign-v1 contracts (both pin 50_000 exact)
  D2: Llama370MConfig::VOCAB_SIZE is a hardcoded const — `--vocab-size`
      CLI flag is silently ignored by the model-side config path
  D3: no pre-flight gate caught the mismatch → N-09 OOB escape guard masked
      garbage tokens as escape id, producing silent data corruption

This PR ships the orthogonal D3 fix only. D1/D2 require an A/B decision
(bump both contracts to 50_257, OR retrain tokenizer to 50_000) that is
deferred to a follow-up. The pre-flight gate falsifies any future
mismatch regardless of which number wins.

Contract-first sequence:
  1. contracts/model-families/llama-370m-sovereign-v1.yaml v1.3.0 → v1.4.0:
     new GATE-ARCH-370M-011 + FALSIFY-ARCH-370M-011, ship_blocking,
     binds_to AC-SHIP2-003. `pv validate` → 0 errors.
  2. aprender-train: pure `assert_tokenizer_vocab_matches_model` helper
     (algorithm-level proof, no I/O) + `falsify_gate_arch_370m_011_*`
     test covering match, mismatch (50_257 vs 50_000), and edge cases.
  3. apr-cli: `preflight_tokenizer_vocab_matches_model` wrapper reads
     <tokenizer_dir>/vocab.json, counts entries, delegates to pure helper.
     Called at top of `drive_real` before ShardBatchIter::new — refuses
     to dispatch a real training step on mismatch. Synthetic drive is
     intentionally exempt (never touches the real model).
  4. Tests: 3 new CLI tests (accept match, reject 50257 vs 50000, reject
     missing vocab.json) + fix to existing real_mode_empty_dataset_dir
     test that now stages a valid vocab.json so the shard-iterator
     failure path is still reachable.

Verification:
  cargo test -p aprender-train --lib llama_370m → 10 passed
  cargo test -p apr-cli --lib commands::pretrain  → 11 passed
  pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → valid

Unblocks task #126 (re-dispatch) pending D1/D2 A/B decision.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant