fix(pretrain): default to drive_real; add GATE-TRAIN-010 falsifier by noahgift · Pull Request #949 · paiml/aprender

noahgift · 2026-04-20T19:37:38Z

Summary

Root cause: apr pretrain shipped with #[arg(long, default_value = "true")] synthetic: bool and no --synthetic=false companion. Every invocation silently routed to drive_synthetic (scripted-loss), so tasks APR-FORMAT-002: APR v2 format spec for web-scale models #119 / Integrate trueno-viz for training visualization #124 / Integrate trueno-rag for enhanced text/document ML #125 captured synthetic output and mis-labeled it real compute. val_loss=18 at target=10 and val_loss=4.2 at target=3 match the synthetic formula target × 1.8 linear-decayed — proving no real compute ran.
Fix: action = ArgAction::SetTrue → absent = false = drive_real, present = true = drive_synthetic.
Contract-first: bump training-loop-pretrain-v1 v1.3.0 → v1.4.0 with INV-TRAIN-010 (CLI default routing) and GATE-TRAIN-010 (ship_blocking, binds AC-SHIP2-003). Two parse-level falsifier tests assert the routing discriminator byte-for-byte.

Why parse-level tests?

Existing mode_* tests call run(...) directly with explicit synthetic: bool — they never exercise the clap surface, which is exactly why the defect slipped through task #105. The new tests drive Cli::try_parse_from so any future regression in the clap attribute fails a unit test before it reaches CI.

Test plan

cargo test -p apr-cli --features training --lib pretrain → 8/8 pass (debug + release)
pv validate contracts/training-loop-pretrain-v1.yaml → 0 errors, 0 warnings
cargo build -p apr-cli --features training → clean
CI ci / gate + workspace-test green
After merge: rebuild canonical binary on lambda-labs, re-dispatch task fix(lint): Resolve bashrs false positives #126

Unblocks

Task #126 — MODEL-2 from-scratch pretrain on lambda-labs RTX 4090.

🤖 Generated with Claude Code

) Contract training-loop-pretrain-v1 v1.3.0 → v1.4.0. Root cause (tasks #119 / #124 / #125): `extended_commands.rs` shipped `#[arg(long, default_value = "true")] synthetic: bool` with no `--synthetic=false` companion. Every `apr pretrain` invocation silently routed to drive_synthetic (scripted loss), so every "real-compute" capture was actually synthetic — val_loss values (18 at target=10, 4.2 at target=3) match the synthetic formula target × 1.8 linear-decayed. Fix: * extended_commands.rs: flip to `action = ArgAction::SetTrue` so absent = false (real), present = true (synthetic). * contracts/training-loop-pretrain-v1.yaml: - bump 1.3.0 → 1.4.0 - add INV-TRAIN-010 (CLI default routing must reach drive_real) - add GATE-TRAIN-010 (ship_blocking, binds AC-SHIP2-003) * crates/apr-cli/src/commands/pretrain.rs: - cli_pretrain_defaults_to_real_compute - cli_pretrain_synthetic_flag_routes_to_synthetic Parse on a 16 MiB worker thread because the `Commands` enum overflows the default 2 MiB test-thread stack in debug. Verification: * cargo test -p apr-cli --features training --lib pretrain → 8/8 pass (debug + release) * pv validate contracts/training-loop-pretrain-v1.yaml → 0 errors, 0 warnings Unblocks task #126 (MODEL-2 real-compute dispatch). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#961) Task #128. First real MODEL-2 from-scratch compute dispatch (post PR #949 canonical rebuild at commit 29607ed, 2026-04-21) reached drive_real and immediately fired `Embedding::forward token_id 50241 >= vocab_size 50000. N-09 OOB escape.` — provably garbage gradients on 257 out-of-range tokens. Root cause decomposes 3 ways: D1: tokenizer artifact at vocab=50_257 (GPT-2 standard: 50_000 merges + 256 bytes + 1 <|endoftext|>) violates BOTH tokenizer-bpe-v1 and llama-370m-sovereign-v1 contracts (both pin 50_000 exact) D2: Llama370MConfig::VOCAB_SIZE is a hardcoded const — `--vocab-size` CLI flag is silently ignored by the model-side config path D3: no pre-flight gate caught the mismatch → N-09 OOB escape guard masked garbage tokens as escape id, producing silent data corruption This PR ships the orthogonal D3 fix only. D1/D2 require an A/B decision (bump both contracts to 50_257, OR retrain tokenizer to 50_000) that is deferred to a follow-up. The pre-flight gate falsifies any future mismatch regardless of which number wins. Contract-first sequence: 1. contracts/model-families/llama-370m-sovereign-v1.yaml v1.3.0 → v1.4.0: new GATE-ARCH-370M-011 + FALSIFY-ARCH-370M-011, ship_blocking, binds_to AC-SHIP2-003. `pv validate` → 0 errors. 2. aprender-train: pure `assert_tokenizer_vocab_matches_model` helper (algorithm-level proof, no I/O) + `falsify_gate_arch_370m_011_*` test covering match, mismatch (50_257 vs 50_000), and edge cases. 3. apr-cli: `preflight_tokenizer_vocab_matches_model` wrapper reads <tokenizer_dir>/vocab.json, counts entries, delegates to pure helper. Called at top of `drive_real` before ShardBatchIter::new — refuses to dispatch a real training step on mismatch. Synthetic drive is intentionally exempt (never touches the real model). 4. Tests: 3 new CLI tests (accept match, reject 50257 vs 50000, reject missing vocab.json) + fix to existing real_mode_empty_dataset_dir test that now stages a valid vocab.json so the shard-iterator failure path is still reachable. Verification: cargo test -p aprender-train --lib llama_370m → 10 passed cargo test -p apr-cli --lib commands::pretrain → 11 passed pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → valid Unblocks task #126 (re-dispatch) pending D1/D2 A/B decision. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 20, 2026 19:37

noahgift added 3 commits April 21, 2026 04:54

Merge branch 'main' into fix/pretrain-synthetic-default-routing

2e40a41

Merge branch 'main' into fix/pretrain-synthetic-default-routing

2946ced

Merge branch 'main' into fix/pretrain-synthetic-default-routing

5930082

noahgift merged commit 29607ed into main Apr 21, 2026
10 checks passed

noahgift deleted the fix/pretrain-synthetic-default-routing branch April 21, 2026 04:14

noahgift mentioned this pull request Apr 21, 2026

feat(gate-arch-370m-011): pre-flight tokenizer↔model vocab parity gate (task #128) #961

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pretrain): default to drive_real; add GATE-TRAIN-010 falsifier#949

fix(pretrain): default to drive_real; add GATE-TRAIN-010 falsifier#949
noahgift merged 4 commits into
mainfrom
fix/pretrain-synthetic-default-routing

noahgift commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 20, 2026

Summary

Why parse-level tests?

Test plan

Unblocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant