feat(gate-arch-370m-011): pre-flight tokenizer↔model vocab parity gate (task #128) by noahgift · Pull Request #961 · paiml/aprender

noahgift · 2026-04-21T04:48:42Z

Summary

Ships orthogonal pre-flight gate GATE-ARCH-370M-011 that refuses apr pretrain real-compute dispatch when tokenizer.vocab_size != Llama370MConfig::VOCAB_SIZE.
Discovered after re-dispatching task fix(lint): Resolve bashrs false positives #126 post-PR fix(pretrain): default to drive_real; add GATE-TRAIN-010 falsifier #949: first real MODEL-2 run at commit 29607ed fired Embedding::forward token_id 50241 >= vocab_size 50000. N-09 OOB escape. on step 1 — silently corrupts gradients via the N-09 escape path.
Root cause decomposes 3 ways (tokenizer artifact @50_257 vs contracts pinned @50_000, hardcoded VOCAB_SIZE const, no pre-flight gate). This PR fixes only the gate (D3). D1/D2 need an A/B decision and will follow.

Changes

contracts/model-families/llama-370m-sovereign-v1.yaml v1.3.0 → v1.4.0: new GATE-ARCH-370M-011 + FALSIFY-ARCH-370M-011, ship_blocking: true, binds_to: AC-SHIP2-003. pv validate → 0 errors.
crates/aprender-train/src/models/llama_370m.rs: pure algorithm-level helper assert_tokenizer_vocab_matches_model + falsify_gate_arch_370m_011_helper_rejects_mismatch test covering match, 50_257 vs 50_000 mismatch, and edge cases.
crates/apr-cli/src/commands/pretrain.rs: CLI wrapper preflight_tokenizer_vocab_matches_model reads <tokenizer_dir>/vocab.json, delegates to pure helper. Called at top of drive_real before ShardBatchIter::new. Synthetic drive is intentionally exempt (never touches the real model). 3 new tests + 1 fixed test (staged valid vocab.json so the intended shard-iterator failure path is still reachable).

Verification

cargo test -p aprender-train --lib llama_370m → 10 passed
cargo test -p apr-cli --lib commands::pretrain → 11 passed (4 new)
cargo run -p aprender-contracts-cli --bin pv -- validate contracts/model-families/llama-370m-sovereign-v1.yaml → valid
cargo clippy -p aprender-train -p apr-cli --lib -- -D warnings → clean
cargo fmt -- --check on affected files → clean

Test plan

CI ci / gate passes
CI workspace-test passes
Post-merge: dispatch apr pretrain --mode from-scratch with mismatched tokenizer → verify refusal message cites GATE-ARCH-370M-011 and both vocab sizes
Post-merge: follow-up PR to resolve D1/D2 (A: bump contracts to 50_257, B: retrain tokenizer to 50_000)

Unblocks

task fix(lint): Resolve bashrs false positives #126 (MODEL-2 from-scratch real-compute dispatch) — pending D1/D2 A/B decision

🤖 Generated with Claude Code

Task #128. First real MODEL-2 from-scratch compute dispatch (post PR #949 canonical rebuild at commit 29607ed, 2026-04-21) reached drive_real and immediately fired `Embedding::forward token_id 50241 >= vocab_size 50000. N-09 OOB escape.` — provably garbage gradients on 257 out-of-range tokens. Root cause decomposes 3 ways: D1: tokenizer artifact at vocab=50_257 (GPT-2 standard: 50_000 merges + 256 bytes + 1 <|endoftext|>) violates BOTH tokenizer-bpe-v1 and llama-370m-sovereign-v1 contracts (both pin 50_000 exact) D2: Llama370MConfig::VOCAB_SIZE is a hardcoded const — `--vocab-size` CLI flag is silently ignored by the model-side config path D3: no pre-flight gate caught the mismatch → N-09 OOB escape guard masked garbage tokens as escape id, producing silent data corruption This PR ships the orthogonal D3 fix only. D1/D2 require an A/B decision (bump both contracts to 50_257, OR retrain tokenizer to 50_000) that is deferred to a follow-up. The pre-flight gate falsifies any future mismatch regardless of which number wins. Contract-first sequence: 1. contracts/model-families/llama-370m-sovereign-v1.yaml v1.3.0 → v1.4.0: new GATE-ARCH-370M-011 + FALSIFY-ARCH-370M-011, ship_blocking, binds_to AC-SHIP2-003. `pv validate` → 0 errors. 2. aprender-train: pure `assert_tokenizer_vocab_matches_model` helper (algorithm-level proof, no I/O) + `falsify_gate_arch_370m_011_*` test covering match, mismatch (50_257 vs 50_000), and edge cases. 3. apr-cli: `preflight_tokenizer_vocab_matches_model` wrapper reads <tokenizer_dir>/vocab.json, counts entries, delegates to pure helper. Called at top of `drive_real` before ShardBatchIter::new — refuses to dispatch a real training step on mismatch. Synthetic drive is intentionally exempt (never touches the real model). 4. Tests: 3 new CLI tests (accept match, reject 50257 vs 50000, reject missing vocab.json) + fix to existing real_mode_empty_dataset_dir test that now stages a valid vocab.json so the shard-iterator failure path is still reachable. Verification: cargo test -p aprender-train --lib llama_370m → 10 passed cargo test -p apr-cli --lib commands::pretrain → 11 passed pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → valid Unblocks task #126 (re-dispatch) pending D1/D2 A/B decision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

#131) (#984) Closes the 3-way parity defect surfaced by task #126 at commit 29607ed. The MODEL-2 tokenizer artifact (trained in task #118) has 50_257 entries per GPT-2 industry convention (50_000 BPE merges + 256 byte-level fallback tokens + 1 sentinel). Both pre-existing contracts pinned 50_000 exactly, and Llama370MConfig::VOCAB_SIZE was hardcoded 50_000 — a mismatch that GATE-ARCH-370M-011 (PR #961) now catches pre-flight but previously let through silently via the N-09 OOB escape guard in Embedding::forward. Option A (this PR) honors the real-world artifact and burns no tokenizer re-training compute. Option B (retrain tokenizer at 50_000) was rejected as muda. Changes are byte-in-lockstep across three artifacts: - contracts/tokenizer-bpe-v1.yaml 1.1.0 → 1.2.0: vocab_size 50000 → 50257 merge_rules_count_expected 49996 → 49997 INV-BPE-004 / GATE-BPE-004 window [49992, 50000] → [49993, 50001] - contracts/model-families/llama-370m-sovereign-v1.yaml 1.4.0 → 1.5.0: architecture.vocab_size 50000 → 50257 INV-ARCH-370M-006 desc + falsifier updated INV-ARCH-370M-009 shape example [50000, 1024] → [50257, 1024] - crates/aprender-train/src/models/llama_370m.rs: pub const VOCAB_SIZE: usize = 50_000 → 50_257 validate() const-fn assert updated falsify_gate_arch_370m_011_helper_rejects_mismatch counter-example switched to VOCAB_SIZE - 1 so the test exercises a real mismatch - crates/apr-cli/src/commands/pretrain.rs: preflight_rejects_tokenizer_vocab_mismatch counter-example updated in lockstep; uses Llama370MConfig::VOCAB_SIZE.to_string() assertions so the test no longer rots on future vocab changes - crates/apr-cli/src/tokenize_commands.rs: `apr tokenize train --vocab-size` default 50000 → 50257 (GPT-2 canon) Param count: INV-ARCH-370M-001 band [366M, 374M] still holds — delta is (257 × 1024) = 263_168 = 0.07%. Validation: - `pv validate` both contracts: 0 errors - `cargo test -p aprender-train --lib models::llama_370m`: 9/9 pass - `cargo test -p apr-cli --lib commands::pretrain`: 11/11 pass - `cargo fmt --check` (touched files): clean Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 21, 2026 04:48

noahgift added 2 commits April 21, 2026 07:36

Merge branch 'main' into feat/gate-arch-370m-011-vocab-preflight

4ad07b6

Merge branch 'main' into feat/gate-arch-370m-011-vocab-preflight

c49f3c8

noahgift merged commit 15c0bf5 into main Apr 21, 2026
10 checks passed

noahgift deleted the feat/gate-arch-370m-011-vocab-preflight branch April 21, 2026 07:49

noahgift mentioned this pull request Apr 21, 2026

feat(vocab-50257): bump 370M vocab_size 50_000 → 50_257 (Option A, task #131) #984

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gate-arch-370m-011): pre-flight tokenizer↔model vocab parity gate (task #128)#961

feat(gate-arch-370m-011): pre-flight tokenizer↔model vocab parity gate (task #128)#961
noahgift merged 3 commits into
mainfrom
feat/gate-arch-370m-011-vocab-preflight

noahgift commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 21, 2026

Summary

Changes

Verification

Test plan

Unblocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant