feat(vocab-50257): bump 370M vocab_size 50_000 → 50_257 (Option A, task #131) by noahgift · Pull Request #984 · paiml/aprender

noahgift · 2026-04-21T08:28:28Z

Summary

Closes the 3-way parity defect surfaced by task #126 at commit 29607ed. The MODEL-2 tokenizer artifact (trained in task #118) has 50_257 entries per GPT-2 industry convention (50_000 BPE merges + 256 byte-level fallback tokens + 1 sentinel). Both pre-existing contracts pinned 50_000 exactly, and Llama370MConfig::VOCAB_SIZE was hardcoded 50_000 — a mismatch that GATE-ARCH-370M-011 (PR #961) now catches pre-flight but previously let through silently via the N-09 OOB escape guard in Embedding::forward.

Option A (this PR) honors the real-world artifact and burns no tokenizer re-training compute. Option B (retrain tokenizer at 50_000) was rejected as muda.

Byte-lockstep changes across three artifacts

contracts/tokenizer-bpe-v1.yaml 1.1.0 → 1.2.0: vocab_size: 50257, merge_rules_count_expected: 49997, INV-BPE-001/004 bounds updated, v1.2.0 changelog entry added.
contracts/model-families/llama-370m-sovereign-v1.yaml 1.4.0 → 1.5.0: vocab_size: 50257, INV-ARCH-370M-006 / INV-ARCH-370M-009 equations + falsifiers updated, v1.5.0 changelog entry added.
crates/aprender-train/src/models/llama_370m.rs: pub const VOCAB_SIZE: usize = 50_257, validate() const panic + all doctests updated.

INV-ARCH-370M-001 param-count band [366M, 374M] still holds — delta of (257 × 1024) = 263_168 params = 0.07% of lower bound.

Test plan

cargo test -p aprender-train --lib models::llama_370m — 9/9 pass
cargo test -p apr-cli --lib commands::pretrain — 11/11 pass (incl. rewritten preflight_rejects_tokenizer_vocab_mismatch that uses VOCAB_SIZE - 1 as the rot-proof counter-example)
pv validate contracts/tokenizer-bpe-v1.yaml — green
pv validate contracts/model-families/llama-370m-sovereign-v1.yaml — green
CI required checks (workspace-test + ci/gate) green on this PR

Unblocks

Task fix(lint): Resolve bashrs false positives #126: MODEL-2 from-scratch real-compute pretrain on lambda-labs RTX 4090

#131) Closes the 3-way parity defect surfaced by task #126 at commit 29607ed. The MODEL-2 tokenizer artifact (trained in task #118) has 50_257 entries per GPT-2 industry convention (50_000 BPE merges + 256 byte-level fallback tokens + 1 sentinel). Both pre-existing contracts pinned 50_000 exactly, and Llama370MConfig::VOCAB_SIZE was hardcoded 50_000 — a mismatch that GATE-ARCH-370M-011 (PR #961) now catches pre-flight but previously let through silently via the N-09 OOB escape guard in Embedding::forward. Option A (this PR) honors the real-world artifact and burns no tokenizer re-training compute. Option B (retrain tokenizer at 50_000) was rejected as muda. Changes are byte-in-lockstep across three artifacts: - contracts/tokenizer-bpe-v1.yaml 1.1.0 → 1.2.0: vocab_size 50000 → 50257 merge_rules_count_expected 49996 → 49997 INV-BPE-004 / GATE-BPE-004 window [49992, 50000] → [49993, 50001] - contracts/model-families/llama-370m-sovereign-v1.yaml 1.4.0 → 1.5.0: architecture.vocab_size 50000 → 50257 INV-ARCH-370M-006 desc + falsifier updated INV-ARCH-370M-009 shape example [50000, 1024] → [50257, 1024] - crates/aprender-train/src/models/llama_370m.rs: pub const VOCAB_SIZE: usize = 50_000 → 50_257 validate() const-fn assert updated falsify_gate_arch_370m_011_helper_rejects_mismatch counter-example switched to VOCAB_SIZE - 1 so the test exercises a real mismatch - crates/apr-cli/src/commands/pretrain.rs: preflight_rejects_tokenizer_vocab_mismatch counter-example updated in lockstep; uses Llama370MConfig::VOCAB_SIZE.to_string() assertions so the test no longer rots on future vocab changes - crates/apr-cli/src/tokenize_commands.rs: `apr tokenize train --vocab-size` default 50000 → 50257 (GPT-2 canon) Param count: INV-ARCH-370M-001 band [366M, 374M] still holds — delta is (257 × 1024) = 263_168 = 0.07%. Validation: - `pv validate` both contracts: 0 errors - `cargo test -p aprender-train --lib models::llama_370m`: 9/9 pass - `cargo test -p apr-cli --lib commands::pretrain`: 11/11 pass - `cargo fmt --check` (touched files): clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 21, 2026 08:28

Merge branch 'main' into feat/vocab-50257-parity-option-a

fbeff4b

noahgift merged commit f7ad114 into main Apr 21, 2026
10 checks passed

noahgift deleted the feat/vocab-50257-parity-option-a branch April 21, 2026 09:01

noahgift mentioned this pull request Apr 21, 2026

feat(task-132): Phase 0 — GPU training backend contract + spec v2.23.0 #989

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vocab-50257): bump 370M vocab_size 50_000 → 50_257 (Option A, task #131)#984

feat(vocab-50257): bump 370M vocab_size 50_000 → 50_257 (Option A, task #131)#984
noahgift merged 2 commits into
mainfrom
feat/vocab-50257-parity-option-a

noahgift commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 21, 2026

Summary

Byte-lockstep changes across three artifacts

Test plan

Unblocks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant