Skip to content

feat(vocab-50257): bump 370M vocab_size 50_000 → 50_257 (Option A, task #131)#984

Merged
noahgift merged 2 commits into
mainfrom
feat/vocab-50257-parity-option-a
Apr 21, 2026
Merged

feat(vocab-50257): bump 370M vocab_size 50_000 → 50_257 (Option A, task #131)#984
noahgift merged 2 commits into
mainfrom
feat/vocab-50257-parity-option-a

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Closes the 3-way parity defect surfaced by task #126 at commit 29607ed. The MODEL-2 tokenizer artifact (trained in task #118) has 50_257 entries per GPT-2 industry convention (50_000 BPE merges + 256 byte-level fallback tokens + 1 sentinel). Both pre-existing contracts pinned 50_000 exactly, and Llama370MConfig::VOCAB_SIZE was hardcoded 50_000 — a mismatch that GATE-ARCH-370M-011 (PR #961) now catches pre-flight but previously let through silently via the N-09 OOB escape guard in Embedding::forward.

Option A (this PR) honors the real-world artifact and burns no tokenizer re-training compute. Option B (retrain tokenizer at 50_000) was rejected as muda.

Byte-lockstep changes across three artifacts

  • contracts/tokenizer-bpe-v1.yaml 1.1.0 → 1.2.0: vocab_size: 50257, merge_rules_count_expected: 49997, INV-BPE-001/004 bounds updated, v1.2.0 changelog entry added.
  • contracts/model-families/llama-370m-sovereign-v1.yaml 1.4.0 → 1.5.0: vocab_size: 50257, INV-ARCH-370M-006 / INV-ARCH-370M-009 equations + falsifiers updated, v1.5.0 changelog entry added.
  • crates/aprender-train/src/models/llama_370m.rs: pub const VOCAB_SIZE: usize = 50_257, validate() const panic + all doctests updated.

INV-ARCH-370M-001 param-count band [366M, 374M] still holds — delta of (257 × 1024) = 263_168 params = 0.07% of lower bound.

Test plan

  • cargo test -p aprender-train --lib models::llama_370m9/9 pass
  • cargo test -p apr-cli --lib commands::pretrain11/11 pass (incl. rewritten preflight_rejects_tokenizer_vocab_mismatch that uses VOCAB_SIZE - 1 as the rot-proof counter-example)
  • pv validate contracts/tokenizer-bpe-v1.yaml — green
  • pv validate contracts/model-families/llama-370m-sovereign-v1.yaml — green
  • CI required checks (workspace-test + ci/gate) green on this PR

Unblocks

#131)

Closes the 3-way parity defect surfaced by task #126 at commit 29607ed.
The MODEL-2 tokenizer artifact (trained in task #118) has 50_257 entries
per GPT-2 industry convention (50_000 BPE merges + 256 byte-level
fallback tokens + 1 sentinel). Both pre-existing contracts pinned
50_000 exactly, and Llama370MConfig::VOCAB_SIZE was hardcoded 50_000 —
a mismatch that GATE-ARCH-370M-011 (PR #961) now catches pre-flight
but previously let through silently via the N-09 OOB escape guard in
Embedding::forward.

Option A (this PR) honors the real-world artifact and burns no
tokenizer re-training compute. Option B (retrain tokenizer at 50_000)
was rejected as muda.

Changes are byte-in-lockstep across three artifacts:
- contracts/tokenizer-bpe-v1.yaml 1.1.0 → 1.2.0:
    vocab_size 50000 → 50257
    merge_rules_count_expected 49996 → 49997
    INV-BPE-004 / GATE-BPE-004 window [49992, 50000] → [49993, 50001]
- contracts/model-families/llama-370m-sovereign-v1.yaml 1.4.0 → 1.5.0:
    architecture.vocab_size 50000 → 50257
    INV-ARCH-370M-006 desc + falsifier updated
    INV-ARCH-370M-009 shape example [50000, 1024] → [50257, 1024]
- crates/aprender-train/src/models/llama_370m.rs:
    pub const VOCAB_SIZE: usize = 50_000 → 50_257
    validate() const-fn assert updated
    falsify_gate_arch_370m_011_helper_rejects_mismatch counter-example
    switched to VOCAB_SIZE - 1 so the test exercises a real mismatch
- crates/apr-cli/src/commands/pretrain.rs:
    preflight_rejects_tokenizer_vocab_mismatch counter-example updated
    in lockstep; uses Llama370MConfig::VOCAB_SIZE.to_string() assertions
    so the test no longer rots on future vocab changes
- crates/apr-cli/src/tokenize_commands.rs:
    `apr tokenize train --vocab-size` default 50000 → 50257 (GPT-2 canon)

Param count: INV-ARCH-370M-001 band [366M, 374M] still holds — delta
is (257 × 1024) = 263_168 = 0.07%.

Validation:
- `pv validate` both contracts: 0 errors
- `cargo test -p aprender-train --lib models::llama_370m`: 9/9 pass
- `cargo test -p apr-cli --lib commands::pretrain`: 11/11 pass
- `cargo fmt --check` (touched files): clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 21, 2026 08:28
@noahgift noahgift merged commit f7ad114 into main Apr 21, 2026
10 checks passed
@noahgift noahgift deleted the feat/vocab-50257-parity-option-a branch April 21, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant