spec(ship-two-models): v2.98 → v2.99 — §54 step 5g multi-step prereqs + live preflight smoke by noahgift · Pull Request #1496 · paiml/aprender

noahgift · 2026-05-05T02:36:06Z

Summary

§53 closed with "step 5g LIVE remains" framing 5g as a single operator dispatch. Live source inspection of the post-#1494 binary plus an actual smoke run on canonical inputs revealed step 5g has multi-step prerequisites that were NOT enumerated in §50's original 8-step decomposition.

Live empirical smoke (commit 92c7e23 apr binary, 2026-05-05T04:31Z):
```
apr pretrain --init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr> \
--tokenizer <legacy 50257-vocab dir> \
--dataset
→ CORRECT FAIL-FAST: GATE-ARCH-370M-011 (INV-ARCH-370M-006) violated:
tokenizer vocab_size (50257) != model vocab_size (151936)
```

This is the FIRST end-to-end runtime evidence that the §50.4 cascade's polymorphic preflight works in the user-facing CLI. FALSIFY-APR-PRETRAIN-ARCH-005/006 are now LIVE-INTEGRATION (not just unit-test PARTIAL).

But the smoke also surfaces 5g's true scope.

Re-scoped 5g roadmap

Step	What it does	LOC / wall	Status
5g.0	Extract Qwen tokenizer vocab.json + merges.txt from HF cache tokenizer.json	~50 LOC + ~5 min	NEXT PR
5g.1	Re-tokenize codeparrot corpus with Qwen vocab	0 LOC + ~10 hr operator-dispatch	gated on 5g.0
5g.2	LIVE 500-step fine-tune dispatch	0 LOC + ~20-60 min	gated on 5g.1
5g.3	val_loss < 9.38 verdict; flip MODEL-2 ship % 57% → ≥58%	0 LOC	gated on 5g.2

Methodology takeaway

Top-down spec planning consistently underestimates scope-coupling between heterogeneous code paths. Third instance of the same lesson:

§50 found §49's "0 LOC" was 8-step (architectural coupling)
§52 found §50's "5f weight load" was 2-step (CLI dispatch coupling)
§54 found §53's "5g LIVE" is 4-step (tokenizer-format coupling)

Five Whys

Why didn't §50 enumerate 5g.0? Top-down decomposition under-counted the tokenizer-format axis (HF `tokenizer.json` vs aprender's `vocab.json` + `merges.txt`).
Why does aprender require vocab.json + merges.txt rather than reading tokenizer.json? Historical: BPE loader was authored against GPT-2's released format. HF `tokenizer.json` came later.
Why extraction (5g.0) instead of a tokenizer.json reader? Extraction is ~50 LOC of Python; reader integration is ~200 LOC of Rust + tests. Extraction is the cheaper ship-% path; reader is the principled follow-up.
Why is 5g.1's 10-hour wall acceptable? Per `feedback_compute_pre_authorized.md`, named training/tokenization runs are pre-authorized below the 48-hour threshold.
Why is the smoke load-bearing for the spec? It's the FIRST live evidence on canonical model + corpus + binary that the §50.4 cascade does what it claims. Unit tests prove the algorithm; the smoke proves the integration.

Net effects

Spec v2.98.0 → v2.99.0.
§50.4 roadmap: 5a-5f.4 INTEGRATION-COMPLETE; 5g re-scoped to 5g.0/5g.1/5g.2/5g.3.
MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57% until 5g.3.
Coverage tally: snapshot. v1.2.0 FUNCTIONAL is reinforced; DISCHARGED waits for 5g.3.

Files changed

`docs/specifications/aprender-train/ship-two-models-spec.md` — §54 amendment.
`evidence/section-54-5g-prereqs-2026-05-05/preflight-fail-fast-smoke.md` — live smoke evidence.

Test plan

PMAT pre-commit quality gates pass
CI gate green (workspace-test, ci/gate)
Auto-merge fires on green CI

🤖 Generated with Claude Code

…requisites + live preflight smoke §53 closed with "step 5g LIVE remains" framing 5g as a single operator dispatch. Live source inspection of the post-#1494 binary plus an actual smoke run revealed step 5g has multi-step prerequisites that were NOT enumerated in §50's original 8-step decomposition. Live empirical smoke on canonical inputs: apr pretrain --init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr> --tokenizer <legacy 50257-vocab dir> --dataset <legacy codeparrot shards> → CORRECT FAIL-FAST: GATE-ARCH-370M-011 (INV-ARCH-370M-006) violated: tokenizer vocab_size (50257) != model vocab_size (151936) This is the FIRST end-to-end runtime evidence that the §50.4 cascade's polymorphic preflight (PR #1476 + #1494) works in the user-facing CLI: - Read --init APR metadata: vocab=151936, hidden=896, layers=24 - target_vocab = init_arch.vocab_size = 151936 (NOT legacy 50257) - Tokenizer dir vocab.json count = 50257 - Mismatch → fail-fast before trainer allocation But the smoke also surfaces 5g's true scope. A Qwen-vocab tokenizer dir + Qwen-tokenized corpus must exist BEFORE the preflight passes. Neither exists on this host today. Step 5g re-scoped: 5g.0 — Qwen tokenizer extraction (~50 LOC, ~5min wall) [next PR] 5g.1 — Qwen-tokenized corpus (0 LOC, ~10hr wall, operator-dispatch) 5g.2 — LIVE 500-step fine-tune (0 LOC, ~20-60min, operator-dispatch) 5g.3 — val_loss < 9.38 verdict; flip MODEL-2 ship % 57% → ≥58% Methodology takeaway: top-down spec planning consistently underestimates scope-coupling between heterogeneous code paths. This is the third instance of the same lesson: - §50 found §49's "0 LOC" was 8-step (architectural coupling) - §52 found §50's "5f weight load" was 2-step (CLI dispatch coupling) - §54 found §53's "5g LIVE" is 4-step (tokenizer-format coupling) Falsifier scoreboard impact: - FALSIFY-APR-PRETRAIN-ARCH-005/006 reach LIVE-INTEGRATION level (proven via real CLI dispatch, not just unit tests) - Contract `apr-pretrain-arch-polymorphic-v1` v1.2.0 FUNCTIONAL is reinforced; promotion to DISCHARGED waits for 5g.3 val_loss measurement Net effects: - Spec v2.98.0 → v2.99.0 - MODEL-1 ship % unchanged at 91% - MODEL-2 ship % unchanged at 57% (gated on 5g.3) - Coverage tally: snapshot, no contract status flip Refs: SPEC-SHIP-TWO-001 §50.4 step 5g, PR #1476 + #1494, evidence/section-54-5g-prereqs-2026-05-05/preflight-fail-fast-smoke.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Author contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL + implement `apr tokenize import-hf <tokenizer.json> --output <DIR>` that extracts vocab.json + merges.txt + manifest.json from a HuggingFace BPE tokenizer.json into aprender's two-file layout. This is the prerequisite step from §54's 5g re-scoping: aprender's GPT-2-style BPE loader requires vocab.json + merges.txt; public Qwen2.5/Llama2/Mistral tokenizers distribute as a single tokenizer.json. Without import-hf, fine-tuning from these checkpoints is blocked. ## What ships Contract `contracts/apr-cli-tokenize-import-hf-v1.yaml` v1.0.0 (5 falsifiers): FALSIFY-TOK-IMPORT-HF-001 — command registered in dispatch surface FALSIFY-TOK-IMPORT-HF-002 — BPE input produces non-empty vocab+merges FALSIFY-TOK-IMPORT-HF-003 — vocab count == |tokenizer.json:model.vocab| FALSIFY-TOK-IMPORT-HF-004 — merges.txt is one merge per line in order FALSIFY-TOK-IMPORT-HF-005 — non-BPE input fails fast (Unigram/WordPiece) Subcommand `apr tokenize import-hf`: --input <FILE> HF tokenizer.json (BPE required) --output <DIR> output dir for vocab.json + merges.txt + manifest.json --include-added-tokens also emit added_tokens (e.g., <|im_start|>) in vocab.json Output dir layout (drop-in compatible with apr tokenize encode-corpus and apr pretrain --tokenizer): vocab.json — JSON object: token-string → integer-id merges.txt — #version: 0.2 header + one space-separated merge per line manifest.json — provenance: source path, sha256, counts, timestamp 5 unit tests cover FALSIFY-002..005 + the --include-added-tokens path. ## LIVE smoke (canonical input) apr tokenize import-hf \ --input <Qwen2.5-Coder-0.5B-Instruct/tokenizer.json> \ --output /tmp/qwen-0.5b-tokenizer-extracted --json → effective_vocab=151643, merges=151387, added_tokens=22, sha256 captured Files written: 3.2 MiB vocab.json, 1.6 MiB merges.txt, 534 B manifest.json Evidence file: evidence/section-50-4-step-5g-0-import-hf-2026-05-05/live-extraction-smoke.md ## Five Whys 1. Why a new subcommand rather than a script? Per feedback_stack_tool_extension_not_cli_shim.md: when apr lacks a feature, extend apr in-tree. Authored as an apr subcommand with a provable contract; one-off scripts would have been muda. 2. Why under `apr tokenize` rather than `apr import` or top-level? `apr tokenize` already handles tokenizer artifacts (plan/apply/train/ encode-corpus); import-hf is symmetric — produces the same output shape (vocab.json + merges.txt) as `train` does, just from an external HF source rather than from a corpus. `apr import` is for model files, not tokenizers. 3. Why default to BPE-only (excluding added_tokens)? The aprender BPE loader is a state machine; added tokens (e.g., <|endoftext|>) are control-string substitutions handled differently. Default mode keeps the BPE machine pure; --include-added-tokens is the explicit opt-in for when downstream consumption needs the unified vocab. 4. Why fail-fast on non-BPE? Unigram and WordPiece have different state machines; silent extraction would produce a vocab.json that the BPE loader accepts but tokenizes incorrectly. Fail-fast names the actual model.type and cites the contract id (auditability). 5. Why not also handle the polymorphic preflight gap (151643/151936)? That's a separate concern — the preflight's strict `==` semantic is correct for from-scratch models but too strict for Qwen-style models with reserved slots. §55 follow-up; out of 5g.0 scope. ## Net effects - Contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL. - 1 new apr subcommand (`apr tokenize import-hf`) wired into dispatch. - 5 new unit tests + 1 live smoke evidence file. - Unblocks 5g.1 (Qwen-tokenized corpus pretokenization) modulo the §55 preflight strictness finding (vocab gap 151665 vs 151936). - MODEL-1 ship %: unchanged at 91%. - MODEL-2 ship %: unchanged at 57% until 5g.3 val_loss < 9.38. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496 — re-scoped 5g into 5g.0/.../5g.3), contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.2.0 FUNCTIONAL (sibling) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…xation (#1500) §54 LIVE smoke surfaced that public Qwen2.5-Coder-0.5B-Instruct's tokenizer.json materializes 151643 BPE entries + 22 added = 151665 effective strings, but config.json declares vocab_size=151936 (271 reserved/special slots not in tokenizer.json). Strict equality preflight was correct for §24/§25 from-scratch but too strict for HF-distributed pretrained checkpoints with reserved slots. ## Relaxation init=Some → tokenizer_vocab ≤ model_vocab (RELAXED, admits HF reserved slots) init=None → tokenizer_vocab == model_vocab (UNCHANGED, §24/§25 baseline) Safety: tokenizer-emitted ids ∈ [0, tokenizer_vocab) ⊆ [0, model_vocab). Reserved high-id slots are never indexed at training time. N-09 OOB escape impossible. Symmetric guard: tokenizer_vocab > model_vocab MUST FAIL even under init=Some (FALSIFY-APR-PRETRAIN-ARCH-010 — bound is ≤, not <). ## What ships Helper: assert_tokenizer_vocab_within_model_bound (aprender-train) symmetric to assert_tokenizer_vocab_matches_model Wireup: preflight_tokenizer_vocab_matches_target now takes init_is_some: bool; drive_real passes init_arch.is_some() to route to relaxed/strict. Contract: apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL qwen_tokenizer_vocab_compatibility refined formula + invariants +FALSIFY-009 (relaxed accept) +FALSIFY-010 (oversize reject — OOB safety) total: 10 falsifiers, all PASS Tests (4 new, all PASS): falsify_apr_pretrain_arch_009_relaxed_bound_accepts_qwen_reserved_slots falsify_apr_pretrain_arch_010_relaxed_bound_rejects_oversized_tokenizer preflight_qwen_reserved_slots_pass_under_polymorphic_init preflight_oversized_tokenizer_rejected_even_under_polymorphic_init Spec amendment §55 + LIVE smoke evidence. ## LIVE smoke (this branch's apr binary + §54 extracted Qwen tokenizer) timeout 30 apr pretrain --init <Qwen.apr> --tokenizer <extracted-dir> ... → exit=124 (timeout), AFTER preflight passed → Configuration printed + Device: cpu + (proceeded to weight load) → No GATE-ARCH-370M-011 violations Evidence: evidence/section-55-relaxed-preflight-2026-05-05/relaxed-preflight-passes-smoke.md ## Five Whys 1. Why did §54 not catch this? §54 used legacy 50257 tokenizer (not §54's own extracted Qwen tokenizer); within-Qwen mismatch only surfaces after 5g.0 lands. 2. Why bound is ≤ not ==? HF checkpoints standardly declare vocab > tokenizer materialized; strict-equality would fail every Qwen/Llama2/Mistral. 3. Why preserve strict equality on from-scratch? §24/§25 evidence was gathered under strict; weakening retroactively could mask future from-scratch tokenizer-drift bugs. 4. Why new helper rather than mode parameter on existing? External callers (training-loop-pretrain-v1 contract) explicitly want strict; mode param would be backward-incompatible. 5. Why pin both FALSIFY-009 + FALSIFY-010? Bound is ≤, not <. Without FALSIFY-010, a regression to tokenizer_vocab > model_vocab would silently restore N-09 OOB risk. ## Net effects - Spec v2.99.0 → v3.00.0 (cascade pivots from infrastructure to LIVE prerequisites). - Contract apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL. - 5g.0.1 lands as single PR; 5g.1 unblocked. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3 val_loss < 9.38. - Coverage tally: +2 PARTIAL_ALGORITHM_LEVEL falsifiers + LIVE-INTEGRATION reinforcement of FALSIFY-005/009. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496) for the gap finding, §55 (this PR) for the resolution. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ted; full run is ~17hr operator-dispatch §55 (in-flight PR #1500) closes the polymorphic preflight strictness gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run. apr tokenize encode-corpus \ --corpus <python-permissive-5k.jsonl> \ --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \ --output <smoke-shards> --shard-tokens 1000000 → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs) → ~110 sec / M-token single-thread → No errors; shard rotation correct → Killed before manifest.json write (sufficient evidence accumulated) Legacy 50257-vocab: ~64 sec / M-token → 9.99 hr for 565M (validated) Qwen 151643-vocab: ~110 sec / M-token → ~17 hr for 565M (projected) Qwen is ~70% slower per-token because the BPE merge table is 3× larger (151387 vs 49997 merges); per-character merge-table search dominates encoding cost. Below the 48hr feedback_compute_pre_authorized.md ceiling, so 5g.1 full run is pre-authorized. 5g.0 ✅ MERGED PR #1497 (apr tokenize import-hf) 5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation) 5g.1 CORRECTNESS-VALIDATED (this PR), full run pending operator 5g.2 gated on 5g.1 full run 5g.3 gated on 5g.2 (val_loss < 9.38 verdict) 1. Why smoke before full run? ~17hr non-trivial; smoke proves chain correctness before committing to long wall. 2. Why 5000 docs? Smallest slice that exercises shard rotation (12M tokens > 10 shards). 3. Why kill smoke instead of complete? 13 shards = sufficient evidence; finishing wouldn't add information. 4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost. 5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below 48hr ceiling; ROI negative for current cycle. - Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way — §56 has no code/contract changes). - 5g.1 reaches CORRECTNESS-VALIDATED state. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight), §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ted; full run is ~17hr operator-dispatch (#1501) §55 (in-flight PR #1500) closes the polymorphic preflight strictness gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run. apr tokenize encode-corpus \ --corpus <python-permissive-5k.jsonl> \ --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \ --output <smoke-shards> --shard-tokens 1000000 → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs) → ~110 sec / M-token single-thread → No errors; shard rotation correct → Killed before manifest.json write (sufficient evidence accumulated) Legacy 50257-vocab: ~64 sec / M-token → 9.99 hr for 565M (validated) Qwen 151643-vocab: ~110 sec / M-token → ~17 hr for 565M (projected) Qwen is ~70% slower per-token because the BPE merge table is 3× larger (151387 vs 49997 merges); per-character merge-table search dominates encoding cost. Below the 48hr feedback_compute_pre_authorized.md ceiling, so 5g.1 full run is pre-authorized. 5g.0 ✅ MERGED PR #1497 (apr tokenize import-hf) 5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation) 5g.1 CORRECTNESS-VALIDATED (this PR), full run pending operator 5g.2 gated on 5g.1 full run 5g.3 gated on 5g.2 (val_loss < 9.38 verdict) 1. Why smoke before full run? ~17hr non-trivial; smoke proves chain correctness before committing to long wall. 2. Why 5000 docs? Smallest slice that exercises shard rotation (12M tokens > 10 shards). 3. Why kill smoke instead of complete? 13 shards = sufficient evidence; finishing wouldn't add information. 4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost. 5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below 48hr ceiling; ROI negative for current cycle. - Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way — §56 has no code/contract changes). - 5g.1 reaches CORRECTNESS-VALIDATED state. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight), §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/ Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 5, 2026 02:36

noahgift merged commit c89e9c9 into main May 5, 2026
11 checks passed

noahgift deleted the spec/section-54-step-5g-prereqs branch May 5, 2026 02:59

noahgift mentioned this pull request May 5, 2026

feat(apr-cli): apr tokenize import-hf — §50.4 step 5g.0 #1497

Merged

7 tasks

noahgift mentioned this pull request May 5, 2026

feat(apr-cli + aprender-train + spec): §55 polymorphic preflight relaxation v1.2 → v1.3 FUNCTIONAL #1500

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(ship-two-models): v2.98 → v2.99 — §54 step 5g multi-step prereqs + live preflight smoke#1496

spec(ship-two-models): v2.98 → v2.99 — §54 step 5g multi-step prereqs + live preflight smoke#1496
noahgift merged 1 commit into
mainfrom
spec/section-54-step-5g-prereqs

noahgift commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 5, 2026

Summary

Re-scoped 5g roadmap

Methodology takeaway

Five Whys

Net effects

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant