spec(ship-two-models): v2.98 → v2.99 — §54 step 5g multi-step prereqs + live preflight smoke#1496
Merged
Merged
Conversation
…requisites + live preflight smoke §53 closed with "step 5g LIVE remains" framing 5g as a single operator dispatch. Live source inspection of the post-#1494 binary plus an actual smoke run revealed step 5g has multi-step prerequisites that were NOT enumerated in §50's original 8-step decomposition. Live empirical smoke on canonical inputs: apr pretrain --init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr> --tokenizer <legacy 50257-vocab dir> --dataset <legacy codeparrot shards> → CORRECT FAIL-FAST: GATE-ARCH-370M-011 (INV-ARCH-370M-006) violated: tokenizer vocab_size (50257) != model vocab_size (151936) This is the FIRST end-to-end runtime evidence that the §50.4 cascade's polymorphic preflight (PR #1476 + #1494) works in the user-facing CLI: - Read --init APR metadata: vocab=151936, hidden=896, layers=24 - target_vocab = init_arch.vocab_size = 151936 (NOT legacy 50257) - Tokenizer dir vocab.json count = 50257 - Mismatch → fail-fast before trainer allocation But the smoke also surfaces 5g's true scope. A Qwen-vocab tokenizer dir + Qwen-tokenized corpus must exist BEFORE the preflight passes. Neither exists on this host today. Step 5g re-scoped: 5g.0 — Qwen tokenizer extraction (~50 LOC, ~5min wall) [next PR] 5g.1 — Qwen-tokenized corpus (0 LOC, ~10hr wall, operator-dispatch) 5g.2 — LIVE 500-step fine-tune (0 LOC, ~20-60min, operator-dispatch) 5g.3 — val_loss < 9.38 verdict; flip MODEL-2 ship % 57% → ≥58% Methodology takeaway: top-down spec planning consistently underestimates scope-coupling between heterogeneous code paths. This is the third instance of the same lesson: - §50 found §49's "0 LOC" was 8-step (architectural coupling) - §52 found §50's "5f weight load" was 2-step (CLI dispatch coupling) - §54 found §53's "5g LIVE" is 4-step (tokenizer-format coupling) Falsifier scoreboard impact: - FALSIFY-APR-PRETRAIN-ARCH-005/006 reach LIVE-INTEGRATION level (proven via real CLI dispatch, not just unit tests) - Contract `apr-pretrain-arch-polymorphic-v1` v1.2.0 FUNCTIONAL is reinforced; promotion to DISCHARGED waits for 5g.3 val_loss measurement Net effects: - Spec v2.98.0 → v2.99.0 - MODEL-1 ship % unchanged at 91% - MODEL-2 ship % unchanged at 57% (gated on 5g.3) - Coverage tally: snapshot, no contract status flip Refs: SPEC-SHIP-TWO-001 §50.4 step 5g, PR #1476 + #1494, evidence/section-54-5g-prereqs-2026-05-05/preflight-fail-fast-smoke.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
noahgift
added a commit
that referenced
this pull request
May 5, 2026
Author contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL + implement `apr tokenize import-hf <tokenizer.json> --output <DIR>` that extracts vocab.json + merges.txt + manifest.json from a HuggingFace BPE tokenizer.json into aprender's two-file layout. This is the prerequisite step from §54's 5g re-scoping: aprender's GPT-2-style BPE loader requires vocab.json + merges.txt; public Qwen2.5/Llama2/Mistral tokenizers distribute as a single tokenizer.json. Without import-hf, fine-tuning from these checkpoints is blocked. ## What ships Contract `contracts/apr-cli-tokenize-import-hf-v1.yaml` v1.0.0 (5 falsifiers): FALSIFY-TOK-IMPORT-HF-001 — command registered in dispatch surface FALSIFY-TOK-IMPORT-HF-002 — BPE input produces non-empty vocab+merges FALSIFY-TOK-IMPORT-HF-003 — vocab count == |tokenizer.json:model.vocab| FALSIFY-TOK-IMPORT-HF-004 — merges.txt is one merge per line in order FALSIFY-TOK-IMPORT-HF-005 — non-BPE input fails fast (Unigram/WordPiece) Subcommand `apr tokenize import-hf`: --input <FILE> HF tokenizer.json (BPE required) --output <DIR> output dir for vocab.json + merges.txt + manifest.json --include-added-tokens also emit added_tokens (e.g., <|im_start|>) in vocab.json Output dir layout (drop-in compatible with apr tokenize encode-corpus and apr pretrain --tokenizer): vocab.json — JSON object: token-string → integer-id merges.txt — #version: 0.2 header + one space-separated merge per line manifest.json — provenance: source path, sha256, counts, timestamp 5 unit tests cover FALSIFY-002..005 + the --include-added-tokens path. ## LIVE smoke (canonical input) apr tokenize import-hf \ --input <Qwen2.5-Coder-0.5B-Instruct/tokenizer.json> \ --output /tmp/qwen-0.5b-tokenizer-extracted --json → effective_vocab=151643, merges=151387, added_tokens=22, sha256 captured Files written: 3.2 MiB vocab.json, 1.6 MiB merges.txt, 534 B manifest.json Evidence file: evidence/section-50-4-step-5g-0-import-hf-2026-05-05/live-extraction-smoke.md ## Five Whys 1. Why a new subcommand rather than a script? Per feedback_stack_tool_extension_not_cli_shim.md: when apr lacks a feature, extend apr in-tree. Authored as an apr subcommand with a provable contract; one-off scripts would have been muda. 2. Why under `apr tokenize` rather than `apr import` or top-level? `apr tokenize` already handles tokenizer artifacts (plan/apply/train/ encode-corpus); import-hf is symmetric — produces the same output shape (vocab.json + merges.txt) as `train` does, just from an external HF source rather than from a corpus. `apr import` is for model files, not tokenizers. 3. Why default to BPE-only (excluding added_tokens)? The aprender BPE loader is a state machine; added tokens (e.g., <|endoftext|>) are control-string substitutions handled differently. Default mode keeps the BPE machine pure; --include-added-tokens is the explicit opt-in for when downstream consumption needs the unified vocab. 4. Why fail-fast on non-BPE? Unigram and WordPiece have different state machines; silent extraction would produce a vocab.json that the BPE loader accepts but tokenizes incorrectly. Fail-fast names the actual model.type and cites the contract id (auditability). 5. Why not also handle the polymorphic preflight gap (151643/151936)? That's a separate concern — the preflight's strict `==` semantic is correct for from-scratch models but too strict for Qwen-style models with reserved slots. §55 follow-up; out of 5g.0 scope. ## Net effects - Contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL. - 1 new apr subcommand (`apr tokenize import-hf`) wired into dispatch. - 5 new unit tests + 1 live smoke evidence file. - Unblocks 5g.1 (Qwen-tokenized corpus pretokenization) modulo the §55 preflight strictness finding (vocab gap 151665 vs 151936). - MODEL-1 ship %: unchanged at 91%. - MODEL-2 ship %: unchanged at 57% until 5g.3 val_loss < 9.38. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496 — re-scoped 5g into 5g.0/.../5g.3), contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.2.0 FUNCTIONAL (sibling) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks
noahgift
added a commit
that referenced
this pull request
May 5, 2026
…xation (#1500) §54 LIVE smoke surfaced that public Qwen2.5-Coder-0.5B-Instruct's tokenizer.json materializes 151643 BPE entries + 22 added = 151665 effective strings, but config.json declares vocab_size=151936 (271 reserved/special slots not in tokenizer.json). Strict equality preflight was correct for §24/§25 from-scratch but too strict for HF-distributed pretrained checkpoints with reserved slots. ## Relaxation init=Some → tokenizer_vocab ≤ model_vocab (RELAXED, admits HF reserved slots) init=None → tokenizer_vocab == model_vocab (UNCHANGED, §24/§25 baseline) Safety: tokenizer-emitted ids ∈ [0, tokenizer_vocab) ⊆ [0, model_vocab). Reserved high-id slots are never indexed at training time. N-09 OOB escape impossible. Symmetric guard: tokenizer_vocab > model_vocab MUST FAIL even under init=Some (FALSIFY-APR-PRETRAIN-ARCH-010 — bound is ≤, not <). ## What ships Helper: assert_tokenizer_vocab_within_model_bound (aprender-train) symmetric to assert_tokenizer_vocab_matches_model Wireup: preflight_tokenizer_vocab_matches_target now takes init_is_some: bool; drive_real passes init_arch.is_some() to route to relaxed/strict. Contract: apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL qwen_tokenizer_vocab_compatibility refined formula + invariants +FALSIFY-009 (relaxed accept) +FALSIFY-010 (oversize reject — OOB safety) total: 10 falsifiers, all PASS Tests (4 new, all PASS): falsify_apr_pretrain_arch_009_relaxed_bound_accepts_qwen_reserved_slots falsify_apr_pretrain_arch_010_relaxed_bound_rejects_oversized_tokenizer preflight_qwen_reserved_slots_pass_under_polymorphic_init preflight_oversized_tokenizer_rejected_even_under_polymorphic_init Spec amendment §55 + LIVE smoke evidence. ## LIVE smoke (this branch's apr binary + §54 extracted Qwen tokenizer) timeout 30 apr pretrain --init <Qwen.apr> --tokenizer <extracted-dir> ... → exit=124 (timeout), AFTER preflight passed → Configuration printed + Device: cpu + (proceeded to weight load) → No GATE-ARCH-370M-011 violations Evidence: evidence/section-55-relaxed-preflight-2026-05-05/relaxed-preflight-passes-smoke.md ## Five Whys 1. Why did §54 not catch this? §54 used legacy 50257 tokenizer (not §54's own extracted Qwen tokenizer); within-Qwen mismatch only surfaces after 5g.0 lands. 2. Why bound is ≤ not ==? HF checkpoints standardly declare vocab > tokenizer materialized; strict-equality would fail every Qwen/Llama2/Mistral. 3. Why preserve strict equality on from-scratch? §24/§25 evidence was gathered under strict; weakening retroactively could mask future from-scratch tokenizer-drift bugs. 4. Why new helper rather than mode parameter on existing? External callers (training-loop-pretrain-v1 contract) explicitly want strict; mode param would be backward-incompatible. 5. Why pin both FALSIFY-009 + FALSIFY-010? Bound is ≤, not <. Without FALSIFY-010, a regression to tokenizer_vocab > model_vocab would silently restore N-09 OOB risk. ## Net effects - Spec v2.99.0 → v3.00.0 (cascade pivots from infrastructure to LIVE prerequisites). - Contract apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL. - 5g.0.1 lands as single PR; 5g.1 unblocked. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3 val_loss < 9.38. - Coverage tally: +2 PARTIAL_ALGORITHM_LEVEL falsifiers + LIVE-INTEGRATION reinforcement of FALSIFY-005/009. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496) for the gap finding, §55 (this PR) for the resolution. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
…ted; full run is ~17hr operator-dispatch §55 (in-flight PR #1500) closes the polymorphic preflight strictness gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run. apr tokenize encode-corpus \ --corpus <python-permissive-5k.jsonl> \ --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \ --output <smoke-shards> --shard-tokens 1000000 → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs) → ~110 sec / M-token single-thread → No errors; shard rotation correct → Killed before manifest.json write (sufficient evidence accumulated) Legacy 50257-vocab: ~64 sec / M-token → 9.99 hr for 565M (validated) Qwen 151643-vocab: ~110 sec / M-token → ~17 hr for 565M (projected) Qwen is ~70% slower per-token because the BPE merge table is 3× larger (151387 vs 49997 merges); per-character merge-table search dominates encoding cost. Below the 48hr feedback_compute_pre_authorized.md ceiling, so 5g.1 full run is pre-authorized. 5g.0 ✅ MERGED PR #1497 (apr tokenize import-hf) 5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation) 5g.1 CORRECTNESS-VALIDATED (this PR), full run pending operator 5g.2 gated on 5g.1 full run 5g.3 gated on 5g.2 (val_loss < 9.38 verdict) 1. Why smoke before full run? ~17hr non-trivial; smoke proves chain correctness before committing to long wall. 2. Why 5000 docs? Smallest slice that exercises shard rotation (12M tokens > 10 shards). 3. Why kill smoke instead of complete? 13 shards = sufficient evidence; finishing wouldn't add information. 4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost. 5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below 48hr ceiling; ROI negative for current cycle. - Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way — §56 has no code/contract changes). - 5g.1 reaches CORRECTNESS-VALIDATED state. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight), §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
…ted; full run is ~17hr operator-dispatch (#1501) §55 (in-flight PR #1500) closes the polymorphic preflight strictness gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run. apr tokenize encode-corpus \ --corpus <python-permissive-5k.jsonl> \ --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \ --output <smoke-shards> --shard-tokens 1000000 → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs) → ~110 sec / M-token single-thread → No errors; shard rotation correct → Killed before manifest.json write (sufficient evidence accumulated) Legacy 50257-vocab: ~64 sec / M-token → 9.99 hr for 565M (validated) Qwen 151643-vocab: ~110 sec / M-token → ~17 hr for 565M (projected) Qwen is ~70% slower per-token because the BPE merge table is 3× larger (151387 vs 49997 merges); per-character merge-table search dominates encoding cost. Below the 48hr feedback_compute_pre_authorized.md ceiling, so 5g.1 full run is pre-authorized. 5g.0 ✅ MERGED PR #1497 (apr tokenize import-hf) 5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation) 5g.1 CORRECTNESS-VALIDATED (this PR), full run pending operator 5g.2 gated on 5g.1 full run 5g.3 gated on 5g.2 (val_loss < 9.38 verdict) 1. Why smoke before full run? ~17hr non-trivial; smoke proves chain correctness before committing to long wall. 2. Why 5000 docs? Smallest slice that exercises shard rotation (12M tokens > 10 shards). 3. Why kill smoke instead of complete? 13 shards = sufficient evidence; finishing wouldn't add information. 4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost. 5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below 48hr ceiling; ROI negative for current cycle. - Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way — §56 has no code/contract changes). - 5g.1 reaches CORRECTNESS-VALIDATED state. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight), §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/ Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
§53 closed with "step 5g LIVE remains" framing 5g as a single operator dispatch. Live source inspection of the post-#1494 binary plus an actual smoke run on canonical inputs revealed step 5g has multi-step prerequisites that were NOT enumerated in §50's original 8-step decomposition.
Live empirical smoke (commit 92c7e23 apr binary, 2026-05-05T04:31Z):
```
apr pretrain --init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr> \
--tokenizer <legacy 50257-vocab dir> \
--dataset
→ CORRECT FAIL-FAST: GATE-ARCH-370M-011 (INV-ARCH-370M-006) violated:
tokenizer vocab_size (50257) != model vocab_size (151936)
```
This is the FIRST end-to-end runtime evidence that the §50.4 cascade's polymorphic preflight works in the user-facing CLI. FALSIFY-APR-PRETRAIN-ARCH-005/006 are now LIVE-INTEGRATION (not just unit-test PARTIAL).
But the smoke also surfaces 5g's true scope.
Re-scoped 5g roadmap
Methodology takeaway
Top-down spec planning consistently underestimates scope-coupling between heterogeneous code paths. Third instance of the same lesson:
Five Whys
Net effects
Files changed
Test plan
🤖 Generated with Claude Code