Skip to content

spec(ship-two-models): v2.98 → v2.99 — §54 step 5g multi-step prereqs + live preflight smoke#1496

Merged
noahgift merged 1 commit into
mainfrom
spec/section-54-step-5g-prereqs
May 5, 2026
Merged

spec(ship-two-models): v2.98 → v2.99 — §54 step 5g multi-step prereqs + live preflight smoke#1496
noahgift merged 1 commit into
mainfrom
spec/section-54-step-5g-prereqs

Conversation

@noahgift

@noahgift noahgift commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

§53 closed with "step 5g LIVE remains" framing 5g as a single operator dispatch. Live source inspection of the post-#1494 binary plus an actual smoke run on canonical inputs revealed step 5g has multi-step prerequisites that were NOT enumerated in §50's original 8-step decomposition.

Live empirical smoke (commit 92c7e23 apr binary, 2026-05-05T04:31Z):
```
apr pretrain --init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr> \
--tokenizer <legacy 50257-vocab dir> \
--dataset
→ CORRECT FAIL-FAST: GATE-ARCH-370M-011 (INV-ARCH-370M-006) violated:
tokenizer vocab_size (50257) != model vocab_size (151936)
```

This is the FIRST end-to-end runtime evidence that the §50.4 cascade's polymorphic preflight works in the user-facing CLI. FALSIFY-APR-PRETRAIN-ARCH-005/006 are now LIVE-INTEGRATION (not just unit-test PARTIAL).

But the smoke also surfaces 5g's true scope.

Re-scoped 5g roadmap

Step What it does LOC / wall Status
5g.0 Extract Qwen tokenizer vocab.json + merges.txt from HF cache tokenizer.json ~50 LOC + ~5 min NEXT PR
5g.1 Re-tokenize codeparrot corpus with Qwen vocab 0 LOC + ~10 hr operator-dispatch gated on 5g.0
5g.2 LIVE 500-step fine-tune dispatch 0 LOC + ~20-60 min gated on 5g.1
5g.3 val_loss < 9.38 verdict; flip MODEL-2 ship % 57% → ≥58% 0 LOC gated on 5g.2

Methodology takeaway

Top-down spec planning consistently underestimates scope-coupling between heterogeneous code paths. Third instance of the same lesson:

  • §50 found §49's "0 LOC" was 8-step (architectural coupling)
  • §52 found §50's "5f weight load" was 2-step (CLI dispatch coupling)
  • §54 found §53's "5g LIVE" is 4-step (tokenizer-format coupling)

Five Whys

  1. Why didn't §50 enumerate 5g.0? Top-down decomposition under-counted the tokenizer-format axis (HF `tokenizer.json` vs aprender's `vocab.json` + `merges.txt`).
  2. Why does aprender require vocab.json + merges.txt rather than reading tokenizer.json? Historical: BPE loader was authored against GPT-2's released format. HF `tokenizer.json` came later.
  3. Why extraction (5g.0) instead of a tokenizer.json reader? Extraction is ~50 LOC of Python; reader integration is ~200 LOC of Rust + tests. Extraction is the cheaper ship-% path; reader is the principled follow-up.
  4. Why is 5g.1's 10-hour wall acceptable? Per `feedback_compute_pre_authorized.md`, named training/tokenization runs are pre-authorized below the 48-hour threshold.
  5. Why is the smoke load-bearing for the spec? It's the FIRST live evidence on canonical model + corpus + binary that the §50.4 cascade does what it claims. Unit tests prove the algorithm; the smoke proves the integration.

Net effects

  • Spec v2.98.0 → v2.99.0.
  • §50.4 roadmap: 5a-5f.4 INTEGRATION-COMPLETE; 5g re-scoped to 5g.0/5g.1/5g.2/5g.3.
  • MODEL-1 ship % unchanged at 91%; MODEL-2 ship % unchanged at 57% until 5g.3.
  • Coverage tally: snapshot. v1.2.0 FUNCTIONAL is reinforced; DISCHARGED waits for 5g.3.

Files changed

  • `docs/specifications/aprender-train/ship-two-models-spec.md` — §54 amendment.
  • `evidence/section-54-5g-prereqs-2026-05-05/preflight-fail-fast-smoke.md` — live smoke evidence.

Test plan

  • PMAT pre-commit quality gates pass
  • CI gate green (workspace-test, ci/gate)
  • Auto-merge fires on green CI

🤖 Generated with Claude Code

…requisites + live preflight smoke

§53 closed with "step 5g LIVE remains" framing 5g as a single operator
dispatch. Live source inspection of the post-#1494 binary plus an
actual smoke run revealed step 5g has multi-step prerequisites that
were NOT enumerated in §50's original 8-step decomposition.

Live empirical smoke on canonical inputs:
  apr pretrain --init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr>
               --tokenizer <legacy 50257-vocab dir>
               --dataset <legacy codeparrot shards>
  → CORRECT FAIL-FAST: GATE-ARCH-370M-011 (INV-ARCH-370M-006)
    violated: tokenizer vocab_size (50257) != model vocab_size (151936)

This is the FIRST end-to-end runtime evidence that the §50.4 cascade's
polymorphic preflight (PR #1476 + #1494) works in the user-facing CLI:
  - Read --init APR metadata: vocab=151936, hidden=896, layers=24
  - target_vocab = init_arch.vocab_size = 151936 (NOT legacy 50257)
  - Tokenizer dir vocab.json count = 50257
  - Mismatch → fail-fast before trainer allocation

But the smoke also surfaces 5g's true scope. A Qwen-vocab tokenizer dir
+ Qwen-tokenized corpus must exist BEFORE the preflight passes. Neither
exists on this host today.

Step 5g re-scoped:
  5g.0 — Qwen tokenizer extraction (~50 LOC, ~5min wall) [next PR]
  5g.1 — Qwen-tokenized corpus (0 LOC, ~10hr wall, operator-dispatch)
  5g.2 — LIVE 500-step fine-tune (0 LOC, ~20-60min, operator-dispatch)
  5g.3 — val_loss < 9.38 verdict; flip MODEL-2 ship % 57% → ≥58%

Methodology takeaway: top-down spec planning consistently
underestimates scope-coupling between heterogeneous code paths. This
is the third instance of the same lesson:
  - §50 found §49's "0 LOC" was 8-step (architectural coupling)
  - §52 found §50's "5f weight load" was 2-step (CLI dispatch coupling)
  - §54 found §53's "5g LIVE" is 4-step (tokenizer-format coupling)

Falsifier scoreboard impact:
  - FALSIFY-APR-PRETRAIN-ARCH-005/006 reach LIVE-INTEGRATION level
    (proven via real CLI dispatch, not just unit tests)
  - Contract `apr-pretrain-arch-polymorphic-v1` v1.2.0 FUNCTIONAL is
    reinforced; promotion to DISCHARGED waits for 5g.3 val_loss measurement

Net effects:
  - Spec v2.98.0 → v2.99.0
  - MODEL-1 ship % unchanged at 91%
  - MODEL-2 ship % unchanged at 57% (gated on 5g.3)
  - Coverage tally: snapshot, no contract status flip

Refs: SPEC-SHIP-TWO-001 §50.4 step 5g, PR #1476 + #1494,
      evidence/section-54-5g-prereqs-2026-05-05/preflight-fail-fast-smoke.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 5, 2026 02:36
@noahgift noahgift merged commit c89e9c9 into main May 5, 2026
11 checks passed
@noahgift noahgift deleted the spec/section-54-step-5g-prereqs branch May 5, 2026 02:59
noahgift added a commit that referenced this pull request May 5, 2026
Author contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL
+ implement `apr tokenize import-hf <tokenizer.json> --output <DIR>`
that extracts vocab.json + merges.txt + manifest.json from a HuggingFace
BPE tokenizer.json into aprender's two-file layout.

This is the prerequisite step from §54's 5g re-scoping: aprender's
GPT-2-style BPE loader requires vocab.json + merges.txt; public
Qwen2.5/Llama2/Mistral tokenizers distribute as a single tokenizer.json.
Without import-hf, fine-tuning from these checkpoints is blocked.

## What ships

Contract `contracts/apr-cli-tokenize-import-hf-v1.yaml` v1.0.0 (5 falsifiers):
  FALSIFY-TOK-IMPORT-HF-001 — command registered in dispatch surface
  FALSIFY-TOK-IMPORT-HF-002 — BPE input produces non-empty vocab+merges
  FALSIFY-TOK-IMPORT-HF-003 — vocab count == |tokenizer.json:model.vocab|
  FALSIFY-TOK-IMPORT-HF-004 — merges.txt is one merge per line in order
  FALSIFY-TOK-IMPORT-HF-005 — non-BPE input fails fast (Unigram/WordPiece)

Subcommand `apr tokenize import-hf`:
  --input <FILE>             HF tokenizer.json (BPE required)
  --output <DIR>             output dir for vocab.json + merges.txt + manifest.json
  --include-added-tokens     also emit added_tokens (e.g., <|im_start|>) in vocab.json

Output dir layout (drop-in compatible with apr tokenize encode-corpus
and apr pretrain --tokenizer):
  vocab.json    — JSON object: token-string → integer-id
  merges.txt    — #version: 0.2 header + one space-separated merge per line
  manifest.json — provenance: source path, sha256, counts, timestamp

5 unit tests cover FALSIFY-002..005 + the --include-added-tokens path.

## LIVE smoke (canonical input)

apr tokenize import-hf \
  --input <Qwen2.5-Coder-0.5B-Instruct/tokenizer.json> \
  --output /tmp/qwen-0.5b-tokenizer-extracted --json
  → effective_vocab=151643, merges=151387, added_tokens=22, sha256 captured
  Files written: 3.2 MiB vocab.json, 1.6 MiB merges.txt, 534 B manifest.json

Evidence file: evidence/section-50-4-step-5g-0-import-hf-2026-05-05/live-extraction-smoke.md

## Five Whys

1. Why a new subcommand rather than a script? Per
   feedback_stack_tool_extension_not_cli_shim.md: when apr lacks a
   feature, extend apr in-tree. Authored as an apr subcommand with a
   provable contract; one-off scripts would have been muda.
2. Why under `apr tokenize` rather than `apr import` or top-level?
   `apr tokenize` already handles tokenizer artifacts (plan/apply/train/
   encode-corpus); import-hf is symmetric — produces the same output
   shape (vocab.json + merges.txt) as `train` does, just from an
   external HF source rather than from a corpus. `apr import` is for
   model files, not tokenizers.
3. Why default to BPE-only (excluding added_tokens)? The aprender
   BPE loader is a state machine; added tokens (e.g., <|endoftext|>)
   are control-string substitutions handled differently. Default mode
   keeps the BPE machine pure; --include-added-tokens is the explicit
   opt-in for when downstream consumption needs the unified vocab.
4. Why fail-fast on non-BPE? Unigram and WordPiece have different
   state machines; silent extraction would produce a vocab.json that
   the BPE loader accepts but tokenizes incorrectly. Fail-fast names
   the actual model.type and cites the contract id (auditability).
5. Why not also handle the polymorphic preflight gap (151643/151936)?
   That's a separate concern — the preflight's strict `==` semantic
   is correct for from-scratch models but too strict for Qwen-style
   models with reserved slots. §55 follow-up; out of 5g.0 scope.

## Net effects

- Contract apr-cli-tokenize-import-hf-v1 v1.0.0 PARTIAL_ALGORITHM_LEVEL.
- 1 new apr subcommand (`apr tokenize import-hf`) wired into dispatch.
- 5 new unit tests + 1 live smoke evidence file.
- Unblocks 5g.1 (Qwen-tokenized corpus pretokenization) modulo the
  §55 preflight strictness finding (vocab gap 151665 vs 151936).
- MODEL-1 ship %: unchanged at 91%.
- MODEL-2 ship %: unchanged at 57% until 5g.3 val_loss < 9.38.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496 — re-scoped 5g into 5g.0/.../5g.3),
      contracts/apr-pretrain-arch-polymorphic-v1.yaml v1.2.0 FUNCTIONAL (sibling)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…xation (#1500)

§54 LIVE smoke surfaced that public Qwen2.5-Coder-0.5B-Instruct's
tokenizer.json materializes 151643 BPE entries + 22 added = 151665
effective strings, but config.json declares vocab_size=151936
(271 reserved/special slots not in tokenizer.json). Strict equality
preflight was correct for §24/§25 from-scratch but too strict for
HF-distributed pretrained checkpoints with reserved slots.

## Relaxation

  init=Some  → tokenizer_vocab ≤ model_vocab  (RELAXED, admits HF reserved slots)
  init=None  → tokenizer_vocab == model_vocab (UNCHANGED, §24/§25 baseline)

Safety: tokenizer-emitted ids ∈ [0, tokenizer_vocab) ⊆ [0, model_vocab).
Reserved high-id slots are never indexed at training time. N-09 OOB
escape impossible.

Symmetric guard: tokenizer_vocab > model_vocab MUST FAIL even under
init=Some (FALSIFY-APR-PRETRAIN-ARCH-010 — bound is ≤, not <).

## What ships

Helper:
  assert_tokenizer_vocab_within_model_bound (aprender-train)
  symmetric to assert_tokenizer_vocab_matches_model

Wireup:
  preflight_tokenizer_vocab_matches_target now takes init_is_some: bool;
  drive_real passes init_arch.is_some() to route to relaxed/strict.

Contract:
  apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL
  qwen_tokenizer_vocab_compatibility refined formula + invariants
  +FALSIFY-009 (relaxed accept) +FALSIFY-010 (oversize reject — OOB safety)
  total: 10 falsifiers, all PASS

Tests (4 new, all PASS):
  falsify_apr_pretrain_arch_009_relaxed_bound_accepts_qwen_reserved_slots
  falsify_apr_pretrain_arch_010_relaxed_bound_rejects_oversized_tokenizer
  preflight_qwen_reserved_slots_pass_under_polymorphic_init
  preflight_oversized_tokenizer_rejected_even_under_polymorphic_init

Spec amendment §55 + LIVE smoke evidence.

## LIVE smoke (this branch's apr binary + §54 extracted Qwen tokenizer)

  timeout 30 apr pretrain --init <Qwen.apr> --tokenizer <extracted-dir> ...
  → exit=124 (timeout), AFTER preflight passed
  → Configuration printed + Device: cpu + (proceeded to weight load)
  → No GATE-ARCH-370M-011 violations

Evidence: evidence/section-55-relaxed-preflight-2026-05-05/relaxed-preflight-passes-smoke.md

## Five Whys

1. Why did §54 not catch this? §54 used legacy 50257 tokenizer (not §54's
   own extracted Qwen tokenizer); within-Qwen mismatch only surfaces
   after 5g.0 lands.
2. Why bound is ≤ not ==? HF checkpoints standardly declare vocab > tokenizer
   materialized; strict-equality would fail every Qwen/Llama2/Mistral.
3. Why preserve strict equality on from-scratch? §24/§25 evidence was
   gathered under strict; weakening retroactively could mask future
   from-scratch tokenizer-drift bugs.
4. Why new helper rather than mode parameter on existing? External
   callers (training-loop-pretrain-v1 contract) explicitly want strict;
   mode param would be backward-incompatible.
5. Why pin both FALSIFY-009 + FALSIFY-010? Bound is ≤, not <. Without
   FALSIFY-010, a regression to tokenizer_vocab > model_vocab would
   silently restore N-09 OOB risk.

## Net effects

- Spec v2.99.0 → v3.00.0 (cascade pivots from infrastructure to LIVE prerequisites).
- Contract apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL.
- 5g.0.1 lands as single PR; 5g.1 unblocked.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3 val_loss < 9.38.
- Coverage tally: +2 PARTIAL_ALGORITHM_LEVEL falsifiers + LIVE-INTEGRATION
  reinforcement of FALSIFY-005/009.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496) for the gap finding,
      §55 (this PR) for the resolution.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ted; full run is ~17hr operator-dispatch

§55 (in-flight PR #1500) closes the polymorphic preflight strictness
gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that
validates 5g.1's correctness end-to-end before committing to the
multi-hour full run.

  apr tokenize encode-corpus \
    --corpus <python-permissive-5k.jsonl> \
    --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
    --output <smoke-shards> --shard-tokens 1000000

  → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs)
  → ~110 sec / M-token single-thread
  → No errors; shard rotation correct
  → Killed before manifest.json write (sufficient evidence accumulated)

  Legacy 50257-vocab:   ~64 sec / M-token  →  9.99 hr for 565M (validated)
  Qwen 151643-vocab:    ~110 sec / M-token →  ~17 hr for 565M (projected)

Qwen is ~70% slower per-token because the BPE merge table is 3× larger
(151387 vs 49997 merges); per-character merge-table search dominates
encoding cost. Below the 48hr feedback_compute_pre_authorized.md
ceiling, so 5g.1 full run is pre-authorized.

  5g.0  ✅ MERGED PR #1497  (apr tokenize import-hf)
  5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation)
  5g.1  CORRECTNESS-VALIDATED (this PR), full run pending operator
  5g.2  gated on 5g.1 full run
  5g.3  gated on 5g.2 (val_loss < 9.38 verdict)

1. Why smoke before full run? ~17hr non-trivial; smoke proves chain
   correctness before committing to long wall.
2. Why 5000 docs? Smallest slice that exercises shard rotation (12M
   tokens > 10 shards).
3. Why kill smoke instead of complete? 13 shards = sufficient evidence;
   finishing wouldn't add information.
4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost.
5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below
   48hr ceiling; ROI negative for current cycle.

- Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way —
  §56 has no code/contract changes).
- 5g.1 reaches CORRECTNESS-VALIDATED state.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight),
      §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ted; full run is ~17hr operator-dispatch (#1501)

§55 (in-flight PR #1500) closes the polymorphic preflight strictness
gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that
validates 5g.1's correctness end-to-end before committing to the
multi-hour full run.

  apr tokenize encode-corpus \
    --corpus <python-permissive-5k.jsonl> \
    --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
    --output <smoke-shards> --shard-tokens 1000000

  → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs)
  → ~110 sec / M-token single-thread
  → No errors; shard rotation correct
  → Killed before manifest.json write (sufficient evidence accumulated)

  Legacy 50257-vocab:   ~64 sec / M-token  →  9.99 hr for 565M (validated)
  Qwen 151643-vocab:    ~110 sec / M-token →  ~17 hr for 565M (projected)

Qwen is ~70% slower per-token because the BPE merge table is 3× larger
(151387 vs 49997 merges); per-character merge-table search dominates
encoding cost. Below the 48hr feedback_compute_pre_authorized.md
ceiling, so 5g.1 full run is pre-authorized.

  5g.0  ✅ MERGED PR #1497  (apr tokenize import-hf)
  5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation)
  5g.1  CORRECTNESS-VALIDATED (this PR), full run pending operator
  5g.2  gated on 5g.1 full run
  5g.3  gated on 5g.2 (val_loss < 9.38 verdict)

1. Why smoke before full run? ~17hr non-trivial; smoke proves chain
   correctness before committing to long wall.
2. Why 5000 docs? Smallest slice that exercises shard rotation (12M
   tokens > 10 shards).
3. Why kill smoke instead of complete? 13 shards = sufficient evidence;
   finishing wouldn't add information.
4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost.
5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below
   48hr ceiling; ROI negative for current cycle.

- Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way —
  §56 has no code/contract changes).
- 5g.1 reaches CORRECTNESS-VALIDATED state.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight),
      §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant