feat(apr-cli + aprender-train + spec): §55 polymorphic preflight relaxation v1.2 → v1.3 FUNCTIONAL#1500
Merged
Conversation
…xation §54 LIVE smoke surfaced that public Qwen2.5-Coder-0.5B-Instruct's tokenizer.json materializes 151643 BPE entries + 22 added = 151665 effective strings, but config.json declares vocab_size=151936 (271 reserved/special slots not in tokenizer.json). Strict equality preflight was correct for §24/§25 from-scratch but too strict for HF-distributed pretrained checkpoints with reserved slots. ## Relaxation init=Some → tokenizer_vocab ≤ model_vocab (RELAXED, admits HF reserved slots) init=None → tokenizer_vocab == model_vocab (UNCHANGED, §24/§25 baseline) Safety: tokenizer-emitted ids ∈ [0, tokenizer_vocab) ⊆ [0, model_vocab). Reserved high-id slots are never indexed at training time. N-09 OOB escape impossible. Symmetric guard: tokenizer_vocab > model_vocab MUST FAIL even under init=Some (FALSIFY-APR-PRETRAIN-ARCH-010 — bound is ≤, not <). ## What ships Helper: assert_tokenizer_vocab_within_model_bound (aprender-train) symmetric to assert_tokenizer_vocab_matches_model Wireup: preflight_tokenizer_vocab_matches_target now takes init_is_some: bool; drive_real passes init_arch.is_some() to route to relaxed/strict. Contract: apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL qwen_tokenizer_vocab_compatibility refined formula + invariants +FALSIFY-009 (relaxed accept) +FALSIFY-010 (oversize reject — OOB safety) total: 10 falsifiers, all PASS Tests (4 new, all PASS): falsify_apr_pretrain_arch_009_relaxed_bound_accepts_qwen_reserved_slots falsify_apr_pretrain_arch_010_relaxed_bound_rejects_oversized_tokenizer preflight_qwen_reserved_slots_pass_under_polymorphic_init preflight_oversized_tokenizer_rejected_even_under_polymorphic_init Spec amendment §55 + LIVE smoke evidence. ## LIVE smoke (this branch's apr binary + §54 extracted Qwen tokenizer) timeout 30 apr pretrain --init <Qwen.apr> --tokenizer <extracted-dir> ... → exit=124 (timeout), AFTER preflight passed → Configuration printed + Device: cpu + (proceeded to weight load) → No GATE-ARCH-370M-011 violations Evidence: evidence/section-55-relaxed-preflight-2026-05-05/relaxed-preflight-passes-smoke.md ## Five Whys 1. Why did §54 not catch this? §54 used legacy 50257 tokenizer (not §54's own extracted Qwen tokenizer); within-Qwen mismatch only surfaces after 5g.0 lands. 2. Why bound is ≤ not ==? HF checkpoints standardly declare vocab > tokenizer materialized; strict-equality would fail every Qwen/Llama2/Mistral. 3. Why preserve strict equality on from-scratch? §24/§25 evidence was gathered under strict; weakening retroactively could mask future from-scratch tokenizer-drift bugs. 4. Why new helper rather than mode parameter on existing? External callers (training-loop-pretrain-v1 contract) explicitly want strict; mode param would be backward-incompatible. 5. Why pin both FALSIFY-009 + FALSIFY-010? Bound is ≤, not <. Without FALSIFY-010, a regression to tokenizer_vocab > model_vocab would silently restore N-09 OOB risk. ## Net effects - Spec v2.99.0 → v3.00.0 (cascade pivots from infrastructure to LIVE prerequisites). - Contract apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL. - 5g.0.1 lands as single PR; 5g.1 unblocked. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3 val_loss < 9.38. - Coverage tally: +2 PARTIAL_ALGORITHM_LEVEL falsifiers + LIVE-INTEGRATION reinforcement of FALSIFY-005/009. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496) for the gap finding, §55 (this PR) for the resolution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
noahgift
added a commit
that referenced
this pull request
May 5, 2026
…ted; full run is ~17hr operator-dispatch §55 (in-flight PR #1500) closes the polymorphic preflight strictness gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run. apr tokenize encode-corpus \ --corpus <python-permissive-5k.jsonl> \ --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \ --output <smoke-shards> --shard-tokens 1000000 → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs) → ~110 sec / M-token single-thread → No errors; shard rotation correct → Killed before manifest.json write (sufficient evidence accumulated) Legacy 50257-vocab: ~64 sec / M-token → 9.99 hr for 565M (validated) Qwen 151643-vocab: ~110 sec / M-token → ~17 hr for 565M (projected) Qwen is ~70% slower per-token because the BPE merge table is 3× larger (151387 vs 49997 merges); per-character merge-table search dominates encoding cost. Below the 48hr feedback_compute_pre_authorized.md ceiling, so 5g.1 full run is pre-authorized. 5g.0 ✅ MERGED PR #1497 (apr tokenize import-hf) 5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation) 5g.1 CORRECTNESS-VALIDATED (this PR), full run pending operator 5g.2 gated on 5g.1 full run 5g.3 gated on 5g.2 (val_loss < 9.38 verdict) 1. Why smoke before full run? ~17hr non-trivial; smoke proves chain correctness before committing to long wall. 2. Why 5000 docs? Smallest slice that exercises shard rotation (12M tokens > 10 shards). 3. Why kill smoke instead of complete? 13 shards = sufficient evidence; finishing wouldn't add information. 4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost. 5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below 48hr ceiling; ROI negative for current cycle. - Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way — §56 has no code/contract changes). - 5g.1 reaches CORRECTNESS-VALIDATED state. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight), §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 5, 2026
…ted; full run is ~17hr operator-dispatch (#1501) §55 (in-flight PR #1500) closes the polymorphic preflight strictness gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run. apr tokenize encode-corpus \ --corpus <python-permissive-5k.jsonl> \ --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \ --output <smoke-shards> --shard-tokens 1000000 → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs) → ~110 sec / M-token single-thread → No errors; shard rotation correct → Killed before manifest.json write (sufficient evidence accumulated) Legacy 50257-vocab: ~64 sec / M-token → 9.99 hr for 565M (validated) Qwen 151643-vocab: ~110 sec / M-token → ~17 hr for 565M (projected) Qwen is ~70% slower per-token because the BPE merge table is 3× larger (151387 vs 49997 merges); per-character merge-table search dominates encoding cost. Below the 48hr feedback_compute_pre_authorized.md ceiling, so 5g.1 full run is pre-authorized. 5g.0 ✅ MERGED PR #1497 (apr tokenize import-hf) 5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation) 5g.1 CORRECTNESS-VALIDATED (this PR), full run pending operator 5g.2 gated on 5g.1 full run 5g.3 gated on 5g.2 (val_loss < 9.38 verdict) 1. Why smoke before full run? ~17hr non-trivial; smoke proves chain correctness before committing to long wall. 2. Why 5000 docs? Smallest slice that exercises shard rotation (12M tokens > 10 shards). 3. Why kill smoke instead of complete? 13 shards = sufficient evidence; finishing wouldn't add information. 4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost. 5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below 48hr ceiling; ROI negative for current cycle. - Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way — §56 has no code/contract changes). - 5g.1 reaches CORRECTNESS-VALIDATED state. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight), §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/ Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 9, 2026
… (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001)
Adds contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 DRAFT, the
falsifier scaffold for SHIP-TWO §56.4 step 5g.2 — the LIVE 500-step
fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%.
Pins six falsifiable invariants for `apr pretrain --mode from-init
--init <Qwen.apr> --shards-dir <5g.1-corpus> --steps 500 --device cuda`:
- FALSIFY-001 (ship-blocking): exit code == 0
- FALSIFY-002 (advisory): wall ≤ 3600 s on RTX 4090
- FALSIFY-003 (ship-blocking): step-0 loss ≤ 0.7 × ln(151936) ≈ 8.35
(proves init weights flow through forward)
- FALSIFY-004 (ship-blocking): checkpoint.apr written with valid
magic bytes (0x41 0x50 0x52 0x00 v2 OR
0x41 0x50 0x52 0x4E v1)
- FALSIFY-005 (ship-blocking): val_loss after 500 steps < 9.38
(the §34 370M-from-scratch ceiling)
- FALSIFY-006 (advisory): no CUDA OOM / illegal-address / launch-
OoR errors during run
Five-Whys (why this contract first, then live dispatch):
1. Why a contract before the dispatch? Per CLAUDE.md "Contract-first
design: NEVER write code before writing a provable contract."
Even though 5g.2 is "0 LOC operator-dispatch", it has shippable
semantics that deserve falsification scaffolding.
2. Why these particular six gates? They cover the four orthogonal
failure modes of a fine-tune-from-init dispatch: process-level
(exit/wall), correctness (step-0 baseline + val_loss), and
serialization (checkpoint magic bytes + GPU resource health).
3. Why DRAFT status (not PROPOSED, not ACTIVE)? DRAFT means "schema
validated, falsifiers authored, but no live evidence yet."
Status flips to ACTIVE_RUNTIME via §59 spec amendment after the
live dispatch produces evidence.
4. Why a separate contract from apr-pretrain-from-init-v1? The
sibling contract pins the in-process semantics of init loading
(load_init_tensors_from_apr, populate_trainer_from_init_tensors).
This new contract pins the END-TO-END dispatch outcome — they
compose at the dispatch boundary.
5. Why the val_loss < 9.38 threshold (not 5.0 or 7.0)? §34's 200K-
step retrain confirmed val_loss=9.38 as the 370M-from-scratch
capacity ceiling on this corpus. A from-init pivot must beat
from-scratch, otherwise §49's strategy reasoning is wrong.
Pre-requisites VERIFIED on host (lambda-vector RTX 4090):
- /mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr exists
- /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen has
228 shards / 2.278B tokens (manifest.json reconstructed by PR #1575)
- `apr pretrain --init <PATH>` end-to-end runnable per §53 (#1494 MERGED)
- Polymorphic preflight per §55 (#1500 MERGED)
Quality gates:
- `pv validate contracts/apr-pretrain-init-finetune-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS
SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep work)
- MODEL-2 ship %: unchanged at 57% (this PR is contract-only;
ship-% flips on §59 amendment after live verdict)
- Unblocks: §59 spec amendment recording 5g.2 dispatch result
Next steps (follow-ups, NOT this PR):
- LIVE dispatch on RTX 4090 (~20-60 min wall, pre-authorized per
feedback_compute_pre_authorized.md)
- §59 spec amendment v3.05.0 → v3.06.0 with verdict + ship-% flip
- Contract status DRAFT → ACTIVE_RUNTIME
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 9, 2026
… (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001) (#1576) Adds contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 DRAFT, the falsifier scaffold for SHIP-TWO §56.4 step 5g.2 — the LIVE 500-step fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%. Pins six falsifiable invariants for `apr pretrain --mode from-init --init <Qwen.apr> --shards-dir <5g.1-corpus> --steps 500 --device cuda`: - FALSIFY-001 (ship-blocking): exit code == 0 - FALSIFY-002 (advisory): wall ≤ 3600 s on RTX 4090 - FALSIFY-003 (ship-blocking): step-0 loss ≤ 0.7 × ln(151936) ≈ 8.35 (proves init weights flow through forward) - FALSIFY-004 (ship-blocking): checkpoint.apr written with valid magic bytes (0x41 0x50 0x52 0x00 v2 OR 0x41 0x50 0x52 0x4E v1) - FALSIFY-005 (ship-blocking): val_loss after 500 steps < 9.38 (the §34 370M-from-scratch ceiling) - FALSIFY-006 (advisory): no CUDA OOM / illegal-address / launch- OoR errors during run Five-Whys (why this contract first, then live dispatch): 1. Why a contract before the dispatch? Per CLAUDE.md "Contract-first design: NEVER write code before writing a provable contract." Even though 5g.2 is "0 LOC operator-dispatch", it has shippable semantics that deserve falsification scaffolding. 2. Why these particular six gates? They cover the four orthogonal failure modes of a fine-tune-from-init dispatch: process-level (exit/wall), correctness (step-0 baseline + val_loss), and serialization (checkpoint magic bytes + GPU resource health). 3. Why DRAFT status (not PROPOSED, not ACTIVE)? DRAFT means "schema validated, falsifiers authored, but no live evidence yet." Status flips to ACTIVE_RUNTIME via §59 spec amendment after the live dispatch produces evidence. 4. Why a separate contract from apr-pretrain-from-init-v1? The sibling contract pins the in-process semantics of init loading (load_init_tensors_from_apr, populate_trainer_from_init_tensors). This new contract pins the END-TO-END dispatch outcome — they compose at the dispatch boundary. 5. Why the val_loss < 9.38 threshold (not 5.0 or 7.0)? §34's 200K- step retrain confirmed val_loss=9.38 as the 370M-from-scratch capacity ceiling on this corpus. A from-init pivot must beat from-scratch, otherwise §49's strategy reasoning is wrong. Pre-requisites VERIFIED on host (lambda-vector RTX 4090): - /mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr exists - /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen has 228 shards / 2.278B tokens (manifest.json reconstructed by PR #1575) - `apr pretrain --init <PATH>` end-to-end runnable per §53 (#1494 MERGED) - Polymorphic preflight per §55 (#1500 MERGED) Quality gates: - `pv validate contracts/apr-pretrain-init-finetune-v1.yaml`: 0 errors - `pv lint --strict-test-binding`: 9/9 gates PASS SHIP-TWO impact: - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep work) - MODEL-2 ship %: unchanged at 57% (this PR is contract-only; ship-% flips on §59 amendment after live verdict) - Unblocks: §59 spec amendment recording 5g.2 dispatch result Next steps (follow-ups, NOT this PR): - LIVE dispatch on RTX 4090 (~20-60 min wall, pre-authorized per feedback_compute_pre_authorized.md) - §59 spec amendment v3.05.0 → v3.06.0 with verdict + ship-% flip - Contract status DRAFT → ACTIVE_RUNTIME Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements §55 per spec §54 (PR #1496) finding. Refines the polymorphic-path preflight semantic from strict equality to relaxed bound (`tokenizer_vocab ≤ model_vocab`) when `--init` is set, while preserving strict equality for the from-scratch path. Bumps contract v1.2.0 → v1.3.0 FUNCTIONAL. Unblocks 5g.1 (corpus retokenize with Qwen vocab).
Why
§54 LIVE smoke + 5g.0 (PR #1497) chain surfaced that public Qwen2.5-Coder-0.5B-Instruct's `tokenizer.json` materializes 151643 BPE entries + 22 added = 151665 effective, but `config.json` declares vocab_size = 151936. The 271-slot gap is reserved/special slots — the lm_head + embedding layers have weights for IDs 151665..151935, but no tokenizer string maps to them. This is the canonical HF reserved-slot pattern across Qwen2.5/Llama2/Mistral.
Strict equality preflight was correct for §24/§25 from-scratch (where tokenizer is trained to exactly match) but wrong for HF-distributed pretrained checkpoints.
What ships
LIVE smoke
```
$ timeout 30 apr pretrain \
--init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr> \
--tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
--mode finetune --num-steps 1 --device cpu --vocab-size 151936
exit=124 # timeout, AFTER preflight passed
No GATE-ARCH-370M-011 violations in output
```
Process proceeded past preflight to weight load (timeout fired at 30s mid-load). FALSIFY-APR-PRETRAIN-ARCH-009 LIVE-INTEGRATION DISCHARGE.
Five Whys
Test plan
🤖 Generated with Claude Code