Skip to content

feat(apr-cli + aprender-train + spec): §55 polymorphic preflight relaxation v1.2 → v1.3 FUNCTIONAL#1500

Merged
noahgift merged 3 commits into
mainfrom
feat/section-55-polymorphic-preflight-relaxation
May 5, 2026
Merged

feat(apr-cli + aprender-train + spec): §55 polymorphic preflight relaxation v1.2 → v1.3 FUNCTIONAL#1500
noahgift merged 3 commits into
mainfrom
feat/section-55-polymorphic-preflight-relaxation

Conversation

@noahgift

@noahgift noahgift commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements §55 per spec §54 (PR #1496) finding. Refines the polymorphic-path preflight semantic from strict equality to relaxed bound (`tokenizer_vocab ≤ model_vocab`) when `--init` is set, while preserving strict equality for the from-scratch path. Bumps contract v1.2.0 → v1.3.0 FUNCTIONAL. Unblocks 5g.1 (corpus retokenize with Qwen vocab).

Why

§54 LIVE smoke + 5g.0 (PR #1497) chain surfaced that public Qwen2.5-Coder-0.5B-Instruct's `tokenizer.json` materializes 151643 BPE entries + 22 added = 151665 effective, but `config.json` declares vocab_size = 151936. The 271-slot gap is reserved/special slots — the lm_head + embedding layers have weights for IDs 151665..151935, but no tokenizer string maps to them. This is the canonical HF reserved-slot pattern across Qwen2.5/Llama2/Mistral.

Strict equality preflight was correct for §24/§25 from-scratch (where tokenizer is trained to exactly match) but wrong for HF-distributed pretrained checkpoints.

What ships

Artifact Change
`aprender-train/src/models/llama_370m.rs` New helper `assert_tokenizer_vocab_within_model_bound`
`apr-cli/src/commands/pretrain.rs` `preflight_tokenizer_vocab_matches_target` takes `init_is_some: bool`; `drive_real` routes through it
Contract v1.2.0 → v1.3.0 FUNCTIONAL refined `qwen_tokenizer_vocab_compatibility` formula + invariants
FALSIFY-APR-PRETRAIN-ARCH-009 NEW: relaxed bound accepts HF reserved slots
FALSIFY-APR-PRETRAIN-ARCH-010 NEW: relaxed bound rejects oversized (OOB safety)
4 new tests (2 helper + 2 integration) All PASS
`evidence/section-55-relaxed-preflight-2026-05-05/` LIVE smoke evidence
`docs/specifications/aprender-train/ship-two-models-spec.md` §55 amendment

LIVE smoke

```
$ timeout 30 apr pretrain \
--init <Qwen2.5-Coder-0.5B-Instruct-fp16.apr> \
--tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
--mode finetune --num-steps 1 --device cpu --vocab-size 151936
exit=124 # timeout, AFTER preflight passed

No GATE-ARCH-370M-011 violations in output

```

Process proceeded past preflight to weight load (timeout fired at 30s mid-load). FALSIFY-APR-PRETRAIN-ARCH-009 LIVE-INTEGRATION DISCHARGE.

Five Whys

  1. Why did §54 not catch this? §54 used legacy 50257-vocab tokenizer (not §54's own extracted Qwen tokenizer); within-Qwen mismatch only surfaces after 5g.0 lands.
  2. Why bound is ≤ not ==? HF checkpoints standardly declare vocab > tokenizer materialized; strict-equality would fail every Qwen/Llama2/Mistral.
  3. Why preserve strict equality on from-scratch? §24/§25 evidence was under strict; weakening retroactively could mask future from-scratch tokenizer-drift bugs.
  4. Why new helper rather than mode param on existing? External callers (`training-loop-pretrain-v1` contract, `llama-370m-sovereign-v1`) explicitly want strict; mode param would be backward-incompatible.
  5. Why pin both FALSIFY-009 (accept) and FALSIFY-010 (reject)? Bound is ≤, not <. Without FALSIFY-010, a regression to `tokenizer_vocab > model_vocab` would silently restore N-09 OOB risk.

Test plan

  • `pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml` exits 0
  • PMAT pre-commit quality gates pass
  • 3/3 aprender-train llama_370m helper tests pass (existing strict + 2 new relaxed)
  • 7/7 apr-cli preflight tests pass (5 existing + 2 new integration)
  • LIVE smoke: relaxed preflight passes on §54-extracted Qwen tokenizer
  • CI gate green (workspace-test, ci/gate)
  • Auto-merge fires on green CI

🤖 Generated with Claude Code

…xation

§54 LIVE smoke surfaced that public Qwen2.5-Coder-0.5B-Instruct's
tokenizer.json materializes 151643 BPE entries + 22 added = 151665
effective strings, but config.json declares vocab_size=151936
(271 reserved/special slots not in tokenizer.json). Strict equality
preflight was correct for §24/§25 from-scratch but too strict for
HF-distributed pretrained checkpoints with reserved slots.

## Relaxation

  init=Some  → tokenizer_vocab ≤ model_vocab  (RELAXED, admits HF reserved slots)
  init=None  → tokenizer_vocab == model_vocab (UNCHANGED, §24/§25 baseline)

Safety: tokenizer-emitted ids ∈ [0, tokenizer_vocab) ⊆ [0, model_vocab).
Reserved high-id slots are never indexed at training time. N-09 OOB
escape impossible.

Symmetric guard: tokenizer_vocab > model_vocab MUST FAIL even under
init=Some (FALSIFY-APR-PRETRAIN-ARCH-010 — bound is ≤, not <).

## What ships

Helper:
  assert_tokenizer_vocab_within_model_bound (aprender-train)
  symmetric to assert_tokenizer_vocab_matches_model

Wireup:
  preflight_tokenizer_vocab_matches_target now takes init_is_some: bool;
  drive_real passes init_arch.is_some() to route to relaxed/strict.

Contract:
  apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL
  qwen_tokenizer_vocab_compatibility refined formula + invariants
  +FALSIFY-009 (relaxed accept) +FALSIFY-010 (oversize reject — OOB safety)
  total: 10 falsifiers, all PASS

Tests (4 new, all PASS):
  falsify_apr_pretrain_arch_009_relaxed_bound_accepts_qwen_reserved_slots
  falsify_apr_pretrain_arch_010_relaxed_bound_rejects_oversized_tokenizer
  preflight_qwen_reserved_slots_pass_under_polymorphic_init
  preflight_oversized_tokenizer_rejected_even_under_polymorphic_init

Spec amendment §55 + LIVE smoke evidence.

## LIVE smoke (this branch's apr binary + §54 extracted Qwen tokenizer)

  timeout 30 apr pretrain --init <Qwen.apr> --tokenizer <extracted-dir> ...
  → exit=124 (timeout), AFTER preflight passed
  → Configuration printed + Device: cpu + (proceeded to weight load)
  → No GATE-ARCH-370M-011 violations

Evidence: evidence/section-55-relaxed-preflight-2026-05-05/relaxed-preflight-passes-smoke.md

## Five Whys

1. Why did §54 not catch this? §54 used legacy 50257 tokenizer (not §54's
   own extracted Qwen tokenizer); within-Qwen mismatch only surfaces
   after 5g.0 lands.
2. Why bound is ≤ not ==? HF checkpoints standardly declare vocab > tokenizer
   materialized; strict-equality would fail every Qwen/Llama2/Mistral.
3. Why preserve strict equality on from-scratch? §24/§25 evidence was
   gathered under strict; weakening retroactively could mask future
   from-scratch tokenizer-drift bugs.
4. Why new helper rather than mode parameter on existing? External
   callers (training-loop-pretrain-v1 contract) explicitly want strict;
   mode param would be backward-incompatible.
5. Why pin both FALSIFY-009 + FALSIFY-010? Bound is ≤, not <. Without
   FALSIFY-010, a regression to tokenizer_vocab > model_vocab would
   silently restore N-09 OOB risk.

## Net effects

- Spec v2.99.0 → v3.00.0 (cascade pivots from infrastructure to LIVE prerequisites).
- Contract apr-pretrain-arch-polymorphic-v1 v1.2.0 → v1.3.0 FUNCTIONAL.
- 5g.0.1 lands as single PR; 5g.1 unblocked.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3 val_loss < 9.38.
- Coverage tally: +2 PARTIAL_ALGORITHM_LEVEL falsifiers + LIVE-INTEGRATION
  reinforcement of FALSIFY-005/009.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496) for the gap finding,
      §55 (this PR) for the resolution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 5, 2026 03:59
@noahgift noahgift merged commit d4beb7d into main May 5, 2026
10 checks passed
@noahgift noahgift deleted the feat/section-55-polymorphic-preflight-relaxation branch May 5, 2026 05:06
noahgift added a commit that referenced this pull request May 5, 2026
…ted; full run is ~17hr operator-dispatch

§55 (in-flight PR #1500) closes the polymorphic preflight strictness
gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that
validates 5g.1's correctness end-to-end before committing to the
multi-hour full run.

  apr tokenize encode-corpus \
    --corpus <python-permissive-5k.jsonl> \
    --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
    --output <smoke-shards> --shard-tokens 1000000

  → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs)
  → ~110 sec / M-token single-thread
  → No errors; shard rotation correct
  → Killed before manifest.json write (sufficient evidence accumulated)

  Legacy 50257-vocab:   ~64 sec / M-token  →  9.99 hr for 565M (validated)
  Qwen 151643-vocab:    ~110 sec / M-token →  ~17 hr for 565M (projected)

Qwen is ~70% slower per-token because the BPE merge table is 3× larger
(151387 vs 49997 merges); per-character merge-table search dominates
encoding cost. Below the 48hr feedback_compute_pre_authorized.md
ceiling, so 5g.1 full run is pre-authorized.

  5g.0  ✅ MERGED PR #1497  (apr tokenize import-hf)
  5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation)
  5g.1  CORRECTNESS-VALIDATED (this PR), full run pending operator
  5g.2  gated on 5g.1 full run
  5g.3  gated on 5g.2 (val_loss < 9.38 verdict)

1. Why smoke before full run? ~17hr non-trivial; smoke proves chain
   correctness before committing to long wall.
2. Why 5000 docs? Smallest slice that exercises shard rotation (12M
   tokens > 10 shards).
3. Why kill smoke instead of complete? 13 shards = sufficient evidence;
   finishing wouldn't add information.
4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost.
5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below
   48hr ceiling; ROI negative for current cycle.

- Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way —
  §56 has no code/contract changes).
- 5g.1 reaches CORRECTNESS-VALIDATED state.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight),
      §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…ted; full run is ~17hr operator-dispatch (#1501)

§55 (in-flight PR #1500) closes the polymorphic preflight strictness
gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that
validates 5g.1's correctness end-to-end before committing to the
multi-hour full run.

  apr tokenize encode-corpus \
    --corpus <python-permissive-5k.jsonl> \
    --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
    --output <smoke-shards> --shard-tokens 1000000

  → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs)
  → ~110 sec / M-token single-thread
  → No errors; shard rotation correct
  → Killed before manifest.json write (sufficient evidence accumulated)

  Legacy 50257-vocab:   ~64 sec / M-token  →  9.99 hr for 565M (validated)
  Qwen 151643-vocab:    ~110 sec / M-token →  ~17 hr for 565M (projected)

Qwen is ~70% slower per-token because the BPE merge table is 3× larger
(151387 vs 49997 merges); per-character merge-table search dominates
encoding cost. Below the 48hr feedback_compute_pre_authorized.md
ceiling, so 5g.1 full run is pre-authorized.

  5g.0  ✅ MERGED PR #1497  (apr tokenize import-hf)
  5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation)
  5g.1  CORRECTNESS-VALIDATED (this PR), full run pending operator
  5g.2  gated on 5g.1 full run
  5g.3  gated on 5g.2 (val_loss < 9.38 verdict)

1. Why smoke before full run? ~17hr non-trivial; smoke proves chain
   correctness before committing to long wall.
2. Why 5000 docs? Smallest slice that exercises shard rotation (12M
   tokens > 10 shards).
3. Why kill smoke instead of complete? 13 shards = sufficient evidence;
   finishing wouldn't add information.
4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost.
5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below
   48hr ceiling; ROI negative for current cycle.

- Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way —
  §56 has no code/contract changes).
- 5g.1 reaches CORRECTNESS-VALIDATED state.
- MODEL-1 ship % unchanged at 91%.
- MODEL-2 ship % unchanged at 57% until 5g.3.

Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight),
      §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 9, 2026
… (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001)

Adds contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 DRAFT, the
falsifier scaffold for SHIP-TWO §56.4 step 5g.2 — the LIVE 500-step
fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%.

Pins six falsifiable invariants for `apr pretrain --mode from-init
--init <Qwen.apr> --shards-dir <5g.1-corpus> --steps 500 --device cuda`:

- FALSIFY-001 (ship-blocking): exit code == 0
- FALSIFY-002 (advisory):     wall ≤ 3600 s on RTX 4090
- FALSIFY-003 (ship-blocking): step-0 loss ≤ 0.7 × ln(151936) ≈ 8.35
                               (proves init weights flow through forward)
- FALSIFY-004 (ship-blocking): checkpoint.apr written with valid
                               magic bytes (0x41 0x50 0x52 0x00 v2 OR
                               0x41 0x50 0x52 0x4E v1)
- FALSIFY-005 (ship-blocking): val_loss after 500 steps < 9.38
                               (the §34 370M-from-scratch ceiling)
- FALSIFY-006 (advisory):     no CUDA OOM / illegal-address / launch-
                               OoR errors during run

Five-Whys (why this contract first, then live dispatch):

1. Why a contract before the dispatch? Per CLAUDE.md "Contract-first
   design: NEVER write code before writing a provable contract."
   Even though 5g.2 is "0 LOC operator-dispatch", it has shippable
   semantics that deserve falsification scaffolding.
2. Why these particular six gates? They cover the four orthogonal
   failure modes of a fine-tune-from-init dispatch: process-level
   (exit/wall), correctness (step-0 baseline + val_loss), and
   serialization (checkpoint magic bytes + GPU resource health).
3. Why DRAFT status (not PROPOSED, not ACTIVE)? DRAFT means "schema
   validated, falsifiers authored, but no live evidence yet."
   Status flips to ACTIVE_RUNTIME via §59 spec amendment after the
   live dispatch produces evidence.
4. Why a separate contract from apr-pretrain-from-init-v1? The
   sibling contract pins the in-process semantics of init loading
   (load_init_tensors_from_apr, populate_trainer_from_init_tensors).
   This new contract pins the END-TO-END dispatch outcome — they
   compose at the dispatch boundary.
5. Why the val_loss < 9.38 threshold (not 5.0 or 7.0)? §34's 200K-
   step retrain confirmed val_loss=9.38 as the 370M-from-scratch
   capacity ceiling on this corpus. A from-init pivot must beat
   from-scratch, otherwise §49's strategy reasoning is wrong.

Pre-requisites VERIFIED on host (lambda-vector RTX 4090):
- /mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr exists
- /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen has
  228 shards / 2.278B tokens (manifest.json reconstructed by PR #1575)
- `apr pretrain --init <PATH>` end-to-end runnable per §53 (#1494 MERGED)
- Polymorphic preflight per §55 (#1500 MERGED)

Quality gates:
- `pv validate contracts/apr-pretrain-init-finetune-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS

SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep work)
- MODEL-2 ship %: unchanged at 57% (this PR is contract-only;
  ship-% flips on §59 amendment after live verdict)
- Unblocks: §59 spec amendment recording 5g.2 dispatch result

Next steps (follow-ups, NOT this PR):
- LIVE dispatch on RTX 4090 (~20-60 min wall, pre-authorized per
  feedback_compute_pre_authorized.md)
- §59 spec amendment v3.05.0 → v3.06.0 with verdict + ship-% flip
- Contract status DRAFT → ACTIVE_RUNTIME

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 9, 2026
… (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001) (#1576)

Adds contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 DRAFT, the
falsifier scaffold for SHIP-TWO §56.4 step 5g.2 — the LIVE 500-step
fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%.

Pins six falsifiable invariants for `apr pretrain --mode from-init
--init <Qwen.apr> --shards-dir <5g.1-corpus> --steps 500 --device cuda`:

- FALSIFY-001 (ship-blocking): exit code == 0
- FALSIFY-002 (advisory):     wall ≤ 3600 s on RTX 4090
- FALSIFY-003 (ship-blocking): step-0 loss ≤ 0.7 × ln(151936) ≈ 8.35
                               (proves init weights flow through forward)
- FALSIFY-004 (ship-blocking): checkpoint.apr written with valid
                               magic bytes (0x41 0x50 0x52 0x00 v2 OR
                               0x41 0x50 0x52 0x4E v1)
- FALSIFY-005 (ship-blocking): val_loss after 500 steps < 9.38
                               (the §34 370M-from-scratch ceiling)
- FALSIFY-006 (advisory):     no CUDA OOM / illegal-address / launch-
                               OoR errors during run

Five-Whys (why this contract first, then live dispatch):

1. Why a contract before the dispatch? Per CLAUDE.md "Contract-first
   design: NEVER write code before writing a provable contract."
   Even though 5g.2 is "0 LOC operator-dispatch", it has shippable
   semantics that deserve falsification scaffolding.
2. Why these particular six gates? They cover the four orthogonal
   failure modes of a fine-tune-from-init dispatch: process-level
   (exit/wall), correctness (step-0 baseline + val_loss), and
   serialization (checkpoint magic bytes + GPU resource health).
3. Why DRAFT status (not PROPOSED, not ACTIVE)? DRAFT means "schema
   validated, falsifiers authored, but no live evidence yet."
   Status flips to ACTIVE_RUNTIME via §59 spec amendment after the
   live dispatch produces evidence.
4. Why a separate contract from apr-pretrain-from-init-v1? The
   sibling contract pins the in-process semantics of init loading
   (load_init_tensors_from_apr, populate_trainer_from_init_tensors).
   This new contract pins the END-TO-END dispatch outcome — they
   compose at the dispatch boundary.
5. Why the val_loss < 9.38 threshold (not 5.0 or 7.0)? §34's 200K-
   step retrain confirmed val_loss=9.38 as the 370M-from-scratch
   capacity ceiling on this corpus. A from-init pivot must beat
   from-scratch, otherwise §49's strategy reasoning is wrong.

Pre-requisites VERIFIED on host (lambda-vector RTX 4090):
- /mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr exists
- /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen has
  228 shards / 2.278B tokens (manifest.json reconstructed by PR #1575)
- `apr pretrain --init <PATH>` end-to-end runnable per §53 (#1494 MERGED)
- Polymorphic preflight per §55 (#1500 MERGED)

Quality gates:
- `pv validate contracts/apr-pretrain-init-finetune-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS

SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep work)
- MODEL-2 ship %: unchanged at 57% (this PR is contract-only;
  ship-% flips on §59 amendment after live verdict)
- Unblocks: §59 spec amendment recording 5g.2 dispatch result

Next steps (follow-ups, NOT this PR):
- LIVE dispatch on RTX 4090 (~20-60 min wall, pre-authorized per
  feedback_compute_pre_authorized.md)
- §59 spec amendment v3.05.0 → v3.06.0 with verdict + ship-% flip
- Contract status DRAFT → ACTIVE_RUNTIME

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant