spec(ship-two-models): v3.00 → v3.01 — §56 5g.1 LIVE smoke validated; ~17hr full run operator-dispatch#1501
Merged
Conversation
…ted; full run is ~17hr operator-dispatch §55 (in-flight PR #1500) closes the polymorphic preflight strictness gap and unblocks 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run. apr tokenize encode-corpus \ --corpus <python-permissive-5k.jsonl> \ --tokenizer /tmp/qwen-0.5b-tokenizer-extracted \ --output <smoke-shards> --shard-tokens 1000000 → 13 valid u32 shards (12 full × ~1M + 1 partial = ~13M tokens for 5000 docs) → ~110 sec / M-token single-thread → No errors; shard rotation correct → Killed before manifest.json write (sufficient evidence accumulated) Legacy 50257-vocab: ~64 sec / M-token → 9.99 hr for 565M (validated) Qwen 151643-vocab: ~110 sec / M-token → ~17 hr for 565M (projected) Qwen is ~70% slower per-token because the BPE merge table is 3× larger (151387 vs 49997 merges); per-character merge-table search dominates encoding cost. Below the 48hr feedback_compute_pre_authorized.md ceiling, so 5g.1 full run is pre-authorized. 5g.0 ✅ MERGED PR #1497 (apr tokenize import-hf) 5g.0.1 in-flight PR #1500 (§55 polymorphic preflight relaxation) 5g.1 CORRECTNESS-VALIDATED (this PR), full run pending operator 5g.2 gated on 5g.1 full run 5g.3 gated on 5g.2 (val_loss < 9.38 verdict) 1. Why smoke before full run? ~17hr non-trivial; smoke proves chain correctness before committing to long wall. 2. Why 5000 docs? Smallest slice that exercises shard rotation (12M tokens > 10 shards). 3. Why kill smoke instead of complete? 13 shards = sufficient evidence; finishing wouldn't add information. 4. Why Qwen 70% slower? BPE merge-table size dominates encoding cost. 5. Why not parallelize? Out of 5g.1 scope; single-thread wall is below 48hr ceiling; ROI negative for current cycle. - Spec v3.00.0 → v3.01.0 (assumes §55 lands first; safe either way — §56 has no code/contract changes). - 5g.1 reaches CORRECTNESS-VALIDATED state. - MODEL-1 ship % unchanged at 91%. - MODEL-2 ship % unchanged at 57% until 5g.3. Refs: SPEC-SHIP-TWO-001 §54 (PR #1496), §55 (PR #1500 in-flight), §56 (this PR), evidence/section-56-5g-1-smoke-2026-05-05/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7f2f316 to
789d5f6
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
§55 (in-flight PR #1500) closed the polymorphic preflight strictness gap and unblocked 5g.1 dispatch. §56 records the LIVE smoke that validates 5g.1's correctness end-to-end before committing to the multi-hour full run.
Smoke result
```
apr tokenize encode-corpus \
--corpus python-permissive-5k.jsonl \
--tokenizer /tmp/qwen-0.5b-tokenizer-extracted \
--output smoke-shards --shard-tokens 1000000
→ 13 valid u32 shards (~13M tokens for 5000 docs)
→ ~110 sec / M-token single-thread
→ No errors; shard rotation correct
→ Killed when sufficient evidence accumulated
```
Throughput
Qwen is ~70% slower per-token because the BPE merge table is 3× larger. Below the 48hr `feedback_compute_pre_authorized.md` ceiling, so 5g.1 full run is pre-authorized.
Updated 5g roadmap
Five Whys
Net effects
Test plan
🤖 Generated with Claude Code