feat(apr-pretrain): --val-shard for independent held-out validation (PMAT-690 P2-F) by noahgift · Pull Request #1744 · paiml/aprender

noahgift · 2026-05-17T10:59:55Z

Summary

apr pretrain --val-shard <DIR> reads held-out validation batches from a separate .bin-shards directory instead of stealing the first 16 batches of --dataset. Closes the val-distribution-drift gap that confounded the §82-vs-P2C comparison: P2-C's val_loss=4.91 vs §82's val_loss=4.71 was +0.20 worse despite 80× more corpus tokens, but evidence/p2c-2026-05-17/findings.md §97–101 traced that gap to val distribution drift (§82's first-16-batches were codeparrot-only; P2-C's first-16-batches were a mix of the-stack-dedup + codeparrot). The next §82-vs-P2C comparison will be apples-to-apples.

What changes

New clap flag --val-shard <DIR> on apr pretrain. Default None → preserves legacy "first 16 batches of --dataset" behaviour for backwards compatibility.
drive_real() branches the held-out source:
- Some(dir): build a NEW ShardBatchIter over <dir> with .with_wrap_around(false), drain HELD_OUT_BATCHES from it, leave the training iterator at offset 0 (no batch theft).
- None: legacy path (first 16 batches reserved from --dataset iter).
Empty val-shard dir hard-fails with FALSIFY-PRETRAIN-VAL-SHARD-003 naming the path; no silent fallback to --dataset.

Discharges

PMAT-690 P2-F (per docs/specifications/aprender-train/albor-370m-roadmap.md §4 P2-F)
INV-PRETRAIN-VAL-SHARD-001/002/003/004 + falsifiers (new contract contracts/apr-pretrain-val-shard-v1.yaml)

Test plan

cargo test -p apr-cli --features training --test val_shard_test — 2/2 pass:
- falsify_val_shard_003_empty_dir_rejected — empty --val-shard exits non-zero, stderr names the path + falsifier ID
- val_shard_flag_documented_in_help — apr pretrain --help lists --val-shard with §84/P2-F/held-out context
cargo test -p apr-cli --features training --lib — 5,936 tests pass, 0 failures, 0 regressions
cargo test -p aprender-contracts --lib lint::gates::tests::load_contracts_real — new YAML parses against schema
cargo test -p aprender-contracts --lib lint::tests::lint_passes_on_real_contracts — no warnings
cargo clippy -p apr-cli --features training --lib -- -D warnings — clean
Tracking-tests on internal run() invocations: 7 call sites updated to thread None /* val_shard */

Operator workflow

After this lands, the §82-vs-P2C apples-to-apples comparison recipe becomes:

# Tokenize an independent held-out corpus (small N, e.g., 1000 docs)
apr tokenize encode-corpus --corpus held-out-source/ --max-docs 1000 \
  --output /mnt/.../held-out-shard/ \
  --tokenizer qwen-tokenizer/

# Train against any --dataset, using the FIXED val shard
apr pretrain --dataset /mnt/.../qwen-v3/ \
  --val-shard /mnt/.../held-out-shard/ \
  --tokenizer qwen-tokenizer/ \
  --init init.apr \
  --run-dir runs/qwen-v3-vs-v2/

val_loss values are now comparable across --dataset permutations because the val source is fixed.

Refs

docs/specifications/aprender-train/ship-model-2-spec.md §84
evidence/p2c-2026-05-17/findings.md §97–101 (motivation)
docs/specifications/aprender-train/albor-370m-roadmap.md §4 P2-F
contracts/apr-pretrain-val-shard-v1.yaml

🤖 Generated with Claude Code

…PMAT-690 P2-F) `apr pretrain --val-shard <DIR>` reads held-out validation batches from a separate `.bin`-shards directory instead of stealing the first 16 batches of `--dataset`. Closes the val-distribution-drift gap that confounded the §82-vs-P2C comparison: P2-C's val_loss=4.91 vs §82's val_loss=4.71 was +0.20 worse despite 80× more corpus tokens, but evidence/p2c-2026-05-17/findings.md §97-101 traced that gap to val distribution drift (§82's first-16-batches were codeparrot-only; P2-C's first-16-batches were a mix of the-stack-dedup + codeparrot). The next §82-vs-P2C comparison run will be apples-to-apples. ## What changes - New clap flag `--val-shard <DIR>` on `apr pretrain` (default None → preserves legacy "first 16 batches of --dataset" behaviour). - `drive_real()` branches the held-out source: - Some(dir): build a NEW `ShardBatchIter` over <dir> with `.with_wrap_around(false)`, drain HELD_OUT_BATCHES from it, leave the training iterator at offset 0 (no batch theft). - None: legacy path (first 16 batches reserved from --dataset iter). - Empty val-shard dir hard-fails with `FALSIFY-PRETRAIN-VAL-SHARD-003` naming the path; no silent fallback. ## Discharges - PMAT-690 P2-F (per albor-370m-roadmap.md §4 P2-F) - INV-PRETRAIN-VAL-SHARD-001/002/003/004 + FALSIFY-001/002/003/004 (new contract `contracts/apr-pretrain-val-shard-v1.yaml`) ## Tests - 2 integration tests in `crates/apr-cli/tests/val_shard_test.rs`: - `falsify_val_shard_003_empty_dir_rejected` — empty val-shard exits non-zero with the falsifier ID + path in stderr. - `val_shard_flag_documented_in_help` — `apr pretrain --help` lists the flag with §84/P2-F/held-out context for operator discoverability. - Full apr-cli lib test suite: 5,936 tests pass, 0 regressions. - aprender-contracts lint: new contract validates against schema. - `cargo clippy -p apr-cli --features training --lib -- -D warnings`: clean. ## Refs - docs/specifications/aprender-train/ship-model-2-spec.md §84 - evidence/p2c-2026-05-17/findings.md §97-101 (motivation) - docs/specifications/aprender-train/albor-370m-roadmap.md §4 P2-F - contracts/apr-pretrain-val-shard-v1.yaml Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…erified (#1754) * docs(spec): SPEC §84+§85 — P2-C/P2-E live findings, hyperparameter hypothesis CORROBORATED, P0-K closure live-verified Two new spec sections + full P2-E evidence directory. ## §84 — P2-C dispatched; audit hypothesis FALSIFIED; P0-K surfaced P2-C ran the audit-recommended multi-source corpus (49.6B tokens, 80× §82's 1.24B) at the same hyperparameters as §82. Result: val_loss=4.91 @ ep20 (vs §82's 4.71) — IDENTICAL termination shape, +0.2 WORSE despite 80× more data. The Chinchilla-data-starvation hypothesis is FALSIFIED. Debugging the §81-§83 5-PR cascade surfaced PMAT-690 P0-K: `apr convert` (both apr_import and apr_convert paths) didn't stamp hf_architecture / hf_model_type / embedded tokenizer. Five downstream consumer fixes had been patching None values that read from the upstream gap. P0-K closes the producer. ## §85 — P2-E live findings; hyperparameter hypothesis CORROBORATED P2-E ran same qwen-v3 corpus at LR=1.5e-5 (-3.3× lower) + warmup=500 (5× longer). Result: val_loss=4.6227 @ ep49 — BELOW §82's 4.71 AND P2-C's 4.91 floors. No early-stop; smooth monotonic descent across all 50 epochs. Hypothesis from §84 P2-E queue is CORROBORATED. Training throughput: 15,460 tok/s pure (12,880 tok/s end-to-end with checkpoint write) on RTX 4090, sm_89, cuBLAS TF32. This is the canonical apr-cli CUDA training perf baseline for future dispatches. §30 a-priori falsification lesson amendment: the audit's pre-falsification of P2-A2 was correct at the original LR but wrong as a general claim. Future audits MUST explicitly bound their falsification to the hyperparameter region tested. ## P0-K live-verification Synthetic `apr convert` → `apr inspect --quality` round-trip on /tmp/p0k-demo/out.apr (Qwen2 config.json + tiny safetensors fixture) produces: - metadata.hf_architecture = "Qwen2ForCausalLM" (was null pre-P0-K) - metadata.hf_model_type = "qwen2" (was null pre-P0-K) - quality.score = 60/100, hf_identity sub-score = 20/20 vs the pre-P0-K P2-E ep49 checkpoint (trained from an init APR that pre-dates P0-K): - metadata.hf_architecture = null - quality.score = 40/100, hf_identity sub-score = 0/20 The +20 delta on hf_identity empirically confirms P0-K closes the §81-§83 cascade root cause at the CLI surface. ## Ship % impact MODEL-2 stays at 79%. val_loss 4.62 > 3.0 ship gate. Marginal-gain decay analysis says more-of-the-same plateaus ~4.4. Next move (§85 P2-G/H/I queue) requires architectural change or different init. ## Refs - PR #1742 (PMAT-690 P0-K base — apr_import + apr_convert stamping) - PR #1744 (PMAT-690 P2-F — apr pretrain --val-shard) - PR #1746 (P0-K inspect surface) - PR #1748 (P0-K E2E test + apr_convert second path) - PR #1750 (P3-A apr inspect --quality scorer) - memory/feedback_upstream_metadata_masquerade.md (lesson #33) - memory/feedback_parallel_session_worktree_isolation.md (lesson #34) - memory/feedback_cargo_feature_cache_staleness.md (lesson #35) - evidence/p2c-2026-05-17/findings.md (P2-C trajectory + root cause) - evidence/p2e-2026-05-17/findings.md (P2-E corroboration + perf baseline) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): SPEC §86 — apr pretrain --init silently fails on arch-mismatched APRs; PR #1757 ships in-place stamp salvage P2-G v1 dispatch surfaced a SECOND symptom of the §81-§84 cascade root cause: pre-P0-K APR checkpoints (architecture="LlamaForCausalLM" P0-H fallback + Qwen2-tensor shape) are silently non-resumable via `apr pretrain --init`. The init eval at step 0 produced val_loss=8.60 instead of P2-E ep49's recorded 4.62 — definitive proof of silent fall-back to random init when the apr metadata's family-arch discriminator doesn't match the tensor naming convention. ## What §86 covers 1. Root cause walk-through (read_apr_architecture → transformer_config → populate_trainer_from_init_tensors → silent rejection → random init fallback at val_loss ≈ 8.60). 2. Implications: all training checkpoints produced before #1742 landed (2026-05-17T13:32:08Z) are non-resumable. The 50 P2-E checkpoints (~125 GB total) cannot be used for continuation training without intervention. 3. Three workarounds in priority order: - **Re-import** (blocked on HF safetensors locally — would need re-download) - **Restamp in-place** ✅ **SHIPPED via PR #1757** — `apr stamp` extension with --hf-architecture/--hf-model-type/--architecture - **Treat as final** — what P2-G v2 takes (currently in flight) 4. Operator recipe for the §86 salvage (3-line shell example). 5. Failure-mode classification (Class 4 Silent Incorrect Behavior, detection latency 1 epoch, producer-side fix already shipped via P0-K, existing-artifact fix shipped via #1757). 6. Recommended follow-up: INV-INIT-ARCH-MATCH-001 invariant on apr-pretrain-from-init-v1 contract — would catch the §86 case at the gate instead of at init-eval surface. Defer to follow-up PR. ## Stacked on PR #1754 (SPEC §85) Base: `feat/spec-85-p2e-findings`. The §86 amendment depends on §85 context (the P2-E run that surfaced §86). Will auto-rebase to main after #1754 lands. ## Refs - PR #1742 (PMAT-690 P0-K base — apr_import + apr_convert stamping) - PR #1750 (P3-A `apr inspect --quality` scorer — the diagnostic that surfaces §86 quality=40 pre-stamp, 60 post-stamp) - PR #1754 (SPEC §85 P2-E findings — the run that surfaced §86) - PR #1757 (apr stamp HF identity extension — workaround #2 above) - evidence/p2g-2026-05-17/section-86-draft.md - memory/feedback_upstream_metadata_masquerade.md (methodology #33) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): §87 + §88 — Chinchilla 20·N gate + AC-SHIP2-003 compute-bounded ship target; MODEL-2 ships at 95% Two new spec sections plus the AC-SHIP2-003 row amendment that unblocks the Two-Model spec closure. ## §87 — Chinchilla 20·N hard gate (P0-J' upgrade) Per the §85 P2-E + §85.4 P2-G empirical sequence, the 10-20× "ablation band" hits a val_loss ≈ 4.65 plateau regardless of hyperparameter tuning. The §83 v1.0.0 gate (hard at <10, warn-only at 10-20) is upgraded to hard at <20. Audit's compute-optimal target now enforced as the hard floor. Codified via PR #1762. ## §88 — AC-SHIP2-003 compute-bounded ship target Per user direction (Option 4): the strict CE ≤ 2.2 target requires 9-day continuous compute (213 GPU-hours), violating the 48-hour single-shot limit. §88 amends: - `AC-SHIP2-003` (loose form, new compute-bounded target): val CE ≤ 4.7. P2-E's 4.6227 DISCHARGES. - `AC-SHIP2-003-STRICT` (NEW, preserved as distillation epic target): val CE ≤ 2.2. Belongs to PMAT-683/684 (multi-week). Rationale: the Two-Model spec is an EXISTENCE PROOF of the Sovereign AI Stack. P2-E's converged 4.62 proves the Rust-only pipeline end-to-end works perfectly — compute time, not software capability, is the bottleneck. Iteration speed on the stack outweighs hitting a specific perplexity target on a proof-of-concept model. Downstream effects: - MODEL-2 ship % advances 79% → 95%. - All remaining unblocked ACs (AC-SHIP2-007/008/009/010) become operator-dispatchable within the 48-hr compute budget. - P3-C (HF publish) and P3-D (/dogfood) are unblocked. - AC-SHIP2-003-STRICT is the dispatch target for the distillation follow-up epic (NOT a ship blocker for v1). ## What §88 explicitly does NOT do - Does NOT lower the model-quality bar for production. The shipped artifact is a stack-capability proof, not a production model. Model card will note val_loss ≈ 4.62 and the §88 framing. - Does NOT retire AC-SHIP2-003 — renames the strict form to AC-SHIP2-003-STRICT, amends the loose form. - Does NOT block future stricter ships on larger architectures. ## Refs - PR #1742 (PMAT-690 P0-K base) - PR #1754 (SPEC §84+§85+§86 context) - PR #1762 (§87 Chinchilla 20×N hard gate runtime) - docs/specifications/audits/albor-370.md (external audit motivation) - docs/specifications/aprender-train/albor-370m-roadmap.md (P3 phases) - memory/feedback_a_priori_theoretical_falsification.md (#30) - memory/feedback_audit_hypothesis_bounds.md (#36) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): §89 distillation epic scoping + roadmap status sweep + /dogfood template Closes the §80-class spec stack for MODEL-2 v1 ship. Three artifacts: ## §89 — distillation epic scoping (SPEC) Documents the path to AC-SHIP2-003-STRICT (val_loss ≤ 2.2) via Qwen-7B teacher distillation. ~110 lines covering: - 89.1 Why distillation works at this scale (Stanton et al. 2021's 5× token-reduction claim → 9.88B → 2B tokens → 43h GPU fits the 48-hour iteration budget). - 89.2 Existing infrastructure inventory (aprender-train::distill + apr distill CLI + realizar 7B Q4_K load + apr pretrain --init with post-§86 INV-INIT-ARCH-MATCH-001 gate — all already in-tree). - 89.3 PMAT-683 teacher selection + pull (4-6h scope). - 89.4 PMAT-684 distillation training dispatch + evidence (~43h GPU + 8h operator, fits 48-hour budget). - 89.5 PMAT-685 hardening (deferred — multi-teacher / curriculum / LR cycling / layer-wise losses). - 89.6 Out-of-scope alternatives explicitly rejected (9-day compute, 1.5B+ arch, multi-host distributed). - 89.7 Sequencing — v1 must ship + /dogfood GO + at least one external consumer validation BEFORE v2 dispatches. - 89.8 Discharge criteria. ## Roadmap status sweep `docs/specifications/aprender-train/albor-370m-roadmap.md` P3 table updated to reflect actual ship state: - P3-A apr inspect --quality: ✅ SHIPPED (PR #1750) - P3-B apr lint: ⚙️ operator-dispatchable - P3-C-prep model card + readiness: ✅ SHIPPED (PR #1764) - P3-C-exec apr publish: 🟡 OPERATOR-READY - P3-D /dogfood: 🟡 TEMPLATE READY (this PR) Plus new P4 section for the distillation epic (PMAT-683/684/685 expanded entries with effort + probability + acceptance criteria), and a new §7 Post-§88 shipping plan that supersedes the 4-week plan which assumed val_loss < 3.0 was achievable within iteration budget. ## /dogfood verdict template `docs/dogfood-templates/albor-370m-v1-dogfood-template.md` (236 lines) — pre-author the post-publish QA checklist so when operator runs /dogfood after apr publish, the structure is ready. 8 sections: provenance + identity, pull/install verification, inference smoke, benchmark, format export round-trip, apr qa, /dogfood 12+5 gates, independent consumer test (the §89.7 validation-by-use gate that sequences v2 distillation dispatch), final verdict + post-verdict actions (GO / WARN / NO-GO branching). ## What this PR does NOT do - Does NOT actually run /dogfood (template only — execution gated on P3-C-exec which requires user authorization) - Does NOT dispatch PMAT-683/684 distillation (43h GPU; explicit user authorization required + sequencing per §89.7) - Does NOT close ship-model-2-spec.md (stays at 95% per §88 until P3-C-exec lands) ## Stacked on PR #1754 (SPEC §84-§88) Base: `feat/spec-85-p2e-findings`. The §89 scoping depends on the §88 framing. Will auto-rebase to main after #1754 lands. ## Refs - PR #1742 (PMAT-690 P0-K base) - PR #1750 (P3-A apr inspect --quality) - PR #1754 (SPEC §84-§88 stack — context) - PR #1757 (apr stamp HF identity — §86 salvage path) - PR #1764 (model card + readiness script — P3-C-prep) - memory/feedback_post_publish_qa_required.md (#29) - memory/feedback_publish_readiness_preflight.md (#37) - Hinton et al. 2015 (arXiv:1503.02531) — distillation foundations - Stanton et al. 2021 (arXiv:2106.05945) — 5× token-reduction claim Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 17, 2026 11:00

noahgift added 3 commits May 17, 2026 13:00

Merge branch 'main' into feat/pmat-690-p2f-val-shard

6e9b208

Merge branch 'main' into feat/pmat-690-p2f-val-shard

ed8cf72

Merge branch 'main' into feat/pmat-690-p2f-val-shard

33dbe30

noahgift merged commit 78fbd45 into main May 17, 2026
10 checks passed

noahgift deleted the feat/pmat-690-p2f-val-shard branch May 17, 2026 12:59

noahgift mentioned this pull request May 17, 2026

docs(spec): §84 + §85 — P2-C/P2-E live findings + P0-K closure live-verified #1754

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-pretrain): --val-shard for independent held-out validation (PMAT-690 P2-F)#1744

feat(apr-pretrain): --val-shard for independent held-out validation (PMAT-690 P2-F)#1744
noahgift merged 4 commits into
mainfrom
feat/pmat-690-p2f-val-shard

noahgift commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

What changes

Discharges

Test plan

Operator workflow

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant