feat(contracts): apr-pretrain-init-finetune-v1 5g.2 dispatch (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001)#1576
Merged
Conversation
… (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001)
Adds contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 DRAFT, the
falsifier scaffold for SHIP-TWO §56.4 step 5g.2 — the LIVE 500-step
fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%.
Pins six falsifiable invariants for `apr pretrain --mode from-init
--init <Qwen.apr> --shards-dir <5g.1-corpus> --steps 500 --device cuda`:
- FALSIFY-001 (ship-blocking): exit code == 0
- FALSIFY-002 (advisory): wall ≤ 3600 s on RTX 4090
- FALSIFY-003 (ship-blocking): step-0 loss ≤ 0.7 × ln(151936) ≈ 8.35
(proves init weights flow through forward)
- FALSIFY-004 (ship-blocking): checkpoint.apr written with valid
magic bytes (0x41 0x50 0x52 0x00 v2 OR
0x41 0x50 0x52 0x4E v1)
- FALSIFY-005 (ship-blocking): val_loss after 500 steps < 9.38
(the §34 370M-from-scratch ceiling)
- FALSIFY-006 (advisory): no CUDA OOM / illegal-address / launch-
OoR errors during run
Five-Whys (why this contract first, then live dispatch):
1. Why a contract before the dispatch? Per CLAUDE.md "Contract-first
design: NEVER write code before writing a provable contract."
Even though 5g.2 is "0 LOC operator-dispatch", it has shippable
semantics that deserve falsification scaffolding.
2. Why these particular six gates? They cover the four orthogonal
failure modes of a fine-tune-from-init dispatch: process-level
(exit/wall), correctness (step-0 baseline + val_loss), and
serialization (checkpoint magic bytes + GPU resource health).
3. Why DRAFT status (not PROPOSED, not ACTIVE)? DRAFT means "schema
validated, falsifiers authored, but no live evidence yet."
Status flips to ACTIVE_RUNTIME via §59 spec amendment after the
live dispatch produces evidence.
4. Why a separate contract from apr-pretrain-from-init-v1? The
sibling contract pins the in-process semantics of init loading
(load_init_tensors_from_apr, populate_trainer_from_init_tensors).
This new contract pins the END-TO-END dispatch outcome — they
compose at the dispatch boundary.
5. Why the val_loss < 9.38 threshold (not 5.0 or 7.0)? §34's 200K-
step retrain confirmed val_loss=9.38 as the 370M-from-scratch
capacity ceiling on this corpus. A from-init pivot must beat
from-scratch, otherwise §49's strategy reasoning is wrong.
Pre-requisites VERIFIED on host (lambda-vector RTX 4090):
- /mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr exists
- /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen has
228 shards / 2.278B tokens (manifest.json reconstructed by PR #1575)
- `apr pretrain --init <PATH>` end-to-end runnable per §53 (#1494 MERGED)
- Polymorphic preflight per §55 (#1500 MERGED)
Quality gates:
- `pv validate contracts/apr-pretrain-init-finetune-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS
SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep work)
- MODEL-2 ship %: unchanged at 57% (this PR is contract-only;
ship-% flips on §59 amendment after live verdict)
- Unblocks: §59 spec amendment recording 5g.2 dispatch result
Next steps (follow-ups, NOT this PR):
- LIVE dispatch on RTX 4090 (~20-60 min wall, pre-authorized per
feedback_compute_pre_authorized.md)
- §59 spec amendment v3.05.0 → v3.06.0 with verdict + ship-% flip
- Contract status DRAFT → ACTIVE_RUNTIME
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
79623ea to
233b622
Compare
9 tasks
noahgift
added a commit
that referenced
this pull request
May 9, 2026
…DE-PRETRAIN-INIT-CUDA-WIREUP-001)
Mirror the CPU path's `build_shared_trainer_with_init` (§50.4 step 5f.4)
into the CUDA backend so `apr pretrain --init <PATH> --device cuda` can
fine-tune from a public pretrained checkpoint on RTX 4090 — the only
remaining ship-blocker for SHIP-TWO §56.4 step 5g.2.
This PR:
- Adds `entrenar::train::pretrain_real_cuda::build_shared_cuda_trainer_with_init`,
symmetric to the CPU sibling. Composes the SAME §50.4 step-5f machinery
through both backends:
5c: build_transformer_config(init_arch)
5f.1: validate_pretrain_init_arch_compatible(init_arch) — encoder rejection
5f.2: load_init_tensors_from_apr(path) — read APR weights
5f.3: populate_trainer_from_init_tensors(transformer, &tensors) — populate CPU model
5f.5: CudaTransformerTrainer::with_model uploads populated blocks
/ final_norm / lm_head / embed_tokens to GPU.
The §50.4 step 5f.1/5f.2/5f.3 helpers are reused VERBATIM — populate
semantics are identical between CPU and CUDA backends.
- Updates `apr-cli::drive_real_cuda` to accept the same `init_arch:
Option<&TransformerConfig>` + `init_path: Option<&Path>` pair as the
CPU path. When either is `Some`, routes through the new builder.
When both are `None`, preserves the existing from-scratch baseline
(INV-ARCH-370M-001 stays enforced on the from-scratch CUDA path).
- Removes the `FALSIFY-APR-PRETRAIN-INIT-CUDA-001` fail-fast Err in
`drive_real`. The `pub(crate) const FALSIFY_APR_PRETRAIN_INIT_CUDA_001_MSG`
survives and is repurposed as a drift-prevention sentinel — its
payload now reads "is wired for --device cuda via
build_shared_cuda_trainer_with_init (5f.5 SHIPPED)" so a future
regression that re-introduces a fail-fast fires the sentinel test
before the contract reference goes stale.
Five-Whys (root-cause class) for the wireup itself:
1. Why was the CUDA wireup deferred while the CPU wireup landed in
PR #1494? §50.4 step 5f.4 was the smallest cascade-completing PR;
landing both backends in one PR conflated the algorithm-level
wireup with the CUDA-feature-build dependency. Per
`feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 logical
change.
2. Why does the CUDA path even need its own builder? Because the
`CudaTransformerTrainer` constructor uploads weights to GPU at
allocation time — the populated CPU model must exist BEFORE the
GPU upload, or the GPU sees random initialization while the CPU
model has the loaded init.
3. Why pass the populated CPU `Transformer` to `with_model` rather
than loading directly into GPU buffers? Because the CUDA upload
path (`upload_blocks` + `final_norm` + `lm_head`) reads weights
FROM the CPU `Transformer` struct. The cleanest symmetry is
"build CPU model, populate via shared helper, hand to CUDA
constructor" — the same helper closes the §28 SHIP-007 silent-
gibberish defect class on both backends.
4. Why preserve the const sentinel rather than delete it? The const
is referenced by name in `apr-pretrain-arch-polymorphic-v1.yaml`
v1.4.0..v1.6.0 changelog and falsifier entries. Deleting it would
break the contract's audit trail. Repurposing it (semantic flip
from "fail-fast" to "is wired") preserves the audit chain while
the new payload still anchors a drift-prevention test.
5. Why does this PR not run the LIVE 500-step fine-tune? Per PR
atomicity: this PR ships the wireup. The 500-step val_loss < 9.38
verdict is gated by `apr-pretrain-init-finetune-v1.yaml` v1.0.0
(PR #1576) — that contract's FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005
flips MODEL-2 ship % 57% → ≥58%. The two PRs compose: this PR's
wireup is the prerequisite; PR #1576's contract is the verdict.
LIVE END-TO-END DOGFOOD on lambda-vector RTX 4090 (this branch built
with `--features cuda`):
$ apr pretrain --dataset .../codeparrot-python-permissive-shards-qwen \
--tokenizer .../qwen-0.5b-tokenizer-extracted \
--run-dir .../5g-2-smoke-1step-cuda-post5f5 \
--mode finetune --num-steps 1 --batch-size 2 --seq-length 256 \
--device cuda \
--init .../qwen2.5-coder-0.5b-instruct-fp16.apr
[CUDA] cuBLAS initialized — forward TF32 tensor cores
[CUDA] Pre-warmed 27 forward kernels
✓ 24 transformer blocks uploaded to GPU
✓ GPU training state allocated (LM head: 544.5 MB)
=== Run Result ===
OK CONVERGED final val_loss=0.6847 after 1 epoch(s)
Checkpoint: 2.35 GiB, 219 tensors, valid APR v2 (✓ checksum).
This live run discharges:
- FALSIFY-APR-PRETRAIN-INIT-CUDA-001 (sentinel, post-5f.5)
- FALSIFY-APR-PRETRAIN-INIT-FINETUNE-001 (exit 0)
- FALSIFY-APR-PRETRAIN-INIT-FINETUNE-004 (checkpoint written)
- Partial discharge of FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005
(val_loss=0.6847 << 9.38 ceiling, on 1-step fine-tune; 500-step
LIVE remains the binding evidence under PR #1576's contract).
Contract updates:
- `contracts/apr-pretrain-arch-polymorphic-v1.yaml`: v1.6.0 → v1.7.0.
- FALSIFY-CUDA-001 semantic flip (fail-fast → wireup-is-wired sentinel)
- NEW FALSIFY-CUDA-002 (paired-args invariant on the new builder)
- NEW FALSIFY-CUDA-003 (encoder family rejection on the new builder)
- All three new tests fire WITHOUT a CUDA runtime — they exercise
the args-check and encoder-rejection paths that happen before any
GPU allocation.
Quality gates:
- `pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS
- `cargo test -p apr-cli --features training --lib`: 5644/5644 PASS
- `cargo test -p apr-cli --features training --test cli_commands`: 8/8 PASS
- `cargo test -p aprender-train --features cuda --lib build_shared_cuda_trainer_with_init`: 2/2 PASS
- `cargo clippy -p apr-cli --features training --lib -- -D warnings`: clean
- `cargo check -p apr-cli --features training`: clean
- `cargo check -p apr-cli --features training,cuda`: clean
- LIVE: `apr pretrain --init Qwen.apr --device cuda` runs end-to-end on RTX 4090
SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep)
- MODEL-2 ship %: unchanged at 57% (5g.2 LIVE 500-step verdict still
required to flip 57% → ≥58%; this PR closes the only remaining
technical blocker — a 500-step dispatch is now operator-runnable).
- §50.4 cascade COMPLETE (5a-5f.5 all shipped; only 5g LIVE remains).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 9, 2026
…DE-PRETRAIN-INIT-CUDA-WIREUP-001) (#1577) Mirror the CPU path's `build_shared_trainer_with_init` (§50.4 step 5f.4) into the CUDA backend so `apr pretrain --init <PATH> --device cuda` can fine-tune from a public pretrained checkpoint on RTX 4090 — the only remaining ship-blocker for SHIP-TWO §56.4 step 5g.2. This PR: - Adds `entrenar::train::pretrain_real_cuda::build_shared_cuda_trainer_with_init`, symmetric to the CPU sibling. Composes the SAME §50.4 step-5f machinery through both backends: 5c: build_transformer_config(init_arch) 5f.1: validate_pretrain_init_arch_compatible(init_arch) — encoder rejection 5f.2: load_init_tensors_from_apr(path) — read APR weights 5f.3: populate_trainer_from_init_tensors(transformer, &tensors) — populate CPU model 5f.5: CudaTransformerTrainer::with_model uploads populated blocks / final_norm / lm_head / embed_tokens to GPU. The §50.4 step 5f.1/5f.2/5f.3 helpers are reused VERBATIM — populate semantics are identical between CPU and CUDA backends. - Updates `apr-cli::drive_real_cuda` to accept the same `init_arch: Option<&TransformerConfig>` + `init_path: Option<&Path>` pair as the CPU path. When either is `Some`, routes through the new builder. When both are `None`, preserves the existing from-scratch baseline (INV-ARCH-370M-001 stays enforced on the from-scratch CUDA path). - Removes the `FALSIFY-APR-PRETRAIN-INIT-CUDA-001` fail-fast Err in `drive_real`. The `pub(crate) const FALSIFY_APR_PRETRAIN_INIT_CUDA_001_MSG` survives and is repurposed as a drift-prevention sentinel — its payload now reads "is wired for --device cuda via build_shared_cuda_trainer_with_init (5f.5 SHIPPED)" so a future regression that re-introduces a fail-fast fires the sentinel test before the contract reference goes stale. Five-Whys (root-cause class) for the wireup itself: 1. Why was the CUDA wireup deferred while the CPU wireup landed in PR #1494? §50.4 step 5f.4 was the smallest cascade-completing PR; landing both backends in one PR conflated the algorithm-level wireup with the CUDA-feature-build dependency. Per `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 logical change. 2. Why does the CUDA path even need its own builder? Because the `CudaTransformerTrainer` constructor uploads weights to GPU at allocation time — the populated CPU model must exist BEFORE the GPU upload, or the GPU sees random initialization while the CPU model has the loaded init. 3. Why pass the populated CPU `Transformer` to `with_model` rather than loading directly into GPU buffers? Because the CUDA upload path (`upload_blocks` + `final_norm` + `lm_head`) reads weights FROM the CPU `Transformer` struct. The cleanest symmetry is "build CPU model, populate via shared helper, hand to CUDA constructor" — the same helper closes the §28 SHIP-007 silent- gibberish defect class on both backends. 4. Why preserve the const sentinel rather than delete it? The const is referenced by name in `apr-pretrain-arch-polymorphic-v1.yaml` v1.4.0..v1.6.0 changelog and falsifier entries. Deleting it would break the contract's audit trail. Repurposing it (semantic flip from "fail-fast" to "is wired") preserves the audit chain while the new payload still anchors a drift-prevention test. 5. Why does this PR not run the LIVE 500-step fine-tune? Per PR atomicity: this PR ships the wireup. The 500-step val_loss < 9.38 verdict is gated by `apr-pretrain-init-finetune-v1.yaml` v1.0.0 (PR #1576) — that contract's FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005 flips MODEL-2 ship % 57% → ≥58%. The two PRs compose: this PR's wireup is the prerequisite; PR #1576's contract is the verdict. LIVE END-TO-END DOGFOOD on lambda-vector RTX 4090 (this branch built with `--features cuda`): $ apr pretrain --dataset .../codeparrot-python-permissive-shards-qwen \ --tokenizer .../qwen-0.5b-tokenizer-extracted \ --run-dir .../5g-2-smoke-1step-cuda-post5f5 \ --mode finetune --num-steps 1 --batch-size 2 --seq-length 256 \ --device cuda \ --init .../qwen2.5-coder-0.5b-instruct-fp16.apr [CUDA] cuBLAS initialized — forward TF32 tensor cores [CUDA] Pre-warmed 27 forward kernels ✓ 24 transformer blocks uploaded to GPU ✓ GPU training state allocated (LM head: 544.5 MB) === Run Result === OK CONVERGED final val_loss=0.6847 after 1 epoch(s) Checkpoint: 2.35 GiB, 219 tensors, valid APR v2 (✓ checksum). This live run discharges: - FALSIFY-APR-PRETRAIN-INIT-CUDA-001 (sentinel, post-5f.5) - FALSIFY-APR-PRETRAIN-INIT-FINETUNE-001 (exit 0) - FALSIFY-APR-PRETRAIN-INIT-FINETUNE-004 (checkpoint written) - Partial discharge of FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005 (val_loss=0.6847 << 9.38 ceiling, on 1-step fine-tune; 500-step LIVE remains the binding evidence under PR #1576's contract). Contract updates: - `contracts/apr-pretrain-arch-polymorphic-v1.yaml`: v1.6.0 → v1.7.0. - FALSIFY-CUDA-001 semantic flip (fail-fast → wireup-is-wired sentinel) - NEW FALSIFY-CUDA-002 (paired-args invariant on the new builder) - NEW FALSIFY-CUDA-003 (encoder family rejection on the new builder) - All three new tests fire WITHOUT a CUDA runtime — they exercise the args-check and encoder-rejection paths that happen before any GPU allocation. Quality gates: - `pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml`: 0 errors - `pv lint --strict-test-binding`: 9/9 gates PASS - `cargo test -p apr-cli --features training --lib`: 5644/5644 PASS - `cargo test -p apr-cli --features training --test cli_commands`: 8/8 PASS - `cargo test -p aprender-train --features cuda --lib build_shared_cuda_trainer_with_init`: 2/2 PASS - `cargo clippy -p apr-cli --features training --lib -- -D warnings`: clean - `cargo check -p apr-cli --features training`: clean - `cargo check -p apr-cli --features training,cuda`: clean - LIVE: `apr pretrain --init Qwen.apr --device cuda` runs end-to-end on RTX 4090 SHIP-TWO impact: - MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep) - MODEL-2 ship %: unchanged at 57% (5g.2 LIVE 500-step verdict still required to flip 57% → ≥58%; this PR closes the only remaining technical blocker — a 500-step dispatch is now operator-runnable). - §50.4 cascade COMPLETE (5a-5f.5 all shipped; only 5g LIVE remains). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
contracts/apr-pretrain-init-finetune-v1.yamlv1.0.0 DRAFT — the falsifier scaffold for SHIP-TWO §56.4 step 5g.2, the LIVE 500-step fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%.Contract-only PR (no code changes). Status starts DRAFT; flips to ACTIVE_RUNTIME via §59 spec amendment after the live dispatch produces evidence.
Falsifiers
apr pretrain --mode from-init --init <Qwen.apr> --shards-dir <5g.1> --steps 500 --device cudaexits 0Five-Whys
apr-pretrain-from-init-v1? The sibling pins in-process semantics; this contract pins the END-TO-END dispatch outcome — they compose at the dispatch boundary.SHIP-TWO impact
Pre-requisites VERIFIED on host (lambda-vector RTX 4090)
/mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr/mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen(manifest.json reconstructed by PR feat(apr-cli): apr tokenize repair-manifest (PMAT-CODE-TOKENIZE-REPAIR-MANIFEST-001) #1575 PMAT-CODE-TOKENIZE-REPAIR-MANIFEST-001)
apr pretrain --init <PATH>end-to-end runnable per §53 (feat(apr-cli + aprender-train): apr pretrain --init wireup — §50.4 step 5f.4 #1494 MERGED 2026-05-05T01:48Z)Test plan
pv validate contracts/apr-pretrain-init-finetune-v1.yaml— 0 errors, 0 warningspv lint --strict-test-binding— 9/9 gates PASSFiles
contracts/apr-pretrain-init-finetune-v1.yaml(NEW, +208 lines).pv/contracts.idx+.pv/lint-previous.json(index refresh)🤖 Generated with Claude Code