Skip to content

feat(contracts): apr-pretrain-init-finetune-v1 5g.2 dispatch (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001)#1576

Merged
noahgift merged 1 commit into
mainfrom
feat/apr-pretrain-init-finetune-contract
May 9, 2026
Merged

feat(contracts): apr-pretrain-init-finetune-v1 5g.2 dispatch (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001)#1576
noahgift merged 1 commit into
mainfrom
feat/apr-pretrain-init-finetune-contract

Conversation

@noahgift

@noahgift noahgift commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 DRAFT — the falsifier scaffold for SHIP-TWO §56.4 step 5g.2, the LIVE 500-step fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%.

Contract-only PR (no code changes). Status starts DRAFT; flips to ACTIVE_RUNTIME via §59 spec amendment after the live dispatch produces evidence.

Falsifiers

ID Type Rule
001 ship-blocking apr pretrain --mode from-init --init <Qwen.apr> --shards-dir <5g.1> --steps 500 --device cuda exits 0
002 advisory wall ≤ 3600 s on RTX 4090
003 ship-blocking step-0 loss ≤ 0.7 × ln(151936) ≈ 8.35 (proves init weights flow through forward)
004 ship-blocking checkpoint.apr written with magic bytes 0x41 0x50 0x52 0x00 (v2) or 0x41 0x50 0x52 0x4E (v1)
005 ship-blocking val_loss after 500 steps < 9.38 (the §34 370M-from-scratch ceiling) — THIS GATE FLIPS SHIP %
006 advisory no CUDA OOM / illegal-address / launch-OoR during run

Five-Whys

  1. Why a contract before the dispatch? Per CLAUDE.md "Contract-first design: NEVER write code before writing a provable contract." Even 0-LOC operator dispatches deserve falsification scaffolding.
  2. Why these six gates? They cover the four orthogonal failure modes: process-level (exit/wall), correctness (step-0 baseline + val_loss), serialization (checkpoint magic bytes + GPU resource health).
  3. Why DRAFT status? Means "schema validated, falsifiers authored, no live evidence yet." Flips to ACTIVE_RUNTIME on §59 amendment after live dispatch.
  4. Why separate from apr-pretrain-from-init-v1? The sibling pins in-process semantics; this contract pins the END-TO-END dispatch outcome — they compose at the dispatch boundary.
  5. Why val_loss < 9.38? §34's 200K-step retrain confirmed it as the 370M-from-scratch capacity ceiling on this corpus. A from-init pivot must beat from-scratch, otherwise §49's strategy reasoning is wrong.

SHIP-TWO impact

  • MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep work)
  • MODEL-2 ship %: unchanged at 57% (this PR is contract-only; ship-% flips on §59 after live verdict)
  • Unblocks: §59 amendment recording the 5g.2 dispatch outcome

Pre-requisites VERIFIED on host (lambda-vector RTX 4090)

Test plan

  • pv validate contracts/apr-pretrain-init-finetune-v1.yaml — 0 errors, 0 warnings
  • pv lint --strict-test-binding — 9/9 gates PASS

Files

  • contracts/apr-pretrain-init-finetune-v1.yaml (NEW, +208 lines)
  • .pv/contracts.idx + .pv/lint-previous.json (index refresh)

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 8, 2026 21:06
… (PMAT-CODE-PRETRAIN-INIT-FINETUNE-001)

Adds contracts/apr-pretrain-init-finetune-v1.yaml v1.0.0 DRAFT, the
falsifier scaffold for SHIP-TWO §56.4 step 5g.2 — the LIVE 500-step
fine-tune dispatch that flips MODEL-2 ship % 57% → ≥58%.

Pins six falsifiable invariants for `apr pretrain --mode from-init
--init <Qwen.apr> --shards-dir <5g.1-corpus> --steps 500 --device cuda`:

- FALSIFY-001 (ship-blocking): exit code == 0
- FALSIFY-002 (advisory):     wall ≤ 3600 s on RTX 4090
- FALSIFY-003 (ship-blocking): step-0 loss ≤ 0.7 × ln(151936) ≈ 8.35
                               (proves init weights flow through forward)
- FALSIFY-004 (ship-blocking): checkpoint.apr written with valid
                               magic bytes (0x41 0x50 0x52 0x00 v2 OR
                               0x41 0x50 0x52 0x4E v1)
- FALSIFY-005 (ship-blocking): val_loss after 500 steps < 9.38
                               (the §34 370M-from-scratch ceiling)
- FALSIFY-006 (advisory):     no CUDA OOM / illegal-address / launch-
                               OoR errors during run

Five-Whys (why this contract first, then live dispatch):

1. Why a contract before the dispatch? Per CLAUDE.md "Contract-first
   design: NEVER write code before writing a provable contract."
   Even though 5g.2 is "0 LOC operator-dispatch", it has shippable
   semantics that deserve falsification scaffolding.
2. Why these particular six gates? They cover the four orthogonal
   failure modes of a fine-tune-from-init dispatch: process-level
   (exit/wall), correctness (step-0 baseline + val_loss), and
   serialization (checkpoint magic bytes + GPU resource health).
3. Why DRAFT status (not PROPOSED, not ACTIVE)? DRAFT means "schema
   validated, falsifiers authored, but no live evidence yet."
   Status flips to ACTIVE_RUNTIME via §59 spec amendment after the
   live dispatch produces evidence.
4. Why a separate contract from apr-pretrain-from-init-v1? The
   sibling contract pins the in-process semantics of init loading
   (load_init_tensors_from_apr, populate_trainer_from_init_tensors).
   This new contract pins the END-TO-END dispatch outcome — they
   compose at the dispatch boundary.
5. Why the val_loss < 9.38 threshold (not 5.0 or 7.0)? §34's 200K-
   step retrain confirmed val_loss=9.38 as the 370M-from-scratch
   capacity ceiling on this corpus. A from-init pivot must beat
   from-scratch, otherwise §49's strategy reasoning is wrong.

Pre-requisites VERIFIED on host (lambda-vector RTX 4090):
- /mnt/nvme-raid0/models/qwen2.5-coder-0.5b-instruct-fp16.apr exists
- /mnt/nvme-raid0/data/codeparrot-python-permissive-shards-qwen has
  228 shards / 2.278B tokens (manifest.json reconstructed by PR #1575)
- `apr pretrain --init <PATH>` end-to-end runnable per §53 (#1494 MERGED)
- Polymorphic preflight per §55 (#1500 MERGED)

Quality gates:
- `pv validate contracts/apr-pretrain-init-finetune-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS

SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep work)
- MODEL-2 ship %: unchanged at 57% (this PR is contract-only;
  ship-% flips on §59 amendment after live verdict)
- Unblocks: §59 spec amendment recording 5g.2 dispatch result

Next steps (follow-ups, NOT this PR):
- LIVE dispatch on RTX 4090 (~20-60 min wall, pre-authorized per
  feedback_compute_pre_authorized.md)
- §59 spec amendment v3.05.0 → v3.06.0 with verdict + ship-% flip
- Contract status DRAFT → ACTIVE_RUNTIME

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/apr-pretrain-init-finetune-contract branch from 79623ea to 233b622 Compare May 9, 2026 05:02
@noahgift noahgift merged commit 5fde3de into main May 9, 2026
10 checks passed
@noahgift noahgift deleted the feat/apr-pretrain-init-finetune-contract branch May 9, 2026 05:19
noahgift added a commit that referenced this pull request May 9, 2026
…DE-PRETRAIN-INIT-CUDA-WIREUP-001)

Mirror the CPU path's `build_shared_trainer_with_init` (§50.4 step 5f.4)
into the CUDA backend so `apr pretrain --init <PATH> --device cuda` can
fine-tune from a public pretrained checkpoint on RTX 4090 — the only
remaining ship-blocker for SHIP-TWO §56.4 step 5g.2.

This PR:

- Adds `entrenar::train::pretrain_real_cuda::build_shared_cuda_trainer_with_init`,
  symmetric to the CPU sibling. Composes the SAME §50.4 step-5f machinery
  through both backends:
    5c:   build_transformer_config(init_arch)
    5f.1: validate_pretrain_init_arch_compatible(init_arch) — encoder rejection
    5f.2: load_init_tensors_from_apr(path) — read APR weights
    5f.3: populate_trainer_from_init_tensors(transformer, &tensors) — populate CPU model
    5f.5: CudaTransformerTrainer::with_model uploads populated blocks
          / final_norm / lm_head / embed_tokens to GPU.
  The §50.4 step 5f.1/5f.2/5f.3 helpers are reused VERBATIM — populate
  semantics are identical between CPU and CUDA backends.

- Updates `apr-cli::drive_real_cuda` to accept the same `init_arch:
  Option<&TransformerConfig>` + `init_path: Option<&Path>` pair as the
  CPU path. When either is `Some`, routes through the new builder.
  When both are `None`, preserves the existing from-scratch baseline
  (INV-ARCH-370M-001 stays enforced on the from-scratch CUDA path).

- Removes the `FALSIFY-APR-PRETRAIN-INIT-CUDA-001` fail-fast Err in
  `drive_real`. The `pub(crate) const FALSIFY_APR_PRETRAIN_INIT_CUDA_001_MSG`
  survives and is repurposed as a drift-prevention sentinel — its
  payload now reads "is wired for --device cuda via
  build_shared_cuda_trainer_with_init (5f.5 SHIPPED)" so a future
  regression that re-introduces a fail-fast fires the sentinel test
  before the contract reference goes stale.

Five-Whys (root-cause class) for the wireup itself:

1. Why was the CUDA wireup deferred while the CPU wireup landed in
   PR #1494? §50.4 step 5f.4 was the smallest cascade-completing PR;
   landing both backends in one PR conflated the algorithm-level
   wireup with the CUDA-feature-build dependency. Per
   `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 logical
   change.
2. Why does the CUDA path even need its own builder? Because the
   `CudaTransformerTrainer` constructor uploads weights to GPU at
   allocation time — the populated CPU model must exist BEFORE the
   GPU upload, or the GPU sees random initialization while the CPU
   model has the loaded init.
3. Why pass the populated CPU `Transformer` to `with_model` rather
   than loading directly into GPU buffers? Because the CUDA upload
   path (`upload_blocks` + `final_norm` + `lm_head`) reads weights
   FROM the CPU `Transformer` struct. The cleanest symmetry is
   "build CPU model, populate via shared helper, hand to CUDA
   constructor" — the same helper closes the §28 SHIP-007 silent-
   gibberish defect class on both backends.
4. Why preserve the const sentinel rather than delete it? The const
   is referenced by name in `apr-pretrain-arch-polymorphic-v1.yaml`
   v1.4.0..v1.6.0 changelog and falsifier entries. Deleting it would
   break the contract's audit trail. Repurposing it (semantic flip
   from "fail-fast" to "is wired") preserves the audit chain while
   the new payload still anchors a drift-prevention test.
5. Why does this PR not run the LIVE 500-step fine-tune? Per PR
   atomicity: this PR ships the wireup. The 500-step val_loss < 9.38
   verdict is gated by `apr-pretrain-init-finetune-v1.yaml` v1.0.0
   (PR #1576) — that contract's FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005
   flips MODEL-2 ship % 57% → ≥58%. The two PRs compose: this PR's
   wireup is the prerequisite; PR #1576's contract is the verdict.

LIVE END-TO-END DOGFOOD on lambda-vector RTX 4090 (this branch built
with `--features cuda`):

  $ apr pretrain --dataset .../codeparrot-python-permissive-shards-qwen \
        --tokenizer .../qwen-0.5b-tokenizer-extracted \
        --run-dir .../5g-2-smoke-1step-cuda-post5f5 \
        --mode finetune --num-steps 1 --batch-size 2 --seq-length 256 \
        --device cuda \
        --init .../qwen2.5-coder-0.5b-instruct-fp16.apr

  [CUDA] cuBLAS initialized — forward TF32 tensor cores
  [CUDA] Pre-warmed 27 forward kernels
  ✓ 24 transformer blocks uploaded to GPU
  ✓ GPU training state allocated (LM head: 544.5 MB)
  === Run Result ===
    OK CONVERGED  final val_loss=0.6847 after 1 epoch(s)

  Checkpoint: 2.35 GiB, 219 tensors, valid APR v2 (✓ checksum).

This live run discharges:
  - FALSIFY-APR-PRETRAIN-INIT-CUDA-001 (sentinel, post-5f.5)
  - FALSIFY-APR-PRETRAIN-INIT-FINETUNE-001 (exit 0)
  - FALSIFY-APR-PRETRAIN-INIT-FINETUNE-004 (checkpoint written)
  - Partial discharge of FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005
    (val_loss=0.6847 << 9.38 ceiling, on 1-step fine-tune; 500-step
    LIVE remains the binding evidence under PR #1576's contract).

Contract updates:

- `contracts/apr-pretrain-arch-polymorphic-v1.yaml`: v1.6.0 → v1.7.0.
  - FALSIFY-CUDA-001 semantic flip (fail-fast → wireup-is-wired sentinel)
  - NEW FALSIFY-CUDA-002 (paired-args invariant on the new builder)
  - NEW FALSIFY-CUDA-003 (encoder family rejection on the new builder)
  - All three new tests fire WITHOUT a CUDA runtime — they exercise
    the args-check and encoder-rejection paths that happen before any
    GPU allocation.

Quality gates:
- `pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS
- `cargo test -p apr-cli --features training --lib`: 5644/5644 PASS
- `cargo test -p apr-cli --features training --test cli_commands`: 8/8 PASS
- `cargo test -p aprender-train --features cuda --lib build_shared_cuda_trainer_with_init`: 2/2 PASS
- `cargo clippy -p apr-cli --features training --lib -- -D warnings`: clean
- `cargo check -p apr-cli --features training`: clean
- `cargo check -p apr-cli --features training,cuda`: clean
- LIVE: `apr pretrain --init Qwen.apr --device cuda` runs end-to-end on RTX 4090

SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep)
- MODEL-2 ship %: unchanged at 57% (5g.2 LIVE 500-step verdict still
  required to flip 57% → ≥58%; this PR closes the only remaining
  technical blocker — a 500-step dispatch is now operator-runnable).
- §50.4 cascade COMPLETE (5a-5f.5 all shipped; only 5g LIVE remains).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 9, 2026
…DE-PRETRAIN-INIT-CUDA-WIREUP-001) (#1577)

Mirror the CPU path's `build_shared_trainer_with_init` (§50.4 step 5f.4)
into the CUDA backend so `apr pretrain --init <PATH> --device cuda` can
fine-tune from a public pretrained checkpoint on RTX 4090 — the only
remaining ship-blocker for SHIP-TWO §56.4 step 5g.2.

This PR:

- Adds `entrenar::train::pretrain_real_cuda::build_shared_cuda_trainer_with_init`,
  symmetric to the CPU sibling. Composes the SAME §50.4 step-5f machinery
  through both backends:
    5c:   build_transformer_config(init_arch)
    5f.1: validate_pretrain_init_arch_compatible(init_arch) — encoder rejection
    5f.2: load_init_tensors_from_apr(path) — read APR weights
    5f.3: populate_trainer_from_init_tensors(transformer, &tensors) — populate CPU model
    5f.5: CudaTransformerTrainer::with_model uploads populated blocks
          / final_norm / lm_head / embed_tokens to GPU.
  The §50.4 step 5f.1/5f.2/5f.3 helpers are reused VERBATIM — populate
  semantics are identical between CPU and CUDA backends.

- Updates `apr-cli::drive_real_cuda` to accept the same `init_arch:
  Option<&TransformerConfig>` + `init_path: Option<&Path>` pair as the
  CPU path. When either is `Some`, routes through the new builder.
  When both are `None`, preserves the existing from-scratch baseline
  (INV-ARCH-370M-001 stays enforced on the from-scratch CUDA path).

- Removes the `FALSIFY-APR-PRETRAIN-INIT-CUDA-001` fail-fast Err in
  `drive_real`. The `pub(crate) const FALSIFY_APR_PRETRAIN_INIT_CUDA_001_MSG`
  survives and is repurposed as a drift-prevention sentinel — its
  payload now reads "is wired for --device cuda via
  build_shared_cuda_trainer_with_init (5f.5 SHIPPED)" so a future
  regression that re-introduces a fail-fast fires the sentinel test
  before the contract reference goes stale.

Five-Whys (root-cause class) for the wireup itself:

1. Why was the CUDA wireup deferred while the CPU wireup landed in
   PR #1494? §50.4 step 5f.4 was the smallest cascade-completing PR;
   landing both backends in one PR conflated the algorithm-level
   wireup with the CUDA-feature-build dependency. Per
   `feedback_falsifier_first_cascade_pattern.md`, 1 PR ≈ 1 logical
   change.
2. Why does the CUDA path even need its own builder? Because the
   `CudaTransformerTrainer` constructor uploads weights to GPU at
   allocation time — the populated CPU model must exist BEFORE the
   GPU upload, or the GPU sees random initialization while the CPU
   model has the loaded init.
3. Why pass the populated CPU `Transformer` to `with_model` rather
   than loading directly into GPU buffers? Because the CUDA upload
   path (`upload_blocks` + `final_norm` + `lm_head`) reads weights
   FROM the CPU `Transformer` struct. The cleanest symmetry is
   "build CPU model, populate via shared helper, hand to CUDA
   constructor" — the same helper closes the §28 SHIP-007 silent-
   gibberish defect class on both backends.
4. Why preserve the const sentinel rather than delete it? The const
   is referenced by name in `apr-pretrain-arch-polymorphic-v1.yaml`
   v1.4.0..v1.6.0 changelog and falsifier entries. Deleting it would
   break the contract's audit trail. Repurposing it (semantic flip
   from "fail-fast" to "is wired") preserves the audit chain while
   the new payload still anchors a drift-prevention test.
5. Why does this PR not run the LIVE 500-step fine-tune? Per PR
   atomicity: this PR ships the wireup. The 500-step val_loss < 9.38
   verdict is gated by `apr-pretrain-init-finetune-v1.yaml` v1.0.0
   (PR #1576) — that contract's FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005
   flips MODEL-2 ship % 57% → ≥58%. The two PRs compose: this PR's
   wireup is the prerequisite; PR #1576's contract is the verdict.

LIVE END-TO-END DOGFOOD on lambda-vector RTX 4090 (this branch built
with `--features cuda`):

  $ apr pretrain --dataset .../codeparrot-python-permissive-shards-qwen \
        --tokenizer .../qwen-0.5b-tokenizer-extracted \
        --run-dir .../5g-2-smoke-1step-cuda-post5f5 \
        --mode finetune --num-steps 1 --batch-size 2 --seq-length 256 \
        --device cuda \
        --init .../qwen2.5-coder-0.5b-instruct-fp16.apr

  [CUDA] cuBLAS initialized — forward TF32 tensor cores
  [CUDA] Pre-warmed 27 forward kernels
  ✓ 24 transformer blocks uploaded to GPU
  ✓ GPU training state allocated (LM head: 544.5 MB)
  === Run Result ===
    OK CONVERGED  final val_loss=0.6847 after 1 epoch(s)

  Checkpoint: 2.35 GiB, 219 tensors, valid APR v2 (✓ checksum).

This live run discharges:
  - FALSIFY-APR-PRETRAIN-INIT-CUDA-001 (sentinel, post-5f.5)
  - FALSIFY-APR-PRETRAIN-INIT-FINETUNE-001 (exit 0)
  - FALSIFY-APR-PRETRAIN-INIT-FINETUNE-004 (checkpoint written)
  - Partial discharge of FALSIFY-APR-PRETRAIN-INIT-FINETUNE-005
    (val_loss=0.6847 << 9.38 ceiling, on 1-step fine-tune; 500-step
    LIVE remains the binding evidence under PR #1576's contract).

Contract updates:

- `contracts/apr-pretrain-arch-polymorphic-v1.yaml`: v1.6.0 → v1.7.0.
  - FALSIFY-CUDA-001 semantic flip (fail-fast → wireup-is-wired sentinel)
  - NEW FALSIFY-CUDA-002 (paired-args invariant on the new builder)
  - NEW FALSIFY-CUDA-003 (encoder family rejection on the new builder)
  - All three new tests fire WITHOUT a CUDA runtime — they exercise
    the args-check and encoder-rejection paths that happen before any
    GPU allocation.

Quality gates:
- `pv validate contracts/apr-pretrain-arch-polymorphic-v1.yaml`: 0 errors
- `pv lint --strict-test-binding`: 9/9 gates PASS
- `cargo test -p apr-cli --features training --lib`: 5644/5644 PASS
- `cargo test -p apr-cli --features training --test cli_commands`: 8/8 PASS
- `cargo test -p aprender-train --features cuda --lib build_shared_cuda_trainer_with_init`: 2/2 PASS
- `cargo clippy -p apr-cli --features training --lib -- -D warnings`: clean
- `cargo check -p apr-cli --features training`: clean
- `cargo check -p apr-cli --features training,cuda`: clean
- LIVE: `apr pretrain --init Qwen.apr --device cuda` runs end-to-end on RTX 4090

SHIP-TWO impact:
- MODEL-1 ship %: unchanged at 91% (this is MODEL-2 prep)
- MODEL-2 ship %: unchanged at 57% (5g.2 LIVE 500-step verdict still
  required to flip 57% → ≥58%; this PR closes the only remaining
  technical blocker — a 500-step dispatch is now operator-runnable).
- §50.4 cascade COMPLETE (5a-5f.5 all shipped; only 5g LIVE remains).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant