feat(falsify-ship-013-014): MODEL-2 bundle completes 12/12 PARTIAL coverage by noahgift · Pull Request #1036 · paiml/aprender

noahgift · 2026-04-23T21:13:55Z

Summary

Bundled PARTIAL_ALGORITHM_LEVEL discharge of the last two untouched MODEL-2 AC rows:

FALSIFY-SHIP-013 / AC-SHIP2-003: entrenar pretraining loop reaches target loss (CE ≤ 2.2 on val) — pure f32-threshold verdict fn verdict_from_val_ce_loss(f32) -> Ship013Verdict bound to const AC_SHIP2_003_MAX_VAL_CROSS_ENTROPY_LOSS = 2.2.
FALSIFY-SHIP-014 / AC-SHIP2-004: Training on RTX 4090 completes within 21 days — pure u32-threshold verdict fn verdict_from_training_duration_days(u32) -> Ship014Verdict bound to const AC_SHIP2_004_MAX_TRAINING_DURATION_DAYS = 21.

First bundled double-discharge on the SHIP-TWO-001 surface. Completes MODEL-2 to 12/12 PARTIAL_ALGORITHM_LEVEL touched.

Stacking

Stacked on feat/falsify-ship-016-restacked / PR #1035 (v2.37.0 / llama-370m-sovereign-v1.yaml v1.9.0). PR base is main so squash-merge chains sensibly once #1035 lands.

Changes

crates/aprender-train/src/models/llama_370m.rs: 2 new public const floors (f32 + u32), 2 verdict enums, 2 pure const fn verdict fns, 2 mutation-survey unit tests.
contracts/model-families/llama-370m-sovereign-v1.yaml: v1.9.0 → v1.10.0 (stays ACTIVE). Two new gates — GATE-ARCH-370M-013 (binds AC-SHIP2-003 ↔ FALSIFY-SHIP-013) and GATE-ARCH-370M-014 (binds AC-SHIP2-004 ↔ FALSIFY-SHIP-014) — both with discharge_status: PARTIAL_ALGORITHM_LEVEL + ship_blocking: true.
docs/specifications/aprender-train/ship-two-models-spec.md: Version 2.37.0 → 2.38.0; v2.38.0 Date-field entry; AC-SHIP2-003 + AC-SHIP2-004 rows tagged **(PARTIAL_ALGORITHM_LEVEL v2.38.0)**.

Mutation surveys

SHIP-013 — 7 sections: exact 2.2 boundary Pass / ULP-above Fail + ULP-below Pass / clear Pass band {0.0, 0.5, 1.0, 2.0, 2.199} / clear Fail band {2.201, 3.0, 10.0, f32::MAX} / non-finite {NaN, ±∞} conservative Fail / negative-CE domain-violation Fail (H(p,q) ≥ 0 by definition) / 2.2 provenance pin.

SHIP-014 — 6 sections: exact 21 boundary Pass / adjacent 20→Pass + 22→Fail / clear Pass band {0, 1, 7, 14, 20, 21} / clear Fail band {22, 30, 100, u32::MAX} / monotonicity sweep 0..=42 flipping exactly once at 21→22 / 21 provenance pin.

Verification

cargo fmt -p aprender-train --check — clean
cargo test -p aprender-train --lib ship_013 → 1 passed
cargo test -p aprender-train --lib ship_014 → 1 passed
cargo test -p aprender-train --lib llama_370m → 20 passed (no regressions)
cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/model-families/llama-370m-sovereign-v1.yaml → 0 errors, 0 warnings

Full discharge blocks on

SHIP-013: live apr pretrain --mode from-scratch --validate loop on RTX 4090 with --features cuda producing a real MODEL-2 val cross-entropy at the final validation step; feed into verdict_from_val_ce_loss → Pass.
SHIP-014: real wall-clock measurement of a MODEL-2 pretraining run on RTX 4090 from first apr pretrain dispatch to final checkpoint write, converted to integer days; feed into verdict_from_training_duration_days → Pass.

Both full discharges block on AC-SHIP2-003/004 compute-dispatch (blocked today by task #132 — CUDA training backend gap).

Status shift

MODEL-2 coverage: 8/12 → 12/12 PARTIAL_ALGORITHM_LEVEL touched (complete)
Across both models: 23 PARTIAL + 3 DISCHARGED

Test plan

cargo fmt -p aprender-train --check clean
cargo test -p aprender-train --lib ship_013 1 passed
cargo test -p aprender-train --lib ship_014 1 passed
cargo test -p aprender-train --lib llama_370m 20 passed
pv validate contracts/model-families/llama-370m-sovereign-v1.yaml 0 errors
CI required status checks green on merge

🤖 Generated with Claude Code

…ce-v1 multi-bind FALSIFY-SHIP-009 (AC-SHIP1-009 "MODEL-1 teacher license + data provenance recorded in model.apr metadata") attains PARTIAL_ALGORITHM_LEVEL by attaching a second binding to the same C-APR-PROVENANCE contract that already discharges MODEL-2's AC-SHIP2-012. The AprV2Metadata + serde-JSON decision rule is model-agnostic, so one contract cleanly carries both discharges. Changes: - contracts/apr-provenance-v1.yaml v1.0.0 → v1.1.0 (stays ACTIVE): new GATE-APR-PROV-004 block binds AC-SHIP1-009 / FALSIFY-SHIP-009 at PARTIAL_ALGORITHM_LEVEL with ship_blocking=true; full discharge blocks on teacher .apr republish populating license, data_source, data_license as named fields (PMAT-686 fixture-swap). - crates/aprender-core/src/format/tests/provenance_tests.rs: - falsify_ship_009_apr_metadata_applies_to_model_1_teacher — teacher-representative round-trip (license="apache-2.0", data_source="qwen2.5-coder-7b-instruct", data_license="apache-2.0"). - falsify_ship_009_gate_apr_prov_004_has_partial_discharge_marker — include_str! YAML-binding assertion that the new gate has the correct binds_to / falsification_id / discharge_status / flags. - crates/aprender-core/Cargo.toml: add serde_yaml to [dev-dependencies] (needed for the YAML-binding test). - docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0 → v2.24.0: new v2.24.0 amendment block documenting the first MODEL-1 PARTIAL and first multi-model multi-bind on one contract. Pattern extensions: - First MODEL-1 PARTIAL (prior six targeted MODEL-2). - First multi-model multi-bind on ONE contract (prior PARTIALs each had a dedicated contract). - Sixth falsification of the "exhausted" verdict: SHIP-019 → SHIP-017 → SHIP-020 → SHIP-018 → SHIP-016 → SHIP-009 — sixth is cross-model, strictly more surprising than the prior five. All 5 provenance tests green (3 SHIP-022 + 2 SHIP-009). Status after v2.24.0: - MODEL-2: 3/12 ACTIVE + 7/12 PARTIAL = 10/12 touched (83.3%) - MODEL-1: 9/10 DISCHARGED (via SHIP-TWO-001-MODEL-1-TEACHER tag) + 1/10 PARTIAL (009). Will flip to fully ACTIVE when PMAT-686 republishes teacher.apr with provenance fields populated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…EVEL discharge (task #149) MODEL-2 (albor 370M Sovereign) gate #4 at PARTIAL: binds AC-SHIP2-007 ("apr run produces syntactically valid Python on 100 held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005 with `discharge_status: PARTIAL_ALGORITHM_LEVEL`. The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is a ship-blocker" — is a pure integer threshold and is proven correct at `cargo test` time today. Full discharge (100-prompt `apr run` harness against a trained 370M .apr) remains PENDING on pretraining compute-dispatch (AC-SHIP2-003/004) — fixture swap is data-only, no harness rewrite required. Changes: - crates/aprender-train/src/models/llama_370m.rs: - Adds `AC_SHIP2_007_HELDOUT_PROMPT_COUNT` (=100) + `AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS` (=1) consts mirroring the spec §6 harness size and §8.3 FALSIFY-SHIP-017 tolerance. - Adds `verdict_from_syntax_error_count(errors) -> Ship017Verdict` const fn — the pure threshold. - Adds `falsify_ship_017_syntax_error_count_threshold_logic` — Pass boundary (0,1), Fail boundary (2,50,100), monotonicity sweep ∈ [0,100], and provenance pinning. - Adds `falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker` — binds sovereign contract YAML shape (falsification_id, binds_to, discharge_status, evidence_discharged_by, full_discharge_blocks_on, ship_blocking) to Rust tests via include_str!. - contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE): adds GATE-ARCH-370M-005. - docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0 → v2.25.0 with amendment block: counter-example survey continues to find new PARTIAL levers after two prior "exhausted" verdicts (SHIP-015 → SHIP-019 → SHIP-017). New status: 3/12 ACTIVE + 4/12 PARTIAL = 7/12 touched (58.3%). Verification: - cargo test -p aprender-train --lib llama_370m → 12/12 pass (including both new falsify_ship_017_* tests) - cargo clippy -p aprender-train --lib -- -D warnings → clean - pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → Contract is valid Closes task #149. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX 4090) to a new GATE-ARCH-370M-006 in the sovereign contract via a pure f32 threshold fn + two unit tests. The compute-heavy half (`apr bench` on a real trained 370M .apr) is deferred to AC-SHIP2-003/004 compute-dispatch; the decision rule itself is proven today. Changes: - crates/aprender-train/src/models/llama_370m.rs: * AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 (const floor) * Ship020Verdict { Pass, Fail } * verdict_from_decode_tps(f32) -> Ship020Verdict (fn, non-finite → Fail) * falsify_ship_020_decode_tps_threshold_logic (5 invariants: Pass boundary, Fail boundary at one f32 ULP, monotonicity in both directions, conservative Fail for NaN/±∞, provenance pinning that the const stays = 100.0) * falsify_ship_020_gate_arch_370m_006_has_partial_discharge_marker (contract parses + advertises PARTIAL_ALGORITHM_LEVEL + evidence_discharged_by populated + full_discharge_blocks_on documented + ship_blocking:true) - contracts/model-families/llama-370m-sovereign-v1.yaml: * v1.5.0 → v1.6.0, stays ACTIVE * New GATE-ARCH-370M-006 binding AC-SHIP2-010 ↔ FALSIFY-SHIP-020 with discharge_status: PARTIAL_ALGORITHM_LEVEL - docs/specifications/aprender-train/ship-two-models-spec.md: * v2.23.0 → v2.26.0 with amendment block * MODEL-2 ship-gate status updated: 3/12 ACTIVE + 5/12 PARTIAL = 8/12 touched (66.7%) - crates/aprender-train/src/train/device.rs: * 2 pre-existing fmt fixes (6 lines of whitespace) — restores `cargo fmt -p aprender-train --check` green. Pre-existing on origin/main; kept in this PR under Toyota Way "all defects are your defects" rule. Pattern lesson: v2.22.0 declared MODEL-2 non-compute PARTIAL levers "exhausted" — re-running the counter-example survey has now falsified that verdict three times (SHIP-019 → SHIP-017 → SHIP-020). When a SHIP gate names a threshold / tolerance / ratio / cut-off and the compute-heavy harness is separable from the decision function, the threshold fn can land today at unit-test time — even when the full end-to-end harness is blocked on compute. Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004 compute-dispatch + three independent `apr bench --tokens 128 --json` medians on RTX 4090 host. Fixture-swap only — no decision-rule rewrite. Verification: - cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS - pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → "Contract is valid. 0 error(s), 0 warning(s)." - cargo clippy -p aprender-train --lib → green - cargo fmt -p aprender-train --check → green Task #150. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

AC-SHIP2-008 / FALSIFY-SHIP-018 bound via new GATE-ARCH-370M-007 at PARTIAL_ALGORITHM_LEVEL. Pure two-number threshold fn `verdict_from_pass_at_1(correct, total, threshold_pct)` + const `AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0` in crates/aprender-train/src/models/llama_370m.rs — proves the spec's 'HumanEval pass@1 ≥ 30.0%' decision rule at `cargo test` time, independent of a trained artifact. Two unit tests prove: - boundary (f32-exact 50/100 = 50.0% with ±ULP shift showing `>=` is inclusive; 49/164 and 29/100 fail the 30.0 floor) - monotonicity (correct sweep 0..=164 at total=164 never flips Pass → Fail) - div-safety (total=0 fails closed) + sanity (correct>total fails) - non-finite threshold guard (NaN / ±∞ all Fail) - provenance pin (const stays = 30.0) - YAML marker (GATE-ARCH-370M-007 carries PARTIAL_ALGORITHM_LEVEL, binds AC-SHIP2-008, cites FALSIFY-SHIP-018, ship_blocking:true) Full discharge blocks on real 370M .apr (AC-SHIP2-003/004 compute) + three seed=0 `apr eval --benchmark humaneval --json` median pass@1 values fed into the verdict fn — all three must Pass. Fixture-swap only; no harness rewrite. 6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020). Spec v2.22.0's 'exhausted' verdict now falsified 4×. Remaining 5th-PARTIAL candidate: SHIP-016 (`apr qa` 8-of-8 aggregate — not a single threshold). SHIP-013/014 genuinely need real compute. Contract: llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE). Spec: ship-two-models-spec.md v2.23.0 → v2.24.0 (amendment block). Also: 6-line pre-existing fmt fix in train/device.rs under Toyota Way "all defects are your defects" (same pattern as PR #1005). Status: MODEL-2 ship-gates 3/12 ACTIVE + 6/12 PARTIAL = 9/12 touched (75.0%). Remaining 3 (003/004/006) all need real 370M compute. Tests: cargo test -p aprender-train --lib models::llama_370m → 11/11 pass. `pv validate contracts/model-families/llama-370m-sovereign-v1.yaml` → Contract is valid. cargo fmt -p aprender-train --check → clean. cargo clippy -p aprender-train --lib -- -D warnings → clean. Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ND verdict fn Wires GATE-ARCH-370M-008 (AC-SHIP2-006) to a pure verdict_from_qa_gates(&[bool]) -> Ship016Verdict aggregate-AND fn in aprender-train/src/models/llama_370m.rs, proven today by exhaustive 2^8 = 256-combination sweep + single-gate-flip falsifiability + monotonicity + 3 contract-drift guards (slice length 0/7/9/16 → Fail even when all-true). Discharge marker: PARTIAL_ALGORITHM_LEVEL. Pattern note: SHIP-016 is the first aggregate-AND shape — SHIP-017/018/020 were single-threshold shapes. The proof pattern now covers two distinct decision-rule shapes, confirming decision-rule/compute-harness separation is a reusable pattern, not a one-off. **5th PARTIAL after "exhausted" verdict falsified 4× already** (SHIP-019 → SHIP-017 → SHIP-020 → SHIP-018 → SHIP-016). **MODEL-2 ship-gate coverage: 3/12 ACTIVE + 7/12 PARTIAL = 10/12 touched (83.3%).** Remaining 2 truly compute-blocked (003 CE ≤ 2.2, 004 ≤21-day wall-clock) have no fixture-swap trick. Changes: - contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (GATE-ARCH-370M-008 block added; stays ACTIVE) - crates/aprender-train/src/models/llama_370m.rs: + AC_SHIP2_006_REQUIRED_QA_GATE_COUNT = 8 const + Ship016Verdict enum + verdict_from_qa_gates(&[bool]) pure fn with aggregate-AND + falsify_ship_016_apr_qa_aggregate_and_logic test (2^8 sweep + single-gate-flip + monotonicity + 3 contract-drift guards) + falsify_ship_016_gate_arch_370m_008_has_partial_discharge_marker test (YAML binding: binds_to AC-SHIP2-006, falsification_id FALSIFY-SHIP-016, discharge_status PARTIAL_ALGORITHM_LEVEL) - docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0 → v2.24.0 (amendment block documenting 5th PARTIAL, first aggregate-AND shape) - crates/aprender-train/src/train/device.rs: pre-existing fmt fixes bundled per Toyota Way "all defects are your defects" Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004 compute-dispatch + 8-gate apr qa harness invocation with exit 0 → feed the 8 gate-result booleans into verdict_from_qa_gates and require Ship016Verdict::Pass. Fixture-swap only — no harness rewrite. Refs #152 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…X 4090 training-budget PARTIAL discharges (12/12 MODEL-2 complete) Bundled PARTIAL_ALGORITHM_LEVEL discharge of the last two untouched MODEL-2 AC rows: AC-SHIP2-003 (val CE ≤ 2.2) and AC-SHIP2-004 (training ≤ 21 days on RTX 4090). First bundled double-discharge on the SHIP-TWO-001 surface. **FALSIFY-SHIP-013 / AC-SHIP2-003 / GATE-ARCH-370M-013** — val CE floor - `AC_SHIP2_003_MAX_VAL_CROSS_ENTROPY_LOSS: f32 = 2.2` - `Ship013Verdict { Pass, Fail }` - `const fn verdict_from_val_ce_loss(f32) -> Ship013Verdict` — Pass iff measured CE is finite AND non-negative AND ≤ 2.2. Negative values Fail conservatively because cross-entropy H(p,q) ≥ 0 by definition. - `falsify_ship_013_val_ce_loss_threshold_logic` — 7-section mutation survey: 1. Exact boundary 2.2 → Pass (inclusive floor, not strict <) 2. ULP asymmetry — above 2.2 → Fail, below 2.2 → Pass 3. Clear Pass band {0.0, 0.5, 1.0, 2.0, 2.199} 4. Clear Fail band {2.201, 3.0, 10.0, f32::MAX} 5. Non-finite {NaN, +∞, -∞} → Fail conservatively 6. Negative-CE domain-violation Fail ({-0.001, -1.0, -∞}) 7. Provenance pin: const stays = 2.2_f32 **FALSIFY-SHIP-014 / AC-SHIP2-004 / GATE-ARCH-370M-014** — training budget - `AC_SHIP2_004_MAX_TRAINING_DURATION_DAYS: u32 = 21` - `Ship014Verdict { Pass, Fail }` - `const fn verdict_from_training_duration_days(u32) -> Ship014Verdict` — Pass iff measured ≤ 21. u32 auto-rules out negatives and non-finites. - `falsify_ship_014_training_duration_threshold_logic` — 6-section mutation survey: 1. Exact boundary 21 → Pass (inclusive ceiling) 2. Adjacent: 20 → Pass, 22 → Fail 3. Clear Pass band {0, 1, 7, 14, 20, 21} 4. Clear Fail band {22, 30, 100, u32::MAX} 5. Monotonicity sweep 0..=42 — flips exactly once at 21→22 6. Provenance pin: const stays = 21_u32 **Changes:** - crates/aprender-train/src/models/llama_370m.rs: * 2 new public const floors + 2 verdict enums + 2 pure `const fn` verdict fns * 2 new mutation-survey unit tests (inside existing tests mod) - contracts/model-families/llama-370m-sovereign-v1.yaml: * v1.9.0 → v1.10.0, stays ACTIVE * New GATE-ARCH-370M-013 binding AC-SHIP2-003 ↔ FALSIFY-SHIP-013 with discharge_status: PARTIAL_ALGORITHM_LEVEL * New GATE-ARCH-370M-014 binding AC-SHIP2-004 ↔ FALSIFY-SHIP-014 with discharge_status: PARTIAL_ALGORITHM_LEVEL * v1.10.0 changelog entry at top of changelog block - docs/specifications/aprender-train/ship-two-models-spec.md: * Version 2.37.0 → 2.38.0 * v2.38.0 Date-field entry describing the bundle * AC-SHIP2-003 and AC-SHIP2-004 rows tagged `**(PARTIAL_ALGORITHM_LEVEL v2.38.0)**` **Verification:** - `cargo fmt -p aprender-train --check` — clean - `cargo test -p aprender-train --lib ship_013` → 1 passed - `cargo test -p aprender-train --lib ship_014` → 1 passed - `cargo test -p aprender-train --lib llama_370m` → 20 passed - `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/model-families/llama-370m-sovereign-v1.yaml` → 0 errors **Full discharge still blocks on:** - SHIP-013: live `apr pretrain --mode from-scratch --validate` loop on RTX 4090 with `--features cuda` producing a real MODEL-2 val CE. - SHIP-014: real wall-clock measurement of a MODEL-2 pretraining run on RTX 4090 from first `apr pretrain` dispatch to final checkpoint write. **Status shift:** - MODEL-2 coverage: 8/12 → **12/12 PARTIAL_ALGORITHM_LEVEL touched** (complete) - Across both models: 23 PARTIAL + 3 DISCHARGED Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-04-24T11:42:22Z

Superseded by #1044 — 11-PR cascade collapsed into single squash-merge to avoid O(n²) rebase treadmill. Content identical; this branch's commit is in #1044.

noahgift enabled auto-merge (squash) April 23, 2026 21:14

noahgift mentioned this pull request Apr 23, 2026

docs(ship-two-001): v2.39.0 back-annotate AC tables + falsification tests for single-source-of-truth #1037

Closed

6 tasks

noahgift force-pushed the feat/falsify-ship-013-014-bundle branch from c21736f to 6d85204 Compare April 24, 2026 06:47

noahgift and others added 6 commits April 24, 2026 12:40

noahgift force-pushed the feat/falsify-ship-013-014-bundle branch from 6d85204 to be5e58e Compare April 24, 2026 10:46

noahgift mentioned this pull request Apr 24, 2026

feat(ship-two-001): full algorithmic coverage bundle + README contract-backed rewrite (v2.30 → v2.43) #1044

Merged

noahgift closed this Apr 24, 2026

auto-merge was automatically disabled April 24, 2026 11:42
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(falsify-ship-013-014): MODEL-2 bundle completes 12/12 PARTIAL coverage#1036

feat(falsify-ship-013-014): MODEL-2 bundle completes 12/12 PARTIAL coverage#1036
noahgift wants to merge 6 commits into
mainfrom
feat/falsify-ship-013-014-bundle

noahgift commented Apr 23, 2026

Uh oh!

noahgift commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 23, 2026

Summary

Stacking

Changes

Mutation surveys

Verification

Full discharge blocks on

Status shift

Test plan

Uh oh!

noahgift commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant