feat(ship-two-001): FALSIFY-SHIP-017 AC-SHIP2-007 PARTIAL_ALGORITHM_LEVEL discharge (task #149) by noahgift · Pull Request #1004 · paiml/aprender

noahgift · 2026-04-22T11:57:36Z

Summary

Binds AC-SHIP2-007 ("apr run produces syntactically valid Python on 100 held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005 in contracts/model-families/llama-370m-sovereign-v1.yaml (v1.5.0 → v1.6.0, stays ACTIVE) with discharge_status: PARTIAL_ALGORITHM_LEVEL.
Adds a pure const fn verdict_from_syntax_error_count(errors: usize) -> Ship017Verdict + two unit tests in crates/aprender-train/src/models/llama_370m.rs proving the threshold rule today.
MODEL-2 ship-gate status: 3/12 ACTIVE + 4/12 PARTIAL = 7/12 touched (58.3%).

The algorithm-level proof

The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is a ship-blocker" — is a pure integer threshold and is proven correct at cargo test time.

Two tests cover it:

falsify_ship_017_syntax_error_count_threshold_logic — Pass boundary (0, 1 errors), Fail boundary (2 errors), pathological cases (50, 100 errors all Fail), monotonicity sweep over all errors ∈ [0, 100], and provenance pinning (AC_SHIP2_007_HELDOUT_PROMPT_COUNT == 100, AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS == 1).
falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker — binds the sovereign contract YAML shape (falsification_id, binds_to, discharge_status, evidence_discharged_by, full_discharge_blocks_on, ship_blocking) to the Rust tests via include_str!.

Full discharge (100-prompt apr run harness against a trained 370M .apr) blocks on pretraining compute-dispatch (AC-SHIP2-003/004). Fixture swap is data-only — no harness rewrite required.

Pattern confirmation

This is the 4th PARTIAL lever found after two prior "harvesting exhausted" verdicts (SHIP-015 → SHIP-019 → SHIP-017). The spec v2.25.0 amendment records this — re-running the counter-example survey continues to find new levers, so "exhausted" should be treated as provisional.

Test plan

cargo test -p aprender-train --lib llama_370m → 12/12 pass (both new falsify_ship_017_* tests green)
cargo clippy -p aprender-train --lib -- -D warnings → clean
pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → Contract is valid
CI green on self-hosted runners (workspace-test + ci/gate)

Files changed

contracts/model-families/llama-370m-sovereign-v1.yaml — adds GATE-ARCH-370M-005; version bump v1.5.0 → v1.6.0
crates/aprender-train/src/models/llama_370m.rs — adds threshold fn, consts, 2 tests
docs/specifications/aprender-train/ship-two-models-spec.md — v2.23.0 → v2.25.0 with amendment block

Closes task #149.

🤖 Generated with Claude Code

…EVEL discharge (task #149) MODEL-2 (albor 370M Sovereign) gate #4 at PARTIAL: binds AC-SHIP2-007 ("apr run produces syntactically valid Python on 100 held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005 with `discharge_status: PARTIAL_ALGORITHM_LEVEL`. The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is a ship-blocker" — is a pure integer threshold and is proven correct at `cargo test` time today. Full discharge (100-prompt `apr run` harness against a trained 370M .apr) remains PENDING on pretraining compute-dispatch (AC-SHIP2-003/004) — fixture swap is data-only, no harness rewrite required. Changes: - crates/aprender-train/src/models/llama_370m.rs: - Adds `AC_SHIP2_007_HELDOUT_PROMPT_COUNT` (=100) + `AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS` (=1) consts mirroring the spec §6 harness size and §8.3 FALSIFY-SHIP-017 tolerance. - Adds `verdict_from_syntax_error_count(errors) -> Ship017Verdict` const fn — the pure threshold. - Adds `falsify_ship_017_syntax_error_count_threshold_logic` — Pass boundary (0,1), Fail boundary (2,50,100), monotonicity sweep ∈ [0,100], and provenance pinning. - Adds `falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker` — binds sovereign contract YAML shape (falsification_id, binds_to, discharge_status, evidence_discharged_by, full_discharge_blocks_on, ship_blocking) to Rust tests via include_str!. - contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE): adds GATE-ARCH-370M-005. - docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0 → v2.25.0 with amendment block: counter-example survey continues to find new PARTIAL levers after two prior "exhausted" verdicts (SHIP-015 → SHIP-019 → SHIP-017). New status: 3/12 ACTIVE + 4/12 PARTIAL = 7/12 touched (58.3%). Verification: - cargo test -p aprender-train --lib llama_370m → 12/12 pass (including both new falsify_ship_017_* tests) - cargo clippy -p aprender-train --lib -- -D warnings → clean - pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → Contract is valid Closes task #149. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…@1 ≥86.00% (1.2 pp noise → 84.80%) (#1021) * feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL discharge — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. Clean-branch rebuild of the SHIP-005 delta from the now-superseded stacked PR #1015 (commit 8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002 landing). Re-based directly on current main (f615148) so SHIP-005 stands alone — SHIP-007 (#1019) remains open blocked on infra defect #1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still in flight. MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 + SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148; 5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR #1019 are not yet on main. Once they land, the two (or three) `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Supersedes #1015 (stacked-branch original). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: retrigger after disk-guard stuck workspace-test (#1021) Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step (vs typical 19min). Canceled. Re-triggering on fresh runner. Tracking: infra issue #1020 second incidence. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-04-23T17:49:44Z

Superseded by #1032 — rebased onto MODEL-1 stack at v2.34.0.

This was referenced Apr 22, 2026

feat(ship-two-001): FALSIFY-SHIP-020 algorithm-level PARTIAL discharge (5th PARTIAL) #1005

Closed

feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge #1016

Closed

noahgift mentioned this pull request Apr 23, 2026

feat(falsify-ship-017): MODEL-2 AC-SHIP2-007 PARTIAL discharge (restacked) #1032

Closed

5 tasks

noahgift closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ship-two-001): FALSIFY-SHIP-017 AC-SHIP2-007 PARTIAL_ALGORITHM_LEVEL discharge (task #149)#1004

feat(ship-two-001): FALSIFY-SHIP-017 AC-SHIP2-007 PARTIAL_ALGORITHM_LEVEL discharge (task #149)#1004
noahgift wants to merge 1 commit into
mainfrom
feat/falsify-ship-017-partial-discharge

noahgift commented Apr 22, 2026

Uh oh!

noahgift commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

The algorithm-level proof

Pattern confirmation

Test plan

Files changed

Uh oh!

noahgift commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant