falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn by noahgift · Pull Request #1006 · paiml/aprender

noahgift · 2026-04-22T13:39:45Z

Summary

Discharges FALSIFY-SHIP-018 / AC-SHIP2-008 / GATE-ARCH-370M-007 at
PARTIAL_ALGORITHM_LEVEL. Lands a pure two-number threshold function
verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship018Verdict
in crates/aprender-train/src/models/llama_370m.rs plus const
AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0 pinning the spec
§5.2 AC-SHIP2-008 floor — proves the HumanEval pass@1 decision rule
at cargo test time, independent of a trained 370M artifact.

This is the 6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020).
Spec v2.22.0's "exhausted" verdict is now falsified 4 times.

What ships today

verdict_from_pass_at_1(correct: usize, total: usize, threshold_pct: f32) -> Ship018Verdict
— inclusive-floor comparison with conservative-Fail guards.
Ship018Verdict enum { Pass, Fail }.
AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT: f32 = 30.0 (contract floor).
Two unit tests:
- falsify_ship_018_humaneval_pass_at_1_threshold_logic — boundary
  (f32-exact 50/100 = 50.0% with ±ULP proving inclusive >=),
  below-floor rejection (49/164 ≈ 29.88%, 29/100 = 29.0%), generous
  pass (82/164, 164/164), hard-red (0/164, 1/164), monotonicity
  sweep (correct ∈ 0..=164 at total=164 never flips Pass → Fail),
  div-safety (total=0 fails closed), correct>total sanity, non-finite
  threshold guard (NaN / ±∞ all Fail), provenance pinning (const
  stays = 30.0).
- falsify_ship_018_gate_arch_370m_007_has_partial_discharge_marker
  — parses SOVEREIGN_CONTRACT_YAML via serde_yaml, asserts the
  gate binds FALSIFY-SHIP-018 / AC-SHIP2-008,
  discharge_status == PARTIAL_ALGORITHM_LEVEL,
  evidence_discharged_by non-empty, full_discharge_blocks_on
  present, ship_blocking: true.
contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 →
v1.6.0 (stays ACTIVE) with new GATE-ARCH-370M-007 block + changelog
entry.
docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
→ v2.24.0 with amendment block.

Pattern lesson

The "exhausted" verdict has now been falsified 4 times:
SHIP-019 → SHIP-017 → SHIP-020 → SHIP-018. Whenever a SHIP-spec
gate names a threshold/tolerance/ratio/cut-off and the compute-heavy
harness is separable from the decision function, the threshold
function can land today. Remaining 5th-PARTIAL candidate worth a
survey pass: SHIP-016 (apr qa 8-of-8 aggregate — separable
but not a single-threshold proof). SHIP-013/014 genuinely need real
RTX 4090 compute (CE loss + 21-day wall-clock).

Full discharge blocks on

Real 370M .apr checkpoint from AC-SHIP2-003/004 compute-dispatch +
three independent apr eval --benchmark humaneval --json median
pass@1 values at seed=0 on the SHIP-TWO-001 canonical host; feed
each into verdict_from_pass_at_1 and require all three Pass.
Fixture-swap only — no harness rewrite required.

Status after this PR

MODEL-2 ship-gates: 3/12 ACTIVE (001, 011, 012) + 6/12 PARTIAL
(002 via SHIP-012, 005 via SHIP-015, 007 via SHIP-017, 008 via
SHIP-018, 009 via SHIP-019, 010 via SHIP-020) = 9/12 touched (75.0%).
Remaining 3 genuinely compute-blocked: 003 (CE ≤ 2.2 val loss),
004 (≤21-day wall-clock), 006 (apr qa aggregate).

Also bundled

A 6-line pre-existing fmt fix in crates/aprender-train/src/train/device.rs
under Toyota Way "all defects are your defects" — same pattern as
PR #1005. Without it, cargo fmt -p aprender-train --check fails on
main for reasons unrelated to this PR.

Test plan

cargo test -p aprender-train --lib models::llama_370m →
11/11 pass (2 new + 9 existing).
pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
→ Contract is valid.
cargo fmt -p aprender-train --check → clean.
cargo clippy -p aprender-train --lib -- -D warnings → clean.
CI (workspace-test, ci/test, ci/lint, ci/coverage) green.

Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018, GATE-ARCH-370M-007.

🤖 Generated with Claude Code

AC-SHIP2-008 / FALSIFY-SHIP-018 bound via new GATE-ARCH-370M-007 at PARTIAL_ALGORITHM_LEVEL. Pure two-number threshold fn `verdict_from_pass_at_1(correct, total, threshold_pct)` + const `AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0` in crates/aprender-train/src/models/llama_370m.rs — proves the spec's 'HumanEval pass@1 ≥ 30.0%' decision rule at `cargo test` time, independent of a trained artifact. Two unit tests prove: - boundary (f32-exact 50/100 = 50.0% with ±ULP shift showing `>=` is inclusive; 49/164 and 29/100 fail the 30.0 floor) - monotonicity (correct sweep 0..=164 at total=164 never flips Pass → Fail) - div-safety (total=0 fails closed) + sanity (correct>total fails) - non-finite threshold guard (NaN / ±∞ all Fail) - provenance pin (const stays = 30.0) - YAML marker (GATE-ARCH-370M-007 carries PARTIAL_ALGORITHM_LEVEL, binds AC-SHIP2-008, cites FALSIFY-SHIP-018, ship_blocking:true) Full discharge blocks on real 370M .apr (AC-SHIP2-003/004 compute) + three seed=0 `apr eval --benchmark humaneval --json` median pass@1 values fed into the verdict fn — all three must Pass. Fixture-swap only; no harness rewrite. 6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020). Spec v2.22.0's 'exhausted' verdict now falsified 4×. Remaining 5th-PARTIAL candidate: SHIP-016 (`apr qa` 8-of-8 aggregate — not a single threshold). SHIP-013/014 genuinely need real compute. Contract: llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE). Spec: ship-two-models-spec.md v2.23.0 → v2.24.0 (amendment block). Also: 6-line pre-existing fmt fix in train/device.rs under Toyota Way "all defects are your defects" (same pattern as PR #1005). Status: MODEL-2 ship-gates 3/12 ACTIVE + 6/12 PARTIAL = 9/12 touched (75.0%). Remaining 3 (003/004/006) all need real 370M compute. Tests: cargo test -p aprender-train --lib models::llama_370m → 11/11 pass. `pv validate contracts/model-families/llama-370m-sovereign-v1.yaml` → Contract is valid. cargo fmt -p aprender-train --check → clean. cargo clippy -p aprender-train --lib -- -D warnings → clean. Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-04-23T18:08:40Z

Superseded by #1034 — rebased onto MODEL-1 stack + SHIP-017 + SHIP-020 at v2.36.0.

This was referenced Apr 22, 2026

feat(falsify-ship-005): MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise) PARTIAL discharge #1015

Merged

feat(falsify-ship-018): MODEL-2 AC-SHIP2-008 PARTIAL discharge (restacked) #1034

Closed

noahgift closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn#1006

falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn#1006
noahgift wants to merge 1 commit into
mainfrom
feat/falsify-ship-018-partial-discharge

noahgift commented Apr 22, 2026

Uh oh!

noahgift commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

What ships today

Pattern lesson

Full discharge blocks on

Status after this PR

Also bundled

Test plan

Uh oh!

noahgift commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant