falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn#1006
Closed
noahgift wants to merge 1 commit into
Closed
falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn#1006noahgift wants to merge 1 commit into
noahgift wants to merge 1 commit into
Conversation
AC-SHIP2-008 / FALSIFY-SHIP-018 bound via new GATE-ARCH-370M-007 at
PARTIAL_ALGORITHM_LEVEL. Pure two-number threshold fn
`verdict_from_pass_at_1(correct, total, threshold_pct)` + const
`AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0` in
crates/aprender-train/src/models/llama_370m.rs — proves the spec's
'HumanEval pass@1 ≥ 30.0%' decision rule at `cargo test` time,
independent of a trained artifact. Two unit tests prove:
- boundary (f32-exact 50/100 = 50.0% with ±ULP shift showing `>=`
is inclusive; 49/164 and 29/100 fail the 30.0 floor)
- monotonicity (correct sweep 0..=164 at total=164 never flips
Pass → Fail)
- div-safety (total=0 fails closed) + sanity (correct>total fails)
- non-finite threshold guard (NaN / ±∞ all Fail)
- provenance pin (const stays = 30.0)
- YAML marker (GATE-ARCH-370M-007 carries PARTIAL_ALGORITHM_LEVEL,
binds AC-SHIP2-008, cites FALSIFY-SHIP-018, ship_blocking:true)
Full discharge blocks on real 370M .apr (AC-SHIP2-003/004 compute)
+ three seed=0 `apr eval --benchmark humaneval --json` median
pass@1 values fed into the verdict fn — all three must Pass.
Fixture-swap only; no harness rewrite.
6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020). Spec
v2.22.0's 'exhausted' verdict now falsified 4×. Remaining 5th-PARTIAL
candidate: SHIP-016 (`apr qa` 8-of-8 aggregate — not a single
threshold). SHIP-013/014 genuinely need real compute.
Contract: llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE).
Spec: ship-two-models-spec.md v2.23.0 → v2.24.0 (amendment block).
Also: 6-line pre-existing fmt fix in train/device.rs under Toyota
Way "all defects are your defects" (same pattern as PR #1005).
Status: MODEL-2 ship-gates 3/12 ACTIVE + 6/12 PARTIAL = 9/12 touched
(75.0%). Remaining 3 (003/004/006) all need real 370M compute.
Tests: cargo test -p aprender-train --lib models::llama_370m → 11/11
pass. `pv validate contracts/model-families/llama-370m-sovereign-v1.yaml`
→ Contract is valid. cargo fmt -p aprender-train --check → clean.
cargo clippy -p aprender-train --lib -- -D warnings → clean.
Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Apr 22, 2026
Merged
Contributor
Author
|
Superseded by #1034 — rebased onto MODEL-1 stack + SHIP-017 + SHIP-020 at v2.36.0. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Discharges FALSIFY-SHIP-018 / AC-SHIP2-008 / GATE-ARCH-370M-007 at
PARTIAL_ALGORITHM_LEVEL. Lands a pure two-number threshold function
verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship018Verdictin
crates/aprender-train/src/models/llama_370m.rsplus constAC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0pinning the spec§5.2 AC-SHIP2-008 floor — proves the HumanEval pass@1 decision rule
at
cargo testtime, independent of a trained 370M artifact.This is the 6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020).
Spec v2.22.0's "exhausted" verdict is now falsified 4 times.
What ships today
verdict_from_pass_at_1(correct: usize, total: usize, threshold_pct: f32) -> Ship018Verdict— inclusive-floor comparison with conservative-Fail guards.
Ship018Verdictenum{ Pass, Fail }.AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT: f32 = 30.0(contract floor).falsify_ship_018_humaneval_pass_at_1_threshold_logic— boundary(f32-exact 50/100 = 50.0% with ±ULP proving inclusive
>=),below-floor rejection (49/164 ≈ 29.88%, 29/100 = 29.0%), generous
pass (82/164, 164/164), hard-red (0/164, 1/164), monotonicity
sweep (correct ∈ 0..=164 at total=164 never flips Pass → Fail),
div-safety (total=0 fails closed), correct>total sanity, non-finite
threshold guard (NaN / ±∞ all Fail), provenance pinning (const
stays = 30.0).
falsify_ship_018_gate_arch_370m_007_has_partial_discharge_marker— parses
SOVEREIGN_CONTRACT_YAMLviaserde_yaml, asserts thegate binds FALSIFY-SHIP-018 / AC-SHIP2-008,
discharge_status == PARTIAL_ALGORITHM_LEVEL,evidence_discharged_bynon-empty,full_discharge_blocks_onpresent,
ship_blocking: true.contracts/model-families/llama-370m-sovereign-v1.yamlv1.5.0 →v1.6.0 (stays ACTIVE) with new GATE-ARCH-370M-007 block + changelog
entry.
docs/specifications/aprender-train/ship-two-models-spec.mdv2.23.0→ v2.24.0 with amendment block.
Pattern lesson
The "exhausted" verdict has now been falsified 4 times:
SHIP-019 → SHIP-017 → SHIP-020 → SHIP-018. Whenever a SHIP-spec
gate names a threshold/tolerance/ratio/cut-off and the compute-heavy
harness is separable from the decision function, the threshold
function can land today. Remaining 5th-PARTIAL candidate worth a
survey pass: SHIP-016 (
apr qa8-of-8 aggregate — separablebut not a single-threshold proof). SHIP-013/014 genuinely need real
RTX 4090 compute (CE loss + 21-day wall-clock).
Full discharge blocks on
Real 370M .apr checkpoint from AC-SHIP2-003/004 compute-dispatch +
three independent
apr eval --benchmark humaneval --jsonmedianpass@1 values at seed=0 on the SHIP-TWO-001 canonical host; feed
each into
verdict_from_pass_at_1and require all three Pass.Fixture-swap only — no harness rewrite required.
Status after this PR
MODEL-2 ship-gates: 3/12 ACTIVE (001, 011, 012) + 6/12 PARTIAL
(002 via SHIP-012, 005 via SHIP-015, 007 via SHIP-017, 008 via
SHIP-018, 009 via SHIP-019, 010 via SHIP-020) = 9/12 touched (75.0%).
Remaining 3 genuinely compute-blocked: 003 (CE ≤ 2.2 val loss),
004 (≤21-day wall-clock), 006 (
apr qaaggregate).Also bundled
A 6-line pre-existing fmt fix in
crates/aprender-train/src/train/device.rsunder Toyota Way "all defects are your defects" — same pattern as
PR #1005. Without it,
cargo fmt -p aprender-train --checkfails onmain for reasons unrelated to this PR.
Test plan
cargo test -p aprender-train --lib models::llama_370m→11/11 pass (2 new + 9 existing).
pv validate contracts/model-families/llama-370m-sovereign-v1.yaml→
Contract is valid.cargo fmt -p aprender-train --check→ clean.cargo clippy -p aprender-train --lib -- -D warnings→ clean.Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018, GATE-ARCH-370M-007.
🤖 Generated with Claude Code