Skip to content

falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn#1006

Closed
noahgift wants to merge 1 commit into
mainfrom
feat/falsify-ship-018-partial-discharge
Closed

falsify(ship): SHIP-018 PARTIAL — humaneval pass@1 ≥30.0% threshold fn#1006
noahgift wants to merge 1 commit into
mainfrom
feat/falsify-ship-018-partial-discharge

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Discharges FALSIFY-SHIP-018 / AC-SHIP2-008 / GATE-ARCH-370M-007 at
PARTIAL_ALGORITHM_LEVEL. Lands a pure two-number threshold function
verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship018Verdict
in crates/aprender-train/src/models/llama_370m.rs plus const
AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0 pinning the spec
§5.2 AC-SHIP2-008 floor — proves the HumanEval pass@1 decision rule
at cargo test time, independent of a trained 370M artifact.

This is the 6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020).
Spec v2.22.0's "exhausted" verdict is now falsified 4 times.

What ships today

  • verdict_from_pass_at_1(correct: usize, total: usize, threshold_pct: f32) -> Ship018Verdict
    — inclusive-floor comparison with conservative-Fail guards.
  • Ship018Verdict enum { Pass, Fail }.
  • AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT: f32 = 30.0 (contract floor).
  • Two unit tests:
    • falsify_ship_018_humaneval_pass_at_1_threshold_logic — boundary
      (f32-exact 50/100 = 50.0% with ±ULP proving inclusive >=),
      below-floor rejection (49/164 ≈ 29.88%, 29/100 = 29.0%), generous
      pass (82/164, 164/164), hard-red (0/164, 1/164), monotonicity
      sweep (correct ∈ 0..=164 at total=164 never flips Pass → Fail),
      div-safety (total=0 fails closed), correct>total sanity, non-finite
      threshold guard (NaN / ±∞ all Fail), provenance pinning (const
      stays = 30.0).
    • falsify_ship_018_gate_arch_370m_007_has_partial_discharge_marker
      — parses SOVEREIGN_CONTRACT_YAML via serde_yaml, asserts the
      gate binds FALSIFY-SHIP-018 / AC-SHIP2-008,
      discharge_status == PARTIAL_ALGORITHM_LEVEL,
      evidence_discharged_by non-empty, full_discharge_blocks_on
      present, ship_blocking: true.
  • contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 →
    v1.6.0 (stays ACTIVE) with new GATE-ARCH-370M-007 block + changelog
    entry.
  • docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
    → v2.24.0 with amendment block.

Pattern lesson

The "exhausted" verdict has now been falsified 4 times:
SHIP-019 → SHIP-017 → SHIP-020 → SHIP-018. Whenever a SHIP-spec
gate names a threshold/tolerance/ratio/cut-off and the compute-heavy
harness is separable from the decision function, the threshold
function can land today. Remaining 5th-PARTIAL candidate worth a
survey pass: SHIP-016 (apr qa 8-of-8 aggregate — separable
but not a single-threshold proof). SHIP-013/014 genuinely need real
RTX 4090 compute (CE loss + 21-day wall-clock).

Full discharge blocks on

Real 370M .apr checkpoint from AC-SHIP2-003/004 compute-dispatch +
three independent apr eval --benchmark humaneval --json median
pass@1 values at seed=0 on the SHIP-TWO-001 canonical host; feed
each into verdict_from_pass_at_1 and require all three Pass.
Fixture-swap only — no harness rewrite required.

Status after this PR

MODEL-2 ship-gates: 3/12 ACTIVE (001, 011, 012) + 6/12 PARTIAL
(002 via SHIP-012, 005 via SHIP-015, 007 via SHIP-017, 008 via
SHIP-018, 009 via SHIP-019, 010 via SHIP-020) = 9/12 touched (75.0%)
.
Remaining 3 genuinely compute-blocked: 003 (CE ≤ 2.2 val loss),
004 (≤21-day wall-clock), 006 (apr qa aggregate).

Also bundled

A 6-line pre-existing fmt fix in crates/aprender-train/src/train/device.rs
under Toyota Way "all defects are your defects" — same pattern as
PR #1005. Without it, cargo fmt -p aprender-train --check fails on
main for reasons unrelated to this PR.

Test plan

  • cargo test -p aprender-train --lib models::llama_370m
    11/11 pass (2 new + 9 existing).
  • pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
    Contract is valid.
  • cargo fmt -p aprender-train --check → clean.
  • cargo clippy -p aprender-train --lib -- -D warnings → clean.
  • CI (workspace-test, ci/test, ci/lint, ci/coverage) green.

Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018, GATE-ARCH-370M-007.

🤖 Generated with Claude Code

AC-SHIP2-008 / FALSIFY-SHIP-018 bound via new GATE-ARCH-370M-007 at
PARTIAL_ALGORITHM_LEVEL. Pure two-number threshold fn
`verdict_from_pass_at_1(correct, total, threshold_pct)` + const
`AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0` in
crates/aprender-train/src/models/llama_370m.rs — proves the spec's
'HumanEval pass@1 ≥ 30.0%' decision rule at `cargo test` time,
independent of a trained artifact. Two unit tests prove:

  - boundary (f32-exact 50/100 = 50.0% with ±ULP shift showing `>=`
    is inclusive; 49/164 and 29/100 fail the 30.0 floor)
  - monotonicity (correct sweep 0..=164 at total=164 never flips
    Pass → Fail)
  - div-safety (total=0 fails closed) + sanity (correct>total fails)
  - non-finite threshold guard (NaN / ±∞ all Fail)
  - provenance pin (const stays = 30.0)
  - YAML marker (GATE-ARCH-370M-007 carries PARTIAL_ALGORITHM_LEVEL,
    binds AC-SHIP2-008, cites FALSIFY-SHIP-018, ship_blocking:true)

Full discharge blocks on real 370M .apr (AC-SHIP2-003/004 compute)
+ three seed=0 `apr eval --benchmark humaneval --json` median
pass@1 values fed into the verdict fn — all three must Pass.
Fixture-swap only; no harness rewrite.

6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020). Spec
v2.22.0's 'exhausted' verdict now falsified 4×. Remaining 5th-PARTIAL
candidate: SHIP-016 (`apr qa` 8-of-8 aggregate — not a single
threshold). SHIP-013/014 genuinely need real compute.

Contract: llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE).
Spec: ship-two-models-spec.md v2.23.0 → v2.24.0 (amendment block).
Also: 6-line pre-existing fmt fix in train/device.rs under Toyota
Way "all defects are your defects" (same pattern as PR #1005).

Status: MODEL-2 ship-gates 3/12 ACTIVE + 6/12 PARTIAL = 9/12 touched
(75.0%). Remaining 3 (003/004/006) all need real 370M compute.

Tests: cargo test -p aprender-train --lib models::llama_370m → 11/11
pass. `pv validate contracts/model-families/llama-370m-sovereign-v1.yaml`
→ Contract is valid. cargo fmt -p aprender-train --check → clean.
cargo clippy -p aprender-train --lib -- -D warnings → clean.

Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Superseded by #1034 — rebased onto MODEL-1 stack + SHIP-017 + SHIP-020 at v2.36.0.

@noahgift noahgift closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant