feat(falsify-ship-005): MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise) PARTIAL discharge by noahgift · Pull Request #1015 · paiml/aprender

noahgift · 2026-04-22T19:34:51Z

Summary

5th compute-free MODEL-1 PARTIAL lever. Wires AC-SHIP1-005
(`apr eval --benchmark humaneval` ≥ 86.00% pass@1, 1.2 pp noise
allowance → effective floor 84.80%) to a pure two-number threshold
verdict fn. Stacked on #1014 (SHIP-007).

Contract `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0:
adds `FALSIFY-QW2E-SHIP-005` (ship_blocking, PARTIAL_ALGORITHM_LEVEL)
binding three constants (nominal 86.00, noise 1.20 pp, effective
~84.80) and `verdict_from_pass_at_1(correct, total, threshold_pct)
-> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`.
8-section mutation survey: safe-margin Pass above effective /
above nominal Pass / noise-window Fail at nominal (85/100) /
below-effective Fail including HumanEval-canonical 139/164 =
84.756% / monotonicity 0..=164 / div-safety + sanity guards /
non-finite threshold Fail / tolerance-bounded provenance pin
(f32 `86.0 − 1.2 ≈ 84.79999924` ≠ exact 84.80).
Spec v2.26.0 → v2.27.0: AC-SHIP1-005 + FALSIFY-SHIP-005 rows
annotated `(PARTIAL_ALGORITHM_LEVEL v2.27.0)`; amendment
history entry notes MODEL-1 coverage 4/10 → 5/10 touched and
11 PARTIAL + 3 DISCHARGED across both models.

Mirrors MODEL-2 SHIP-018 pass@1 shape but uniquely carries the
1.2 pp noise allowance (MODEL-2 has none). Authored self-contained
because SHIP-018 PR #1006 not yet on main; follow-up dedup tracked
when both MODEL-1 and MODEL-2 `verdict_from_pass_at_1_*` fns
converge.

Test plan

`cargo test -p aprender-core --lib
falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passed
`cargo run --quiet -p aprender-contracts-cli --bin pv --
validate contracts/qwen2-e2e-verification-v1.yaml` — Contract is valid
Full discharge: live `apr eval --benchmark humaneval
paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with
`--features cuda`; median pass@1 across 3 seed=0 runs ≥ 86.00
(or ≥ 84.80 under the 1.2 pp noise allowance)

🤖 Generated with Claude Code

…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. 5th compute-free MODEL-1 lever (SHIP-008 + SHIP-009 + SHIP-006 + SHIP-007 + SHIP-005) brings MODEL-1 AC-SHIP1 coverage to 5/10 touched. Mirrors MODEL-2 SHIP-018 pattern (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 310 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because MODEL-2 SHIP-018 sibling PR has not yet landed on main. Once it does, the two `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-04-22T22:31:08Z

Superseded by #1021: This PR was merged into the stacked SHIP-007 branch (feat/falsify-ship-007-partial-discharge), not into main. The SHIP-005 delta has been rebuilt on a fresh branch from main and re-opened as #1021 — identical code/contract/spec modulo commit-message notes that SHIP-005 now stands alone since SHIP-007 #1019 remains blocked on runner-disk-guard infra defect #1020.

…@1 ≥86.00% (1.2 pp noise → 84.80%) (#1021) * feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL discharge — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. Clean-branch rebuild of the SHIP-005 delta from the now-superseded stacked PR #1015 (commit 8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002 landing). Re-based directly on current main (f615148) so SHIP-005 stands alone — SHIP-007 (#1019) remains open blocked on infra defect #1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still in flight. MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 + SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148; 5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR #1019 are not yet on main. Once they land, the two (or three) `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Supersedes #1015 (stacked-branch original). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: retrigger after disk-guard stuck workspace-test (#1021) Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step (vs typical 19min). Canceled. Re-triggering on fresh runner. Tracking: infra issue #1020 second incidence. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit 8c497a0 into feat/falsify-ship-007-partial-discharge Apr 22, 2026
1 check passed

noahgift deleted the feat/falsify-ship-005-partial-discharge branch April 22, 2026 19:36

noahgift mentioned this pull request Apr 22, 2026

feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → 84.80%) #1021

Merged

6 tasks

noahgift mentioned this pull request Apr 23, 2026

CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(falsify-ship-005): MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise) PARTIAL discharge#1015

feat(falsify-ship-005): MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise) PARTIAL discharge#1015
noahgift merged 1 commit into
feat/falsify-ship-007-partial-dischargefrom
feat/falsify-ship-005-partial-discharge

noahgift commented Apr 22, 2026

Uh oh!

Uh oh!

noahgift commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

Test plan

Uh oh!

Uh oh!

noahgift commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant