feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → 84.80%) by noahgift · Pull Request #1021 · paiml/aprender

noahgift · 2026-04-22T22:30:26Z

Summary

Wires MODEL-1 apr eval --benchmark humaneval ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn at PARTIAL_ALGORITHM_LEVEL.
Clean-branch rebuild of SHIP-005 delta from the now-superseded stacked PR feat(falsify-ship-005): MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise) PARTIAL discharge #1015 (commit 8c497a0 layered atop SHIP-007 feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch) #1019). Re-based on current main (f615148) so SHIP-005 stands alone.
Mirrors MODEL-2 SHIP-018 shape (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance per spec §4.2 AC-SHIP1-005. MODEL-2 has no noise window.

Coverage delta

MODEL-1 AC-SHIP1 coverage: 4/10 → 5/10 (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002 → + SHIP-005). Remaining 5 MODEL-1 levers (001/003/004/007/010) touched at 0, 1, 1, 1 respectively (SHIP-007 pending #1019 infra-blocked).

Contract delta

contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0 — adds FALSIFY-QW2E-SHIP-005 binding:

AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00
AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20
AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80

to verdict_from_pass_at_1(correct: usize, total: usize, threshold_pct: f32) -> Ship005Verdict in crates/aprender-core/src/metrics/ship_005.rs — conservative Fail on total=0, correct>total, or non-finite threshold_pct (NaN, ±∞).

Algorithm-level mutation survey (8 sections)

Safe-margin Pass above effective floor (85/100 = 85.0%)
Above nominal floor (87/100 = 87.0%) Pass
Noise-window Fail at nominal (85/100 Fails nominal threshold)
Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756%
Monotonicity sweep correct=0..=164 at effective threshold
Div-safety (total=0) + sanity (correct>total) → Fail
Non-finite threshold (NaN, ±∞) → Fail conservatively
Tolerance-bounded provenance pin on all three constants (f32 86.0 − 1.2 ≈ 84.79999924 ≠ exact 84.80 — pin uses < 1e-4)

Spec delta

docs/specifications/aprender-train/ship-two-models-spec.md v2.26.0 → v2.27.0:

AC-SHIP1-005 row annotated **(PARTIAL_ALGORITHM_LEVEL v2.27.0)**
FALSIFY-SHIP-005 row in §6 annotated with contract + test reference
Amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models; MODEL-1 at 5/10

Full discharge blocks on

Live apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance).

Supersedes

PR feat(falsify-ship-005): MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise) PARTIAL discharge #1015 (stacked-branch original; can be closed after this merges)

Test plan

cargo test -p aprender-core --lib falsify_ship_005_humaneval_pass_at_1_threshold_logic — PASS
cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml — 0 errors, 0 warnings, "Contract is valid."
cargo fmt --package aprender-core --check — clean
cargo clippy -p aprender-core --lib --no-deps — clean
CI ci / gate + workspace-test green (may hit infra defect CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020 if runner-disk-guard race reappears)
Admin merge when green

🤖 Generated with Claude Code

…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. Clean-branch rebuild of the SHIP-005 delta from the now-superseded stacked PR #1015 (commit 8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002 landing). Re-based directly on current main (f615148) so SHIP-005 stands alone — SHIP-007 (#1019) remains open blocked on infra defect #1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still in flight. MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 + SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148; 5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR #1019 are not yet on main. Once they land, the two (or three) `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Supersedes #1015 (stacked-branch original). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step (vs typical 19min). Canceled. Re-triggering on fresh runner. Tracking: infra issue #1020 second incidence.

This was referenced Apr 22, 2026

feat(falsify-ship-005): MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise) PARTIAL discharge #1015

Merged

CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020

Closed

noahgift added 3 commits April 23, 2026 01:08

ci: retrigger after disk-guard stuck workspace-test (#1021)

9f190fc

Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step (vs typical 19min). Canceled. Re-triggering on fresh runner. Tracking: infra issue #1020 second incidence.

Merge branch 'main' into feat/falsify-ship-005-clean

9ed9c30

Merge branch 'main' into feat/falsify-ship-005-clean

6e3ab40

noahgift merged commit 9e3286d into main Apr 23, 2026
17 of 20 checks passed

noahgift deleted the feat/falsify-ship-005-clean branch April 23, 2026 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → 84.80%)#1021

feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → 84.80%)#1021
noahgift merged 4 commits into
mainfrom
feat/falsify-ship-005-clean

noahgift commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

Coverage delta

Contract delta

Algorithm-level mutation survey (8 sections)

Spec delta

Full discharge blocks on

Supersedes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant