feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → 84.80%)#1021
Merged
Merged
Conversation
…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. Clean-branch rebuild of the SHIP-005 delta from the now-superseded stacked PR #1015 (commit 8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002 landing). Re-based directly on current main (f615148) so SHIP-005 stands alone — SHIP-007 (#1019) remains open blocked on infra defect #1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still in flight. MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 + SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148; 5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR #1019 are not yet on main. Once they land, the two (or three) `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Supersedes #1015 (stacked-branch original). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step (vs typical 19min). Canceled. Re-triggering on fresh runner. Tracking: infra issue #1020 second incidence.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
apr eval --benchmark humanevalship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn atPARTIAL_ALGORITHM_LEVEL.Coverage delta
MODEL-1 AC-SHIP1 coverage: 4/10 → 5/10 (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002 → + SHIP-005). Remaining 5 MODEL-1 levers (001/003/004/007/010) touched at 0, 1, 1, 1 respectively (SHIP-007 pending #1019 infra-blocked).
Contract delta
contracts/qwen2-e2e-verification-v1.yamlv1.1.0 → v1.2.0 — addsFALSIFY-QW2E-SHIP-005binding:AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80to
verdict_from_pass_at_1(correct: usize, total: usize, threshold_pct: f32) -> Ship005Verdictincrates/aprender-core/src/metrics/ship_005.rs— conservativeFailontotal=0,correct>total, or non-finitethreshold_pct(NaN, ±∞).Algorithm-level mutation survey (8 sections)
Passabove effective floor (85/100 = 85.0%)PassFailat nominal (85/100 Fails nominal threshold)Failincl. HumanEval-canonical 139/164 = 84.756%FailFailconservatively86.0 − 1.2 ≈ 84.79999924≠ exact 84.80 — pin uses< 1e-4)Spec delta
docs/specifications/aprender-train/ship-two-models-spec.mdv2.26.0 → v2.27.0:**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**Full discharge blocks on
Live
apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --jsonon RTX 4090 with--features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance).Supersedes
Test plan
cargo test -p aprender-core --lib falsify_ship_005_humaneval_pass_at_1_threshold_logic— PASScargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml— 0 errors, 0 warnings, "Contract is valid."cargo fmt --package aprender-core --check— cleancargo clippy -p aprender-core --lib --no-deps— cleanci / gate+workspace-testgreen (may hit infra defect CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020 if runner-disk-guard race reappears)🤖 Generated with Claude Code