Skip to content

feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → 84.80%)#1021

Merged
noahgift merged 4 commits into
mainfrom
feat/falsify-ship-005-clean
Apr 23, 2026
Merged

feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → 84.80%)#1021
noahgift merged 4 commits into
mainfrom
feat/falsify-ship-005-clean

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Coverage delta

MODEL-1 AC-SHIP1 coverage: 4/10 → 5/10 (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002 → + SHIP-005). Remaining 5 MODEL-1 levers (001/003/004/007/010) touched at 0, 1, 1, 1 respectively (SHIP-007 pending #1019 infra-blocked).

Contract delta

contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0 — adds FALSIFY-QW2E-SHIP-005 binding:

  • AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00
  • AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20
  • AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80

to verdict_from_pass_at_1(correct: usize, total: usize, threshold_pct: f32) -> Ship005Verdict in crates/aprender-core/src/metrics/ship_005.rs — conservative Fail on total=0, correct>total, or non-finite threshold_pct (NaN, ±∞).

Algorithm-level mutation survey (8 sections)

  1. Safe-margin Pass above effective floor (85/100 = 85.0%)
  2. Above nominal floor (87/100 = 87.0%) Pass
  3. Noise-window Fail at nominal (85/100 Fails nominal threshold)
  4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756%
  5. Monotonicity sweep correct=0..=164 at effective threshold
  6. Div-safety (total=0) + sanity (correct>total) → Fail
  7. Non-finite threshold (NaN, ±∞) → Fail conservatively
  8. Tolerance-bounded provenance pin on all three constants (f32 86.0 − 1.2 ≈ 84.79999924 ≠ exact 84.80 — pin uses < 1e-4)

Spec delta

docs/specifications/aprender-train/ship-two-models-spec.md v2.26.0 → v2.27.0:

  • AC-SHIP1-005 row annotated **(PARTIAL_ALGORITHM_LEVEL v2.27.0)**
  • FALSIFY-SHIP-005 row in §6 annotated with contract + test reference
  • Amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models; MODEL-1 at 5/10

Full discharge blocks on

Live apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance).

Supersedes

Test plan

  • cargo test -p aprender-core --lib falsify_ship_005_humaneval_pass_at_1_threshold_logic — PASS
  • cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml — 0 errors, 0 warnings, "Contract is valid."
  • cargo fmt --package aprender-core --check — clean
  • cargo clippy -p aprender-core --lib --no-deps — clean
  • CI ci / gate + workspace-test green (may hit infra defect CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020 if runner-disk-guard race reappears)
  • Admin merge when green

🤖 Generated with Claude Code

…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild

Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005)
to a pure two-number threshold verdict fn. Clean-branch rebuild of the
SHIP-005 delta from the now-superseded stacked PR #1015 (commit
8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002
landing). Re-based directly on current main (f615148) so SHIP-005
stands alone — SHIP-007 (#1019) remains open blocked on infra defect
#1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still
in flight.

MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 +
SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148;
5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1
threshold) but uniquely carries a 1.2 pp noise allowance called out by
spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window.

contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0:
  - Adds FALSIFY-QW2E-SHIP-005 binding
      AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00
      AC_SHIP1_005_NOISE_ALLOWANCE_PP              = 1.20
      AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80
    to `verdict_from_pass_at_1(correct, total, threshold_pct) ->
    Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`.
  - 8-section mutation survey:
      1. Safe-margin Pass above effective floor (85/100 = 85.0%)
      2. Above nominal floor (87/100 = 87.0%) Pass
      3. Noise-window Fail at nominal (85/100 Fails nominal)
      4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756%
      5. Monotonicity sweep correct=0..=164 at effective
      6. Div-safety (total=0) + sanity (correct>total) → Fail
      7. Non-finite threshold (NaN, ±∞) → Fail conservatively
      8. Tolerance-bounded provenance pin on all three constants
         (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80).
  - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`,
    `full_discharge_blocks_on: live apr eval --benchmark humaneval ...`
    on RTX 4090; 6 named counter_example_classes.

crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines):
  - Three-constant design unique to MODEL-1 (SHIP-007/018 had one).
  - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns
    `Ship005Verdict::Fail` conservatively on: total=0 (div guard),
    correct>total (sanity), !threshold.is_finite() (NaN/±∞).
  - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing.

Spec `docs/specifications/aprender-train/ship-two-models-spec.md`
  v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows
  `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry
  noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10.

Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR
#1019 are not yet on main. Once they land, the two (or three)
`verdict_from_pass_at_1_*` fns should be dedup'd into a single
parameterized helper.

Full discharge blocks on: live `apr eval --benchmark humaneval
paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with
--features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or
≥ 84.80 under the 1.2 pp noise allowance).

Tests:
  cargo test -p aprender-core --lib \
    falsify_ship_005_humaneval_pass_at_1_threshold_logic
Contract:
  cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \
    contracts/qwen2-e2e-verification-v1.yaml

Supersedes #1015 (stacked-branch original).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step
(vs typical 19min). Canceled. Re-triggering on fresh runner.

Tracking: infra issue #1020 second incidence.
@noahgift noahgift merged commit 9e3286d into main Apr 23, 2026
17 of 20 checks passed
@noahgift noahgift deleted the feat/falsify-ship-005-clean branch April 23, 2026 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant