feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch) by noahgift · Pull Request #1019 · paiml/aprender

noahgift · 2026-04-22T21:21:58Z

Summary

Clean-branch rebuild of FALSIFY-SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on post-SHIP-002 main (f615148, contract v1.1.0). Supersedes stale PR #1014 which was stacked on SHIP-008/SHIP-006 branches that had not yet merged to main.

Binds MODEL-1 ship floor apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1 ≥ 30.0 tok/s on RTX 4090 to a pure f32-threshold verdict fn verdict_from_decode_tps(f32) -> Ship007Verdict in crates/aprender-core/src/bench/ship_007.rs. The decision rule is proven today via a 7-section mutation survey; the compute-heavy half (live apr bench on RTX 4090) is deferred to hardware evidence collection.

contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0 (adds FALSIFY-QW2E-SHIP-007)
crates/aprender-core/src/bench/ship_007.rs (NEW 179 LoC)
crates/aprender-core/src/bench/mod.rs (1-line registration)
docs/specifications/aprender-train/ship-two-models-spec.md v2.26.0 → v2.27.0

MODEL-1 AC-SHIP1 coverage: 4/10 → 5/10 touched (+ SHIP-007).

Mirrors MODEL-2 SHIP-020 single-f32-threshold shape — once both land, the two verdict_from_decode_tps_* fns should be deduplicated into a single parameterized helper.

Test plan

cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic → 1 passed
pv validate contracts/qwen2-e2e-verification-v1.yaml → 0 errors, 0 warnings
cargo clippy -p aprender-core --lib -- -D warnings → clean
cargo fmt --check -p aprender-core → clean
CI ci / gate passes on self-hosted intel-clean-room runner
CI workspace-test passes

🤖 Generated with Claude Code

noahgift · 2026-04-22T23:19:27Z

All required CI green. Updating branch to resolve BEHIND state for admin merge.

…@1 ≥86.00% (1.2 pp noise → 84.80%) (#1021) * feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL discharge — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. Clean-branch rebuild of the SHIP-005 delta from the now-superseded stacked PR #1015 (commit 8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002 landing). Re-based directly on current main (f615148) so SHIP-005 stands alone — SHIP-007 (#1019) remains open blocked on infra defect #1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still in flight. MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 + SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148; 5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR #1019 are not yet on main. Once they land, the two (or three) `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Supersedes #1015 (stacked-branch original). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: retrigger after disk-guard stuck workspace-test (#1021) Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step (vs typical 19min). Canceled. Re-triggering on fresh runner. Tracking: infra issue #1020 second incidence. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…RL + SHA-256 verdict rules (6/10) (#1022) Wires MODEL-1 AC-SHIP1-010 ("published artifact URL resolves; SHA-256 matches manifest") to pure algorithm-level decision rules in `crates/aprender-core/src/format/ship_010.rs` via two verdict fns and a 7-section mutation survey per side. Two constants bind the ship rules: - `AC_SHIP1_010_SHA256_HEX_LEN = 64` — sha256 = 32 bytes = 64 lowercase hex chars, per canonical `sha256sum` output. Any digest whose length differs, whose case differs, or whose bytes are not `[0-9a-f]` is rejected before any equality comparison. - `AC_SHIP1_010_REQUIRED_URL_SCHEME = "https://"` — TLS floor codified as a byte-literal per §4.2 (plaintext `http://` is MITM-spoofable and is a ship-blocker, not a warning). Two pure verdict fns in `format/ship_010.rs`: - `verdict_from_sha256_match(expected_hex, actual_hex) -> Ship010Verdict` — length gate + canonical-lowercase-hex gate + byte-equal compare. Short-circuits to `Fail` before any string comparison when either input is malformed. - `verdict_from_manifest_url(url) -> Ship010Verdict` — starts-with `https://` + non-empty host + ASCII-whitespace/control byte rejection. Accepts `huggingface.co/...` and `...amazonaws.com/...` canonical forms; rejects plaintext, scheme-less, empty-host, and whitespace/control-poisoned URLs. 7-section mutation survey per fn (proves each precondition is load-bearing): - SHA-256: identical-hex Pass / single-hex-flip Fail / wrong-length Fail / uppercase-hex rejected / non-hex (`g`..`z`) rejected / all-zero guard / provenance pin on constant `AC_SHIP1_010_SHA256_HEX_LEN`. - URL: HF canonical Pass / S3 canonical Pass / plaintext `http://` Fail / scheme-less Fail / empty-host (`https://`) Fail / whitespace-control `\n \t \r` rejected / provenance pin on constant `AC_SHIP1_010_REQUIRED_URL_SCHEME`. Contract `publish-manifest-v1.yaml` v1.3.0 → v1.4.0 adds a new `FALSIFY-SHIP-010` block under `falsification_tests:` binding the parent AC (`parent_acceptance_criteria: AC-SHIP1-010`), listing the two constants under `binds_constants:`, and pointing `evidence_discharged_by:` at the three Rust test fns. Status stays ACTIVE; discharge level is `PARTIAL_ALGORITHM_LEVEL` — full discharge blocks on a live `curl -sSI <artifact_url>` 200-OK + `sha256sum <local_file>` against a freshly-pulled `paiml/qwen2.5-coder-7b-apache-q4k-v1` file, verified against the manifest SHA-256 on a host with HF network egress. Coverage math post-landing: - MODEL-1: 5/10 → **6/10** touched (1 DISCHARGED from SHIP-001, plus five PARTIALs on SHIP-002 / SHIP-005 / SHIP-006 / SHIP-007 / SHIP-008 and now SHIP-010). First MODEL-1 network-dependent PARTIAL — others have been format / algorithm / threshold rules. - Combined both-models tally: 12 PARTIAL + 3 DISCHARGED (was 11 + 3). Why self-contained (not stacked on PR #1019 SHIP-007): SHIP-010 lives in `format/` next to LAYOUT contracts, not in `metrics/` or `qa/`; the two domains are orthogonal, so the PR is based on fresh main rather than stacked. `publish-manifest-v1.yaml` has no overlap with `qwen2-e2e-verification-v1.yaml` (SHIP-007 home). Dogfood evidence: - `cargo build -p aprender-core --lib` → green (14.83s) - `cargo test -p aprender-core --lib format::ship_010` → `3 passed; 0 failed; 0 ignored` - `pv validate contracts/publish-manifest-v1.yaml` → `0 error(s), 0 warning(s). Contract is valid.` - `cargo fmt -p aprender-core -- --check` → clean Spec bump: v2.27.0 → v2.28.0 (entry added at top of header; AC table row for AC-SHIP1-010 tagged `PARTIAL_ALGORITHM_LEVEL v2.28.0`).

…lean branch) Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on main (superseding stale PR #1014 which was stacked on feat/falsify-ship-008/006-partial-discharge branches that had not yet merged to main). Algorithm commit carries the same 7-section mutation survey as the original be6d129, re-based onto post-SHIP-002 main (commit f615148, contract v1.1.0). Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is proven today; compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding 30.0 to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL v2.27.0 marker and adds v2.27.0 amendment entry. Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005 not yet on main). Once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with model-specific floors pinned as module-level consts. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002) → **5/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean. Fmt: `cargo fmt --check -p aprender-core` → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift mentioned this pull request Apr 23, 2026

feat(ship-two-001): FALSIFY-SHIP-010 PARTIAL discharge — MODEL-1 HF URL + SHA-256 (6/10) #1022

Merged

6 tasks

noahgift and others added 2 commits April 23, 2026 05:36

ci: retrigger after 3 disk-guard race failures

0f8a778

noahgift force-pushed the feat/falsify-ship-007-clean branch from fa1b7fc to 0f8a778 Compare April 23, 2026 03:36

noahgift mentioned this pull request Apr 23, 2026

Fleet CI: per-PR CARGO_TARGET_DIR missing in paiml/.github reusable sovereign-ci.yml — 15 disk-guard collisions on aprender PR #1019 (2026-04-23) paiml/organizational-intelligence-plugin#12

Closed

Merge branch 'main' into feat/falsify-ship-007-clean

934fd5e

noahgift merged commit 651e07b into main Apr 23, 2026
10 checks passed

noahgift deleted the feat/falsify-ship-007-clean branch April 23, 2026 07:52

noahgift mentioned this pull request Apr 23, 2026

docs(ship-two-001): v2.30.0 — SHIP-007 merged + fleet CI hardened + session wrap #1024

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch)#1019

feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch)#1019
noahgift merged 3 commits into
mainfrom
feat/falsify-ship-007-clean

noahgift commented Apr 22, 2026

Uh oh!

noahgift commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

Test plan

Uh oh!

noahgift commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant