Skip to content

feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch)#1019

Merged
noahgift merged 3 commits into
mainfrom
feat/falsify-ship-007-clean
Apr 23, 2026
Merged

feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch)#1019
noahgift merged 3 commits into
mainfrom
feat/falsify-ship-007-clean

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Clean-branch rebuild of FALSIFY-SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on post-SHIP-002 main (f615148, contract v1.1.0). Supersedes stale PR #1014 which was stacked on SHIP-008/SHIP-006 branches that had not yet merged to main.

Binds MODEL-1 ship floor apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1 ≥ 30.0 tok/s on RTX 4090 to a pure f32-threshold verdict fn verdict_from_decode_tps(f32) -> Ship007Verdict in crates/aprender-core/src/bench/ship_007.rs. The decision rule is proven today via a 7-section mutation survey; the compute-heavy half (live apr bench on RTX 4090) is deferred to hardware evidence collection.

  • contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0 (adds FALSIFY-QW2E-SHIP-007)
  • crates/aprender-core/src/bench/ship_007.rs (NEW 179 LoC)
  • crates/aprender-core/src/bench/mod.rs (1-line registration)
  • docs/specifications/aprender-train/ship-two-models-spec.md v2.26.0 → v2.27.0

MODEL-1 AC-SHIP1 coverage: 4/10 → 5/10 touched (+ SHIP-007).

Mirrors MODEL-2 SHIP-020 single-f32-threshold shape — once both land, the two verdict_from_decode_tps_* fns should be deduplicated into a single parameterized helper.

Test plan

  • cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic → 1 passed
  • pv validate contracts/qwen2-e2e-verification-v1.yaml → 0 errors, 0 warnings
  • cargo clippy -p aprender-core --lib -- -D warnings → clean
  • cargo fmt --check -p aprender-core → clean
  • CI ci / gate passes on self-hosted intel-clean-room runner
  • CI workspace-test passes

🤖 Generated with Claude Code

@noahgift

Copy link
Copy Markdown
Contributor Author

All required CI green. Updating branch to resolve BEHIND state for admin merge.

noahgift added a commit that referenced this pull request Apr 23, 2026
…@1 ≥86.00% (1.2 pp noise → 84.80%) (#1021)

* feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL discharge — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild

Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005)
to a pure two-number threshold verdict fn. Clean-branch rebuild of the
SHIP-005 delta from the now-superseded stacked PR #1015 (commit
8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002
landing). Re-based directly on current main (f615148) so SHIP-005
stands alone — SHIP-007 (#1019) remains open blocked on infra defect
#1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still
in flight.

MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 +
SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148;
5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1
threshold) but uniquely carries a 1.2 pp noise allowance called out by
spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window.

contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0:
  - Adds FALSIFY-QW2E-SHIP-005 binding
      AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00
      AC_SHIP1_005_NOISE_ALLOWANCE_PP              = 1.20
      AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80
    to `verdict_from_pass_at_1(correct, total, threshold_pct) ->
    Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`.
  - 8-section mutation survey:
      1. Safe-margin Pass above effective floor (85/100 = 85.0%)
      2. Above nominal floor (87/100 = 87.0%) Pass
      3. Noise-window Fail at nominal (85/100 Fails nominal)
      4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756%
      5. Monotonicity sweep correct=0..=164 at effective
      6. Div-safety (total=0) + sanity (correct>total) → Fail
      7. Non-finite threshold (NaN, ±∞) → Fail conservatively
      8. Tolerance-bounded provenance pin on all three constants
         (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80).
  - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`,
    `full_discharge_blocks_on: live apr eval --benchmark humaneval ...`
    on RTX 4090; 6 named counter_example_classes.

crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines):
  - Three-constant design unique to MODEL-1 (SHIP-007/018 had one).
  - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns
    `Ship005Verdict::Fail` conservatively on: total=0 (div guard),
    correct>total (sanity), !threshold.is_finite() (NaN/±∞).
  - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing.

Spec `docs/specifications/aprender-train/ship-two-models-spec.md`
  v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows
  `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry
  noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10.

Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR
#1019 are not yet on main. Once they land, the two (or three)
`verdict_from_pass_at_1_*` fns should be dedup'd into a single
parameterized helper.

Full discharge blocks on: live `apr eval --benchmark humaneval
paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with
--features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or
≥ 84.80 under the 1.2 pp noise allowance).

Tests:
  cargo test -p aprender-core --lib \
    falsify_ship_005_humaneval_pass_at_1_threshold_logic
Contract:
  cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \
    contracts/qwen2-e2e-verification-v1.yaml

Supersedes #1015 (stacked-branch original).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: retrigger after disk-guard stuck workspace-test (#1021)

Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step
(vs typical 19min). Canceled. Re-triggering on fresh runner.

Tracking: infra issue #1020 second incidence.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 23, 2026
…RL + SHA-256 verdict rules (6/10) (#1022)

Wires MODEL-1 AC-SHIP1-010 ("published artifact URL resolves; SHA-256
matches manifest") to pure algorithm-level decision rules in
`crates/aprender-core/src/format/ship_010.rs` via two verdict fns and
a 7-section mutation survey per side.

Two constants bind the ship rules:

- `AC_SHIP1_010_SHA256_HEX_LEN = 64` — sha256 = 32 bytes = 64 lowercase
  hex chars, per canonical `sha256sum` output. Any digest whose length
  differs, whose case differs, or whose bytes are not `[0-9a-f]` is
  rejected before any equality comparison.

- `AC_SHIP1_010_REQUIRED_URL_SCHEME = "https://"` — TLS floor codified
  as a byte-literal per §4.2 (plaintext `http://` is MITM-spoofable
  and is a ship-blocker, not a warning).

Two pure verdict fns in `format/ship_010.rs`:

- `verdict_from_sha256_match(expected_hex, actual_hex) -> Ship010Verdict`
  — length gate + canonical-lowercase-hex gate + byte-equal compare.
  Short-circuits to `Fail` before any string comparison when either
  input is malformed.

- `verdict_from_manifest_url(url) -> Ship010Verdict` — starts-with
  `https://` + non-empty host + ASCII-whitespace/control byte rejection.
  Accepts `huggingface.co/...` and `...amazonaws.com/...` canonical
  forms; rejects plaintext, scheme-less, empty-host, and
  whitespace/control-poisoned URLs.

7-section mutation survey per fn (proves each precondition is
load-bearing):

- SHA-256: identical-hex Pass / single-hex-flip Fail / wrong-length
  Fail / uppercase-hex rejected / non-hex (`g`..`z`) rejected / all-zero
  guard / provenance pin on constant `AC_SHIP1_010_SHA256_HEX_LEN`.

- URL: HF canonical Pass / S3 canonical Pass / plaintext `http://`
  Fail / scheme-less Fail / empty-host (`https://`) Fail /
  whitespace-control `\n \t \r` rejected / provenance pin on constant
  `AC_SHIP1_010_REQUIRED_URL_SCHEME`.

Contract `publish-manifest-v1.yaml` v1.3.0 → v1.4.0 adds a new
`FALSIFY-SHIP-010` block under `falsification_tests:` binding the
parent AC (`parent_acceptance_criteria: AC-SHIP1-010`), listing the
two constants under `binds_constants:`, and pointing
`evidence_discharged_by:` at the three Rust test fns. Status stays
ACTIVE; discharge level is `PARTIAL_ALGORITHM_LEVEL` — full discharge
blocks on a live `curl -sSI <artifact_url>` 200-OK + `sha256sum
<local_file>` against a freshly-pulled `paiml/qwen2.5-coder-7b-apache-q4k-v1`
file, verified against the manifest SHA-256 on a host with HF
network egress.

Coverage math post-landing:

- MODEL-1: 5/10 → **6/10** touched (1 DISCHARGED from SHIP-001, plus
  five PARTIALs on SHIP-002 / SHIP-005 / SHIP-006 / SHIP-007 /
  SHIP-008 and now SHIP-010). First MODEL-1 network-dependent PARTIAL
  — others have been format / algorithm / threshold rules.

- Combined both-models tally: 12 PARTIAL + 3 DISCHARGED (was 11 + 3).

Why self-contained (not stacked on PR #1019 SHIP-007): SHIP-010 lives
in `format/` next to LAYOUT contracts, not in `metrics/` or `qa/`;
the two domains are orthogonal, so the PR is based on fresh main
rather than stacked. `publish-manifest-v1.yaml` has no overlap with
`qwen2-e2e-verification-v1.yaml` (SHIP-007 home).

Dogfood evidence:

- `cargo build -p aprender-core --lib` → green (14.83s)
- `cargo test -p aprender-core --lib format::ship_010` →
  `3 passed; 0 failed; 0 ignored`
- `pv validate contracts/publish-manifest-v1.yaml` →
  `0 error(s), 0 warning(s). Contract is valid.`
- `cargo fmt -p aprender-core -- --check` → clean

Spec bump: v2.27.0 → v2.28.0 (entry added at top of header; AC table
row for AC-SHIP1-010 tagged `PARTIAL_ALGORITHM_LEVEL v2.28.0`).
noahgift and others added 2 commits April 23, 2026 05:36
…lean branch)

Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on
main (superseding stale PR #1014 which was stacked on
feat/falsify-ship-008/006-partial-discharge branches that had not yet
merged to main). Algorithm commit carries the same 7-section mutation
survey as the original be6d129, re-based onto post-SHIP-002 main
(commit f615148, contract v1.1.0).

Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090
(7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold
verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is
proven today; compute-heavy half (live `apr bench` on RTX 4090) is
deferred to hardware evidence collection.

Files:
- `crates/aprender-core/src/bench/ship_007.rs` (NEW) —
  `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`,
  `Ship007Verdict { Pass, Fail }`,
  `verdict_from_decode_tps(f32) -> Ship007Verdict`,
  `falsify_ship_007_decode_tps_threshold_logic` 7-section survey:
    1. boundary (30.0 exactly → Pass; contract is ≥, not >)
    2. one-ULP-below → Fail (sharpest off-by-one counter-example)
    3. clear Pass band (45 / 100 tok/s)
    4. clear Fail band (0 / 10 / 29.999999)
    5. monotonicity above floor + below floor
    6. non-finite → Fail conservatively (NaN, +∞, -∞)
    7. provenance pin binding 30.0 to spec §4.2.

- `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`.

- `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds
  `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`,
  `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by`
  pointing at ship_007.rs + the harness test, and
  `full_discharge_blocks_on` live `apr bench --iterations 5
  --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090
  with --features cuda; median of 5 iterations must be ≥ 30.0.

- `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0
  → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL
  v2.27.0 marker and adds v2.27.0 amendment entry.

Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005
not yet on main). Once both ship, the two `verdict_from_decode_tps_*`
fns should be deduplicated into a single parameterized helper
`verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with
model-specific floors pinned as module-level consts. MODEL-1 floor is
30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor
is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth).

MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006
+ SHIP-002) → **5/10** touched (+ SHIP-007).

Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed.
Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors.
Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean.
Fmt: `cargo fmt --check -p aprender-core` → clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 651e07b into main Apr 23, 2026
10 checks passed
@noahgift noahgift deleted the feat/falsify-ship-007-clean branch April 23, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant