feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL discharge#1014
Closed
noahgift wants to merge 4 commits into
Closed
feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL discharge#1014noahgift wants to merge 4 commits into
noahgift wants to merge 4 commits into
Conversation
Discharge FALSIFY-SHIP-008 / AC-SHIP1-008 at PARTIAL_ALGORITHM_LEVEL.
- contracts/chat-template-v1.yaml v1.0.0 -> v1.1.0: adds
GATE-CHAT-SHIP-008 binding ChatMLTemplate::format_conversation to
the canonical Qwen2.5-Coder-7B (system, user) golden via a pure
verdict_from_chat_template_render const fn. ship_blocking: true,
discharge_status: PARTIAL_ALGORITHM_LEVEL; full discharge blocks
on live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1` completion
diff against golden.
- crates/aprender-core/src/text/chat_template/ship_008.rs (new):
AC_SHIP1_008_CANONICAL_{SYSTEM,USER,GOLDEN} constants +
Ship008Verdict enum + verdict_from_chat_template_render const fn
(byte-equality, UTF-8-safe) + 5-section mutation survey
(engine-binding, empty Fail, missing-gen-prompt Fail, wrong-delim
Fail, swapped-roles Fail, single-byte flip Fail) + symmetry +
provenance pin.
- crates/aprender-core/src/text/chat_template/mod.rs: include!
ship_008.rs alongside existing template.rs, raw_template.rs.
- docs/specifications/aprender-train/ship-two-models-spec.md
v2.23.0 -> v2.24.0: AC-SHIP1-008 row + FALSIFY-SHIP-008 row
annotated PARTIAL_ALGORITHM_LEVEL; v2.24.0 amendment entry
records MODEL-1 coverage 1/10 -> 2/10 (first MODEL-1
non-provenance PARTIAL; mirrors SHIP-016/017/018/020 pattern).
Test: cargo test -p aprender-core --lib
falsify_ship_008_chat_template_render_bind -> 1 passed
Contract: pv validate contracts/chat-template-v1.yaml -> Contract is valid
Refs: SHIP-TWO-001, task #155
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arge Wires AC-SHIP1-006 "apr qa <model> — all 8 gates PASS" at PARTIAL_ALGORITHM_LEVEL: a pure aggregate-AND verdict fn bound to the 8-gate ship criterion from `docs/specifications/components/qa.md` §3 (golden / throughput / ollama parity / gpu speedup / tensor contracts / format parity / ptx parity / metadata). Files: - `crates/aprender-core/src/qa/ship_006.rs` (NEW, 217 lines) — `verdict_from_qa_gates(&[bool]) -> Ship006Verdict` const fn with 7-section mutation survey: all-Pass→Pass, all-Fail→Fail, single-gate-flip × 8, exhaustive 2^8=256 bitmask proof, Pass→Fail monotonicity, length-drift counter-examples (0 / 7 / 9 / 16), provenance pin (AC_SHIP1_006_REQUIRED_QA_GATE_COUNT = 8). - `crates/aprender-core/src/qa/mod.rs` — register `pub mod ship_006;`. - `contracts/apr-model-qa-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QA-SHIP-006` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_006.rs + the harness test, and `full_discharge_blocks_on` live `apr qa paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on an RTX 4090 host (8× `"pass": true` entries in the JSON body). - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.24.0 → v2.25.0 — annotates AC-SHIP1-006 + FALSIFY-SHIP-006 rows with PARTIAL_ALGORITHM_LEVEL markers and adds v2.25.0 amendment entry. Design: mirrors the aggregate-AND shape set by MODEL-2 SHIP-016 (task #152 on `feat/falsify-ship-016-partial-discharge`, not yet on main). Authored self-contained because SHIP-016 hasn't landed; once both ship, the two `verdict_from_qa_gates_*` fns should be deduplicated into a single parameterized helper. Required gate count differs by model (both 8 today — the spec's "All must Pass" is model-independent). MODEL-1 AC-SHIP1 coverage: 2/10 touched (SHIP-008 + SHIP-009) → **3/10** touched (+ SHIP-006). First MODEL-1 aggregate-AND PARTIAL. Full discharge blocks on a live `apr qa` run against the teacher weights on RTX 4090; the compute-heavy portion is intentionally out of scope here. Test: `cargo test -p aprender-core --lib falsify_ship_006_apr_qa_eight_gates_aggregate` → 1 passed. Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/apr-model-qa-v1.yaml` → 0 errors. Stacked on #1012 (feat/falsify-ship-008-partial-discharge). Spec v2.25.0 builds on v2.24.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…scharge
Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090
(7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold
verdict fn bound to the MODEL-1 teacher ship floor. The decision rule
is proven today; the compute-heavy half (live `apr bench` on RTX 4090)
is deferred to hardware evidence collection.
Files:
- `crates/aprender-core/src/bench/ship_007.rs` (NEW, 158 lines) —
`AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`,
`Ship007Verdict { Pass, Fail }`,
`verdict_from_decode_tps(f32) -> Ship007Verdict`,
`falsify_ship_007_decode_tps_threshold_logic` 7-section survey:
1. boundary (30.0 exactly → Pass; the contract is ≥, not >)
2. one-ULP-below → Fail (sharpest off-by-one counter-example)
3. clear Pass band (45 / 100 tok/s)
4. clear Fail band (0 / 10 / 29.999999)
5. monotonicity above floor + below floor
6. non-finite → Fail conservatively (NaN, +∞, -∞)
7. provenance pin binding the 30.0 constant to spec §4.2.
- `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`.
- `contracts/qwen2-e2e-verification-v1.yaml` v1.0.0 → v1.1.0 — adds
`FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`,
`discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by`
pointing at ship_007.rs + the harness test, and
`full_discharge_blocks_on` live `apr bench --iterations 5
--max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090
with --features cuda; median of 5 iterations must be ≥ 30.0. Also
4 `counter_example_classes` (regressed_kernel / drifted_constant /
relaxed_rule / nan_promoted).
- `docs/specifications/aprender-train/ship-two-models-spec.md`
v2.25.0 → v2.26.0 — annotates AC-SHIP1-007 + FALSIFY-SHIP-007 rows
with PARTIAL_ALGORITHM_LEVEL markers and adds v2.26.0 amendment entry.
Design: mirrors the MODEL-2 SHIP-020 single-f32-threshold shape (task
#150 on `feat/falsify-ship-020-partial-discharge`, PR #1005 not yet on
main). Authored self-contained because SHIP-020 hasn't landed; once
both ship, the two `verdict_from_decode_tps_*` fns should be
deduplicated into a single parameterized helper
`verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with
the model-specific floor pinned as a module-level const. MODEL-1 floor
is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the size of 370M); MODEL-2
floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth).
MODEL-1 AC-SHIP1 coverage: 3/10 touched (SHIP-008 + SHIP-009 +
SHIP-006) → **4/10** touched (+ SHIP-007).
Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed.
Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors.
Stacked on #1013 (feat/falsify-ship-006-partial-discharge), which is
itself stacked on #1012 (feat/falsify-ship-008-partial-discharge).
Spec v2.26.0 builds on v2.25.0 which builds on v2.24.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Base automatically changed from
feat/falsify-ship-008-partial-discharge
to
main
April 22, 2026 16:55
Merged
3 tasks
…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) (#1015) Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. 5th compute-free MODEL-1 lever (SHIP-008 + SHIP-009 + SHIP-006 + SHIP-007 + SHIP-005) brings MODEL-1 AC-SHIP1 coverage to 5/10 touched. Mirrors MODEL-2 SHIP-018 pattern (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 310 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because MODEL-2 SHIP-018 sibling PR has not yet landed on main. Once it does, the two `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Apr 22, 2026
Contributor
Author
|
Superseded by PR #1019 — clean-branch rebuild on post-SHIP-002 main (contract v1.1.0 → v1.2.0). The original branch was stacked on feat/falsify-ship-008/006-partial-discharge which had not merged to main, producing CONFLICTING state. Closing stale. |
noahgift
added a commit
that referenced
this pull request
Apr 23, 2026
…lean branch) Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on main (superseding stale PR #1014 which was stacked on feat/falsify-ship-008/006-partial-discharge branches that had not yet merged to main). Algorithm commit carries the same 7-section mutation survey as the original be6d129, re-based onto post-SHIP-002 main (commit f615148, contract v1.1.0). Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is proven today; compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding 30.0 to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL v2.27.0 marker and adds v2.27.0 amendment entry. Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005 not yet on main). Once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with model-specific floors pinned as module-level consts. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002) → **5/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean. Fmt: `cargo fmt --check -p aprender-core` → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 23, 2026
…lean branch) (#1019) * feat(falsify-ship-007): MODEL-1 apr bench decode ≥30 tok/s PARTIAL (clean branch) Clean-branch rebuild of SHIP-007 PARTIAL_ALGORITHM_LEVEL discharge on main (superseding stale PR #1014 which was stacked on feat/falsify-ship-008/006-partial-discharge branches that had not yet merged to main). Algorithm commit carries the same 7-section mutation survey as the original be6d129, re-based onto post-SHIP-002 main (commit f615148, contract v1.1.0). Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. Decision rule is proven today; compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding 30.0 to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 — annotates AC-SHIP1-007 row with PARTIAL_ALGORITHM_LEVEL v2.27.0 marker and adds v2.27.0 amendment entry. Design: mirrors MODEL-2 SHIP-020 single-f32-threshold shape (PR #1005 not yet on main). Once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with model-specific floors pinned as module-level consts. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the 370M size); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 4/10 touched (SHIP-009 + SHIP-008 + SHIP-006 + SHIP-002) → **5/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `pv validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Clippy: `cargo clippy -p aprender-core --lib -- -D warnings` → clean. Fmt: `cargo fmt --check -p aprender-core` → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: retrigger after 3 disk-guard race failures --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires AC-SHIP1-007 (`apr bench` decode ≥30 tok/s on RTX 4090 for 7B Q4_K)
at `PARTIAL_ALGORITHM_LEVEL`: a pure f32 threshold verdict fn bound to the
MODEL-1 teacher ship floor.
MODEL-1 AC-SHIP1 coverage: 3/10 → 4/10 touched (after SHIP-008 + SHIP-009 + SHIP-006).
What changed
Design
Test plan
Stacked on #1012 (which absorbed #1013)
Base = `feat/falsify-ship-008-partial-discharge` (PR #1012 — already contains SHIP-008 + SHIP-006 after #1013 merged into its branch). When #1012 merges to main, GitHub will automatically retarget this PR to `main`.
🤖 Generated with Claude Code