feat(falsify-ship-020): MODEL-2 AC-SHIP2-010 PARTIAL discharge (restacked)#1033
Closed
noahgift wants to merge 3 commits into
Closed
feat(falsify-ship-020): MODEL-2 AC-SHIP2-010 PARTIAL discharge (restacked)#1033noahgift wants to merge 3 commits into
noahgift wants to merge 3 commits into
Conversation
This was referenced Apr 23, 2026
4ad6754 to
2555eb1
Compare
…ce-v1 multi-bind
FALSIFY-SHIP-009 (AC-SHIP1-009 "MODEL-1 teacher license + data
provenance recorded in model.apr metadata") attains
PARTIAL_ALGORITHM_LEVEL by attaching a second binding to the same
C-APR-PROVENANCE contract that already discharges MODEL-2's
AC-SHIP2-012. The AprV2Metadata + serde-JSON decision rule is
model-agnostic, so one contract cleanly carries both discharges.
Changes:
- contracts/apr-provenance-v1.yaml v1.0.0 → v1.1.0 (stays ACTIVE):
new GATE-APR-PROV-004 block binds AC-SHIP1-009 / FALSIFY-SHIP-009
at PARTIAL_ALGORITHM_LEVEL with ship_blocking=true; full discharge
blocks on teacher .apr republish populating license, data_source,
data_license as named fields (PMAT-686 fixture-swap).
- crates/aprender-core/src/format/tests/provenance_tests.rs:
- falsify_ship_009_apr_metadata_applies_to_model_1_teacher —
teacher-representative round-trip (license="apache-2.0",
data_source="qwen2.5-coder-7b-instruct", data_license="apache-2.0").
- falsify_ship_009_gate_apr_prov_004_has_partial_discharge_marker —
include_str! YAML-binding assertion that the new gate has the
correct binds_to / falsification_id / discharge_status / flags.
- crates/aprender-core/Cargo.toml: add serde_yaml to [dev-dependencies]
(needed for the YAML-binding test).
- docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
→ v2.24.0: new v2.24.0 amendment block documenting the first
MODEL-1 PARTIAL and first multi-model multi-bind on one contract.
Pattern extensions:
- First MODEL-1 PARTIAL (prior six targeted MODEL-2).
- First multi-model multi-bind on ONE contract (prior PARTIALs each
had a dedicated contract).
- Sixth falsification of the "exhausted" verdict: SHIP-019 →
SHIP-017 → SHIP-020 → SHIP-018 → SHIP-016 → SHIP-009 — sixth is
cross-model, strictly more surprising than the prior five.
All 5 provenance tests green (3 SHIP-022 + 2 SHIP-009).
Status after v2.24.0:
- MODEL-2: 3/12 ACTIVE + 7/12 PARTIAL = 10/12 touched (83.3%)
- MODEL-1: 9/10 DISCHARGED (via SHIP-TWO-001-MODEL-1-TEACHER tag) +
1/10 PARTIAL (009). Will flip to fully ACTIVE when PMAT-686
republishes teacher.apr with provenance fields populated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…EVEL discharge (task #149) MODEL-2 (albor 370M Sovereign) gate #4 at PARTIAL: binds AC-SHIP2-007 ("apr run produces syntactically valid Python on 100 held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005 with `discharge_status: PARTIAL_ALGORITHM_LEVEL`. The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is a ship-blocker" — is a pure integer threshold and is proven correct at `cargo test` time today. Full discharge (100-prompt `apr run` harness against a trained 370M .apr) remains PENDING on pretraining compute-dispatch (AC-SHIP2-003/004) — fixture swap is data-only, no harness rewrite required. Changes: - crates/aprender-train/src/models/llama_370m.rs: - Adds `AC_SHIP2_007_HELDOUT_PROMPT_COUNT` (=100) + `AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS` (=1) consts mirroring the spec §6 harness size and §8.3 FALSIFY-SHIP-017 tolerance. - Adds `verdict_from_syntax_error_count(errors) -> Ship017Verdict` const fn — the pure threshold. - Adds `falsify_ship_017_syntax_error_count_threshold_logic` — Pass boundary (0,1), Fail boundary (2,50,100), monotonicity sweep ∈ [0,100], and provenance pinning. - Adds `falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker` — binds sovereign contract YAML shape (falsification_id, binds_to, discharge_status, evidence_discharged_by, full_discharge_blocks_on, ship_blocking) to Rust tests via include_str!. - contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE): adds GATE-ARCH-370M-005. - docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0 → v2.25.0 with amendment block: counter-example survey continues to find new PARTIAL levers after two prior "exhausted" verdicts (SHIP-015 → SHIP-019 → SHIP-017). New status: 3/12 ACTIVE + 4/12 PARTIAL = 7/12 touched (58.3%). Verification: - cargo test -p aprender-train --lib llama_370m → 12/12 pass (including both new falsify_ship_017_* tests) - cargo clippy -p aprender-train --lib -- -D warnings → clean - pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → Contract is valid Closes task #149. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX
4090) to a new GATE-ARCH-370M-006 in the sovereign contract via a pure
f32 threshold fn + two unit tests. The compute-heavy half (`apr bench`
on a real trained 370M .apr) is deferred to AC-SHIP2-003/004
compute-dispatch; the decision rule itself is proven today.
Changes:
- crates/aprender-train/src/models/llama_370m.rs:
* AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 (const floor)
* Ship020Verdict { Pass, Fail }
* verdict_from_decode_tps(f32) -> Ship020Verdict (fn, non-finite → Fail)
* falsify_ship_020_decode_tps_threshold_logic (5 invariants:
Pass boundary, Fail boundary at one f32 ULP, monotonicity in
both directions, conservative Fail for NaN/±∞, provenance
pinning that the const stays = 100.0)
* falsify_ship_020_gate_arch_370m_006_has_partial_discharge_marker
(contract parses + advertises PARTIAL_ALGORITHM_LEVEL +
evidence_discharged_by populated + full_discharge_blocks_on
documented + ship_blocking:true)
- contracts/model-families/llama-370m-sovereign-v1.yaml:
* v1.5.0 → v1.6.0, stays ACTIVE
* New GATE-ARCH-370M-006 binding AC-SHIP2-010 ↔ FALSIFY-SHIP-020
with discharge_status: PARTIAL_ALGORITHM_LEVEL
- docs/specifications/aprender-train/ship-two-models-spec.md:
* v2.23.0 → v2.26.0 with amendment block
* MODEL-2 ship-gate status updated: 3/12 ACTIVE + 5/12 PARTIAL =
8/12 touched (66.7%)
- crates/aprender-train/src/train/device.rs:
* 2 pre-existing fmt fixes (6 lines of whitespace) — restores
`cargo fmt -p aprender-train --check` green. Pre-existing on
origin/main; kept in this PR under Toyota Way "all defects are
your defects" rule.
Pattern lesson: v2.22.0 declared MODEL-2 non-compute PARTIAL levers
"exhausted" — re-running the counter-example survey has now falsified
that verdict three times (SHIP-019 → SHIP-017 → SHIP-020). When a
SHIP gate names a threshold / tolerance / ratio / cut-off and the
compute-heavy harness is separable from the decision function, the
threshold fn can land today at unit-test time — even when the full
end-to-end harness is blocked on compute.
Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004
compute-dispatch + three independent `apr bench --tokens 128 --json`
medians on RTX 4090 host. Fixture-swap only — no decision-rule rewrite.
Verification:
- cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS
- pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
→ "Contract is valid. 0 error(s), 0 warning(s)."
- cargo clippy -p aprender-train --lib → green
- cargo fmt -p aprender-train --check → green
Task #150.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2555eb1 to
057ec97
Compare
Contributor
Author
auto-merge was automatically disabled
April 24, 2026 11:42
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restacks the MODEL-2 FALSIFY-SHIP-020 PARTIAL discharge (originally PR #1005) onto SHIP-017 (PR #1032) which itself stacks on the MODEL-1 stack.
What this discharges
AC-SHIP2-010 — "
apr benchdecode ≥100 tok/s on RTX 4090 (370M target)" — via new GATE-ARCH-370M-006 incontracts/model-families/llama-370m-sovereign-v1.yaml(v1.6.0 → v1.7.0, stays ACTIVE) withdischarge_status: PARTIAL_ALGORITHM_LEVEL.Decision rule: pure f32 threshold — "median decode throughput ≥ 100 tok/s on RTX 4090 for 370M target" — bound in
crates/aprender-train/src/models/llama_370m.rs:AC_SHIP2_010_MIN_DECODE_TPS_RTX4090: f32 = 100.0verdict_from_decode_tps(measured_tps: f32) -> Ship020VerdictFull discharge blocks on real trained 370M
.apr+ 3 seed=0apr bench --tokens 128 --jsonmedians on RTX 4090.Stacked on
feat/falsify-ship-017-restacked(PR feat(falsify-ship-017): MODEL-2 AC-SHIP2-007 PARTIAL discharge (restacked) #1032)Spec bump
v2.34.0 → v2.35.0 (new amendment block; AC-SHIP2-010 table row marked
**(PARTIAL_ALGORITHM_LEVEL v2.35.0)**).Aggregate status
MODEL-2 coverage 5/12 → 6/12 touched. MODEL-1 remains fully saturated at 10/10 PARTIAL. Combined: 19 PARTIAL + 3 DISCHARGED across both models.
Verification
cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/model-families/llama-370m-sovereign-v1.yaml→Contract is valid, 0 errorscargo test -p aprender-train --lib llama_370m→ 14/14 pass (including both newfalsify_ship_020_*tests alongside all SHIP-011/017/019 tests)Supersedes
Supersedes and closes #1005 (the DIRTY pre-stack version).
Test plan
🤖 Generated with Claude Code