feat(falsify-ship-017): MODEL-2 AC-SHIP2-007 PARTIAL discharge (restacked)#1032
Closed
noahgift wants to merge 2 commits into
Closed
feat(falsify-ship-017): MODEL-2 AC-SHIP2-007 PARTIAL discharge (restacked)#1032noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
This was referenced Apr 23, 2026
12298c5 to
42b12cf
Compare
…ce-v1 multi-bind
FALSIFY-SHIP-009 (AC-SHIP1-009 "MODEL-1 teacher license + data
provenance recorded in model.apr metadata") attains
PARTIAL_ALGORITHM_LEVEL by attaching a second binding to the same
C-APR-PROVENANCE contract that already discharges MODEL-2's
AC-SHIP2-012. The AprV2Metadata + serde-JSON decision rule is
model-agnostic, so one contract cleanly carries both discharges.
Changes:
- contracts/apr-provenance-v1.yaml v1.0.0 → v1.1.0 (stays ACTIVE):
new GATE-APR-PROV-004 block binds AC-SHIP1-009 / FALSIFY-SHIP-009
at PARTIAL_ALGORITHM_LEVEL with ship_blocking=true; full discharge
blocks on teacher .apr republish populating license, data_source,
data_license as named fields (PMAT-686 fixture-swap).
- crates/aprender-core/src/format/tests/provenance_tests.rs:
- falsify_ship_009_apr_metadata_applies_to_model_1_teacher —
teacher-representative round-trip (license="apache-2.0",
data_source="qwen2.5-coder-7b-instruct", data_license="apache-2.0").
- falsify_ship_009_gate_apr_prov_004_has_partial_discharge_marker —
include_str! YAML-binding assertion that the new gate has the
correct binds_to / falsification_id / discharge_status / flags.
- crates/aprender-core/Cargo.toml: add serde_yaml to [dev-dependencies]
(needed for the YAML-binding test).
- docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
→ v2.24.0: new v2.24.0 amendment block documenting the first
MODEL-1 PARTIAL and first multi-model multi-bind on one contract.
Pattern extensions:
- First MODEL-1 PARTIAL (prior six targeted MODEL-2).
- First multi-model multi-bind on ONE contract (prior PARTIALs each
had a dedicated contract).
- Sixth falsification of the "exhausted" verdict: SHIP-019 →
SHIP-017 → SHIP-020 → SHIP-018 → SHIP-016 → SHIP-009 — sixth is
cross-model, strictly more surprising than the prior five.
All 5 provenance tests green (3 SHIP-022 + 2 SHIP-009).
Status after v2.24.0:
- MODEL-2: 3/12 ACTIVE + 7/12 PARTIAL = 10/12 touched (83.3%)
- MODEL-1: 9/10 DISCHARGED (via SHIP-TWO-001-MODEL-1-TEACHER tag) +
1/10 PARTIAL (009). Will flip to fully ACTIVE when PMAT-686
republishes teacher.apr with provenance fields populated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…EVEL discharge (task #149) MODEL-2 (albor 370M Sovereign) gate #4 at PARTIAL: binds AC-SHIP2-007 ("apr run produces syntactically valid Python on 100 held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005 with `discharge_status: PARTIAL_ALGORITHM_LEVEL`. The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is a ship-blocker" — is a pure integer threshold and is proven correct at `cargo test` time today. Full discharge (100-prompt `apr run` harness against a trained 370M .apr) remains PENDING on pretraining compute-dispatch (AC-SHIP2-003/004) — fixture swap is data-only, no harness rewrite required. Changes: - crates/aprender-train/src/models/llama_370m.rs: - Adds `AC_SHIP2_007_HELDOUT_PROMPT_COUNT` (=100) + `AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS` (=1) consts mirroring the spec §6 harness size and §8.3 FALSIFY-SHIP-017 tolerance. - Adds `verdict_from_syntax_error_count(errors) -> Ship017Verdict` const fn — the pure threshold. - Adds `falsify_ship_017_syntax_error_count_threshold_logic` — Pass boundary (0,1), Fail boundary (2,50,100), monotonicity sweep ∈ [0,100], and provenance pinning. - Adds `falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker` — binds sovereign contract YAML shape (falsification_id, binds_to, discharge_status, evidence_discharged_by, full_discharge_blocks_on, ship_blocking) to Rust tests via include_str!. - contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE): adds GATE-ARCH-370M-005. - docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0 → v2.25.0 with amendment block: counter-example survey continues to find new PARTIAL levers after two prior "exhausted" verdicts (SHIP-015 → SHIP-019 → SHIP-017). New status: 3/12 ACTIVE + 4/12 PARTIAL = 7/12 touched (58.3%). Verification: - cargo test -p aprender-train --lib llama_370m → 12/12 pass (including both new falsify_ship_017_* tests) - cargo clippy -p aprender-train --lib -- -D warnings → clean - pv validate contracts/model-families/llama-370m-sovereign-v1.yaml → Contract is valid Closes task #149. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
42b12cf to
166c305
Compare
Contributor
Author
auto-merge was automatically disabled
April 24, 2026 11:42
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restacks the MODEL-2 FALSIFY-SHIP-017 PARTIAL discharge (originally PR #1004) onto the completed MODEL-1 stack (SHIP-009 at v2.33.0 → this PR at v2.34.0).
What this discharges
AC-SHIP2-007 — "
apr runproduces syntactically valid Python on 100 held-out prompts" — via new GATE-ARCH-370M-005 incontracts/model-families/llama-370m-sovereign-v1.yaml(v1.5.0 → v1.6.0, stays ACTIVE) withdischarge_status: PARTIAL_ALGORITHM_LEVEL.Decision rule: "≤ 1 SyntaxError tolerated out of 100 held-out prompts, ≥ 2 is a ship-blocker" — a pure integer threshold bound in
crates/aprender-train/src/models/llama_370m.rs:AC_SHIP2_007_HELDOUT_PROMPT_COUNT = 100AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS = 1verdict_from_syntax_error_count(errors: usize) -> Ship017Verdictinclude_str!)Full discharge blocks on real trained 370M
.apr+ 100-promptapr runharness (AC-SHIP2-003/004 pretraining compute-dispatch) — fixture swap only, no harness rewrite.Stacked on
main@57adc9f0c(SHIP-009 via PR feat(falsify-ship-009): MODEL-1 apr-provenance multi-bind PARTIAL discharge (10/10 — last MODEL-1 row) #1031)Spec bump
v2.33.0 → v2.34.0 (amendment block rewritten; AC-SHIP2-007 table row marked
**(PARTIAL_ALGORITHM_LEVEL v2.34.0)**).Aggregate status
MODEL-2 coverage 4/12 → 5/12 touched. MODEL-1 remains fully saturated at 10/10 PARTIAL. Combined: 18 PARTIAL + 3 DISCHARGED across both models.
Verification
cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/model-families/llama-370m-sovereign-v1.yaml→Contract is valid, 0 errorscargo test -p aprender-train --lib llama_370m→ 12/12 pass (including both newfalsify_ship_017_*tests)Supersedes
Supersedes and closes #1004 (the DIRTY pre-stack version).
Test plan
🤖 Generated with Claude Code