Skip to content

feat(falsify-ship-016): MODEL-2 AC-SHIP2-006 PARTIAL discharge (restacked)#1035

Closed
noahgift wants to merge 5 commits into
mainfrom
feat/falsify-ship-016-restacked
Closed

feat(falsify-ship-016): MODEL-2 AC-SHIP2-006 PARTIAL discharge (restacked)#1035
noahgift wants to merge 5 commits into
mainfrom
feat/falsify-ship-016-restacked

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Restacks the MODEL-2 FALSIFY-SHIP-016 PARTIAL discharge (originally PR #1008) onto SHIP-018 (PR #1034) which itself stacks on SHIP-020 (PR #1033), SHIP-017 (PR #1032), and the MODEL-1 stack.

What this discharges

AC-SHIP2-006"apr qa <model>.apr — all 8 gates PASS" — via new GATE-ARCH-370M-008 in contracts/model-families/llama-370m-sovereign-v1.yaml (v1.8.0 → v1.9.0, stays ACTIVE) with discharge_status: PARTIAL_ALGORITHM_LEVEL.

Decision rule: pure aggregate-AND over 8 Boolean gate-results (golden_output / throughput / ollama_parity / gpu_vs_cpu_speedup / tensor_contract / cross_format_parity / ptx_parity / probar) bound in crates/aprender-train/src/models/llama_370m.rs:

  • AC_SHIP2_006_REQUIRED_QA_GATE_COUNT: usize = 8
  • verdict_from_qa_gates(gate_results: &[bool]) -> Ship016Verdict
  • 2 falsification tests: exhaustive 2^8 = 256-combination sweep + YAML shape bind

Full discharge blocks on real trained 370M .apr + apr qa <model>.apr exit 0 with all 8 gates green on RTX 4090.

Stacked on

Spec bump

v2.36.0 → v2.37.0 (new amendment block; AC-SHIP2-006 table row marked **(PARTIAL_ALGORITHM_LEVEL v2.37.0)**).

Aggregate status

MODEL-2 coverage 7/12 → 8/12 touched. MODEL-1 still fully saturated at 10/10 PARTIAL. Combined: 21 PARTIAL + 3 DISCHARGED across both models. This completes the compute-free MODEL-2 PARTIAL harvest within the restacking window — remaining 2 gates (003 val loss ≤ 2.2 / 004 ≤21-day wall-clock) are genuinely compute-bound.

Verification

  • cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/model-families/llama-370m-sovereign-v1.yamlContract is valid, 0 errors
  • cargo test -p aprender-train --lib llama_370m → 18/18 pass (all prior + both new falsify_ship_016_* tests)

Supersedes

Supersedes and closes #1008 (the DIRTY pre-stack version).

Test plan

  • Contract validates clean
  • All new tests green
  • CI gate + workspace-test pass

🤖 Generated with Claude Code

noahgift and others added 5 commits April 24, 2026 12:40
…ce-v1 multi-bind

FALSIFY-SHIP-009 (AC-SHIP1-009 "MODEL-1 teacher license + data
provenance recorded in model.apr metadata") attains
PARTIAL_ALGORITHM_LEVEL by attaching a second binding to the same
C-APR-PROVENANCE contract that already discharges MODEL-2's
AC-SHIP2-012. The AprV2Metadata + serde-JSON decision rule is
model-agnostic, so one contract cleanly carries both discharges.

Changes:
- contracts/apr-provenance-v1.yaml v1.0.0 → v1.1.0 (stays ACTIVE):
  new GATE-APR-PROV-004 block binds AC-SHIP1-009 / FALSIFY-SHIP-009
  at PARTIAL_ALGORITHM_LEVEL with ship_blocking=true; full discharge
  blocks on teacher .apr republish populating license, data_source,
  data_license as named fields (PMAT-686 fixture-swap).
- crates/aprender-core/src/format/tests/provenance_tests.rs:
  - falsify_ship_009_apr_metadata_applies_to_model_1_teacher —
    teacher-representative round-trip (license="apache-2.0",
    data_source="qwen2.5-coder-7b-instruct", data_license="apache-2.0").
  - falsify_ship_009_gate_apr_prov_004_has_partial_discharge_marker —
    include_str! YAML-binding assertion that the new gate has the
    correct binds_to / falsification_id / discharge_status / flags.
- crates/aprender-core/Cargo.toml: add serde_yaml to [dev-dependencies]
  (needed for the YAML-binding test).
- docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
  → v2.24.0: new v2.24.0 amendment block documenting the first
  MODEL-1 PARTIAL and first multi-model multi-bind on one contract.

Pattern extensions:
- First MODEL-1 PARTIAL (prior six targeted MODEL-2).
- First multi-model multi-bind on ONE contract (prior PARTIALs each
  had a dedicated contract).
- Sixth falsification of the "exhausted" verdict: SHIP-019 →
  SHIP-017 → SHIP-020 → SHIP-018 → SHIP-016 → SHIP-009 — sixth is
  cross-model, strictly more surprising than the prior five.

All 5 provenance tests green (3 SHIP-022 + 2 SHIP-009).

Status after v2.24.0:
- MODEL-2: 3/12 ACTIVE + 7/12 PARTIAL = 10/12 touched (83.3%)
- MODEL-1: 9/10 DISCHARGED (via SHIP-TWO-001-MODEL-1-TEACHER tag) +
  1/10 PARTIAL (009). Will flip to fully ACTIVE when PMAT-686
  republishes teacher.apr with provenance fields populated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…EVEL discharge (task #149)

MODEL-2 (albor 370M Sovereign) gate #4 at PARTIAL: binds
AC-SHIP2-007 ("apr run produces syntactically valid Python on 100
held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005
with `discharge_status: PARTIAL_ALGORITHM_LEVEL`.

The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is
a ship-blocker" — is a pure integer threshold and is proven correct
at `cargo test` time today. Full discharge (100-prompt `apr run`
harness against a trained 370M .apr) remains PENDING on pretraining
compute-dispatch (AC-SHIP2-003/004) — fixture swap is data-only, no
harness rewrite required.

Changes:
- crates/aprender-train/src/models/llama_370m.rs:
  - Adds `AC_SHIP2_007_HELDOUT_PROMPT_COUNT` (=100) +
    `AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS` (=1) consts mirroring
    the spec §6 harness size and §8.3 FALSIFY-SHIP-017 tolerance.
  - Adds `verdict_from_syntax_error_count(errors) -> Ship017Verdict`
    const fn — the pure threshold.
  - Adds `falsify_ship_017_syntax_error_count_threshold_logic` —
    Pass boundary (0,1), Fail boundary (2,50,100), monotonicity
    sweep ∈ [0,100], and provenance pinning.
  - Adds `falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker`
    — binds sovereign contract YAML shape (falsification_id,
    binds_to, discharge_status, evidence_discharged_by,
    full_discharge_blocks_on, ship_blocking) to Rust tests via
    include_str!.
- contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 →
  v1.6.0 (stays ACTIVE): adds GATE-ARCH-370M-005.
- docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
  → v2.25.0 with amendment block: counter-example survey continues
  to find new PARTIAL levers after two prior "exhausted" verdicts
  (SHIP-015 → SHIP-019 → SHIP-017). New status: 3/12 ACTIVE + 4/12
  PARTIAL = 7/12 touched (58.3%).

Verification:
- cargo test -p aprender-train --lib llama_370m → 12/12 pass
  (including both new falsify_ship_017_* tests)
- cargo clippy -p aprender-train --lib -- -D warnings → clean
- pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
  → Contract is valid

Closes task #149.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Binds AC-SHIP2-010 (inference decode throughput ≥ 100 tok/s on RTX
4090) to a new GATE-ARCH-370M-006 in the sovereign contract via a pure
f32 threshold fn + two unit tests. The compute-heavy half (`apr bench`
on a real trained 370M .apr) is deferred to AC-SHIP2-003/004
compute-dispatch; the decision rule itself is proven today.

Changes:
- crates/aprender-train/src/models/llama_370m.rs:
  * AC_SHIP2_010_MIN_DECODE_TPS_RTX4090 = 100.0 (const floor)
  * Ship020Verdict { Pass, Fail }
  * verdict_from_decode_tps(f32) -> Ship020Verdict (fn, non-finite → Fail)
  * falsify_ship_020_decode_tps_threshold_logic (5 invariants:
    Pass boundary, Fail boundary at one f32 ULP, monotonicity in
    both directions, conservative Fail for NaN/±∞, provenance
    pinning that the const stays = 100.0)
  * falsify_ship_020_gate_arch_370m_006_has_partial_discharge_marker
    (contract parses + advertises PARTIAL_ALGORITHM_LEVEL +
    evidence_discharged_by populated + full_discharge_blocks_on
    documented + ship_blocking:true)

- contracts/model-families/llama-370m-sovereign-v1.yaml:
  * v1.5.0 → v1.6.0, stays ACTIVE
  * New GATE-ARCH-370M-006 binding AC-SHIP2-010 ↔ FALSIFY-SHIP-020
    with discharge_status: PARTIAL_ALGORITHM_LEVEL

- docs/specifications/aprender-train/ship-two-models-spec.md:
  * v2.23.0 → v2.26.0 with amendment block
  * MODEL-2 ship-gate status updated: 3/12 ACTIVE + 5/12 PARTIAL =
    8/12 touched (66.7%)

- crates/aprender-train/src/train/device.rs:
  * 2 pre-existing fmt fixes (6 lines of whitespace) — restores
    `cargo fmt -p aprender-train --check` green. Pre-existing on
    origin/main; kept in this PR under Toyota Way "all defects are
    your defects" rule.

Pattern lesson: v2.22.0 declared MODEL-2 non-compute PARTIAL levers
"exhausted" — re-running the counter-example survey has now falsified
that verdict three times (SHIP-019 → SHIP-017 → SHIP-020). When a
SHIP gate names a threshold / tolerance / ratio / cut-off and the
compute-heavy harness is separable from the decision function, the
threshold fn can land today at unit-test time — even when the full
end-to-end harness is blocked on compute.

Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004
compute-dispatch + three independent `apr bench --tokens 128 --json`
medians on RTX 4090 host. Fixture-swap only — no decision-rule rewrite.

Verification:
- cargo test -p aprender-train --lib models::llama_370m → 11/11 PASS
- pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
  → "Contract is valid. 0 error(s), 0 warning(s)."
- cargo clippy -p aprender-train --lib → green
- cargo fmt -p aprender-train --check → green

Task #150.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AC-SHIP2-008 / FALSIFY-SHIP-018 bound via new GATE-ARCH-370M-007 at
PARTIAL_ALGORITHM_LEVEL. Pure two-number threshold fn
`verdict_from_pass_at_1(correct, total, threshold_pct)` + const
`AC_SHIP2_008_MIN_HUMANEVAL_PASS_AT_1_PCT = 30.0` in
crates/aprender-train/src/models/llama_370m.rs — proves the spec's
'HumanEval pass@1 ≥ 30.0%' decision rule at `cargo test` time,
independent of a trained artifact. Two unit tests prove:

  - boundary (f32-exact 50/100 = 50.0% with ±ULP shift showing `>=`
    is inclusive; 49/164 and 29/100 fail the 30.0 floor)
  - monotonicity (correct sweep 0..=164 at total=164 never flips
    Pass → Fail)
  - div-safety (total=0 fails closed) + sanity (correct>total fails)
  - non-finite threshold guard (NaN / ±∞ all Fail)
  - provenance pin (const stays = 30.0)
  - YAML marker (GATE-ARCH-370M-007 carries PARTIAL_ALGORITHM_LEVEL,
    binds AC-SHIP2-008, cites FALSIFY-SHIP-018, ship_blocking:true)

Full discharge blocks on real 370M .apr (AC-SHIP2-003/004 compute)
+ three seed=0 `apr eval --benchmark humaneval --json` median
pass@1 values fed into the verdict fn — all three must Pass.
Fixture-swap only; no harness rewrite.

6th PARTIAL for MODEL-2 (after SHIP-012/015/017/019/020). Spec
v2.22.0's 'exhausted' verdict now falsified 4×. Remaining 5th-PARTIAL
candidate: SHIP-016 (`apr qa` 8-of-8 aggregate — not a single
threshold). SHIP-013/014 genuinely need real compute.

Contract: llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0 (stays ACTIVE).
Spec: ship-two-models-spec.md v2.23.0 → v2.24.0 (amendment block).
Also: 6-line pre-existing fmt fix in train/device.rs under Toyota
Way "all defects are your defects" (same pattern as PR #1005).

Status: MODEL-2 ship-gates 3/12 ACTIVE + 6/12 PARTIAL = 9/12 touched
(75.0%). Remaining 3 (003/004/006) all need real 370M compute.

Tests: cargo test -p aprender-train --lib models::llama_370m → 11/11
pass. `pv validate contracts/model-families/llama-370m-sovereign-v1.yaml`
→ Contract is valid. cargo fmt -p aprender-train --check → clean.
cargo clippy -p aprender-train --lib -- -D warnings → clean.

Refs: SHIP-TWO-001, task #151, FALSIFY-SHIP-018.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ND verdict fn

Wires GATE-ARCH-370M-008 (AC-SHIP2-006) to a pure
verdict_from_qa_gates(&[bool]) -> Ship016Verdict aggregate-AND fn in
aprender-train/src/models/llama_370m.rs, proven today by exhaustive
2^8 = 256-combination sweep + single-gate-flip falsifiability +
monotonicity + 3 contract-drift guards (slice length 0/7/9/16 → Fail
even when all-true). Discharge marker: PARTIAL_ALGORITHM_LEVEL.

Pattern note: SHIP-016 is the first aggregate-AND shape —
SHIP-017/018/020 were single-threshold shapes. The proof pattern now
covers two distinct decision-rule shapes, confirming
decision-rule/compute-harness separation is a reusable pattern, not a
one-off.

**5th PARTIAL after "exhausted" verdict falsified 4× already**
(SHIP-019 → SHIP-017 → SHIP-020 → SHIP-018 → SHIP-016).

**MODEL-2 ship-gate coverage: 3/12 ACTIVE + 7/12 PARTIAL = 10/12
touched (83.3%).** Remaining 2 truly compute-blocked (003 CE ≤ 2.2,
004 ≤21-day wall-clock) have no fixture-swap trick.

Changes:
- contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 → v1.6.0
  (GATE-ARCH-370M-008 block added; stays ACTIVE)
- crates/aprender-train/src/models/llama_370m.rs:
  + AC_SHIP2_006_REQUIRED_QA_GATE_COUNT = 8 const
  + Ship016Verdict enum
  + verdict_from_qa_gates(&[bool]) pure fn with aggregate-AND
  + falsify_ship_016_apr_qa_aggregate_and_logic test (2^8 sweep +
    single-gate-flip + monotonicity + 3 contract-drift guards)
  + falsify_ship_016_gate_arch_370m_008_has_partial_discharge_marker
    test (YAML binding: binds_to AC-SHIP2-006, falsification_id
    FALSIFY-SHIP-016, discharge_status PARTIAL_ALGORITHM_LEVEL)
- docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
  → v2.24.0 (amendment block documenting 5th PARTIAL, first
  aggregate-AND shape)
- crates/aprender-train/src/train/device.rs: pre-existing fmt fixes
  bundled per Toyota Way "all defects are your defects"

Full discharge blocks on: real 370M .apr from AC-SHIP2-003/004
compute-dispatch + 8-gate apr qa harness invocation with exit 0 →
feed the 8 gate-result booleans into verdict_from_qa_gates and
require Ship016Verdict::Pass. Fixture-swap only — no harness rewrite.

Refs #152

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Superseded by #1044 — 11-PR cascade collapsed into single squash-merge to avoid O(n²) rebase treadmill. Content identical; this branch's commit is in #1044.

@noahgift noahgift closed this Apr 24, 2026
auto-merge was automatically disabled April 24, 2026 11:42

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant