Skip to content

feat(ship-two-001): FALSIFY-SHIP-017 AC-SHIP2-007 PARTIAL_ALGORITHM_LEVEL discharge (task #149)#1004

Closed
noahgift wants to merge 1 commit into
mainfrom
feat/falsify-ship-017-partial-discharge
Closed

feat(ship-two-001): FALSIFY-SHIP-017 AC-SHIP2-007 PARTIAL_ALGORITHM_LEVEL discharge (task #149)#1004
noahgift wants to merge 1 commit into
mainfrom
feat/falsify-ship-017-partial-discharge

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

  • Binds AC-SHIP2-007 ("apr run produces syntactically valid Python on 100 held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005 in contracts/model-families/llama-370m-sovereign-v1.yaml (v1.5.0 → v1.6.0, stays ACTIVE) with discharge_status: PARTIAL_ALGORITHM_LEVEL.
  • Adds a pure const fn verdict_from_syntax_error_count(errors: usize) -> Ship017Verdict + two unit tests in crates/aprender-train/src/models/llama_370m.rs proving the threshold rule today.
  • MODEL-2 ship-gate status: 3/12 ACTIVE + 4/12 PARTIAL = 7/12 touched (58.3%).

The algorithm-level proof

The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is a ship-blocker" — is a pure integer threshold and is proven correct at cargo test time.

Two tests cover it:

  1. falsify_ship_017_syntax_error_count_threshold_logic — Pass boundary (0, 1 errors), Fail boundary (2 errors), pathological cases (50, 100 errors all Fail), monotonicity sweep over all errors ∈ [0, 100], and provenance pinning (AC_SHIP2_007_HELDOUT_PROMPT_COUNT == 100, AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS == 1).
  2. falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker — binds the sovereign contract YAML shape (falsification_id, binds_to, discharge_status, evidence_discharged_by, full_discharge_blocks_on, ship_blocking) to the Rust tests via include_str!.

Full discharge (100-prompt apr run harness against a trained 370M .apr) blocks on pretraining compute-dispatch (AC-SHIP2-003/004). Fixture swap is data-only — no harness rewrite required.

Pattern confirmation

This is the 4th PARTIAL lever found after two prior "harvesting exhausted" verdicts (SHIP-015 → SHIP-019 → SHIP-017). The spec v2.25.0 amendment records this — re-running the counter-example survey continues to find new levers, so "exhausted" should be treated as provisional.

Test plan

  • cargo test -p aprender-train --lib llama_370m → 12/12 pass (both new falsify_ship_017_* tests green)
  • cargo clippy -p aprender-train --lib -- -D warnings → clean
  • pv validate contracts/model-families/llama-370m-sovereign-v1.yamlContract is valid
  • CI green on self-hosted runners (workspace-test + ci/gate)

Files changed

  • contracts/model-families/llama-370m-sovereign-v1.yaml — adds GATE-ARCH-370M-005; version bump v1.5.0 → v1.6.0
  • crates/aprender-train/src/models/llama_370m.rs — adds threshold fn, consts, 2 tests
  • docs/specifications/aprender-train/ship-two-models-spec.md — v2.23.0 → v2.25.0 with amendment block

Closes task #149.

🤖 Generated with Claude Code

…EVEL discharge (task #149)

MODEL-2 (albor 370M Sovereign) gate #4 at PARTIAL: binds
AC-SHIP2-007 ("apr run produces syntactically valid Python on 100
held-out prompts") to FALSIFY-SHIP-017 via new GATE-ARCH-370M-005
with `discharge_status: PARTIAL_ALGORITHM_LEVEL`.

The decision rule — "≤ 1 SyntaxError tolerated out of 100, ≥ 2 is
a ship-blocker" — is a pure integer threshold and is proven correct
at `cargo test` time today. Full discharge (100-prompt `apr run`
harness against a trained 370M .apr) remains PENDING on pretraining
compute-dispatch (AC-SHIP2-003/004) — fixture swap is data-only, no
harness rewrite required.

Changes:
- crates/aprender-train/src/models/llama_370m.rs:
  - Adds `AC_SHIP2_007_HELDOUT_PROMPT_COUNT` (=100) +
    `AC_SHIP2_007_MAX_TOLERATED_SYNTAX_ERRORS` (=1) consts mirroring
    the spec §6 harness size and §8.3 FALSIFY-SHIP-017 tolerance.
  - Adds `verdict_from_syntax_error_count(errors) -> Ship017Verdict`
    const fn — the pure threshold.
  - Adds `falsify_ship_017_syntax_error_count_threshold_logic` —
    Pass boundary (0,1), Fail boundary (2,50,100), monotonicity
    sweep ∈ [0,100], and provenance pinning.
  - Adds `falsify_ship_017_gate_arch_370m_005_has_partial_discharge_marker`
    — binds sovereign contract YAML shape (falsification_id,
    binds_to, discharge_status, evidence_discharged_by,
    full_discharge_blocks_on, ship_blocking) to Rust tests via
    include_str!.
- contracts/model-families/llama-370m-sovereign-v1.yaml v1.5.0 →
  v1.6.0 (stays ACTIVE): adds GATE-ARCH-370M-005.
- docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0
  → v2.25.0 with amendment block: counter-example survey continues
  to find new PARTIAL levers after two prior "exhausted" verdicts
  (SHIP-015 → SHIP-019 → SHIP-017). New status: 3/12 ACTIVE + 4/12
  PARTIAL = 7/12 touched (58.3%).

Verification:
- cargo test -p aprender-train --lib llama_370m → 12/12 pass
  (including both new falsify_ship_017_* tests)
- cargo clippy -p aprender-train --lib -- -D warnings → clean
- pv validate contracts/model-families/llama-370m-sovereign-v1.yaml
  → Contract is valid

Closes task #149.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 23, 2026
…@1 ≥86.00% (1.2 pp noise → 84.80%) (#1021)

* feat(ship-two-001): FALSIFY-SHIP-005 PARTIAL discharge — MODEL-1 HumanEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) clean-branch rebuild

Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005)
to a pure two-number threshold verdict fn. Clean-branch rebuild of the
SHIP-005 delta from the now-superseded stacked PR #1015 (commit
8c497a0 which layered on top of SHIP-007 PR #1019 prior to SHIP-002
landing). Re-based directly on current main (f615148) so SHIP-005
stands alone — SHIP-007 (#1019) remains open blocked on infra defect
#1020 (runner-disk-guard cross-runner race) and SHIP-018 (#1004) still
in flight.

MODEL-1 AC-SHIP1 coverage on main: 3/10 touched (SHIP-009 + SHIP-008 +
SHIP-006) pre-SHIP-002; 4/10 once SHIP-002 landed via f615148;
5/10 once this PR lands. Mirrors MODEL-2 SHIP-018 shape (pass@1
threshold) but uniquely carries a 1.2 pp noise allowance called out by
spec §4.2 AC-SHIP1-005 — MODEL-2 has no noise window.

contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0:
  - Adds FALSIFY-QW2E-SHIP-005 binding
      AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00
      AC_SHIP1_005_NOISE_ALLOWANCE_PP              = 1.20
      AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80
    to `verdict_from_pass_at_1(correct, total, threshold_pct) ->
    Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`.
  - 8-section mutation survey:
      1. Safe-margin Pass above effective floor (85/100 = 85.0%)
      2. Above nominal floor (87/100 = 87.0%) Pass
      3. Noise-window Fail at nominal (85/100 Fails nominal)
      4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756%
      5. Monotonicity sweep correct=0..=164 at effective
      6. Div-safety (total=0) + sanity (correct>total) → Fail
      7. Non-finite threshold (NaN, ±∞) → Fail conservatively
      8. Tolerance-bounded provenance pin on all three constants
         (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80).
  - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`,
    `full_discharge_blocks_on: live apr eval --benchmark humaneval ...`
    on RTX 4090; 6 named counter_example_classes.

crates/aprender-core/src/metrics/ship_005.rs (NEW, 305 lines):
  - Three-constant design unique to MODEL-1 (SHIP-007/018 had one).
  - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns
    `Ship005Verdict::Fail` conservatively on: total=0 (div guard),
    correct>total (sanity), !threshold.is_finite() (NaN/±∞).
  - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing.

Spec `docs/specifications/aprender-train/ship-two-models-spec.md`
  v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows
  `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry
  noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10.

Authored self-contained because SHIP-018 PR #1004 and SHIP-007 PR
#1019 are not yet on main. Once they land, the two (or three)
`verdict_from_pass_at_1_*` fns should be dedup'd into a single
parameterized helper.

Full discharge blocks on: live `apr eval --benchmark humaneval
paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with
--features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or
≥ 84.80 under the 1.2 pp noise allowance).

Tests:
  cargo test -p aprender-core --lib \
    falsify_ship_005_humaneval_pass_at_1_threshold_logic
Contract:
  cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \
    contracts/qwen2-e2e-verification-v1.yaml

Supersedes #1015 (stacked-branch original).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: retrigger after disk-guard stuck workspace-test (#1021)

Previous run 24806028080 workspace-test stuck at 37+min on lib-tests step
(vs typical 19min). Canceled. Re-triggering on fresh runner.

Tracking: infra issue #1020 second incidence.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Superseded by #1032 — rebased onto MODEL-1 stack at v2.34.0.

@noahgift noahgift closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant