Skip to content

feat(p1-0): apr-cli-pull-dataset-v1 — contract for apr pull dataset --include --license-allowlist#1080

Merged
noahgift merged 2 commits into
mainfrom
feat/p1-0-apr-cli-pull-dataset-contract
Apr 27, 2026
Merged

feat(p1-0): apr-cli-pull-dataset-v1 — contract for apr pull dataset --include --license-allowlist#1080
noahgift merged 2 commits into
mainfrom
feat/p1-0-apr-cli-pull-dataset-contract

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

P1.0 of the SHIP-TWO-001 §26.9 corpus pipeline. Authoring contracts/apr-cli-pull-dataset-v1.yaml is the prerequisite for P1.1 (extend apr CLI) per the §26.8.1 binding methodology rule: when apr lacks a feature, author contract → extend apr → use extended stack tool. Never route around to non-stack CLIs (huggingface-cli) or deprecated namespaces (batuta hf pull).

What this contract codifies

Equation Domain Invariants
apr_pull_dataset_signature apr-cli surface dataset asset-type via positional dispatch; backward-compat model path
include_glob_semantics shard-pattern selection fnmatch globs; no-match = fail-fast (NOT silent no-op)
license_allowlist_semantics per-row license filter case-insensitive SPDX-id; default column license; row-level
registry_drift_prevention 3-surface coherence clap + yaml + cli_commands test all updated atomically

Falsification tests (8 total)

  • FALSIFY-APR-PULL-DATASET-001: subcommand exists with required flags
  • -002: include glob filters correctly (1 file in/1 file out)
  • -003: no-match glob fails fast (exit non-zero)
  • -004: license allowlist drops disallowed rows (parquet row filter)
  • -005: model-path backward compatible (apr pull <MODEL> unchanged)
  • -006: 3-surface drift prevention (registry test passes)
  • -007: pv validate exits 0
  • -008: deprecated namespaces (batuta hf pull, huggingface-cli) not used in P1

Validation

$ pv validate contracts/apr-cli-pull-dataset-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.

$ pv score contracts/apr-cli-pull-dataset-v1.yaml
apr-cli-pull-dataset-v1 — 0.71 (Grade C)
  Spec: 0.70 | Falsify: 1.00 | Kani: 0.25 | Lean: 0.50 | Bind: 1.00

Kani/Lean scores upgrade in P1.1 once implementation provides harnesses + theorems.

Status

PROPOSED — promotion to ACTIVE requires:

  • P1.1 implementation lands
  • All 8 FALSIFY-APR-PULL-DATASET-* tests pass live
  • apr-cli-commands-v1.yaml registry updated
  • cli_commands::registered_commands() test PASSES with new dataset asset-type

Spec references

  • SPEC-SHIP-TWO-001 §26.8 — apr-is-canonical binding methodology rule
  • SPEC-SHIP-TWO-001 §26.9 — P1.0 prerequisite of corpus pipeline
  • feedback_monorepo_single_source_of_truth.md — APR-MONO consolidation 2026-04-23
  • feedback_fix_root_cause_never_route_around.md
  • feedback_cli_subcommand_three_surface_drift.md

Test plan

  • pv validate contracts/apr-cli-pull-dataset-v1.yaml exits 0
  • CI workspace-test passes
  • CI gate passes
  • Contract status field is PROPOSED (not yet ACTIVE)

🤖 Generated with Claude Code

…--include --license-allowlist` per spec §26.8

P1.0 of the SHIP-TWO-001 §26.9 corpus pipeline. Authoring this
contract is the prerequisite for P1.1 (extend apr CLI) per the
binding methodology rule §26.8.1: when `apr` lacks a feature,
author contract → extend apr → use extended stack tool. Never
route around to non-stack CLIs (huggingface-cli) or to deprecated
namespaces (batuta hf pull).

Contract defines:
- New `apr pull dataset <REPO>` asset-type (currently apr pull is
  model-only with `apr pull <MODEL>`)
- --include <GLOB> for shard-pattern selection (fnmatch, no-match
  fails fast)
- --license-allowlist <CSV> for row-level SPDX-id filtering
- --revision <REV> propagated from existing model path
- --output <DIR> with sensible default

8 falsification tests cover:
- Subcommand exists with required flags
- include glob filters correctly
- No-match glob fails fast (not silent no-op)
- License allowlist drops disallowed rows
- Model-path backward compatibility preserved
- 3-surface drift prevention (clap + registry yaml + cli_commands test)
- pv validate passes
- Deprecated namespaces (batuta hf pull, huggingface-cli) not used
  in P1 pipeline

4 proof obligations (1 invariant + 1 invariant + 1 safety + 1
liveness). 2 Kani harnesses with bounds.

`pv validate` exits 0, 0 errors / 0 warnings.
`pv score` = 0.71 Grade C — Falsify 1.00, Spec 0.70, Bind 1.00.
Kani/Lean scores upgrade in P1.1 with implementation.

Status: PROPOSED. Promotion to ACTIVE requires P1.1
(implementation) + all 8 FALSIFY tests passing live.

Spec: SPEC-SHIP-TWO-001 §26.8 + §26.9
References:
- feedback_monorepo_single_source_of_truth.md (APR-MONO)
- feedback_fix_root_cause_never_route_around.md
- feedback_cli_subcommand_three_surface_drift.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 27, 2026 06:41
@noahgift noahgift merged commit 0bd4d96 into main Apr 27, 2026
10 checks passed
@noahgift noahgift deleted the feat/p1-0-apr-cli-pull-dataset-contract branch April 27, 2026 07:42
noahgift added a commit that referenced this pull request Apr 27, 2026
…oard + critical-path map — spec v2.73.0 → v2.74.0 (#1087)

Session-end snapshot consolidating today's 10-PR cascade into a
single source-of-truth for next session.

The goal: ship two models to HF, both built end-to-end on the
in-tree Sovereign AI Stack.

Coverage scoreboard EOD 2026-04-27:
| Category    | DISCHARGED | PARTIAL | Total | %D  |
|-------------|-----------:|--------:|------:|----:|
| MODEL-1     |          5 |       5 |    10 | 50% |
| MODEL-2     |          3 |       9 |    12 | 25% |
| GPUTRAIN    |          7 |       0 |     7 |100% |
| Ship Gates  |          - |      12 |    12 |  0% |
| Falsifiers  |          - |       7 |     7 |  0% |
| Sum         |         15 |      33 |    48 | 31% |

Critical path — MODEL-1: PR E (replace helpers::f32_matmul with
Q4K-fused dispatch) discharges 5 PARTIALs at one fix site.
~150-300 LOC.

Critical path — MODEL-2: P1.1 (apr pull dataset extension) →
P1.4 (corpus pull) → P2 (100K-step training) discharges 9
PARTIALs.

10-PR session cascade (6 merged, 4 open + this):
- #1076-#1080: spec + contract foundation (MERGED)
- #1081: P3 PR A scaffold (MERGED)
- #1082-#1083: P3 PR B+C wiring (OPEN, stacked)
- #1084-#1085: §27/§28 binding criterion + root cause (OPEN)
- #1086: PR D forward-parity contract (OPEN)

Falsification chain (complete, root-reached):
§15.4 → §16 → §17 → §23 → §27 → §28 → PR D contract → PR E (next)
"forward path" → ... → "APR F32 vs GGUF Q4K matmul precision"
                            → "binding criterion as durable spec"
                            → "fix at mod_apr_transformer.rs:138-140"

Methodology preserved: zero eprintln!, zero route-arounds, apr
canonical, contract-first, lambda-labs pre-authorized, 5-whys
reaches root.

Next session: PR E first (5 ACs), then P1.1 + P1.4 + P2
(9 ACs).

Spec v2.73.0 → v2.74.0. No coverage flip at amendment — §29 is
a scoreboard, not a discharge.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…confirms §25 corpus-diversity hypothesis — v2.77 → v2.78 (#1094)

P1 corpus pipeline complete end-to-end. P2 MODEL-2 retrain on 565.6M-token
codeparrot Python+permissive corpus (7.6× the 4× CSN-Python baseline)
pushes val_loss from the 9.7507 plateau to 9.3837 — a 0.367-nat (4.7%)
improvement with the SAME training configuration.

§25 had concluded (after 80K-step LR-budget falsification on 4× CSN-Python):
  "There is no LR/step configuration that beats val_loss=9.75 on
   CSN-Python — only Stack v2 will move the needle."

§33 confirms this empirically. The corpus-diversity binding criterion of
§26.9 is satisfied.

## Pipeline (all stack-canonical, no muda)

| Phase | Outcome |
|-------|---------|
| P1.0 contract authored (PROPOSED → ACTIVE) | #1080#1089 |
| P1.1 apr pull dataset extension | #1089 MERGED |
| P1.4 codeparrot pull | 80 shards / 27 GB |
| P1.5a parquet → JSONL filter | 405,904 rows / 3.17 GB |
| P1.5b BPE encode-corpus | 57 shards / 565.6M tokens / 10h |
| P2 MODEL-2 retrain on RTX 4090 | EARLY_STOP at 51 ep / 47 min |

Total wall time from contract authoring to val_loss=9.3837: ~14 hours.

## Training curve highlights

- epoch 0: train=9.7567, val=10.0698 (init)
- epoch 10: train=9.4610, val=9.5657 (post-warmup)
- epoch 30: train=9.2x, val=9.42x
- epoch 44: val=9.3837 (BEST)
- epoch 50: train=9.2093, val=9.3889 (EARLY_STOP next)

Full per-epoch metadata in evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json.

## Coverage impact

§33 is binding evidence for SHIP-021 (corpus diversity binding) — promotion
to DISCHARGED is deferred to a separate PR that updates the SHIP-021
contract atomically. Spec scoreboard unchanged (15+33) in this PR.

## Files

- evidence/model-2-codeparrot-retrain-2026-04-28/launch.log
- evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json
- §33 spec section (8 subsections, ~80 lines)
- Header: v2.77.0 → v2.78.0

## Methodology landed

The §26.8 stack-tool-extension rule paid off concretely:
- 6h authoring cost (P1.0 contract + P1.1 impl) → permanent apr capability
- Every future dataset pull benefits
- §33's val_loss=9.3837 is downstream proof of the methodology

This commit represents the first cycle in §22→§33 where the spec amendment
has the same priority as the empirical result.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant