feat(p1-0): apr-cli-pull-dataset-v1 — contract for `apr pull dataset --include --license-allowlist` by noahgift · Pull Request #1080 · paiml/aprender

noahgift · 2026-04-27T06:41:51Z

Summary

P1.0 of the SHIP-TWO-001 §26.9 corpus pipeline. Authoring contracts/apr-cli-pull-dataset-v1.yaml is the prerequisite for P1.1 (extend apr CLI) per the §26.8.1 binding methodology rule: when apr lacks a feature, author contract → extend apr → use extended stack tool. Never route around to non-stack CLIs (huggingface-cli) or deprecated namespaces (batuta hf pull).

What this contract codifies

Equation	Domain	Invariants
`apr_pull_dataset_signature`	apr-cli surface	dataset asset-type via positional dispatch; backward-compat model path
`include_glob_semantics`	shard-pattern selection	fnmatch globs; no-match = fail-fast (NOT silent no-op)
`license_allowlist_semantics`	per-row license filter	case-insensitive SPDX-id; default column `license`; row-level
`registry_drift_prevention`	3-surface coherence	clap + yaml + cli_commands test all updated atomically

Falsification tests (8 total)

FALSIFY-APR-PULL-DATASET-001: subcommand exists with required flags
-002: include glob filters correctly (1 file in/1 file out)
-003: no-match glob fails fast (exit non-zero)
-004: license allowlist drops disallowed rows (parquet row filter)
-005: model-path backward compatible (apr pull <MODEL> unchanged)
-006: 3-surface drift prevention (registry test passes)
-007: pv validate exits 0
-008: deprecated namespaces (batuta hf pull, huggingface-cli) not used in P1

Validation

$ pv validate contracts/apr-cli-pull-dataset-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.

$ pv score contracts/apr-cli-pull-dataset-v1.yaml
apr-cli-pull-dataset-v1 — 0.71 (Grade C)
  Spec: 0.70 | Falsify: 1.00 | Kani: 0.25 | Lean: 0.50 | Bind: 1.00

Kani/Lean scores upgrade in P1.1 once implementation provides harnesses + theorems.

Status

PROPOSED — promotion to ACTIVE requires:

P1.1 implementation lands
All 8 FALSIFY-APR-PULL-DATASET-* tests pass live
apr-cli-commands-v1.yaml registry updated
cli_commands::registered_commands() test PASSES with new dataset asset-type

Spec references

SPEC-SHIP-TWO-001 §26.8 — apr-is-canonical binding methodology rule
SPEC-SHIP-TWO-001 §26.9 — P1.0 prerequisite of corpus pipeline
feedback_monorepo_single_source_of_truth.md — APR-MONO consolidation 2026-04-23
feedback_fix_root_cause_never_route_around.md
feedback_cli_subcommand_three_surface_drift.md

Test plan

pv validate contracts/apr-cli-pull-dataset-v1.yaml exits 0
CI workspace-test passes
CI gate passes
Contract status field is PROPOSED (not yet ACTIVE)

🤖 Generated with Claude Code

…--include --license-allowlist` per spec §26.8 P1.0 of the SHIP-TWO-001 §26.9 corpus pipeline. Authoring this contract is the prerequisite for P1.1 (extend apr CLI) per the binding methodology rule §26.8.1: when `apr` lacks a feature, author contract → extend apr → use extended stack tool. Never route around to non-stack CLIs (huggingface-cli) or to deprecated namespaces (batuta hf pull). Contract defines: - New `apr pull dataset <REPO>` asset-type (currently apr pull is model-only with `apr pull <MODEL>`) - --include <GLOB> for shard-pattern selection (fnmatch, no-match fails fast) - --license-allowlist <CSV> for row-level SPDX-id filtering - --revision <REV> propagated from existing model path - --output <DIR> with sensible default 8 falsification tests cover: - Subcommand exists with required flags - include glob filters correctly - No-match glob fails fast (not silent no-op) - License allowlist drops disallowed rows - Model-path backward compatibility preserved - 3-surface drift prevention (clap + registry yaml + cli_commands test) - pv validate passes - Deprecated namespaces (batuta hf pull, huggingface-cli) not used in P1 pipeline 4 proof obligations (1 invariant + 1 invariant + 1 safety + 1 liveness). 2 Kani harnesses with bounds. `pv validate` exits 0, 0 errors / 0 warnings. `pv score` = 0.71 Grade C — Falsify 1.00, Spec 0.70, Bind 1.00. Kani/Lean scores upgrade in P1.1 with implementation. Status: PROPOSED. Promotion to ACTIVE requires P1.1 (implementation) + all 8 FALSIFY tests passing live. Spec: SPEC-SHIP-TWO-001 §26.8 + §26.9 References: - feedback_monorepo_single_source_of_truth.md (APR-MONO) - feedback_fix_root_cause_never_route_around.md - feedback_cli_subcommand_three_surface_drift.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…oard + critical-path map — spec v2.73.0 → v2.74.0 (#1087) Session-end snapshot consolidating today's 10-PR cascade into a single source-of-truth for next session. The goal: ship two models to HF, both built end-to-end on the in-tree Sovereign AI Stack. Coverage scoreboard EOD 2026-04-27: | Category | DISCHARGED | PARTIAL | Total | %D | |-------------|-----------:|--------:|------:|----:| | MODEL-1 | 5 | 5 | 10 | 50% | | MODEL-2 | 3 | 9 | 12 | 25% | | GPUTRAIN | 7 | 0 | 7 |100% | | Ship Gates | - | 12 | 12 | 0% | | Falsifiers | - | 7 | 7 | 0% | | Sum | 15 | 33 | 48 | 31% | Critical path — MODEL-1: PR E (replace helpers::f32_matmul with Q4K-fused dispatch) discharges 5 PARTIALs at one fix site. ~150-300 LOC. Critical path — MODEL-2: P1.1 (apr pull dataset extension) → P1.4 (corpus pull) → P2 (100K-step training) discharges 9 PARTIALs. 10-PR session cascade (6 merged, 4 open + this): - #1076-#1080: spec + contract foundation (MERGED) - #1081: P3 PR A scaffold (MERGED) - #1082-#1083: P3 PR B+C wiring (OPEN, stacked) - #1084-#1085: §27/§28 binding criterion + root cause (OPEN) - #1086: PR D forward-parity contract (OPEN) Falsification chain (complete, root-reached): §15.4 → §16 → §17 → §23 → §27 → §28 → PR D contract → PR E (next) "forward path" → ... → "APR F32 vs GGUF Q4K matmul precision" → "binding criterion as durable spec" → "fix at mod_apr_transformer.rs:138-140" Methodology preserved: zero eprintln!, zero route-arounds, apr canonical, contract-first, lambda-labs pre-authorized, 5-whys reaches root. Next session: PR E first (5 ACs), then P1.1 + P1.4 + P2 (9 ACs). Spec v2.73.0 → v2.74.0. No coverage flip at amendment — §29 is a scoreboard, not a discharge. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…confirms §25 corpus-diversity hypothesis — v2.77 → v2.78 (#1094) P1 corpus pipeline complete end-to-end. P2 MODEL-2 retrain on 565.6M-token codeparrot Python+permissive corpus (7.6× the 4× CSN-Python baseline) pushes val_loss from the 9.7507 plateau to 9.3837 — a 0.367-nat (4.7%) improvement with the SAME training configuration. §25 had concluded (after 80K-step LR-budget falsification on 4× CSN-Python): "There is no LR/step configuration that beats val_loss=9.75 on CSN-Python — only Stack v2 will move the needle." §33 confirms this empirically. The corpus-diversity binding criterion of §26.9 is satisfied. ## Pipeline (all stack-canonical, no muda) | Phase | Outcome | |-------|---------| | P1.0 contract authored (PROPOSED → ACTIVE) | #1080 → #1089 | | P1.1 apr pull dataset extension | #1089 MERGED | | P1.4 codeparrot pull | 80 shards / 27 GB | | P1.5a parquet → JSONL filter | 405,904 rows / 3.17 GB | | P1.5b BPE encode-corpus | 57 shards / 565.6M tokens / 10h | | P2 MODEL-2 retrain on RTX 4090 | EARLY_STOP at 51 ep / 47 min | Total wall time from contract authoring to val_loss=9.3837: ~14 hours. ## Training curve highlights - epoch 0: train=9.7567, val=10.0698 (init) - epoch 10: train=9.4610, val=9.5657 (post-warmup) - epoch 30: train=9.2x, val=9.42x - epoch 44: val=9.3837 (BEST) - epoch 50: train=9.2093, val=9.3889 (EARLY_STOP next) Full per-epoch metadata in evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json. ## Coverage impact §33 is binding evidence for SHIP-021 (corpus diversity binding) — promotion to DISCHARGED is deferred to a separate PR that updates the SHIP-021 contract atomically. Spec scoreboard unchanged (15+33) in this PR. ## Files - evidence/model-2-codeparrot-retrain-2026-04-28/launch.log - evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json - §33 spec section (8 subsections, ~80 lines) - Header: v2.77.0 → v2.78.0 ## Methodology landed The §26.8 stack-tool-extension rule paid off concretely: - 6h authoring cost (P1.0 contract + P1.1 impl) → permanent apr capability - Every future dataset pull benefits - §33's val_loss=9.3837 is downstream proof of the methodology This commit represents the first cycle in §22→§33 where the spec amendment has the same priority as the empirical result. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 27, 2026 06:41

Merge branch 'main' into feat/p1-0-apr-cli-pull-dataset-contract

6df7ee3

noahgift merged commit 0bd4d96 into main Apr 27, 2026
10 checks passed

noahgift deleted the feat/p1-0-apr-cli-pull-dataset-contract branch April 27, 2026 07:42

noahgift mentioned this pull request Apr 27, 2026

docs(ship-two-001): §29 — EOD 2026-04-27 goal recap + coverage scoreboard — spec v2.73.0 → v2.74.0 #1087

Merged

4 tasks

noahgift mentioned this pull request Apr 28, 2026

docs(ship-two-001): §33 — MODEL-2 codeparrot retrain val_loss=9.3837 confirms §25 corpus-diversity hypothesis #1094

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(p1-0): apr-cli-pull-dataset-v1 — contract for `apr pull dataset --include --license-allowlist`#1080

feat(p1-0): apr-cli-pull-dataset-v1 — contract for `apr pull dataset --include --license-allowlist`#1080
noahgift merged 2 commits into
mainfrom
feat/p1-0-apr-cli-pull-dataset-contract

noahgift commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 27, 2026

Summary

What this contract codifies

Falsification tests (8 total)

Validation

Status

Spec references

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant