feat(p1-1): apr pull dataset extension — unblocks MODEL-2 corpus diversification per §26.8 by noahgift · Pull Request #1089 · paiml/aprender

noahgift · 2026-04-27T13:53:25Z

Summary

Implements P1.1 of SHIP-TWO-001 §26.8 stack-tool-extension chain. Extends apr pull with the dataset asset-type per apr-cli-pull-dataset-v1.yaml to unblock the corpus pipeline P1.4 → P2 → MODEL-2 convergence.

apr pull dataset <REPO> [--include <GLOB>] [--revision <REV>] [--output <DIR>]

Backward-compat preserved: existing apr pull <MODEL> model path unchanged.

Why this matters (per §26.8)

The previous session flagged a sub-agent attempting to route around the missing apr pull dataset capability via huggingface-cli download --include. Per feedback_stack_tool_extension_not_cli_shim.md (apr is canonical post-APR-MONO), the correct fix is to extend apr — not bypass it. This PR is that fix.

P1 of MODEL-2 (codeparrot/github-code-clean → 1B+ Python tokens → MODEL-2 convergence) was gated on this landing. Empirical §24/§25 results showed corpus-diversity is the binding constraint for MODEL-2 (val_loss=9.75 plateau on 4× CSN-Python).

Falsification verdict (live RTX 4090, 2026-04-27)

Test	Result
FALSIFY-APR-PULL-DATASET-001 (--help shows --include)	✅ PASS
FALSIFY-APR-PULL-DATASET-002 (glob filters correctly)	✅ PASS — `--include 'README.md'` on openwebtext → 1 file
FALSIFY-APR-PULL-DATASET-003 (no-match exits non-zero)	✅ PASS — exit code 5 (ValidationFailed)
FALSIFY-APR-PULL-DATASET-004 (license-allowlist)	⏸️ DEFERRED to P1.1.5 (parquet round-trip)
FALSIFY-APR-PULL-DATASET-005 (model backward-compat)	✅ PASS — `apr pull qwen2.5-coder --dry-run` clean
FALSIFY-APR-PULL-DATASET-006 (3-surface drift)	✅ PASS — `cargo test cli_commands` 6/6
FALSIFY-APR-PULL-DATASET-007 (pv validate)	✅ PASS — contract 1.0.0 → 1.1.0 ACTIVE
FALSIFY-APR-PULL-DATASET-008 (no muda)	✅ PASS

5 unit tests for filter_files_by_globs PASS.

Implementation details

commands_enum.rs: Pull variant gains repo, include, output fields. First positional discriminates: "dataset" triggers dataset-pull; otherwise falls through to existing model-pull.
dispatch.rs: Routes to pull::run_dataset(...) when model_ref == "dataset".
commands/pull_dataset.rs (NEW, 196 LOC): HF API listing (/api/datasets/{repo}/tree/{rev}?recursive=1) + glob filtering + fail-fast on no-match + reuse of existing download_file_with_progress streaming helper.
Default cache: ~/.cache/aprender/datasets/<repo>/ (or $XDG_CACHE_HOME/aprender/datasets/<repo>/).

P1.1.5 deferred

License-allowlist row filter (--license-allowlist <CSV>, --license-column <NAME>) requires parquet/jsonl row-level filtering — non-trivial new infrastructure (parquet-rs round-trip, jsonl re-encode). Scoped out per §26.8 acceptable-exception list. Contract retains the equation specification; implementation tracked separately.

Test plan

cargo build --release -p apr-cli clean
cargo test -p apr-cli --release --lib pull_dataset_tests (5/5 PASS)
cargo test -p apr-cli --release --test cli_commands (6/6 PASS)
pv validate contracts/apr-cli-pull-dataset-v1.yaml (0 errors)
Live smoke: apr pull dataset openwebtext --include 'README.md' --output /tmp/x → 1 file pulled
Live smoke: apr pull dataset openwebtext --include 'no/such/*' → exit 5
Backward-compat: apr pull qwen2.5-coder --dry-run exits 0

Next session

P1.4 (pull codeparrot/github-code-clean) can now proceed via:

apr pull dataset codeparrot/github-code-clean \
    --include 'data/train-000[0-7][0-9]-of-00880.parquet' \
    --output /mnt/nvme-raid0/datasets/github-code-clean

Then P2 retrain MODEL-2 on the bigger corpus (~7.3 hr GPU).

🤖 Generated with Claude Code

Implements P1.1 of SHIP-TWO-001 §26.8 stack-tool-extension chain. Extends `apr pull` with the dataset asset-type per `apr-cli-pull-dataset-v1.yaml`: apr pull dataset <REPO> [--include <GLOB>] [--revision <REV>] [--output <DIR>] Backward-compat preserved: `apr pull <MODEL>` model path unchanged. ## What this enables P1 of the corpus pipeline (codeparrot/github-code-clean → 1B+ Python tokens → MODEL-2 convergence) is gated on this. Stack-canonical workflow replacing the route-around `huggingface-cli download --include` muda flagged in §26.8. ## Implementation - **commands_enum.rs**: New `Pull` fields — `repo: Option<String>`, `include: Vec<String>`, `output: Option<PathBuf>`. Discriminates on first positional == "dataset" (per contract: "asset-type is a positional discriminator, not a flag"). - **dispatch.rs**: Routes to `pull::run_dataset(repo, include, revision, output)` when first positional == "dataset"; else falls through to existing model puller. - **commands/pull_dataset.rs** (NEW, 196 LOC): Lists files via HF API `/api/datasets/{repo}/tree/{rev}?recursive=1`; filters by glob patterns; fail-fast on no-match; streams matched files via existing `download_file_with_progress` helper. - **Default cache**: `~/.cache/aprender/datasets/<repo>/` (or `$XDG_CACHE_HOME/aprender/datasets/<repo>/`). ## Falsification verdict (live on noah-Lambda-Vector RTX 4090) | ID | Rule | Result | |----|------|--------| | FALSIFY-APR-PULL-DATASET-001 | --help shows --include | PASS (visible in `apr pull dataset --help`) | | FALSIFY-APR-PULL-DATASET-002 | include glob filters correctly | PASS (openwebtext --include 'README.md' → 1 file) | | FALSIFY-APR-PULL-DATASET-003 | no-match exits non-zero | PASS (exit code 5 = ValidationFailed) | | FALSIFY-APR-PULL-DATASET-004 | license-allowlist filters rows | DEFERRED to P1.1.5 (parquet round-trip required) | | FALSIFY-APR-PULL-DATASET-005 | model path backward-compat | PASS (`apr pull qwen2.5-coder --dry-run` resolves cleanly) | | FALSIFY-APR-PULL-DATASET-006 | 3-surface drift gate | PASS (`cargo test cli_commands` 6/6 pass) | | FALSIFY-APR-PULL-DATASET-007 | pv validate exits 0 | PASS (contract bumped 1.0.0 → 1.1.0 ACTIVE) | | FALSIFY-APR-PULL-DATASET-008 | no batuta hf pull / huggingface-cli | PASS (no regressions in P1 scripts) | 5 unit tests for `filter_files_by_globs` (empty, glob match, no-match, multi-include union, invalid glob) PASS. ## Contract bump - `metadata.version`: 1.0.0 → 1.1.0 - `metadata.status`: PROPOSED → ACTIVE - `proof_obligations[2].type`: safety → soundness (pv-validator schema fix) - `proof_obligations[3].type`: liveness → termination (pv-validator schema fix) ## P1.1.5 deferred License-allowlist row filter (`--license-allowlist <CSV>`, `--license-column <NAME>`) requires parquet/jsonl row-level filtering — non- trivial new infrastructure (parquet-rs round-trip, jsonl re-encode). Scoped out per spec §26.8 acceptable-exception list. Contract retains the equation specification; implementation tracked separately. ## Files changed - `crates/apr-cli/src/commands_enum.rs` (Pull variant) - `crates/apr-cli/src/dispatch.rs` (asset_type dispatch) - `crates/apr-cli/src/commands/pull.rs` (include!() new module) - `crates/apr-cli/src/commands/pull_dataset.rs` (NEW) - `crates/apr-cli/src/lib_dispatch_coverage.rs` (test fixture) - `crates/apr-cli/src/lib_extract_paths.rs` (test fixture) - `crates/apr-cli/src/lib_parse_export.rs` (test fixtures, 3 sites) - `contracts/apr-cli-pull-dataset-v1.yaml` (PROPOSED → ACTIVE 1.1.0) - `crates/apr-cli/src/lib.rs` (rustfmt-only) - `crates/apr-cli/src/commands/stamp.rs` (rustfmt-only) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…confirms §25 corpus-diversity hypothesis — v2.77 → v2.78 (#1094) P1 corpus pipeline complete end-to-end. P2 MODEL-2 retrain on 565.6M-token codeparrot Python+permissive corpus (7.6× the 4× CSN-Python baseline) pushes val_loss from the 9.7507 plateau to 9.3837 — a 0.367-nat (4.7%) improvement with the SAME training configuration. §25 had concluded (after 80K-step LR-budget falsification on 4× CSN-Python): "There is no LR/step configuration that beats val_loss=9.75 on CSN-Python — only Stack v2 will move the needle." §33 confirms this empirically. The corpus-diversity binding criterion of §26.9 is satisfied. ## Pipeline (all stack-canonical, no muda) | Phase | Outcome | |-------|---------| | P1.0 contract authored (PROPOSED → ACTIVE) | #1080 → #1089 | | P1.1 apr pull dataset extension | #1089 MERGED | | P1.4 codeparrot pull | 80 shards / 27 GB | | P1.5a parquet → JSONL filter | 405,904 rows / 3.17 GB | | P1.5b BPE encode-corpus | 57 shards / 565.6M tokens / 10h | | P2 MODEL-2 retrain on RTX 4090 | EARLY_STOP at 51 ep / 47 min | Total wall time from contract authoring to val_loss=9.3837: ~14 hours. ## Training curve highlights - epoch 0: train=9.7567, val=10.0698 (init) - epoch 10: train=9.4610, val=9.5657 (post-warmup) - epoch 30: train=9.2x, val=9.42x - epoch 44: val=9.3837 (BEST) - epoch 50: train=9.2093, val=9.3889 (EARLY_STOP next) Full per-epoch metadata in evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json. ## Coverage impact §33 is binding evidence for SHIP-021 (corpus diversity binding) — promotion to DISCHARGED is deferred to a separate PR that updates the SHIP-021 contract atomically. Spec scoreboard unchanged (15+33) in this PR. ## Files - evidence/model-2-codeparrot-retrain-2026-04-28/launch.log - evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json - §33 spec section (8 subsections, ~80 lines) - Header: v2.77.0 → v2.78.0 ## Methodology landed The §26.8 stack-tool-extension rule paid off concretely: - 6h authoring cost (P1.0 contract + P1.1 impl) → permanent apr capability - Every future dataset pull benefits - §33's val_loss=9.3837 is downstream proof of the methodology This commit represents the first cycle in §22→§33 where the spec amendment has the same priority as the empirical result. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 27, 2026 13:53

noahgift force-pushed the feat/p1-1-apr-pull-dataset branch from b762a0b to 1626923 Compare April 27, 2026 14:20

noahgift merged commit 6f5cfee into main Apr 27, 2026
10 checks passed

noahgift deleted the feat/p1-1-apr-pull-dataset branch April 27, 2026 14:39

noahgift mentioned this pull request Apr 28, 2026

docs(ship-two-001): §33 — MODEL-2 codeparrot retrain val_loss=9.3837 confirms §25 corpus-diversity hypothesis #1094

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(p1-1): apr pull dataset extension — unblocks MODEL-2 corpus diversification per §26.8#1089

feat(p1-1): apr pull dataset extension — unblocks MODEL-2 corpus diversification per §26.8#1089
noahgift merged 1 commit into
mainfrom
feat/p1-1-apr-pull-dataset

noahgift commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 27, 2026

Summary

Why this matters (per §26.8)

Falsification verdict (live RTX 4090, 2026-04-27)

Implementation details

P1.1.5 deferred

Test plan

Next session

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant