feat(p1-1): apr pull dataset extension — unblocks MODEL-2 corpus diversification per §26.8#1089
Merged
Merged
Conversation
Implements P1.1 of SHIP-TWO-001 §26.8 stack-tool-extension chain. Extends
`apr pull` with the dataset asset-type per `apr-cli-pull-dataset-v1.yaml`:
apr pull dataset <REPO> [--include <GLOB>] [--revision <REV>] [--output <DIR>]
Backward-compat preserved: `apr pull <MODEL>` model path unchanged.
## What this enables
P1 of the corpus pipeline (codeparrot/github-code-clean → 1B+ Python tokens
→ MODEL-2 convergence) is gated on this. Stack-canonical workflow replacing
the route-around `huggingface-cli download --include` muda flagged in §26.8.
## Implementation
- **commands_enum.rs**: New `Pull` fields — `repo: Option<String>`,
`include: Vec<String>`, `output: Option<PathBuf>`. Discriminates on first
positional == "dataset" (per contract: "asset-type is a positional
discriminator, not a flag").
- **dispatch.rs**: Routes to `pull::run_dataset(repo, include, revision, output)`
when first positional == "dataset"; else falls through to existing model
puller.
- **commands/pull_dataset.rs** (NEW, 196 LOC): Lists files via HF API
`/api/datasets/{repo}/tree/{rev}?recursive=1`; filters by glob patterns;
fail-fast on no-match; streams matched files via existing
`download_file_with_progress` helper.
- **Default cache**: `~/.cache/aprender/datasets/<repo>/` (or
`$XDG_CACHE_HOME/aprender/datasets/<repo>/`).
## Falsification verdict (live on noah-Lambda-Vector RTX 4090)
| ID | Rule | Result |
|----|------|--------|
| FALSIFY-APR-PULL-DATASET-001 | --help shows --include | PASS (visible in `apr pull dataset --help`) |
| FALSIFY-APR-PULL-DATASET-002 | include glob filters correctly | PASS (openwebtext --include 'README.md' → 1 file) |
| FALSIFY-APR-PULL-DATASET-003 | no-match exits non-zero | PASS (exit code 5 = ValidationFailed) |
| FALSIFY-APR-PULL-DATASET-004 | license-allowlist filters rows | DEFERRED to P1.1.5 (parquet round-trip required) |
| FALSIFY-APR-PULL-DATASET-005 | model path backward-compat | PASS (`apr pull qwen2.5-coder --dry-run` resolves cleanly) |
| FALSIFY-APR-PULL-DATASET-006 | 3-surface drift gate | PASS (`cargo test cli_commands` 6/6 pass) |
| FALSIFY-APR-PULL-DATASET-007 | pv validate exits 0 | PASS (contract bumped 1.0.0 → 1.1.0 ACTIVE) |
| FALSIFY-APR-PULL-DATASET-008 | no batuta hf pull / huggingface-cli | PASS (no regressions in P1 scripts) |
5 unit tests for `filter_files_by_globs` (empty, glob match, no-match, multi-include
union, invalid glob) PASS.
## Contract bump
- `metadata.version`: 1.0.0 → 1.1.0
- `metadata.status`: PROPOSED → ACTIVE
- `proof_obligations[2].type`: safety → soundness (pv-validator schema fix)
- `proof_obligations[3].type`: liveness → termination (pv-validator schema fix)
## P1.1.5 deferred
License-allowlist row filter (`--license-allowlist <CSV>`,
`--license-column <NAME>`) requires parquet/jsonl row-level filtering — non-
trivial new infrastructure (parquet-rs round-trip, jsonl re-encode). Scoped
out per spec §26.8 acceptable-exception list. Contract retains the equation
specification; implementation tracked separately.
## Files changed
- `crates/apr-cli/src/commands_enum.rs` (Pull variant)
- `crates/apr-cli/src/dispatch.rs` (asset_type dispatch)
- `crates/apr-cli/src/commands/pull.rs` (include!() new module)
- `crates/apr-cli/src/commands/pull_dataset.rs` (NEW)
- `crates/apr-cli/src/lib_dispatch_coverage.rs` (test fixture)
- `crates/apr-cli/src/lib_extract_paths.rs` (test fixture)
- `crates/apr-cli/src/lib_parse_export.rs` (test fixtures, 3 sites)
- `contracts/apr-cli-pull-dataset-v1.yaml` (PROPOSED → ACTIVE 1.1.0)
- `crates/apr-cli/src/lib.rs` (rustfmt-only)
- `crates/apr-cli/src/commands/stamp.rs` (rustfmt-only)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
b762a0b to
1626923
Compare
3 tasks
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…confirms §25 corpus-diversity hypothesis — v2.77 → v2.78 (#1094) P1 corpus pipeline complete end-to-end. P2 MODEL-2 retrain on 565.6M-token codeparrot Python+permissive corpus (7.6× the 4× CSN-Python baseline) pushes val_loss from the 9.7507 plateau to 9.3837 — a 0.367-nat (4.7%) improvement with the SAME training configuration. §25 had concluded (after 80K-step LR-budget falsification on 4× CSN-Python): "There is no LR/step configuration that beats val_loss=9.75 on CSN-Python — only Stack v2 will move the needle." §33 confirms this empirically. The corpus-diversity binding criterion of §26.9 is satisfied. ## Pipeline (all stack-canonical, no muda) | Phase | Outcome | |-------|---------| | P1.0 contract authored (PROPOSED → ACTIVE) | #1080 → #1089 | | P1.1 apr pull dataset extension | #1089 MERGED | | P1.4 codeparrot pull | 80 shards / 27 GB | | P1.5a parquet → JSONL filter | 405,904 rows / 3.17 GB | | P1.5b BPE encode-corpus | 57 shards / 565.6M tokens / 10h | | P2 MODEL-2 retrain on RTX 4090 | EARLY_STOP at 51 ep / 47 min | Total wall time from contract authoring to val_loss=9.3837: ~14 hours. ## Training curve highlights - epoch 0: train=9.7567, val=10.0698 (init) - epoch 10: train=9.4610, val=9.5657 (post-warmup) - epoch 30: train=9.2x, val=9.42x - epoch 44: val=9.3837 (BEST) - epoch 50: train=9.2093, val=9.3889 (EARLY_STOP next) Full per-epoch metadata in evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json. ## Coverage impact §33 is binding evidence for SHIP-021 (corpus diversity binding) — promotion to DISCHARGED is deferred to a separate PR that updates the SHIP-021 contract atomically. Spec scoreboard unchanged (15+33) in this PR. ## Files - evidence/model-2-codeparrot-retrain-2026-04-28/launch.log - evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json - §33 spec section (8 subsections, ~80 lines) - Header: v2.77.0 → v2.78.0 ## Methodology landed The §26.8 stack-tool-extension rule paid off concretely: - 6h authoring cost (P1.0 contract + P1.1 impl) → permanent apr capability - Every future dataset pull benefits - §33's val_loss=9.3837 is downstream proof of the methodology This commit represents the first cycle in §22→§33 where the spec amendment has the same priority as the empirical result. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements P1.1 of SHIP-TWO-001 §26.8 stack-tool-extension chain. Extends
apr pullwith the dataset asset-type perapr-cli-pull-dataset-v1.yamlto unblock the corpus pipeline P1.4 → P2 → MODEL-2 convergence.Backward-compat preserved: existing
apr pull <MODEL>model path unchanged.Why this matters (per §26.8)
The previous session flagged a sub-agent attempting to route around the missing
apr pull datasetcapability viahuggingface-cli download --include. Perfeedback_stack_tool_extension_not_cli_shim.md(apr is canonical post-APR-MONO), the correct fix is to extendapr— not bypass it. This PR is that fix.P1 of MODEL-2 (codeparrot/github-code-clean → 1B+ Python tokens → MODEL-2 convergence) was gated on this landing. Empirical §24/§25 results showed corpus-diversity is the binding constraint for MODEL-2 (val_loss=9.75 plateau on 4× CSN-Python).
Falsification verdict (live RTX 4090, 2026-04-27)
--include 'README.md'on openwebtext → 1 fileapr pull qwen2.5-coder --dry-runcleancargo test cli_commands6/65 unit tests for
filter_files_by_globsPASS.Implementation details
Pullvariant gainsrepo,include,outputfields. First positional discriminates:"dataset"triggers dataset-pull; otherwise falls through to existing model-pull.pull::run_dataset(...)whenmodel_ref == "dataset"./api/datasets/{repo}/tree/{rev}?recursive=1) + glob filtering + fail-fast on no-match + reuse of existingdownload_file_with_progressstreaming helper.~/.cache/aprender/datasets/<repo>/(or$XDG_CACHE_HOME/aprender/datasets/<repo>/).P1.1.5 deferred
License-allowlist row filter (
--license-allowlist <CSV>,--license-column <NAME>) requires parquet/jsonl row-level filtering — non-trivial new infrastructure (parquet-rs round-trip, jsonl re-encode). Scoped out per §26.8 acceptable-exception list. Contract retains the equation specification; implementation tracked separately.Test plan
cargo build --release -p apr-clicleancargo test -p apr-cli --release --lib pull_dataset_tests(5/5 PASS)cargo test -p apr-cli --release --test cli_commands(6/6 PASS)pv validate contracts/apr-cli-pull-dataset-v1.yaml(0 errors)apr pull dataset openwebtext --include 'README.md' --output /tmp/x→ 1 file pulledapr pull dataset openwebtext --include 'no/such/*'→ exit 5apr pull qwen2.5-coder --dry-runexits 0Next session
P1.4 (pull codeparrot/github-code-clean) can now proceed via:
Then P2 retrain MODEL-2 on the bigger corpus (~7.3 hr GPU).
🤖 Generated with Claude Code