Skip to content

feat(p1-1): apr pull dataset extension — unblocks MODEL-2 corpus diversification per §26.8#1089

Merged
noahgift merged 1 commit into
mainfrom
feat/p1-1-apr-pull-dataset
Apr 27, 2026
Merged

feat(p1-1): apr pull dataset extension — unblocks MODEL-2 corpus diversification per §26.8#1089
noahgift merged 1 commit into
mainfrom
feat/p1-1-apr-pull-dataset

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Implements P1.1 of SHIP-TWO-001 §26.8 stack-tool-extension chain. Extends apr pull with the dataset asset-type per apr-cli-pull-dataset-v1.yaml to unblock the corpus pipeline P1.4 → P2 → MODEL-2 convergence.

apr pull dataset <REPO> [--include <GLOB>] [--revision <REV>] [--output <DIR>]

Backward-compat preserved: existing apr pull <MODEL> model path unchanged.

Why this matters (per §26.8)

The previous session flagged a sub-agent attempting to route around the missing apr pull dataset capability via huggingface-cli download --include. Per feedback_stack_tool_extension_not_cli_shim.md (apr is canonical post-APR-MONO), the correct fix is to extend apr — not bypass it. This PR is that fix.

P1 of MODEL-2 (codeparrot/github-code-clean → 1B+ Python tokens → MODEL-2 convergence) was gated on this landing. Empirical §24/§25 results showed corpus-diversity is the binding constraint for MODEL-2 (val_loss=9.75 plateau on 4× CSN-Python).

Falsification verdict (live RTX 4090, 2026-04-27)

Test Result
FALSIFY-APR-PULL-DATASET-001 (--help shows --include) ✅ PASS
FALSIFY-APR-PULL-DATASET-002 (glob filters correctly) ✅ PASS — --include 'README.md' on openwebtext → 1 file
FALSIFY-APR-PULL-DATASET-003 (no-match exits non-zero) ✅ PASS — exit code 5 (ValidationFailed)
FALSIFY-APR-PULL-DATASET-004 (license-allowlist) ⏸️ DEFERRED to P1.1.5 (parquet round-trip)
FALSIFY-APR-PULL-DATASET-005 (model backward-compat) ✅ PASS — apr pull qwen2.5-coder --dry-run clean
FALSIFY-APR-PULL-DATASET-006 (3-surface drift) ✅ PASS — cargo test cli_commands 6/6
FALSIFY-APR-PULL-DATASET-007 (pv validate) ✅ PASS — contract 1.0.0 → 1.1.0 ACTIVE
FALSIFY-APR-PULL-DATASET-008 (no muda) ✅ PASS

5 unit tests for filter_files_by_globs PASS.

Implementation details

  • commands_enum.rs: Pull variant gains repo, include, output fields. First positional discriminates: "dataset" triggers dataset-pull; otherwise falls through to existing model-pull.
  • dispatch.rs: Routes to pull::run_dataset(...) when model_ref == "dataset".
  • commands/pull_dataset.rs (NEW, 196 LOC): HF API listing (/api/datasets/{repo}/tree/{rev}?recursive=1) + glob filtering + fail-fast on no-match + reuse of existing download_file_with_progress streaming helper.
  • Default cache: ~/.cache/aprender/datasets/<repo>/ (or $XDG_CACHE_HOME/aprender/datasets/<repo>/).

P1.1.5 deferred

License-allowlist row filter (--license-allowlist <CSV>, --license-column <NAME>) requires parquet/jsonl row-level filtering — non-trivial new infrastructure (parquet-rs round-trip, jsonl re-encode). Scoped out per §26.8 acceptable-exception list. Contract retains the equation specification; implementation tracked separately.

Test plan

  • cargo build --release -p apr-cli clean
  • cargo test -p apr-cli --release --lib pull_dataset_tests (5/5 PASS)
  • cargo test -p apr-cli --release --test cli_commands (6/6 PASS)
  • pv validate contracts/apr-cli-pull-dataset-v1.yaml (0 errors)
  • Live smoke: apr pull dataset openwebtext --include 'README.md' --output /tmp/x → 1 file pulled
  • Live smoke: apr pull dataset openwebtext --include 'no/such/*' → exit 5
  • Backward-compat: apr pull qwen2.5-coder --dry-run exits 0

Next session

P1.4 (pull codeparrot/github-code-clean) can now proceed via:

apr pull dataset codeparrot/github-code-clean \
    --include 'data/train-000[0-7][0-9]-of-00880.parquet' \
    --output /mnt/nvme-raid0/datasets/github-code-clean

Then P2 retrain MODEL-2 on the bigger corpus (~7.3 hr GPU).

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 27, 2026 13:53
Implements P1.1 of SHIP-TWO-001 §26.8 stack-tool-extension chain. Extends
`apr pull` with the dataset asset-type per `apr-cli-pull-dataset-v1.yaml`:

    apr pull dataset <REPO> [--include <GLOB>] [--revision <REV>] [--output <DIR>]

Backward-compat preserved: `apr pull <MODEL>` model path unchanged.

## What this enables

P1 of the corpus pipeline (codeparrot/github-code-clean → 1B+ Python tokens
→ MODEL-2 convergence) is gated on this. Stack-canonical workflow replacing
the route-around `huggingface-cli download --include` muda flagged in §26.8.

## Implementation

- **commands_enum.rs**: New `Pull` fields — `repo: Option<String>`,
  `include: Vec<String>`, `output: Option<PathBuf>`. Discriminates on first
  positional == "dataset" (per contract: "asset-type is a positional
  discriminator, not a flag").
- **dispatch.rs**: Routes to `pull::run_dataset(repo, include, revision, output)`
  when first positional == "dataset"; else falls through to existing model
  puller.
- **commands/pull_dataset.rs** (NEW, 196 LOC): Lists files via HF API
  `/api/datasets/{repo}/tree/{rev}?recursive=1`; filters by glob patterns;
  fail-fast on no-match; streams matched files via existing
  `download_file_with_progress` helper.
- **Default cache**: `~/.cache/aprender/datasets/<repo>/` (or
  `$XDG_CACHE_HOME/aprender/datasets/<repo>/`).

## Falsification verdict (live on noah-Lambda-Vector RTX 4090)

| ID | Rule | Result |
|----|------|--------|
| FALSIFY-APR-PULL-DATASET-001 | --help shows --include | PASS (visible in `apr pull dataset --help`) |
| FALSIFY-APR-PULL-DATASET-002 | include glob filters correctly | PASS (openwebtext --include 'README.md' → 1 file) |
| FALSIFY-APR-PULL-DATASET-003 | no-match exits non-zero | PASS (exit code 5 = ValidationFailed) |
| FALSIFY-APR-PULL-DATASET-004 | license-allowlist filters rows | DEFERRED to P1.1.5 (parquet round-trip required) |
| FALSIFY-APR-PULL-DATASET-005 | model path backward-compat | PASS (`apr pull qwen2.5-coder --dry-run` resolves cleanly) |
| FALSIFY-APR-PULL-DATASET-006 | 3-surface drift gate | PASS (`cargo test cli_commands` 6/6 pass) |
| FALSIFY-APR-PULL-DATASET-007 | pv validate exits 0 | PASS (contract bumped 1.0.0 → 1.1.0 ACTIVE) |
| FALSIFY-APR-PULL-DATASET-008 | no batuta hf pull / huggingface-cli | PASS (no regressions in P1 scripts) |

5 unit tests for `filter_files_by_globs` (empty, glob match, no-match, multi-include
union, invalid glob) PASS.

## Contract bump

- `metadata.version`: 1.0.0 → 1.1.0
- `metadata.status`: PROPOSED → ACTIVE
- `proof_obligations[2].type`: safety → soundness (pv-validator schema fix)
- `proof_obligations[3].type`: liveness → termination (pv-validator schema fix)

## P1.1.5 deferred

License-allowlist row filter (`--license-allowlist <CSV>`,
`--license-column <NAME>`) requires parquet/jsonl row-level filtering — non-
trivial new infrastructure (parquet-rs round-trip, jsonl re-encode). Scoped
out per spec §26.8 acceptable-exception list. Contract retains the equation
specification; implementation tracked separately.

## Files changed

- `crates/apr-cli/src/commands_enum.rs` (Pull variant)
- `crates/apr-cli/src/dispatch.rs` (asset_type dispatch)
- `crates/apr-cli/src/commands/pull.rs` (include!() new module)
- `crates/apr-cli/src/commands/pull_dataset.rs` (NEW)
- `crates/apr-cli/src/lib_dispatch_coverage.rs` (test fixture)
- `crates/apr-cli/src/lib_extract_paths.rs` (test fixture)
- `crates/apr-cli/src/lib_parse_export.rs` (test fixtures, 3 sites)
- `contracts/apr-cli-pull-dataset-v1.yaml` (PROPOSED → ACTIVE 1.1.0)
- `crates/apr-cli/src/lib.rs` (rustfmt-only)
- `crates/apr-cli/src/commands/stamp.rs` (rustfmt-only)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/p1-1-apr-pull-dataset branch from b762a0b to 1626923 Compare April 27, 2026 14:20
@noahgift noahgift merged commit 6f5cfee into main Apr 27, 2026
10 checks passed
@noahgift noahgift deleted the feat/p1-1-apr-pull-dataset branch April 27, 2026 14:39
noahgift added a commit that referenced this pull request Apr 28, 2026
…confirms §25 corpus-diversity hypothesis — v2.77 → v2.78 (#1094)

P1 corpus pipeline complete end-to-end. P2 MODEL-2 retrain on 565.6M-token
codeparrot Python+permissive corpus (7.6× the 4× CSN-Python baseline)
pushes val_loss from the 9.7507 plateau to 9.3837 — a 0.367-nat (4.7%)
improvement with the SAME training configuration.

§25 had concluded (after 80K-step LR-budget falsification on 4× CSN-Python):
  "There is no LR/step configuration that beats val_loss=9.75 on
   CSN-Python — only Stack v2 will move the needle."

§33 confirms this empirically. The corpus-diversity binding criterion of
§26.9 is satisfied.

## Pipeline (all stack-canonical, no muda)

| Phase | Outcome |
|-------|---------|
| P1.0 contract authored (PROPOSED → ACTIVE) | #1080#1089 |
| P1.1 apr pull dataset extension | #1089 MERGED |
| P1.4 codeparrot pull | 80 shards / 27 GB |
| P1.5a parquet → JSONL filter | 405,904 rows / 3.17 GB |
| P1.5b BPE encode-corpus | 57 shards / 565.6M tokens / 10h |
| P2 MODEL-2 retrain on RTX 4090 | EARLY_STOP at 51 ep / 47 min |

Total wall time from contract authoring to val_loss=9.3837: ~14 hours.

## Training curve highlights

- epoch 0: train=9.7567, val=10.0698 (init)
- epoch 10: train=9.4610, val=9.5657 (post-warmup)
- epoch 30: train=9.2x, val=9.42x
- epoch 44: val=9.3837 (BEST)
- epoch 50: train=9.2093, val=9.3889 (EARLY_STOP next)

Full per-epoch metadata in evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json.

## Coverage impact

§33 is binding evidence for SHIP-021 (corpus diversity binding) — promotion
to DISCHARGED is deferred to a separate PR that updates the SHIP-021
contract atomically. Spec scoreboard unchanged (15+33) in this PR.

## Files

- evidence/model-2-codeparrot-retrain-2026-04-28/launch.log
- evidence/model-2-codeparrot-retrain-2026-04-28/all-epochs.json
- §33 spec section (8 subsections, ~80 lines)
- Header: v2.77.0 → v2.78.0

## Methodology landed

The §26.8 stack-tool-extension rule paid off concretely:
- 6h authoring cost (P1.0 contract + P1.1 impl) → permanent apr capability
- Every future dataset pull benefits
- §33's val_loss=9.3837 is downstream proof of the methodology

This commit represents the first cycle in §22→§33 where the spec amendment
has the same priority as the empirical result.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant