fix(apr-cli): pull dataset — honor --dry-run and follow HF tree pagination (#1410) by noahgift · Pull Request #1411 · paiml/aprender

noahgift · 2026-05-02T14:33:53Z

Closes #1410.

Summary

Two correlated defects in apr pull dataset discovered during a Stack v2 Python pull on lambda-vector. Both fixed at root with provable-contract gates.

Defect	Symptom	Fix
A	`--dry-run` performed full downloads (57 MB written before abort during dataset probe)	Thread `dry_run: bool` through dispatch → `run_dataset`; print plan and return without creating output dir or downloading
B	`data/Python/*.parquet` matched 0 files even though the 9 shards exist (HF tree caps at 1000 entries; Python is past the cap alphabetically)	Replace single-call listing with a loop that follows `Link: <next>; rel="next"` until exhausted

Live verification on noah-Lambda-Vector RTX 4090

Before:
  ✓ 341 files in repo            ← truncated
  ✓ 0 files match include globs
  error: no files matched

After:
  ✓ 921 files in repo            ← pagination working
  ✓ 9 files match include globs
  [dry-run 1/9] data/Python/train-00000-of-00009.parquet
  ... (9 lines)
  ✓ dry-run complete: 9 files matched (no downloads)
  $ ls /tmp/test-pull-dry/        # directory not even created

Five Whys

See commit message — root-causes traced for both defects (no falsifier covering dry-run dataset path; Stack v2 is the first repo to exercise multi-page listing at scale).

Contract gates

Adds 2 falsifiers to apr-cli-pull-dataset-v1.yaml at PARTIAL_ALGORITHM_LEVEL via 5 new unit tests on parse_link_next_url:

FALSIFY-PULL-DATASET-009 — dry-run on dataset path produces zero file writes
FALSIFY-PULL-DATASET-010 — HF tree listing follows pagination

Ship % update

Unblocks MODEL-2 ship — Stack v2 (~25 GB Python parquets) is the empirical lever past the 9.75 val_loss plateau. Once this lands, the standing pull command works:

apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" -o /mnt/nvme-raid0/datasets/the-stack-v2

Test plan

cargo test -p apr-cli --lib pull_dataset — 10/10 pass (5 existing + 5 new)
Live --dry-run on Stack v2 reports 9 matches, 0 disk writes
Live pagination sees 921 files vs. previous 341
CI green (gate + workspace-test)

🤖 Generated with Claude Code

…ation (#1410) Two correlated defects in `apr pull dataset` discovered during Stack v2 Python pull on noah-Lambda-Vector. Both fixed at root. ## Defect A — `--dry-run` ignored on dataset path Root cause: `dispatch.rs` model-vs-dataset fork forwarded `dry_run` only to the model puller; `pull_dataset::run_dataset` never accepted the flag. Result: `apr pull dataset --dry-run` performed full downloads in violation of the documented "no network I/O" contract — observed 57 MB hit disk before manual abort during a "probe" of bigcode/the-stack-v2. Fix: thread `dry_run: bool` through to `run_dataset`, list+filter, then print "[dry-run i/N] <path>" lines and return WITHOUT creating the output directory or invoking the download loop. ## Defect B — HF tree API silently truncated at 1000 entries Root cause: `list_dataset_repo_files` issued a single `GET tree?recursive=1` and treated the response as exhaustive. HF caps each page at 1000 entries and exposes pagination via `Link: <next-url>; rel="next"` header. Datasets with >1000 paths (Stack v2 has ~5000+ across 658 languages) were silently truncated. Symptom: `apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet"` matched 0 files because Python falls alphabetically past the first 341 single-shard languages in the truncated listing. Fix: replace single-call list with a loop that follows `Link: rel="next"` until exhausted. Added `parse_link_next_url` (pure RFC 5988 parser, 5 unit tests) so the pagination logic is testable without HTTP. ## Five Whys (defect A — dry-run) 1. Why does `--dry-run` trigger downloads? — `run_dataset` doesn't accept a `dry_run` parameter. 2. Why doesn't it accept one? — Dispatch only forwards it on the model path; the dataset branch dropped it on the floor. 3. Why doesn't dispatch forward it? — No falsifier asserts "dry-run on dataset path produces zero file writes" — only the model code path is gated (FALSIFY-CRUX-A-01-001). 4. Why no such falsifier? — Original P1.1 (#164) focused on the functional shard-pattern download path; dry-run was tested only against the model branch and the dataset extension inherited only the happy-path tests. 5. Why not caught at runtime? — Until Stack v2, dataset pulls were small-scale (codeparrot/github-code-clean smoke). 25 GB Stack v2 is the first time dry-run was probed at scale; bug surfaced as 57 MB of unwanted writes before manual abort. ## Five Whys (defect B — pagination) 1. Why doesn't `data/Python/*.parquet` match? — The 9 Python shards are not in the file listing apr enumerates. 2. Why aren't they enumerated? — `list_dataset_repo_files` only sees 341 files (HF's first-page limit). 3. Why does HF cap at 1000? — `tree?recursive=1` is paginated with a 1000-entry hard limit; pagination is via `Link: rel=next` header. 4. Why doesn't apr follow it? — Original implementation tested against codeparrot (440 files in single page); multi-page case never exercised at scale. 5. Why not caught? — No contract gate asserts "listing is complete for repos > 1000 paths". Symptom is silent — missing files just don't match `--include`, surfaces as "0 files match" with no upstream signal. ## Contract additions - FALSIFY-PULL-DATASET-009 — dry-run on dataset path produces zero file writes - FALSIFY-PULL-DATASET-010 — HF tree listing follows pagination (>1000 paths visible for repos that have them) Both bound at PARTIAL_ALGORITHM_LEVEL via unit tests in `pull_dataset_tests` (5 new tests covering the Link header parser). Full discharge requires live `apr pull dataset bigcode/the-stack-v2` exit-0 + 9 file matches + 0 disk writes evidence (captured below). ## Live verification on noah-Lambda-Vector RTX 4090 Before fix: $ apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" ✓ 341 files in repo ← truncated ✓ 0 files match include globs error: no files matched After fix: $ apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" --dry-run ✓ 921 files in repo ← pagination working ✓ 9 files match include globs [dry-run 1/9] data/Python/train-00000-of-00009.parquet ... (9 lines) ✓ dry-run complete: 9 files matched (no downloads) $ ls -la /tmp/test-pull-dry/ # directory not even created ## Stack v2 unblocked This unblocks MODEL-2 ship: Stack v2 multi-billion-token Python corpus download (~25 GB across 9 shards) is the empirical lever past the 9.75 val_loss plateau established by 4× CSN-Python runs (memory: project_2026_04_27_session_complete_handoff.md). Closes #1410. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 2, 2026 14:41

Merge branch 'main' into fix/pull-dataset-dry-run-and-pagination

a5a4828

noahgift merged commit 9bc0d4e into main May 3, 2026
10 checks passed

noahgift deleted the fix/pull-dataset-dry-run-and-pagination branch May 3, 2026 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(apr-cli): pull dataset — honor --dry-run and follow HF tree pagination (#1410)#1411

fix(apr-cli): pull dataset — honor --dry-run and follow HF tree pagination (#1410)#1411
noahgift merged 2 commits into
mainfrom
fix/pull-dataset-dry-run-and-pagination

noahgift commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 2, 2026

Summary

Live verification on noah-Lambda-Vector RTX 4090

Five Whys

Contract gates

Ship % update

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant