fix(apr-cli): pull dataset — honor --dry-run and follow HF tree pagination (#1410)#1411
Merged
Merged
Conversation
…ation (#1410) Two correlated defects in `apr pull dataset` discovered during Stack v2 Python pull on noah-Lambda-Vector. Both fixed at root. ## Defect A — `--dry-run` ignored on dataset path Root cause: `dispatch.rs` model-vs-dataset fork forwarded `dry_run` only to the model puller; `pull_dataset::run_dataset` never accepted the flag. Result: `apr pull dataset --dry-run` performed full downloads in violation of the documented "no network I/O" contract — observed 57 MB hit disk before manual abort during a "probe" of bigcode/the-stack-v2. Fix: thread `dry_run: bool` through to `run_dataset`, list+filter, then print "[dry-run i/N] <path>" lines and return WITHOUT creating the output directory or invoking the download loop. ## Defect B — HF tree API silently truncated at 1000 entries Root cause: `list_dataset_repo_files` issued a single `GET tree?recursive=1` and treated the response as exhaustive. HF caps each page at 1000 entries and exposes pagination via `Link: <next-url>; rel="next"` header. Datasets with >1000 paths (Stack v2 has ~5000+ across 658 languages) were silently truncated. Symptom: `apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet"` matched 0 files because Python falls alphabetically past the first 341 single-shard languages in the truncated listing. Fix: replace single-call list with a loop that follows `Link: rel="next"` until exhausted. Added `parse_link_next_url` (pure RFC 5988 parser, 5 unit tests) so the pagination logic is testable without HTTP. ## Five Whys (defect A — dry-run) 1. Why does `--dry-run` trigger downloads? — `run_dataset` doesn't accept a `dry_run` parameter. 2. Why doesn't it accept one? — Dispatch only forwards it on the model path; the dataset branch dropped it on the floor. 3. Why doesn't dispatch forward it? — No falsifier asserts "dry-run on dataset path produces zero file writes" — only the model code path is gated (FALSIFY-CRUX-A-01-001). 4. Why no such falsifier? — Original P1.1 (#164) focused on the functional shard-pattern download path; dry-run was tested only against the model branch and the dataset extension inherited only the happy-path tests. 5. Why not caught at runtime? — Until Stack v2, dataset pulls were small-scale (codeparrot/github-code-clean smoke). 25 GB Stack v2 is the first time dry-run was probed at scale; bug surfaced as 57 MB of unwanted writes before manual abort. ## Five Whys (defect B — pagination) 1. Why doesn't `data/Python/*.parquet` match? — The 9 Python shards are not in the file listing apr enumerates. 2. Why aren't they enumerated? — `list_dataset_repo_files` only sees 341 files (HF's first-page limit). 3. Why does HF cap at 1000? — `tree?recursive=1` is paginated with a 1000-entry hard limit; pagination is via `Link: rel=next` header. 4. Why doesn't apr follow it? — Original implementation tested against codeparrot (440 files in single page); multi-page case never exercised at scale. 5. Why not caught? — No contract gate asserts "listing is complete for repos > 1000 paths". Symptom is silent — missing files just don't match `--include`, surfaces as "0 files match" with no upstream signal. ## Contract additions - FALSIFY-PULL-DATASET-009 — dry-run on dataset path produces zero file writes - FALSIFY-PULL-DATASET-010 — HF tree listing follows pagination (>1000 paths visible for repos that have them) Both bound at PARTIAL_ALGORITHM_LEVEL via unit tests in `pull_dataset_tests` (5 new tests covering the Link header parser). Full discharge requires live `apr pull dataset bigcode/the-stack-v2` exit-0 + 9 file matches + 0 disk writes evidence (captured below). ## Live verification on noah-Lambda-Vector RTX 4090 Before fix: $ apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" ✓ 341 files in repo ← truncated ✓ 0 files match include globs error: no files matched After fix: $ apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" --dry-run ✓ 921 files in repo ← pagination working ✓ 9 files match include globs [dry-run 1/9] data/Python/train-00000-of-00009.parquet ... (9 lines) ✓ dry-run complete: 9 files matched (no downloads) $ ls -la /tmp/test-pull-dry/ # directory not even created ## Stack v2 unblocked This unblocks MODEL-2 ship: Stack v2 multi-billion-token Python corpus download (~25 GB across 9 shards) is the empirical lever past the 9.75 val_loss plateau established by 4× CSN-Python runs (memory: project_2026_04_27_session_complete_handoff.md). Closes #1410. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1410.
Summary
Two correlated defects in
apr pull datasetdiscovered during a Stack v2 Python pull on lambda-vector. Both fixed at root with provable-contract gates.--dry-runperformed full downloads (57 MB written before abort during dataset probe)dry_run: boolthrough dispatch →run_dataset; print plan and return without creating output dir or downloadingdata/Python/*.parquetmatched 0 files even though the 9 shards exist (HF tree caps at 1000 entries; Python is past the cap alphabetically)Link: <next>; rel="next"until exhaustedLive verification on noah-Lambda-Vector RTX 4090
Five Whys
See commit message — root-causes traced for both defects (no falsifier covering dry-run dataset path; Stack v2 is the first repo to exercise multi-page listing at scale).
Contract gates
Adds 2 falsifiers to
apr-cli-pull-dataset-v1.yamlat PARTIAL_ALGORITHM_LEVEL via 5 new unit tests onparse_link_next_url:FALSIFY-PULL-DATASET-009— dry-run on dataset path produces zero file writesFALSIFY-PULL-DATASET-010— HF tree listing follows paginationShip % update
Unblocks MODEL-2 ship — Stack v2 (~25 GB Python parquets) is the empirical lever past the 9.75 val_loss plateau. Once this lands, the standing pull command works:
Test plan
cargo test -p apr-cli --lib pull_dataset— 10/10 pass (5 existing + 5 new)--dry-runon Stack v2 reports 9 matches, 0 disk writes🤖 Generated with Claude Code