Skip to content

fix(apr-cli): pull dataset — honor --dry-run and follow HF tree pagination (#1410)#1411

Merged
noahgift merged 2 commits into
mainfrom
fix/pull-dataset-dry-run-and-pagination
May 3, 2026
Merged

fix(apr-cli): pull dataset — honor --dry-run and follow HF tree pagination (#1410)#1411
noahgift merged 2 commits into
mainfrom
fix/pull-dataset-dry-run-and-pagination

Conversation

@noahgift

@noahgift noahgift commented May 2, 2026

Copy link
Copy Markdown
Contributor

Closes #1410.

Summary

Two correlated defects in apr pull dataset discovered during a Stack v2 Python pull on lambda-vector. Both fixed at root with provable-contract gates.

Defect Symptom Fix
A --dry-run performed full downloads (57 MB written before abort during dataset probe) Thread dry_run: bool through dispatch → run_dataset; print plan and return without creating output dir or downloading
B data/Python/*.parquet matched 0 files even though the 9 shards exist (HF tree caps at 1000 entries; Python is past the cap alphabetically) Replace single-call listing with a loop that follows Link: <next>; rel="next" until exhausted

Live verification on noah-Lambda-Vector RTX 4090

Before:
  ✓ 341 files in repo            ← truncated
  ✓ 0 files match include globs
  error: no files matched

After:
  ✓ 921 files in repo            ← pagination working
  ✓ 9 files match include globs
  [dry-run 1/9] data/Python/train-00000-of-00009.parquet
  ... (9 lines)
  ✓ dry-run complete: 9 files matched (no downloads)
  $ ls /tmp/test-pull-dry/        # directory not even created

Five Whys

See commit message — root-causes traced for both defects (no falsifier covering dry-run dataset path; Stack v2 is the first repo to exercise multi-page listing at scale).

Contract gates

Adds 2 falsifiers to apr-cli-pull-dataset-v1.yaml at PARTIAL_ALGORITHM_LEVEL via 5 new unit tests on parse_link_next_url:

  • FALSIFY-PULL-DATASET-009 — dry-run on dataset path produces zero file writes
  • FALSIFY-PULL-DATASET-010 — HF tree listing follows pagination

Ship % update

Unblocks MODEL-2 ship — Stack v2 (~25 GB Python parquets) is the empirical lever past the 9.75 val_loss plateau. Once this lands, the standing pull command works:

apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" -o /mnt/nvme-raid0/datasets/the-stack-v2

Test plan

  • cargo test -p apr-cli --lib pull_dataset — 10/10 pass (5 existing + 5 new)
  • Live --dry-run on Stack v2 reports 9 matches, 0 disk writes
  • Live pagination sees 921 files vs. previous 341
  • CI green (gate + workspace-test)

🤖 Generated with Claude Code

…ation (#1410)

Two correlated defects in `apr pull dataset` discovered during Stack v2
Python pull on noah-Lambda-Vector. Both fixed at root.

## Defect A — `--dry-run` ignored on dataset path

Root cause: `dispatch.rs` model-vs-dataset fork forwarded `dry_run` only
to the model puller; `pull_dataset::run_dataset` never accepted the flag.
Result: `apr pull dataset --dry-run` performed full downloads in
violation of the documented "no network I/O" contract — observed 57 MB
hit disk before manual abort during a "probe" of bigcode/the-stack-v2.

Fix: thread `dry_run: bool` through to `run_dataset`, list+filter, then
print "[dry-run i/N] <path>" lines and return WITHOUT creating the
output directory or invoking the download loop.

## Defect B — HF tree API silently truncated at 1000 entries

Root cause: `list_dataset_repo_files` issued a single `GET tree?recursive=1`
and treated the response as exhaustive. HF caps each page at 1000
entries and exposes pagination via `Link: <next-url>; rel="next"` header.
Datasets with >1000 paths (Stack v2 has ~5000+ across 658 languages)
were silently truncated. Symptom: `apr pull dataset bigcode/the-stack-v2
--include "data/Python/*.parquet"` matched 0 files because Python falls
alphabetically past the first 341 single-shard languages in the truncated
listing.

Fix: replace single-call list with a loop that follows `Link: rel="next"`
until exhausted. Added `parse_link_next_url` (pure RFC 5988 parser, 5
unit tests) so the pagination logic is testable without HTTP.

## Five Whys (defect A — dry-run)

1. Why does `--dry-run` trigger downloads? — `run_dataset` doesn't
   accept a `dry_run` parameter.
2. Why doesn't it accept one? — Dispatch only forwards it on the model
   path; the dataset branch dropped it on the floor.
3. Why doesn't dispatch forward it? — No falsifier asserts "dry-run on
   dataset path produces zero file writes" — only the model code path
   is gated (FALSIFY-CRUX-A-01-001).
4. Why no such falsifier? — Original P1.1 (#164) focused on the
   functional shard-pattern download path; dry-run was tested only
   against the model branch and the dataset extension inherited only
   the happy-path tests.
5. Why not caught at runtime? — Until Stack v2, dataset pulls were
   small-scale (codeparrot/github-code-clean smoke). 25 GB Stack v2 is
   the first time dry-run was probed at scale; bug surfaced as 57 MB
   of unwanted writes before manual abort.

## Five Whys (defect B — pagination)

1. Why doesn't `data/Python/*.parquet` match? — The 9 Python shards
   are not in the file listing apr enumerates.
2. Why aren't they enumerated? — `list_dataset_repo_files` only sees
   341 files (HF's first-page limit).
3. Why does HF cap at 1000? — `tree?recursive=1` is paginated with a
   1000-entry hard limit; pagination is via `Link: rel=next` header.
4. Why doesn't apr follow it? — Original implementation tested against
   codeparrot (440 files in single page); multi-page case never
   exercised at scale.
5. Why not caught? — No contract gate asserts "listing is complete for
   repos > 1000 paths". Symptom is silent — missing files just don't
   match `--include`, surfaces as "0 files match" with no upstream
   signal.

## Contract additions

- FALSIFY-PULL-DATASET-009 — dry-run on dataset path produces zero
  file writes
- FALSIFY-PULL-DATASET-010 — HF tree listing follows pagination
  (>1000 paths visible for repos that have them)

Both bound at PARTIAL_ALGORITHM_LEVEL via unit tests in
`pull_dataset_tests` (5 new tests covering the Link header parser).
Full discharge requires live `apr pull dataset bigcode/the-stack-v2`
exit-0 + 9 file matches + 0 disk writes evidence (captured below).

## Live verification on noah-Lambda-Vector RTX 4090

Before fix:
  $ apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet"
  ✓ 341 files in repo  ← truncated
  ✓ 0 files match include globs
  error: no files matched

After fix:
  $ apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" --dry-run
  ✓ 921 files in repo  ← pagination working
  ✓ 9 files match include globs
  [dry-run 1/9] data/Python/train-00000-of-00009.parquet
  ... (9 lines)
  ✓ dry-run complete: 9 files matched (no downloads)
  $ ls -la /tmp/test-pull-dry/   # directory not even created

## Stack v2 unblocked

This unblocks MODEL-2 ship: Stack v2 multi-billion-token Python corpus
download (~25 GB across 9 shards) is the empirical lever past the 9.75
val_loss plateau established by 4× CSN-Python runs (memory:
project_2026_04_27_session_complete_handoff.md).

Closes #1410.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 2, 2026 14:41
@noahgift noahgift merged commit 9bc0d4e into main May 3, 2026
10 checks passed
@noahgift noahgift deleted the fix/pull-dataset-dry-run-and-pagination branch May 3, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apr pull dataset: --dry-run is ignored AND HF tree paginates silently (Stack v2 unpullable)

1 participant