Skip to content

apr pull dataset: --dry-run is ignored AND HF tree paginates silently (Stack v2 unpullable) #1410

@noahgift

Description

@noahgift

Two correlated defects in apr pull dataset discovered while pulling bigcode/the-stack-v2 Python shards

Defect A — `--dry-run` performs network I/O and disk writes (contract violation)

Reproduction:
```bash
apr pull dataset bigcode/the-stack-v2 --include "*" --dry-run

Actually downloads 341 files. Wrote 57 MB to disk before being killed.

```

Expected: per the clap help text — `--dry-run` should "resolve short name to canonical URL and exit without performing any network I/O".

Actual: `crates/apr-cli/src/commands/pull_dataset.rs::run_dataset` has no `dry_run` parameter. The dispatch path for `model_ref == "dataset"` never threads the flag through, so dataset pulls always download regardless of `--dry-run`.

Five Whys:

  1. Why does `--dry-run` trigger downloads? — `run_dataset` doesn't accept a `dry_run` parameter.
  2. Why doesn't it accept one? — The dispatch site (in dispatch.rs) routes `model_ref == "dataset"` directly to `run_dataset` without forwarding `pull.dry_run`.
  3. Why doesn't dispatch forward it? — The contract (`apr-cli-pull-dataset-v1.yaml`) has no falsifier asserting "dry-run on dataset path produces zero files".
  4. Why no such falsifier? — The original PR (P1.1, task apr convert doesn't work for gguf format #164) focused on the basic shard-pattern download path; dry-run was tested only on the model code path (FALSIFY-CRUX-A-01-001).
  5. Why not caught at runtime? — Until now, dataset pulls were always small-scale (`codeparrot/github-code-clean` smoke). The 25 GB Stack v2 pull is the first time a dry-run probe was attempted at scale; the bug surfaced as 57 MB of unwanted disk writes.

Defect B — HF tree API silently truncates at 1000 entries

Reproduction:
```bash
curl -sS -H "Authorization: Bearer $HF_TOKEN" \
"https://huggingface.co/api/datasets/bigcode/the-stack-v2/tree/main?recursive=1" \
| jq 'length'

1000 — but the dataset has thousands of paths

```

`apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet"` returns "0 files match" because Python's directory comes alphabetically after the 341 single-shard languages in the first page.

Expected: `apr` paginates until exhausted, sees all paths (estimated >5,000 file entries for Stack v2 across 658 languages, many with multiple parquets).

Actual: `crates/apr-cli/src/commands/pull_dataset.rs::list_dataset_repo_files` calls `tree?recursive=1` once and treats the response as the complete listing. Silently truncated to first 1000 entries.

Five Whys:

  1. Why doesn't `data/Python/*.parquet` match? — The 9 Python shards are not in the file listing apr enumerates.
  2. Why aren't they enumerated? — `list_dataset_repo_files` only sees 341 files (HF's first-page limit).
  3. Why does HF cap at 1000? — `/api/datasets//tree/?recursive=1` is a paginated endpoint with a hard 1000-entry per-response limit. Pagination is via `Link` header (next URL with cursor).
  4. Why doesn't apr follow it? — Original implementation (P1.1) was tested against `codeparrot/github-code-clean` (440 files in single page). Multi-page case never exercised.
  5. Why not caught? — No contract falsifier asserts "dataset listing is exhaustive for repos > 1000 paths". Symptoms are silent: missing files just don't match `--include`, surface as "0 files match" with no upstream signal.

Severity

Both defects block MODEL-2 ship: Stack v2 is the corpus that empirically moves val_loss past the 9.75 plateau (per memory `project_2026_04_27_session_complete_handoff.md`).

Proposed fix

  1. Thread `dry_run` through dispatch → `run_dataset(repo, include, revision, output, dry_run)` and skip download loop when set.
  2. Replace single-call `list_dataset_repo_files` with paginated loop following HF `Link: ; rel="next"` header until exhausted.
  3. Add 2 contract falsifiers to `apr-cli-pull-dataset-v1.yaml`:
    • `FALSIFY-PULL-DATASET-009` — dry-run path produces zero file writes
    • `FALSIFY-PULL-DATASET-010` — listing is complete for repos with > 1000 paths
  4. Live test: `apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" --dry-run` reports 9 matches, no disk writes.

Workaround (until fixed)

Use HF Python huggingface_hub directly:
```bash
uv run --with huggingface_hub python -c '
from huggingface_hub import snapshot_download
snapshot_download("bigcode/the-stack-v2", repo_type="dataset",
allow_patterns="data/Python/*.parquet",
local_dir="/mnt/nvme-raid0/datasets/the-stack-v2")'
```

(Per memory `feedback_stack_tool_extension_not_cli_shim.md`: this is exactly the kind of muda the rule forbids — fix `apr` instead.)

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions