apr pull dataset: --dry-run is ignored AND HF tree paginates silently (Stack v2 unpullable)

## Two correlated defects in `apr pull dataset` discovered while pulling `bigcode/the-stack-v2` Python shards

### Defect A — \`--dry-run\` performs network I/O and disk writes (contract violation)

**Reproduction**:
\`\`\`bash
apr pull dataset bigcode/the-stack-v2 --include "*" --dry-run
# Actually downloads 341 files. Wrote 57 MB to disk before being killed.
\`\`\`

**Expected**: per the clap help text — \`--dry-run\` should "resolve short name to canonical URL and exit without performing any network I/O".

**Actual**: \`crates/apr-cli/src/commands/pull_dataset.rs::run_dataset\` has no \`dry_run\` parameter. The dispatch path for \`model_ref == "dataset"\` never threads the flag through, so dataset pulls always download regardless of \`--dry-run\`.

**Five Whys**:
1. Why does \`--dry-run\` trigger downloads? — \`run_dataset\` doesn't accept a \`dry_run\` parameter.
2. Why doesn't it accept one? — The dispatch site (in dispatch.rs) routes \`model_ref == "dataset"\` directly to \`run_dataset\` without forwarding \`pull.dry_run\`.
3. Why doesn't dispatch forward it? — The contract (\`apr-cli-pull-dataset-v1.yaml\`) has no falsifier asserting "dry-run on dataset path produces zero files".
4. Why no such falsifier? — The original PR (P1.1, task #164) focused on the basic shard-pattern download path; dry-run was tested only on the model code path (FALSIFY-CRUX-A-01-001).
5. Why not caught at runtime? — Until now, dataset pulls were always small-scale (\`codeparrot/github-code-clean\` smoke). The 25 GB Stack v2 pull is the first time a dry-run probe was attempted at scale; the bug surfaced as 57 MB of unwanted disk writes.

### Defect B — HF tree API silently truncates at 1000 entries

**Reproduction**:
\`\`\`bash
curl -sS -H "Authorization: Bearer \$HF_TOKEN" \\
  "https://huggingface.co/api/datasets/bigcode/the-stack-v2/tree/main?recursive=1" \\
  | jq 'length'
# 1000  — but the dataset has thousands of paths
\`\`\`

\`apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet"\` returns "0 files match" because Python's directory comes alphabetically after the 341 single-shard languages in the first page.

**Expected**: \`apr\` paginates until exhausted, sees all paths (estimated >5,000 file entries for Stack v2 across 658 languages, many with multiple parquets).

**Actual**: \`crates/apr-cli/src/commands/pull_dataset.rs::list_dataset_repo_files\` calls \`tree?recursive=1\` once and treats the response as the complete listing. Silently truncated to first 1000 entries.

**Five Whys**:
1. Why doesn't \`data/Python/*.parquet\` match? — The 9 Python shards are not in the file listing apr enumerates.
2. Why aren't they enumerated? — \`list_dataset_repo_files\` only sees 341 files (HF's first-page limit).
3. Why does HF cap at 1000? — \`/api/datasets/<repo>/tree/<rev>?recursive=1\` is a paginated endpoint with a hard 1000-entry per-response limit. Pagination is via \`Link\` header (next URL with cursor).
4. Why doesn't apr follow it? — Original implementation (P1.1) was tested against \`codeparrot/github-code-clean\` (440 files in single page). Multi-page case never exercised.
5. Why not caught? — No contract falsifier asserts "dataset listing is exhaustive for repos > 1000 paths". Symptoms are silent: missing files just don't match \`--include\`, surface as "0 files match" with no upstream signal.

### Severity

Both defects block MODEL-2 ship: Stack v2 is the corpus that empirically moves val_loss past the 9.75 plateau (per memory \`project_2026_04_27_session_complete_handoff.md\`).

### Proposed fix

1. Thread \`dry_run\` through dispatch → \`run_dataset(repo, include, revision, output, dry_run)\` and skip download loop when set.
2. Replace single-call \`list_dataset_repo_files\` with paginated loop following HF \`Link: <next>; rel="next"\` header until exhausted.
3. Add 2 contract falsifiers to \`apr-cli-pull-dataset-v1.yaml\`:
   - \`FALSIFY-PULL-DATASET-009\` — dry-run path produces zero file writes
   - \`FALSIFY-PULL-DATASET-010\` — listing is complete for repos with > 1000 paths
4. Live test: \`apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" --dry-run\` reports 9 matches, no disk writes.

### Workaround (until fixed)

Use HF Python `huggingface_hub` directly:
\`\`\`bash
uv run --with huggingface_hub python -c '
from huggingface_hub import snapshot_download
snapshot_download("bigcode/the-stack-v2", repo_type="dataset",
                  allow_patterns="data/Python/*.parquet",
                  local_dir="/mnt/nvme-raid0/datasets/the-stack-v2")'
\`\`\`

(Per memory \`feedback_stack_tool_extension_not_cli_shim.md\`: this is exactly the kind of muda the rule forbids — fix \`apr\` instead.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apr pull dataset: --dry-run is ignored AND HF tree paginates silently (Stack v2 unpullable) #1410

Two correlated defects in `apr pull dataset` discovered while pulling `bigcode/the-stack-v2` Python shards

Defect A — `--dry-run` performs network I/O and disk writes (contract violation)

Actually downloads 341 files. Wrote 57 MB to disk before being killed.

Defect B — HF tree API silently truncates at 1000 entries

1000 — but the dataset has thousands of paths

Severity

Proposed fix

Workaround (until fixed)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

apr pull dataset: --dry-run is ignored AND HF tree paginates silently (Stack v2 unpullable) #1410

Description

Two correlated defects in apr pull dataset discovered while pulling bigcode/the-stack-v2 Python shards

Defect A — `--dry-run` performs network I/O and disk writes (contract violation)

Actually downloads 341 files. Wrote 57 MB to disk before being killed.

Defect B — HF tree API silently truncates at 1000 entries

1000 — but the dataset has thousands of paths

Severity

Proposed fix

Workaround (until fixed)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Two correlated defects in `apr pull dataset` discovered while pulling `bigcode/the-stack-v2` Python shards