You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Actually downloads 341 files. Wrote 57 MB to disk before being killed.
```
Expected: per the clap help text — `--dry-run` should "resolve short name to canonical URL and exit without performing any network I/O".
Actual: `crates/apr-cli/src/commands/pull_dataset.rs::run_dataset` has no `dry_run` parameter. The dispatch path for `model_ref == "dataset"` never threads the flag through, so dataset pulls always download regardless of `--dry-run`.
Five Whys:
Why does `--dry-run` trigger downloads? — `run_dataset` doesn't accept a `dry_run` parameter.
Why doesn't it accept one? — The dispatch site (in dispatch.rs) routes `model_ref == "dataset"` directly to `run_dataset` without forwarding `pull.dry_run`.
Why doesn't dispatch forward it? — The contract (`apr-cli-pull-dataset-v1.yaml`) has no falsifier asserting "dry-run on dataset path produces zero files".
Why no such falsifier? — The original PR (P1.1, task apr convert doesn't work for gguf format #164) focused on the basic shard-pattern download path; dry-run was tested only on the model code path (FALSIFY-CRUX-A-01-001).
Why not caught at runtime? — Until now, dataset pulls were always small-scale (`codeparrot/github-code-clean` smoke). The 25 GB Stack v2 pull is the first time a dry-run probe was attempted at scale; the bug surfaced as 57 MB of unwanted disk writes.
Defect B — HF tree API silently truncates at 1000 entries
`apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet"` returns "0 files match" because Python's directory comes alphabetically after the 341 single-shard languages in the first page.
Expected: `apr` paginates until exhausted, sees all paths (estimated >5,000 file entries for Stack v2 across 658 languages, many with multiple parquets).
Actual: `crates/apr-cli/src/commands/pull_dataset.rs::list_dataset_repo_files` calls `tree?recursive=1` once and treats the response as the complete listing. Silently truncated to first 1000 entries.
Five Whys:
Why doesn't `data/Python/*.parquet` match? — The 9 Python shards are not in the file listing apr enumerates.
Why aren't they enumerated? — `list_dataset_repo_files` only sees 341 files (HF's first-page limit).
Why does HF cap at 1000? — `/api/datasets//tree/?recursive=1` is a paginated endpoint with a hard 1000-entry per-response limit. Pagination is via `Link` header (next URL with cursor).
Why doesn't apr follow it? — Original implementation (P1.1) was tested against `codeparrot/github-code-clean` (440 files in single page). Multi-page case never exercised.
Why not caught? — No contract falsifier asserts "dataset listing is exhaustive for repos > 1000 paths". Symptoms are silent: missing files just don't match `--include`, surface as "0 files match" with no upstream signal.
Severity
Both defects block MODEL-2 ship: Stack v2 is the corpus that empirically moves val_loss past the 9.75 plateau (per memory `project_2026_04_27_session_complete_handoff.md`).
Proposed fix
Thread `dry_run` through dispatch → `run_dataset(repo, include, revision, output, dry_run)` and skip download loop when set.
Replace single-call `list_dataset_repo_files` with paginated loop following HF `Link: ; rel="next"` header until exhausted.
Add 2 contract falsifiers to `apr-cli-pull-dataset-v1.yaml`:
`FALSIFY-PULL-DATASET-009` — dry-run path produces zero file writes
`FALSIFY-PULL-DATASET-010` — listing is complete for repos with > 1000 paths
Live test: `apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet" --dry-run` reports 9 matches, no disk writes.
Workaround (until fixed)
Use HF Python huggingface_hub directly:
```bash
uv run --with huggingface_hub python -c '
from huggingface_hub import snapshot_download
snapshot_download("bigcode/the-stack-v2", repo_type="dataset",
allow_patterns="data/Python/*.parquet",
local_dir="/mnt/nvme-raid0/datasets/the-stack-v2")'
```
(Per memory `feedback_stack_tool_extension_not_cli_shim.md`: this is exactly the kind of muda the rule forbids — fix `apr` instead.)
Two correlated defects in
apr pull datasetdiscovered while pullingbigcode/the-stack-v2Python shardsDefect A — `--dry-run` performs network I/O and disk writes (contract violation)
Reproduction:
```bash
apr pull dataset bigcode/the-stack-v2 --include "*" --dry-run
Actually downloads 341 files. Wrote 57 MB to disk before being killed.
```
Expected: per the clap help text — `--dry-run` should "resolve short name to canonical URL and exit without performing any network I/O".
Actual: `crates/apr-cli/src/commands/pull_dataset.rs::run_dataset` has no `dry_run` parameter. The dispatch path for `model_ref == "dataset"` never threads the flag through, so dataset pulls always download regardless of `--dry-run`.
Five Whys:
apr convertdoesn't work for gguf format #164) focused on the basic shard-pattern download path; dry-run was tested only on the model code path (FALSIFY-CRUX-A-01-001).Defect B — HF tree API silently truncates at 1000 entries
Reproduction:
```bash
curl -sS -H "Authorization: Bearer $HF_TOKEN" \
"https://huggingface.co/api/datasets/bigcode/the-stack-v2/tree/main?recursive=1" \
| jq 'length'
1000 — but the dataset has thousands of paths
```
`apr pull dataset bigcode/the-stack-v2 --include "data/Python/*.parquet"` returns "0 files match" because Python's directory comes alphabetically after the 341 single-shard languages in the first page.
Expected: `apr` paginates until exhausted, sees all paths (estimated >5,000 file entries for Stack v2 across 658 languages, many with multiple parquets).
Actual: `crates/apr-cli/src/commands/pull_dataset.rs::list_dataset_repo_files` calls `tree?recursive=1` once and treats the response as the complete listing. Silently truncated to first 1000 entries.
Five Whys:
Severity
Both defects block MODEL-2 ship: Stack v2 is the corpus that empirically moves val_loss past the 9.75 plateau (per memory `project_2026_04_27_session_complete_handoff.md`).
Proposed fix
Workaround (until fixed)
Use HF Python
huggingface_hubdirectly:```bash
uv run --with huggingface_hub python -c '
from huggingface_hub import snapshot_download
snapshot_download("bigcode/the-stack-v2", repo_type="dataset",
allow_patterns="data/Python/*.parquet",
local_dir="/mnt/nvme-raid0/datasets/the-stack-v2")'
```
(Per memory `feedback_stack_tool_extension_not_cli_shim.md`: this is exactly the kind of muda the rule forbids — fix `apr` instead.)
🤖 Generated with Claude Code