Skip to content

Streamline and aggregate file updates for reporting to python#340

Merged
hoytak merged 9 commits intomainfrom
hoytak/250516-aggregate-updates
May 20, 2025
Merged

Streamline and aggregate file updates for reporting to python#340
hoytak merged 9 commits intomainfrom
hoytak/250516-aggregate-updates

Conversation

@hoytak
Copy link
Collaborator

@hoytak hoytak commented May 16, 2025

With the current incremental progress updates, the amount of updates going to python is substantial, and each has to acquire a global GIL lock. This negatively affects the upload speed on fast connections.

This PR introduces an intermediate aggregation class that quickly aggregates all the incoming progress updates, then sends the aggregated update list to hf_xet once every 200 ms. With this, the thread contention experienced by the frequent incremental updates is eliminated while still reporting accurate progress to the user.

@hoytak hoytak requested review from assafvayner and seanses May 16, 2025 20:41
@hoytak
Copy link
Collaborator Author

hoytak commented May 16, 2025

Note that this depends on #328; Will update this PR once that is merged.

Copy link
Contributor

@assafvayner assafvayner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a sample python snippet of how to use the new constructs?

@hoytak hoytak merged commit 4faec0b into main May 20, 2025
4 checks passed
@hoytak hoytak deleted the hoytak/250516-aggregate-updates branch May 20, 2025 18:35
rajatarya pushed a commit that referenced this pull request Feb 27, 2026
## Summary

Wrap download progress updaters in `AggregatingProgressUpdater` to
eliminate GIL contention when Python callers provide per-file progress
callbacks.

The upload path has had this aggregation since v1.1.3 (PR #340), but the
download path was missed. Without aggregation, each XORB chunk triggers
a `spawn_blocking` + `Python::with_gil()` callback. With many concurrent
file downloads, this causes severe GIL contention — measured as a **4x
throughput reduction** (3000 MB/s → 750 MB/s on a 25 Gbps link).

The fix wraps the caller-provided `TrackingProgressUpdater` in an
`AggregatingProgressUpdater` (200ms flush interval) inside
`download_file_with_updater()`, matching the pattern already used by
`FileUploadSession`. This reduces Python callback frequency from
thousands/sec to ~5/sec per file while preserving progress bar feedback.

## Root cause

When `huggingface_hub` calls `hf_xet.download_files()`, it passes a
per-file Python callback for progress bar updates. On the Rust side,
each callback invocation goes through:

```
report_bytes_written() / report_transfer_progress()
  → tokio::spawn(register_updates())
    → spawn_blocking(Python::with_gil(callback))
```

With the detailed download progress tracking added in PR #645 (hf-xet
v1.3.0), both `report_bytes_written` and `report_transfer_progress` fire
per chunk, roughly doubling callback frequency. With 8+ concurrent file
downloads, each spawning dozens of concurrent XORB streams, the GIL
becomes a severe bottleneck.

## History

The problem has existed since xet download support was introduced, but
worsened over time:

| Version | Date | Impact |
|---------|------|--------|
| `huggingface_hub v0.30.0` / `hf-xet 0.1.x` | Mar 2025 | Moderate —
synchronous `with_gil()` per chunk, but hf_xet was an optional extra |
| `huggingface_hub v0.31.0` / `hf-xet >=1.1.0` | May 2025 | Moderate —
hf-xet became a hard dependency on x86_64/arm64 |
| `hf-xet v1.1.3` | Jun 2025 | Upload path fixed with
`AggregatingProgressUpdater` (PR #340); download path left unprotected |
| `hf-xet v1.3.0` | Feb 2026 | **Severe** — PR #645 added detailed
per-chunk progress tracking to downloads, doubling callback frequency
without aggregation |

PR #340 explicitly noted: *"each [update] has to acquire a global GIL
lock. This negatively affects the upload speed on fast connections"* —
the same problem, but only the upload side was addressed.

## Benchmarks

Downloading 3 safetensors files (16.1 GB total) from
`Qwen/Qwen3.5-35B-A3B` on a 25 Gbps machine:

| Test | Before | After |
|------|--------|-------|
| `download_files()` with `progress_updater=None` (baseline) | 3119 MB/s
| 3119 MB/s |
| `download_files()` with per-file Python callbacks | **746 MB/s** |
**1789 MB/s** |
| `snapshot_download()` (full Python CLI path with tqdm) | ~750 MB/s |
**2395 MB/s** |

Progress callback overhead drops from **4x slowdown to <1%**.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants