fix: prevent download stall on large file reconstruction by XciD · Pull Request #698 · huggingface/xet-core

XciD · 2026-03-11T12:37:42Z

Summary

Fixes download stalls/deadlocks on large file reconstruction (reported on 48.5 GB GGUF files). The root cause is a circular dependency: the main reconstruction loop holds a buffer semaphore permit while blocking on CAS connection permit acquisition, and xorb write locks held during HTTP downloads cause CAS permit starvation.

Changes

Single-flight xorb downloads via OnceCell (xorb_block.rs): replaces RwLock<Option<...>> with tokio::sync::OnceCell. Only one task per xorb block acquires a CAS permit and downloads the data; concurrent callers wait on the same result without acquiring permits or duplicating work. This eliminates duplicate downloads, prevents double-counted transfer progress, and avoids a failing duplicate from killing the reconstruction.
Decouple CAS permit from buffer permit (file_term.rs): the main loop no longer blocks on CAS permits while holding a buffer permit. The spawned download task delegates to retrieve_data which handles permit acquisition internally via the OnceCell single-flight. This breaks the circular dependency that causes stalls.
Improve error propagation (sequential_writer.rs): when the background writer channel closes, check RunState for the original error before returning a generic "channel closed" message.

Root cause

The reconstruction pipeline has three resource pools: buffer permits (bounded semaphore), CAS download permits (64 concurrent), and per-xorb write locks.

Before this fix, the main loop would:

Acquire a buffer permit (blocking if buffer full)
Call get_data_task() which acquires a CAS permit (blocking if pool exhausted)
Inside retrieve_data(), hold a write lock during the entire HTTP download

This creates two deadlock vectors:

Buffer vs CAS: buffer fills up with terms waiting for CAS permits, but CAS permits are held by tasks blocked behind xorb write locks, and the writer can't drain the buffer because it's waiting for those tasks
CAS vs write lock: multiple tasks sharing the same xorb each hold a CAS permit while blocked on the write lock, starving other xorbs of permits

Reproduction

Reliably reproducible with small buffer:

HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb \
HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT=64mb \
python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/Qwen3-Coder-Next-GGUF', 'Qwen3-Coder-Next-Q4_K_M.gguf', local_dir='/tmp/test')"

Before fix: stalls at ~3.4 GB, no progression (deadlock)
After fix: continuous progression, completes successfully

With default buffer (2 GB), the stall is intermittent depending on network speed (consistently reproduced on slower connections).

Move CAS download permit acquisition from the main reconstruction loop into the spawned download task. Previously, the main loop acquired the permit while holding a buffer permit, coupling buffer backpressure with CAS connection concurrency. When the buffer filled up, the main loop could block on CAS permit acquisition indefinitely, stalling the entire download pipeline. Additionally, release the xorb write lock before performing the HTTP download in retrieve_data(). The old code held the write lock during the entire download, causing other tasks sharing the same xorb to queue up while holding CAS permits idle. This trades occasional duplicate downloads for much better CAS permit utilization. Also improve error propagation in sequential_writer: when the background writer channel closes, check RunState for the original error before returning a generic "channel closed" message. Reproduces deterministically with small buffer sizes (HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb) on large files (e.g. 48.5 GB GGUF). With default buffer (2 GB), the stall is intermittent depending on network speed. Tested: 117 unit tests pass, verified fix on unsloth/Qwen3-Coder-Next Q4_K_M.gguf (48.5 GB, 2366 terms, 932 xorbs).

Replace the RwLock-based caching with tokio::sync::OnceCell, ensuring only one task per xorb block acquires a CAS permit and downloads the data. Concurrent callers wait on the same result without acquiring permits or duplicating work. This addresses the Codex review feedback: - Eliminates duplicate downloads (bandwidth, CPU, memory waste) - Fixes progress accounting (no double-counted transfer bytes) - Prevents a failing duplicate from killing reconstruction when another succeeds - CAS permit acquisition moves inside the OnceCell init, so waiters never hold idle permits

hoytak

Excellent catch here; good fix.

seanses · 2026-03-11T19:36:28Z

Sorry I'm late to the PR, trying to understand the "deadlock" and wanted to double check if my understanding is correct. This is my interaction with Claude:

Your understanding is correct for a single xorb
If only the "N tasks waiting on the write lock for xorb X" problem existed in isolation, it's a throughput degradation, not a deadlock. Eventually the one task that holds the write lock finishes downloading, the 63 waiters see the cached data and return, CAS permits are released, and everything proceeds.

The buffer makes it worse, but still not a strict deadlock
In the old code, get_data_task acquires the CAS permit before returning the DataFuture and before that future is registered with the writer. So the main loop's flow is:

acquire buffer_permit ← blocks if buffer full
await get_data_task() ← acquires CAS permit here, then returns DataFuture
writer.enqueue(data_future, buffer_permit)
If the main loop is stuck waiting for a CAS permit, it's holding a buffer permit without having enqueued anything to the writer. The writer can't drain that slot (it was never registered). So buffer capacity is reduced by the number of terms stuck waiting for CAS permits — but since the main loop is sequential, that's at most 1 at a time.

What makes it an effective indefinite stall in practice
The real issue is the combination on slow connections:

Say buffer = 64 slots, CAS pool = 64 permits
Main loop fills all 64 buffer slots with terms for the same large xorb X (common in GGUF files where one xorb block is referenced by hundreds of file terms). All 64 CAS permits consumed.
1 task holds the write lock and downloads xorb X over a slow connection (hundreds of MB, minutes to download)
63 tasks hold CAS permits waiting for the write lock — doing nothing useful
Writer is waiting for all 64 futures to complete — it can't release buffer permits until they do
Main loop is blocked: buffer is full, no new work can be enqueued
The system isn't in a true circular deadlock — the write lock holder is making progress. But the effective parallelism has collapsed to 1 active download at any time, and that download may be large. On the slow connection used to reproduce the bug, this means minutes of no observable progress, which is indistinguishable from a deadlock to the user.

When the write lock holder eventually finishes, all 63 waiters immediately return (cached data), 64 buffer permits are released simultaneously, and the main loop floods the next 64 terms — many of which might again be for the same xorb, restarting the cycle.

Why the fix actually helps
The OnceCell single-flight means only 1 task ever acquires a CAS permit per xorb block. The other 63 tasks just wait on the OnceCell future without holding any permits. So:

CAS permits reflect actual concurrent downloads, not write-lock waiters
Buffer permits aren't wasted on tasks stuck waiting for CAS permits (since the permit is acquired inside the spawned task, not in the main loop)
True download parallelism is restored across different xorb blocks
So "deadlock" in the PR description is slightly loose — it's really a throughput collapse to near-zero caused by CAS permit starvation, which on a slow connection looks and feels like a deadlock.

XciD added 2 commits March 11, 2026 13:37

XciD marked this pull request as ready for review March 11, 2026 13:21

XciD requested a review from hoytak March 11, 2026 14:07

hoytak approved these changes Mar 11, 2026

View reviewed changes

hoytak merged commit 9ba5fb3 into main Mar 11, 2026
7 checks passed

hoytak deleted the fix/reconstruction-download-stall branch March 11, 2026 15:37

Wauplin mentioned this pull request Mar 13, 2026

HTTP download fails for files >50GB — why is there a limit? huggingface/huggingface_hub#3868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent download stall on large file reconstruction#698

fix: prevent download stall on large file reconstruction#698
hoytak merged 2 commits intomainfrom
fix/reconstruction-download-stall

XciD commented Mar 11, 2026 •

edited

Loading

Uh oh!

hoytak left a comment

Uh oh!

Uh oh!

seanses commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

XciD commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Root cause

Reproduction

Uh oh!

hoytak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seanses commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

XciD commented Mar 11, 2026 •

edited

Loading

seanses commented Mar 11, 2026 •

edited

Loading