Skip to content

fix: prevent download stall on large file reconstruction#698

Merged
hoytak merged 2 commits intomainfrom
fix/reconstruction-download-stall
Mar 11, 2026
Merged

fix: prevent download stall on large file reconstruction#698
hoytak merged 2 commits intomainfrom
fix/reconstruction-download-stall

Conversation

@XciD
Copy link
Member

@XciD XciD commented Mar 11, 2026

Summary

Fixes download stalls/deadlocks on large file reconstruction (reported on 48.5 GB GGUF files). The root cause is a circular dependency: the main reconstruction loop holds a buffer semaphore permit while blocking on CAS connection permit acquisition, and xorb write locks held during HTTP downloads cause CAS permit starvation.

Changes

  1. Single-flight xorb downloads via OnceCell (xorb_block.rs): replaces RwLock<Option<...>> with tokio::sync::OnceCell. Only one task per xorb block acquires a CAS permit and downloads the data; concurrent callers wait on the same result without acquiring permits or duplicating work. This eliminates duplicate downloads, prevents double-counted transfer progress, and avoids a failing duplicate from killing the reconstruction.

  2. Decouple CAS permit from buffer permit (file_term.rs): the main loop no longer blocks on CAS permits while holding a buffer permit. The spawned download task delegates to retrieve_data which handles permit acquisition internally via the OnceCell single-flight. This breaks the circular dependency that causes stalls.

  3. Improve error propagation (sequential_writer.rs): when the background writer channel closes, check RunState for the original error before returning a generic "channel closed" message.

Root cause

The reconstruction pipeline has three resource pools: buffer permits (bounded semaphore), CAS download permits (64 concurrent), and per-xorb write locks.

Before this fix, the main loop would:

  1. Acquire a buffer permit (blocking if buffer full)
  2. Call get_data_task() which acquires a CAS permit (blocking if pool exhausted)
  3. Inside retrieve_data(), hold a write lock during the entire HTTP download

This creates two deadlock vectors:

  • Buffer vs CAS: buffer fills up with terms waiting for CAS permits, but CAS permits are held by tasks blocked behind xorb write locks, and the writer can't drain the buffer because it's waiting for those tasks
  • CAS vs write lock: multiple tasks sharing the same xorb each hold a CAS permit while blocked on the write lock, starving other xorbs of permits

Reproduction

Reliably reproducible with small buffer:

HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb \
HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT=64mb \
python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/Qwen3-Coder-Next-GGUF', 'Qwen3-Coder-Next-Q4_K_M.gguf', local_dir='/tmp/test')"
  • Before fix: stalls at ~3.4 GB, no progression (deadlock)
  • After fix: continuous progression, completes successfully

With default buffer (2 GB), the stall is intermittent depending on network speed (consistently reproduced on slower connections).

XciD added 2 commits March 11, 2026 13:37
Move CAS download permit acquisition from the main reconstruction loop
into the spawned download task. Previously, the main loop acquired the
permit while holding a buffer permit, coupling buffer backpressure with
CAS connection concurrency. When the buffer filled up, the main loop
could block on CAS permit acquisition indefinitely, stalling the entire
download pipeline.

Additionally, release the xorb write lock before performing the HTTP
download in retrieve_data(). The old code held the write lock during the
entire download, causing other tasks sharing the same xorb to queue up
while holding CAS permits idle. This trades occasional duplicate
downloads for much better CAS permit utilization.

Also improve error propagation in sequential_writer: when the background
writer channel closes, check RunState for the original error before
returning a generic "channel closed" message.

Reproduces deterministically with small buffer sizes
(HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb) on large files
(e.g. 48.5 GB GGUF). With default buffer (2 GB), the stall is
intermittent depending on network speed.

Tested: 117 unit tests pass, verified fix on unsloth/Qwen3-Coder-Next
Q4_K_M.gguf (48.5 GB, 2366 terms, 932 xorbs).
Replace the RwLock-based caching with tokio::sync::OnceCell, ensuring
only one task per xorb block acquires a CAS permit and downloads the
data. Concurrent callers wait on the same result without acquiring
permits or duplicating work.

This addresses the Codex review feedback:
- Eliminates duplicate downloads (bandwidth, CPU, memory waste)
- Fixes progress accounting (no double-counted transfer bytes)
- Prevents a failing duplicate from killing reconstruction when
  another succeeds
- CAS permit acquisition moves inside the OnceCell init, so waiters
  never hold idle permits
@XciD XciD marked this pull request as ready for review March 11, 2026 13:21
@XciD XciD requested a review from hoytak March 11, 2026 14:07
Copy link
Collaborator

@hoytak hoytak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent catch here; good fix.

@hoytak hoytak merged commit 9ba5fb3 into main Mar 11, 2026
7 checks passed
@hoytak hoytak deleted the fix/reconstruction-download-stall branch March 11, 2026 15:37
@seanses
Copy link
Collaborator

seanses commented Mar 11, 2026

Sorry I'm late to the PR, trying to understand the "deadlock" and wanted to double check if my understanding is correct. This is my interaction with Claude:

Your understanding is correct for a single xorb
If only the "N tasks waiting on the write lock for xorb X" problem existed in isolation, it's a throughput degradation, not a deadlock. Eventually the one task that holds the write lock finishes downloading, the 63 waiters see the cached data and return, CAS permits are released, and everything proceeds.

The buffer makes it worse, but still not a strict deadlock
In the old code, get_data_task acquires the CAS permit before returning the DataFuture and before that future is registered with the writer. So the main loop's flow is:

acquire buffer_permit ← blocks if buffer full
await get_data_task() ← acquires CAS permit here, then returns DataFuture
writer.enqueue(data_future, buffer_permit)
If the main loop is stuck waiting for a CAS permit, it's holding a buffer permit without having enqueued anything to the writer. The writer can't drain that slot (it was never registered). So buffer capacity is reduced by the number of terms stuck waiting for CAS permits — but since the main loop is sequential, that's at most 1 at a time.

What makes it an effective indefinite stall in practice
The real issue is the combination on slow connections:

Say buffer = 64 slots, CAS pool = 64 permits
Main loop fills all 64 buffer slots with terms for the same large xorb X (common in GGUF files where one xorb block is referenced by hundreds of file terms). All 64 CAS permits consumed.
1 task holds the write lock and downloads xorb X over a slow connection (hundreds of MB, minutes to download)
63 tasks hold CAS permits waiting for the write lock — doing nothing useful
Writer is waiting for all 64 futures to complete — it can't release buffer permits until they do
Main loop is blocked: buffer is full, no new work can be enqueued
The system isn't in a true circular deadlock — the write lock holder is making progress. But the effective parallelism has collapsed to 1 active download at any time, and that download may be large. On the slow connection used to reproduce the bug, this means minutes of no observable progress, which is indistinguishable from a deadlock to the user.

When the write lock holder eventually finishes, all 63 waiters immediately return (cached data), 64 buffer permits are released simultaneously, and the main loop floods the next 64 terms — many of which might again be for the same xorb, restarting the cycle.

Why the fix actually helps
The OnceCell single-flight means only 1 task ever acquires a CAS permit per xorb block. The other 63 tasks just wait on the OnceCell future without holding any permits. So:

CAS permits reflect actual concurrent downloads, not write-lock waiters
Buffer permits aren't wasted on tasks stuck waiting for CAS permits (since the permit is acquired inside the spawned task, not in the main loop)
True download parallelism is restored across different xorb blocks
So "deadlock" in the PR description is slightly loose — it's really a throughput collapse to near-zero caused by CAS permit starvation, which on a slow connection looks and feels like a deadlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants