fix: prevent download stall on large file reconstruction#698
Conversation
Move CAS download permit acquisition from the main reconstruction loop into the spawned download task. Previously, the main loop acquired the permit while holding a buffer permit, coupling buffer backpressure with CAS connection concurrency. When the buffer filled up, the main loop could block on CAS permit acquisition indefinitely, stalling the entire download pipeline. Additionally, release the xorb write lock before performing the HTTP download in retrieve_data(). The old code held the write lock during the entire download, causing other tasks sharing the same xorb to queue up while holding CAS permits idle. This trades occasional duplicate downloads for much better CAS permit utilization. Also improve error propagation in sequential_writer: when the background writer channel closes, check RunState for the original error before returning a generic "channel closed" message. Reproduces deterministically with small buffer sizes (HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb) on large files (e.g. 48.5 GB GGUF). With default buffer (2 GB), the stall is intermittent depending on network speed. Tested: 117 unit tests pass, verified fix on unsloth/Qwen3-Coder-Next Q4_K_M.gguf (48.5 GB, 2366 terms, 932 xorbs).
Replace the RwLock-based caching with tokio::sync::OnceCell, ensuring only one task per xorb block acquires a CAS permit and downloads the data. Concurrent callers wait on the same result without acquiring permits or duplicating work. This addresses the Codex review feedback: - Eliminates duplicate downloads (bandwidth, CPU, memory waste) - Fixes progress accounting (no double-counted transfer bytes) - Prevents a failing duplicate from killing reconstruction when another succeeds - CAS permit acquisition moves inside the OnceCell init, so waiters never hold idle permits
hoytak
left a comment
There was a problem hiding this comment.
Excellent catch here; good fix.
|
Sorry I'm late to the PR, trying to understand the "deadlock" and wanted to double check if my understanding is correct. This is my interaction with Claude:
|
Summary
Fixes download stalls/deadlocks on large file reconstruction (reported on 48.5 GB GGUF files). The root cause is a circular dependency: the main reconstruction loop holds a buffer semaphore permit while blocking on CAS connection permit acquisition, and xorb write locks held during HTTP downloads cause CAS permit starvation.
Changes
Single-flight xorb downloads via
OnceCell(xorb_block.rs): replacesRwLock<Option<...>>withtokio::sync::OnceCell. Only one task per xorb block acquires a CAS permit and downloads the data; concurrent callers wait on the same result without acquiring permits or duplicating work. This eliminates duplicate downloads, prevents double-counted transfer progress, and avoids a failing duplicate from killing the reconstruction.Decouple CAS permit from buffer permit (
file_term.rs): the main loop no longer blocks on CAS permits while holding a buffer permit. The spawned download task delegates toretrieve_datawhich handles permit acquisition internally via the OnceCell single-flight. This breaks the circular dependency that causes stalls.Improve error propagation (
sequential_writer.rs): when the background writer channel closes, checkRunStatefor the original error before returning a generic "channel closed" message.Root cause
The reconstruction pipeline has three resource pools: buffer permits (bounded semaphore), CAS download permits (64 concurrent), and per-xorb write locks.
Before this fix, the main loop would:
get_data_task()which acquires a CAS permit (blocking if pool exhausted)retrieve_data(), hold a write lock during the entire HTTP downloadThis creates two deadlock vectors:
Reproduction
Reliably reproducible with small buffer:
With default buffer (2 GB), the stall is intermittent depending on network speed (consistently reproduced on slower connections).