[Fix] Make concurrent downloads safe even when file locking is broken#4306
Merged
Conversation
…ption Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
hanouticelina
approved these changes
Jun 5, 2026
hanouticelina
left a comment
Collaborator
There was a problem hiding this comment.
the solution looks good to me 👍
Contributor
|
This PR has been shipped as part of the v1.18.0 release. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR:
filelockis not bullet-proof on some filesystems (see issue)filelocki.e. we always download to a new tmp file instead of resuming from incomplete filehf_xetis disabled.Alternative to #4228 (thanks @vadiklyutiy for the report and investigation!) addressing the cache corruption reported there, also related to tox-dev/filelock#389 and huggingface/transformers#30859.
On Lustre/GPFS and some NFS mounts,
flock(2)silently succeeds for every caller, soWeakFileLockis a no-op and concurrenthf_hub_downloadcalls end up appending to the same shared<etag>.incompletefile. Worst case is silent cache poisoning: the first process to finish passes its consistency check and renames the file into the blob store, while a slower process still holds an open fd on the renamed inode and keeps appending to the now-live blob.The bug can be reproduced on any machine (no Lustre needed) with the script below: it patches
fcntl.flockto a silent no-op — exactly what those filesystems do — and races twohf_hub_downloadprocesses with worst-case timing. Onmain, it reliably poisons the cache: a corrupted blob (134,981,632 bytes instead of 125,162,496, wrong sha256) behind a valid pointer, which is the same symptom as thevllm servecrash described in #4228. With this PR, the same script ends with both processes succeeding and a bit-identical blob.repro_4228.pyInstead of detecting broken filesystems, this PR removes the lock from the correctness equation: each download now writes to a process-unique temporary file (
<etag>.<random>.incomplete) and atomically renames it into place, andrefs/<revision>is written atomically (tmp file + rename) so readers can never observe a truncated ref — that ref write happens outside any lock today and was racy even on local filesystems. With this change a broken lock costs only duplicated bandwidth, on any filesystem, including configurations that detection can't catch (FUSE wrappers, Lustrelocalflock, NFSlocal_lock=flock). The lock is kept as a best-effort way to avoid downloading the same file twice.Warning
Breaking change: interrupted downloads are no longer resumed by a later call — each download writes to its own unique temporary file (in-process retry/resume in
http_getis unaffected). Sharing a partial file across processes is exactly what made the corruption possible. Xet-backed downloads (the default for new repos) recover most of the resume value through chunk dedup.🤖 Generated with Claude Code
Note
Medium Risk
Changes core download/cache write paths and drops cross-process resume from shared incomplete files, which can increase bandwidth on failure but fixes silent blob corruption on shared/cluster filesystems.
Overview
Makes Hub cache downloads safe when
flock-based locks are ineffective (e.g. Lustre/GPFS/some NFS), where concurrent processes could corrupt a shared<etag>.incompleteblob._download_to_tmp_and_moveno longer appends to or resumes a shared incomplete file. Each download writes to a process-unique*.{uuid}.incompletepath, then_chmod_and_moveinto the final blob; afinallyblock removes any leftover temp file. Cross-process resume from.incompleteis removed (in-processhttp_getretries are unchanged)._cache_commit_hash_for_specific_revisionnow writes refs via tmp +os.replaceso readers never see a half-written ref.WeakFileLockstays as best-effort deduplication only; comments document that correctness does not depend on it. Tests drop resume/force_downloadincomplete tests, add cleanup of*.incomplete, and adjust the legacy-complete-incomplete regression scenario.Reviewed by Cursor Bugbot for commit 75cbbca. Bugbot is set up for automated code reviews on this repo. Configure here.