[Cache] Detect Lustre/GPFS/NFS in WeakFileLock and fall back to SoftFileLock#4228
[Cache] Detect Lustre/GPFS/NFS in WeakFileLock and fall back to SoftFileLock#4228vadiklyutiy wants to merge 1 commit into
WeakFileLock and fall back to SoftFileLock#4228Conversation
…ileLock to prevent silent flock no-ops from corrupting shared HF caches under concurrent writers. Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 748df01. Configure here.
| class _Statfs(ctypes.Structure): | ||
| # Layout differs by libc/arch; we only need `f_type`, which is the first field on Linux. | ||
| # Reserve 128 bytes total — comfortably larger than any known `struct statfs`. | ||
| _fields_ = [("f_type", ctypes.c_long), ("_pad", ctypes.c_ubyte * 120)] |
There was a problem hiding this comment.
Heap buffer overflow on macOS in _fs_magic
High Severity
The os.name != "posix" guard allows this function to proceed on macOS (where os.name is "posix"), but the 128-byte _Statfs buffer is far too small for macOS's struct statfs (~2168 bytes, which includes two MAXPATHLEN=1024 character arrays for mount paths). When libc.statfs() is called on macOS, it writes the full struct into the undersized buffer, causing a heap buffer overflow that corrupts memory. This is a known ctypes-on-Darwin trap. The guard needs to use sys.platform == "linux" (or startswith("linux")) since the struct layout and magic numbers are Linux-specific.
Reviewed by Cursor Bugbot for commit 748df01. Configure here.
| cache_dir = os.path.dirname(os.path.abspath(str(lock_file))) | ||
| lock: BaseFileLock | ||
| if _should_use_soft_lock(cache_dir): | ||
| lock = SoftFileLock(lock_file, timeout=log_interval) |
There was a problem hiding this comment.
SoftFileLock missing mode=0o664 for shared caches
Medium Severity
The SoftFileLock path omits mode=0o664, while the FileLock path on line 251 explicitly passes it. With the default mode and typical umask (022), lock files are created as 0o644 (not group-writable). This breaks multi-user access on shared caches — precisely the scenario where SoftFileLock is selected (Lustre, GPFS, NFS). The docstring on line 239 explicitly promises group-writable lock files.
Reviewed by Cursor Bugbot for commit 748df01. Configure here.
| result = q.get_nowait() | ||
| except Exception: | ||
| result = "error" | ||
| return result == "ok" |
There was a problem hiding this comment.
Probe doesn't catch multiprocessing exceptions despite fail-closed claim
Medium Severity
The docstring for _flock_actually_serializes claims "Fail-closed: any timeout, exception, or ambiguous result returns False", but the outer try block only has a finally clause — no except. If ctx.Queue(), ctx.Process(), or p.start() raise (common in Docker containers without --ipc=host, or on HPC nodes hitting process limits), the exception propagates through _should_use_soft_lock and crashes WeakFileLock instead of gracefully falling back to SoftFileLock. This is especially problematic since the target users (shared HPC caches) are the most likely to run in restricted environments.
Reviewed by Cursor Bugbot for commit 748df01. Configure here.
|
Hi @vadiklyutiy , have you run into this issue yourself? As mentioned by |
|
Hi @Wauplin First about
and
This is a real issue we encountered during large-scale vLLM testing at NVIDIA. Across approximately 1,000 runs, with dozens running concurrently, we observed 3–7 such failures. Yes, I used AI, especially for writing tests, but I fully understand the purpose of the changes and have personally reviewed every line of code. This PR is not about identifying a potential issue that is likely not real, asking Claude to fix it, and then pushing a PR just to add a contribution to Second about
I'd point to the next comment in the same issue tox-dev/filelock#389 (comment). That describe that this is a real issue and his workaround to patch filelock by changing helped in that case. Also in this huggingface/transformers#30859 (comment) comment
I propose to reopen it. We of course can ping pong the issue between
The main change here is where we choose using All other changes is accurate implementation of |
|
Hey @vadiklyutiy , thanks for the answer and added details. Sorry for the initial closing, we receive a lot of contributions and it's sometimes hard to determine if people actually ran into the issue they are reporting. So yes, let's reopen this PR and see what we can do about it! |
|
Hi @vadiklyutiy, after some investigation and local debugging, I came to the conclusion that the safest bet is to never resume from an incomplete file. This should solve your issue at the cost of introducing a breaking change in the current behavior. However this is not too much of a problem since I push a separate PR (#4306) with the fix. Do you mind taking a look and let me know what you think? Would be even better if you could test it on a real setup on your side to confirm it resolves your problem. Thanks in advance! 🤗 (once #4306 gets merged, I'll close this PR) |


Problem
On Lustre, GPFS, and many NFS mounts,
fcntl.flock(2)returns success for every caller without blocking — soWeakFileLockis a silent no-op on shared HF caches. Concurrenthf_hub_downloadcycles can interleaveO_APPENDwrites to<etag>.incomplete, producing a syntactically-valid but semantically-wrong blob (or a transiently-emptyrefs/<revision>).The downstream symptom seen on multi-process
vllm servejobs sharingHF_HOMEon Lustre is a sporadictransformers.AutoConfig.from_pretrainedcrash on one of N siblings:Unrecognized model … Should have a "model_type" key in its config.json. The sameconfig.jsonwas read successfully ~50 ms earlier in the same process, and 7 of 8 sibling processes parse it fine — a classic race.The current code only falls back to
SoftFileLockwhenfilelockraisesNotImplementedError, which Lustre/GPFS never do.Related:
tox-dev/filelock#389,huggingface/transformers#30859.Solution
WeakFileLocknow decides between nativeFileLockandSoftFileLockup-front, via two layers insrc/huggingface_hub/utils/_fixes.py:statfs(2)family detection — Lustre, GPFS, NFS, SMB, CIFS magic numbers route toSoftFileLock. Cross-node-correct, no IPC needed.multiprocessing.get_context("spawn")for pickle-safety) that re-opens a probe file and triesLOCK_EX|LOCK_NBwhile the parent still holds the lock. Any timeout, exception, or non-"ok"result →SoftFileLock.Both helpers are
@functools.lru_cache'd per cache directory.Two opt-out env vars cover edge cases:
HF_HUB_USE_SOFT_FILELOCK=1— force soft regardless of detection (for exotic FUSE mounts that wrap an unsafe backend).HF_HUB_FORCE_FLOCK=1— force native for sites that mount Lustre with cluster-wideflocksupport and want the small perf benefit.The legacy
NotImplementedError→SoftFileLockfallback inside the acquisition loop is preserved as belt-and-suspenders.Tests
New
TestSoftLockDetection(10 cases):SoftFileLock.tmp_path(under/tmp, ext4) → asserts nativeFileLock; exercises the real probe end-to-end.HF_HUB_USE_SOFT_FILELOCK=Trueoverrides a healthy magic →SoftFileLock.HF_HUB_FORCE_FLOCK=Trueoverrides a Lustre magic →FileLock.0) →SoftFileLock(verifies fail-closed).Results:
pytest tests/test_utils_fixes.py -v→ 14 passed (10 new + 4 pre-existing).pytest tests/test_local_folder.py→ 13 passed, 2 skipped.pytest tests/test_file_download.py→ bit-identical toorigin/mainbaseline (failures are sandboxHF_HUB_OFFLINEnetwork restrictions, not regressions; confirmed viagit stash+ rerun).make styleandmake quality→ clean, includingty check src.mypy src(CI parity) →Success: no issues found in 176 source files.Real-world detection on the dev box, which happens to use NFS for
$HOME:_fs_magic("/home/scratch.vgimpelson_ent/huggingface_hub")→0x6969(NFS) →_should_use_soft_lockreturnsTrue._fs_magic("/tmp")→0xef53(ext4) →_should_use_soft_lockreturnsFalse.Exactly the cases the fix is designed to discriminate.
Note
Medium Risk
Changes core Hub cache locking behavior on POSIX and shared/network filesystems; wrong detection could hurt performance (unnecessary soft locks) or, if forced native flock is misused, reintroduce rare concurrent-write corruption.
Overview
Fixes silent cache corruption when multiple processes share
HF_HUB_CACHEon parallel filesystems whereflock(2)appears to succeed but does not serialize writers.WeakFileLocknow choosesSoftFileLockvs nativeFileLockbefore acquisition, instead of only falling back onNotImplementedError. Detection usesstatfs(2)magic numbers for Lustre, GPFS, NFS, SMB, and CIFS, plus a fail-closed child-process probe for unknown POSIX filesystems. Decisions are cached per cache directory.New knobs:
HF_HUB_USE_SOFT_FILELOCK(always soft) andHF_HUB_FORCE_FLOCK(always native when cluster-wide flock is known good). Constants and docs were added;TestSoftLockDetectioncovers magics, local FS, env overrides, and probe failure.Reviewed by Cursor Bugbot for commit 748df01. Bugbot is set up for automated code reviews on this repo. Configure here.