[Cache] Detect Lustre/GPFS/NFS in `WeakFileLock` and fall back to `SoftFileLock` by vadiklyutiy · Pull Request #4228 · huggingface/huggingface_hub

vadiklyutiy · 2026-05-15T02:10:18Z

Problem

On Lustre, GPFS, and many NFS mounts, fcntl.flock(2) returns success for every caller without blocking — so WeakFileLock is a silent no-op on shared HF caches. Concurrent hf_hub_download cycles can interleave O_APPEND writes to <etag>.incomplete, producing a syntactically-valid but semantically-wrong blob (or a transiently-empty refs/<revision>).

The downstream symptom seen on multi-process vllm serve jobs sharing HF_HOME on Lustre is a sporadic transformers.AutoConfig.from_pretrained crash on one of N siblings: Unrecognized model … Should have a "model_type" key in its config.json. The same config.json was read successfully ~50 ms earlier in the same process, and 7 of 8 sibling processes parse it fine — a classic race.

The current code only falls back to SoftFileLock when filelock raises NotImplementedError, which Lustre/GPFS never do.

Related: tox-dev/filelock#389, huggingface/transformers#30859.

Solution

WeakFileLock now decides between native FileLock and SoftFileLock up-front, via two layers in src/huggingface_hub/utils/_fixes.py:

Primary: statfs(2) family detection — Lustre, GPFS, NFS, SMB, CIFS magic numbers route to SoftFileLock. Cross-node-correct, no IPC needed.
Secondary: fail-closed runtime probe for unknown filesystems — spawns a child process (multiprocessing.get_context("spawn") for pickle-safety) that re-opens a probe file and tries LOCK_EX|LOCK_NB while the parent still holds the lock. Any timeout, exception, or non-"ok" result → SoftFileLock.

Both helpers are @functools.lru_cache'd per cache directory.

Two opt-out env vars cover edge cases:

HF_HUB_USE_SOFT_FILELOCK=1 — force soft regardless of detection (for exotic FUSE mounts that wrap an unsafe backend).
HF_HUB_FORCE_FLOCK=1 — force native for sites that mount Lustre with cluster-wide flock support and want the small perf benefit.

The legacy NotImplementedError → SoftFileLock fallback inside the acquisition loop is preserved as belt-and-suspenders.

Tests

New TestSoftLockDetection (10 cases):

5 parametrized known-broken magics (Lustre / GPFS / NFS / SMB / CIFS) → asserts SoftFileLock.
Local FS via tmp_path (under /tmp, ext4) → asserts native FileLock; exercises the real probe end-to-end.
HF_HUB_USE_SOFT_FILELOCK=True overrides a healthy magic → SoftFileLock.
HF_HUB_FORCE_FLOCK=True overrides a Lustre magic → FileLock.
Probe failure with unknown magic (0) → SoftFileLock (verifies fail-closed).

Results:

pytest tests/test_utils_fixes.py -v → 14 passed (10 new + 4 pre-existing).
pytest tests/test_local_folder.py → 13 passed, 2 skipped.
pytest tests/test_file_download.py → bit-identical to origin/main baseline (failures are sandbox HF_HUB_OFFLINE network restrictions, not regressions; confirmed via git stash + rerun).
make style and make quality → clean, including ty check src.
mypy src (CI parity) → Success: no issues found in 176 source files.

Real-world detection on the dev box, which happens to use NFS for $HOME:

_fs_magic("/home/scratch.vgimpelson_ent/huggingface_hub") → 0x6969 (NFS) → _should_use_soft_lock returns True.
_fs_magic("/tmp") → 0xef53 (ext4) → _should_use_soft_lock returns False.

Exactly the cases the fix is designed to discriminate.

Note

Medium Risk
Changes core Hub cache locking behavior on POSIX and shared/network filesystems; wrong detection could hurt performance (unnecessary soft locks) or, if forced native flock is misused, reintroduce rare concurrent-write corruption.

Overview
Fixes silent cache corruption when multiple processes share HF_HUB_CACHE on parallel filesystems where flock(2) appears to succeed but does not serialize writers.

WeakFileLock now chooses SoftFileLock vs native FileLock before acquisition, instead of only falling back on NotImplementedError. Detection uses statfs(2) magic numbers for Lustre, GPFS, NFS, SMB, and CIFS, plus a fail-closed child-process probe for unknown POSIX filesystems. Decisions are cached per cache directory.

New knobs: HF_HUB_USE_SOFT_FILELOCK (always soft) and HF_HUB_FORCE_FLOCK (always native when cluster-wide flock is known good). Constants and docs were added; TestSoftLockDetection covers magics, local FS, env overrides, and probe failure.

^{Reviewed by Cursor Bugbot for commit 748df01. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ileLock to prevent silent flock no-ops from corrupting shared HF caches under concurrent writers. Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

cursor

Cursor Bugbot has reviewed your changes using default mode and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 748df01. Configure here.}

cursor · 2026-05-15T02:21:15Z

+    class _Statfs(ctypes.Structure):
+        # Layout differs by libc/arch; we only need `f_type`, which is the first field on Linux.
+        # Reserve 128 bytes total — comfortably larger than any known `struct statfs`.
+        _fields_ = [("f_type", ctypes.c_long), ("_pad", ctypes.c_ubyte * 120)]


Heap buffer overflow on macOS in _fs_magic

High Severity

The os.name != "posix" guard allows this function to proceed on macOS (where os.name is "posix"), but the 128-byte _Statfs buffer is far too small for macOS's struct statfs (~2168 bytes, which includes two MAXPATHLEN=1024 character arrays for mount paths). When libc.statfs() is called on macOS, it writes the full struct into the undersized buffer, causing a heap buffer overflow that corrupts memory. This is a known ctypes-on-Darwin trap. The guard needs to use sys.platform == "linux" (or startswith("linux")) since the struct layout and magic numbers are Linux-specific.

^{Reviewed by Cursor Bugbot for commit 748df01. Configure here.}

cursor · 2026-05-15T02:21:15Z

+    cache_dir = os.path.dirname(os.path.abspath(str(lock_file)))
+    lock: BaseFileLock
+    if _should_use_soft_lock(cache_dir):
+        lock = SoftFileLock(lock_file, timeout=log_interval)


SoftFileLock missing mode=0o664 for shared caches

Medium Severity

The SoftFileLock path omits mode=0o664, while the FileLock path on line 251 explicitly passes it. With the default mode and typical umask (022), lock files are created as 0o644 (not group-writable). This breaks multi-user access on shared caches — precisely the scenario where SoftFileLock is selected (Lustre, GPFS, NFS). The docstring on line 239 explicitly promises group-writable lock files.

^{Reviewed by Cursor Bugbot for commit 748df01. Configure here.}

cursor · 2026-05-15T02:21:15Z

+            result = q.get_nowait()
+        except Exception:
+            result = "error"
+        return result == "ok"


Probe doesn't catch multiprocessing exceptions despite fail-closed claim

Medium Severity

The docstring for _flock_actually_serializes claims "Fail-closed: any timeout, exception, or ambiguous result returns False", but the outer try block only has a finally clause — no except. If ctx.Queue(), ctx.Process(), or p.start() raise (common in Docker containers without --ipc=host, or on HPC nodes hitting process limits), the exception propagates through _should_use_soft_lock and crashes WeakFileLock instead of gracefully falling back to SoftFileLock. This is especially problematic since the target users (shared HPC caches) are the most likely to run in restricted environments.

^{Reviewed by Cursor Bugbot for commit 748df01. Configure here.}

Wauplin · 2026-05-19T09:14:23Z

Hi @vadiklyutiy , have you run into this issue yourself?

As mentioned by @gaborbernat in tox-dev/filelock#389 (comment) this really sound like a fnlock Python bug that should be solve in there rather than applying a weak patch in a downstream library (this PR). Also this PR feels heavily AI-generated, which is not a bad thing in itself but without a proper setup to test the PR I'm reluctant to review/merge it. I'll therefore close it for now but happy to reopen if you think it's necessary.

vadiklyutiy · 2026-05-28T19:14:29Z

Hi @Wauplin

First about

have you run into this issue yourself?

and

Also this PR feels heavily AI-generated, which is not a bad thing in itself but without a proper setup to test the PR I'm reluctant to review/merge it.

This is a real issue we encountered during large-scale vLLM testing at NVIDIA. Across approximately 1,000 runs, with dozens running concurrently, we observed 3–7 such failures.

Yes, I used AI, especially for writing tests, but I fully understand the purpose of the changes and have personally reviewed every line of code.

This PR is not about identifying a potential issue that is likely not real, asking Claude to fix it, and then pushing a PR just to add a contribution to huggingface_hub to my resume. You can simply check my public contributions on GitHub; it should be clear that I do not need that.

Second about

As mentioned by @gaborbernat in tox-dev/filelock#389 (comment) this really sound like a fnlock Python bug that should be solve in there rather than applying a weak patch in a downstream library (this PR)

I'd point to the next comment in the same issue tox-dev/filelock#389 (comment). That describe that this is a real issue and his workaround to patch filelock by changing flock to lockf

sed -i "s/fcntl.flock/fcntl.lockf/" site-packages/filelock/_unix.py

helped in that case.

Also in this huggingface/transformers#30859 (comment) comment SoftFileLock was a solution/workaround.

I'll therefore close it for now but happy to reopen if you think it's necessary.

I propose to reopen it. We of course can ping pong the issue between filelock or pyhton but this is real issue.

huggingface_hub has a ready interfaces SoftFileLock that I use here.

The main change here is
https://github.com/vadiklyutiy/huggingface_hub/blob/filelock-lustre-gpfs-fallback/src/huggingface_hub/utils/_fixes.py#L248-L251

where we choose using SoftFileLock for network file systems.

All other changes is accurate implementation of _should_use_soft_lock - checking underlying file system.

Wauplin · 2026-05-29T14:40:47Z

Hey @vadiklyutiy , thanks for the answer and added details. Sorry for the initial closing, we receive a lot of contributions and it's sometimes hard to determine if people actually ran into the issue they are reporting. So yes, let's reopen this PR and see what we can do about it!

Wauplin · 2026-06-03T15:35:10Z

Hi @vadiklyutiy, after some investigation and local debugging, I came to the conclusion that the safest bet is to never resume from an incomplete file. This should solve your issue at the cost of introducing a breaking change in the current behavior. However this is not too much of a problem since hf_xet do not allow resuming from an incomplete file so for now it was more of a "legacy nice to have" rather than a useful feature.

I push a separate PR (#4306) with the fix. Do you mind taking a look and let me know what you think? Would be even better if you could test it on a real setup on your side to confirm it resolves your problem. Thanks in advance! 🤗

(once #4306 gets merged, I'll close this PR)

vadiklyutiy · 2026-06-09T19:02:52Z

@Wauplin
#4306 looks good. Thank you for fixing. I will test it.

[Cache] Detect Lustre/GPFS/NFS in WeakFileLock and fall back to SoftF…

748df01

…ileLock to prevent silent flock no-ops from corrupting shared HF caches under concurrent writers. Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

cursor Bot reviewed May 15, 2026

View reviewed changes

Wauplin closed this May 19, 2026

Wauplin reopened this May 29, 2026

Wauplin mentioned this pull request Jun 3, 2026

[Fix] Make concurrent downloads safe even when file locking is broken #4306

Merged

vadiklyutiy closed this Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cache] Detect Lustre/GPFS/NFS in `WeakFileLock` and fall back to `SoftFileLock`#4228

[Cache] Detect Lustre/GPFS/NFS in `WeakFileLock` and fall back to `SoftFileLock`#4228
vadiklyutiy wants to merge 1 commit into
huggingface:mainfrom
vadiklyutiy:filelock-lustre-gpfs-fallback

vadiklyutiy commented May 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 15, 2026

Uh oh!

cursor Bot May 15, 2026

Uh oh!

cursor Bot May 15, 2026

Uh oh!

Wauplin commented May 19, 2026 •

edited

Loading

Uh oh!

vadiklyutiy commented May 28, 2026

Uh oh!

Wauplin commented May 29, 2026

Uh oh!

Wauplin commented Jun 3, 2026 •

edited

Loading

Uh oh!

vadiklyutiy commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vadiklyutiy commented May 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Tests

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Heap buffer overflow on macOS in _fs_magic

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

SoftFileLock missing mode=0o664 for shared caches

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Probe doesn't catch multiprocessing exceptions despite fail-closed claim

Uh oh!

Wauplin commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy commented May 28, 2026

Uh oh!

Wauplin commented May 29, 2026

Uh oh!

Wauplin commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vadiklyutiy commented May 15, 2026 •

edited by cursor Bot

Loading

Heap buffer overflow on macOS in `_fs_magic`

SoftFileLock missing `mode=0o664` for shared caches

Wauplin commented May 19, 2026 •

edited

Loading

Wauplin commented Jun 3, 2026 •

edited

Loading

vadiklyutiy commented Jun 9, 2026 •

edited

Loading