Skip to content

[Cache] Detect Lustre/GPFS/NFS in WeakFileLock and fall back to SoftFileLock#4228

Closed
vadiklyutiy wants to merge 1 commit into
huggingface:mainfrom
vadiklyutiy:filelock-lustre-gpfs-fallback
Closed

[Cache] Detect Lustre/GPFS/NFS in WeakFileLock and fall back to SoftFileLock#4228
vadiklyutiy wants to merge 1 commit into
huggingface:mainfrom
vadiklyutiy:filelock-lustre-gpfs-fallback

Conversation

@vadiklyutiy

@vadiklyutiy vadiklyutiy commented May 15, 2026

Copy link
Copy Markdown

Problem

On Lustre, GPFS, and many NFS mounts, fcntl.flock(2) returns success for every caller without blocking — so WeakFileLock is a silent no-op on shared HF caches. Concurrent hf_hub_download cycles can interleave O_APPEND writes to <etag>.incomplete, producing a syntactically-valid but semantically-wrong blob (or a transiently-empty refs/<revision>).

The downstream symptom seen on multi-process vllm serve jobs sharing HF_HOME on Lustre is a sporadic transformers.AutoConfig.from_pretrained crash on one of N siblings: Unrecognized model … Should have a "model_type" key in its config.json. The same config.json was read successfully ~50 ms earlier in the same process, and 7 of 8 sibling processes parse it fine — a classic race.

The current code only falls back to SoftFileLock when filelock raises NotImplementedError, which Lustre/GPFS never do.

Related: tox-dev/filelock#389, huggingface/transformers#30859.

Solution

WeakFileLock now decides between native FileLock and SoftFileLock up-front, via two layers in src/huggingface_hub/utils/_fixes.py:

  1. Primary: statfs(2) family detection — Lustre, GPFS, NFS, SMB, CIFS magic numbers route to SoftFileLock. Cross-node-correct, no IPC needed.
  2. Secondary: fail-closed runtime probe for unknown filesystems — spawns a child process (multiprocessing.get_context("spawn") for pickle-safety) that re-opens a probe file and tries LOCK_EX|LOCK_NB while the parent still holds the lock. Any timeout, exception, or non-"ok" result → SoftFileLock.

Both helpers are @functools.lru_cache'd per cache directory.

Two opt-out env vars cover edge cases:

  • HF_HUB_USE_SOFT_FILELOCK=1 — force soft regardless of detection (for exotic FUSE mounts that wrap an unsafe backend).
  • HF_HUB_FORCE_FLOCK=1 — force native for sites that mount Lustre with cluster-wide flock support and want the small perf benefit.

The legacy NotImplementedErrorSoftFileLock fallback inside the acquisition loop is preserved as belt-and-suspenders.

Tests

New TestSoftLockDetection (10 cases):

  • 5 parametrized known-broken magics (Lustre / GPFS / NFS / SMB / CIFS) → asserts SoftFileLock.
  • Local FS via tmp_path (under /tmp, ext4) → asserts native FileLock; exercises the real probe end-to-end.
  • HF_HUB_USE_SOFT_FILELOCK=True overrides a healthy magic → SoftFileLock.
  • HF_HUB_FORCE_FLOCK=True overrides a Lustre magic → FileLock.
  • Probe failure with unknown magic (0) → SoftFileLock (verifies fail-closed).

Results:

  • pytest tests/test_utils_fixes.py -v14 passed (10 new + 4 pre-existing).
  • pytest tests/test_local_folder.py13 passed, 2 skipped.
  • pytest tests/test_file_download.py → bit-identical to origin/main baseline (failures are sandbox HF_HUB_OFFLINE network restrictions, not regressions; confirmed via git stash + rerun).
  • make style and make quality → clean, including ty check src.
  • mypy src (CI parity) → Success: no issues found in 176 source files.

Real-world detection on the dev box, which happens to use NFS for $HOME:

  • _fs_magic("/home/scratch.vgimpelson_ent/huggingface_hub")0x6969 (NFS) → _should_use_soft_lock returns True.
  • _fs_magic("/tmp")0xef53 (ext4) → _should_use_soft_lock returns False.

Exactly the cases the fix is designed to discriminate.


Note

Medium Risk
Changes core Hub cache locking behavior on POSIX and shared/network filesystems; wrong detection could hurt performance (unnecessary soft locks) or, if forced native flock is misused, reintroduce rare concurrent-write corruption.

Overview
Fixes silent cache corruption when multiple processes share HF_HUB_CACHE on parallel filesystems where flock(2) appears to succeed but does not serialize writers.

WeakFileLock now chooses SoftFileLock vs native FileLock before acquisition, instead of only falling back on NotImplementedError. Detection uses statfs(2) magic numbers for Lustre, GPFS, NFS, SMB, and CIFS, plus a fail-closed child-process probe for unknown POSIX filesystems. Decisions are cached per cache directory.

New knobs: HF_HUB_USE_SOFT_FILELOCK (always soft) and HF_HUB_FORCE_FLOCK (always native when cluster-wide flock is known good). Constants and docs were added; TestSoftLockDetection covers magics, local FS, env overrides, and probe failure.

Reviewed by Cursor Bugbot for commit 748df01. Bugbot is set up for automated code reviews on this repo. Configure here.

…ileLock to prevent silent flock no-ops from corrupting shared HF caches under concurrent writers.

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 748df01. Configure here.

class _Statfs(ctypes.Structure):
# Layout differs by libc/arch; we only need `f_type`, which is the first field on Linux.
# Reserve 128 bytes total — comfortably larger than any known `struct statfs`.
_fields_ = [("f_type", ctypes.c_long), ("_pad", ctypes.c_ubyte * 120)]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heap buffer overflow on macOS in _fs_magic

High Severity

The os.name != "posix" guard allows this function to proceed on macOS (where os.name is "posix"), but the 128-byte _Statfs buffer is far too small for macOS's struct statfs (~2168 bytes, which includes two MAXPATHLEN=1024 character arrays for mount paths). When libc.statfs() is called on macOS, it writes the full struct into the undersized buffer, causing a heap buffer overflow that corrupts memory. This is a known ctypes-on-Darwin trap. The guard needs to use sys.platform == "linux" (or startswith("linux")) since the struct layout and magic numbers are Linux-specific.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 748df01. Configure here.

cache_dir = os.path.dirname(os.path.abspath(str(lock_file)))
lock: BaseFileLock
if _should_use_soft_lock(cache_dir):
lock = SoftFileLock(lock_file, timeout=log_interval)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SoftFileLock missing mode=0o664 for shared caches

Medium Severity

The SoftFileLock path omits mode=0o664, while the FileLock path on line 251 explicitly passes it. With the default mode and typical umask (022), lock files are created as 0o644 (not group-writable). This breaks multi-user access on shared caches — precisely the scenario where SoftFileLock is selected (Lustre, GPFS, NFS). The docstring on line 239 explicitly promises group-writable lock files.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 748df01. Configure here.

result = q.get_nowait()
except Exception:
result = "error"
return result == "ok"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probe doesn't catch multiprocessing exceptions despite fail-closed claim

Medium Severity

The docstring for _flock_actually_serializes claims "Fail-closed: any timeout, exception, or ambiguous result returns False", but the outer try block only has a finally clause — no except. If ctx.Queue(), ctx.Process(), or p.start() raise (common in Docker containers without --ipc=host, or on HPC nodes hitting process limits), the exception propagates through _should_use_soft_lock and crashes WeakFileLock instead of gracefully falling back to SoftFileLock. This is especially problematic since the target users (shared HPC caches) are the most likely to run in restricted environments.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 748df01. Configure here.

@Wauplin

Wauplin commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Hi @vadiklyutiy , have you run into this issue yourself?

As mentioned by @gaborbernat in tox-dev/filelock#389 (comment) this really sound like a fnlock Python bug that should be solve in there rather than applying a weak patch in a downstream library (this PR). Also this PR feels heavily AI-generated, which is not a bad thing in itself but without a proper setup to test the PR I'm reluctant to review/merge it. I'll therefore close it for now but happy to reopen if you think it's necessary.

@Wauplin Wauplin closed this May 19, 2026
@vadiklyutiy

Copy link
Copy Markdown
Author

Hi @Wauplin

First about

have you run into this issue yourself?

and

Also this PR feels heavily AI-generated, which is not a bad thing in itself but without a proper setup to test the PR I'm reluctant to review/merge it.

This is a real issue we encountered during large-scale vLLM testing at NVIDIA. Across approximately 1,000 runs, with dozens running concurrently, we observed 3–7 such failures.

Yes, I used AI, especially for writing tests, but I fully understand the purpose of the changes and have personally reviewed every line of code.

This PR is not about identifying a potential issue that is likely not real, asking Claude to fix it, and then pushing a PR just to add a contribution to huggingface_hub to my resume. You can simply check my public contributions on GitHub; it should be clear that I do not need that.

Second about

As mentioned by @gaborbernat in tox-dev/filelock#389 (comment) this really sound like a fnlock Python bug that should be solve in there rather than applying a weak patch in a downstream library (this PR)

I'd point to the next comment in the same issue tox-dev/filelock#389 (comment). That describe that this is a real issue and his workaround to patch filelock by changing flock to lockf

sed -i "s/fcntl.flock/fcntl.lockf/" site-packages/filelock/_unix.py

helped in that case.

Also in this huggingface/transformers#30859 (comment) comment SoftFileLock was a solution/workaround.

I'll therefore close it for now but happy to reopen if you think it's necessary.

I propose to reopen it. We of course can ping pong the issue between filelock or pyhton but this is real issue.

huggingface_hub has a ready interfaces SoftFileLock that I use here.

The main change here is
https://github.com/vadiklyutiy/huggingface_hub/blob/filelock-lustre-gpfs-fallback/src/huggingface_hub/utils/_fixes.py#L248-L251

where we choose using SoftFileLock for network file systems.

All other changes is accurate implementation of _should_use_soft_lock - checking underlying file system.

@Wauplin

Wauplin commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Hey @vadiklyutiy , thanks for the answer and added details. Sorry for the initial closing, we receive a lot of contributions and it's sometimes hard to determine if people actually ran into the issue they are reporting. So yes, let's reopen this PR and see what we can do about it!

@Wauplin

Wauplin commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Hi @vadiklyutiy, after some investigation and local debugging, I came to the conclusion that the safest bet is to never resume from an incomplete file. This should solve your issue at the cost of introducing a breaking change in the current behavior. However this is not too much of a problem since hf_xet do not allow resuming from an incomplete file so for now it was more of a "legacy nice to have" rather than a useful feature.

I push a separate PR (#4306) with the fix. Do you mind taking a look and let me know what you think? Would be even better if you could test it on a real setup on your side to confirm it resolves your problem. Thanks in advance! 🤗

(once #4306 gets merged, I'll close this PR)

@vadiklyutiy

vadiklyutiy commented Jun 9, 2026

Copy link
Copy Markdown
Author

@Wauplin
#4306 looks good. Thank you for fixing. I will test it.

@vadiklyutiy vadiklyutiy closed this Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants