-
-
Notifications
You must be signed in to change notification settings - Fork 128
Description
FileNotFoundError in _acquire() on FUSE/NFS with concurrent processes since 3.21.0
Summary
Since 3.21.0 (PR #408), _release() on Unix calls Path(self.lock_file).unlink(). This causes FileNotFoundError when multiple processes contend on the same lock file over a FUSE or NFS filesystem.
PR #484 already reverted this behavior on Windows for the same class of race condition. The Unix codepath has the same problem.
Environment
- Multi-node distributed training (2-8 nodes, 8 GPUs per node, 16-64 processes)
- Shared FUSE-backed filesystem (CSI volume) for cache directories
- HuggingFace
datasetslibrary callingfilelockinternally duringload_dataset() - filelock 3.21.0 through 3.24.2 (all affected)
Reproduces when
Multiple processes across nodes call FileLock.acquire() on the same lock file path on a shared FUSE/NFS mount. The more processes, the more likely.
Traceback
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/shared-cache/datasets/...lock'
File "filelock/_api.py", line 288, in acquire
self._acquire()
File "filelock/_unix.py", line 49, in _acquire
fd = os.open(self.lock_file, open_flags, open_mode)
Root cause
On local ext4, os.open(path, O_CREAT|O_WRONLY) is atomic in the kernel — it cannot return ENOENT even during concurrent unlink. On FUSE/NFS, this syscall is translated into separate protocol operations (LOOKUP + CREATE). Between those two operations, another process's unlink() from _release() can remove the file, causing os.open(O_CREAT) to return ENOENT.
The top-level os.open() in _acquire() does not catch FileNotFoundError:
fd = os.open(self.lock_file, open_flags, open_mode) # uncaught ENOENTThe nested FileNotFoundError handler only covers the PermissionError fallback path, not this primary path.
Why 3.20.x was safe
_release() in 3.20.x explicitly preserved lock files:
# Do not remove the lockfile:
# https://github.com/tox-dev/py-filelock/issues/31
def _release(self) -> None:
fd = cast("int", self._context.lock_file_fd)
self._context.lock_file_fd = None
fcntl.flock(fd, fcntl.LOCK_UN)
os.close(fd)No unlink = no race window = os.open(O_CREAT) always succeeds, even on FUSE.
Suggested fix
Apply the same fix as PR #484 did for Windows: remove unlink() from _unix.py _release(), or gate it behind an opt-in flag (e.g. FileLock(path, delete_on_release=False)).
Workaround
Pin filelock<3.21 or filelock==3.20.4.