Skip to content

FileNotFoundError in _acquire() on FUSE/NFS with concurrent processes since 3.21.0 #494

@LoganVegnaSHOP

Description

@LoganVegnaSHOP

FileNotFoundError in _acquire() on FUSE/NFS with concurrent processes since 3.21.0

Summary

Since 3.21.0 (PR #408), _release() on Unix calls Path(self.lock_file).unlink(). This causes FileNotFoundError when multiple processes contend on the same lock file over a FUSE or NFS filesystem.

PR #484 already reverted this behavior on Windows for the same class of race condition. The Unix codepath has the same problem.

Environment

  • Multi-node distributed training (2-8 nodes, 8 GPUs per node, 16-64 processes)
  • Shared FUSE-backed filesystem (CSI volume) for cache directories
  • HuggingFace datasets library calling filelock internally during load_dataset()
  • filelock 3.21.0 through 3.24.2 (all affected)

Reproduces when

Multiple processes across nodes call FileLock.acquire() on the same lock file path on a shared FUSE/NFS mount. The more processes, the more likely.

Traceback

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/shared-cache/datasets/...lock'

  File "filelock/_api.py", line 288, in acquire
    self._acquire()
  File "filelock/_unix.py", line 49, in _acquire
    fd = os.open(self.lock_file, open_flags, open_mode)

Root cause

On local ext4, os.open(path, O_CREAT|O_WRONLY) is atomic in the kernel — it cannot return ENOENT even during concurrent unlink. On FUSE/NFS, this syscall is translated into separate protocol operations (LOOKUP + CREATE). Between those two operations, another process's unlink() from _release() can remove the file, causing os.open(O_CREAT) to return ENOENT.

The top-level os.open() in _acquire() does not catch FileNotFoundError:

fd = os.open(self.lock_file, open_flags, open_mode)  # uncaught ENOENT

The nested FileNotFoundError handler only covers the PermissionError fallback path, not this primary path.

Why 3.20.x was safe

_release() in 3.20.x explicitly preserved lock files:

# Do not remove the lockfile:
#   https://github.com/tox-dev/py-filelock/issues/31
def _release(self) -> None:
    fd = cast("int", self._context.lock_file_fd)
    self._context.lock_file_fd = None
    fcntl.flock(fd, fcntl.LOCK_UN)
    os.close(fd)

No unlink = no race window = os.open(O_CREAT) always succeeds, even on FUSE.

Suggested fix

Apply the same fix as PR #484 did for Windows: remove unlink() from _unix.py _release(), or gate it behind an opt-in flag (e.g. FileLock(path, delete_on_release=False)).

Workaround

Pin filelock<3.21 or filelock==3.20.4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions