An issue is observed when using filelock on gpfs.
Specially, the problem occurs when the huggingface datasets create filelocks on a multi-node cluster with a gpfs filesystem.
More details can be found in
huggingface/transformers#30859
the reproduce code (from @thinkahead)
I faced this problem when using the datasets/builder.py with multi node fine tuning. The default filelock FileLock code uses the UnixFileLock because it finds the "import fcntl". On gpfs, the UnixFileLock did not work. I had to use the SoftFileLock. You can try a simple test on the /scrip_continual_pretraining/ from 2 nodes
import time
from filelock import FileLock
#from filelock import SoftFileLock as FileLock
file_path = "/gpfs/text.txt"
lock_path = "/gpfs/test.lock"
lock = FileLock(lock_path, timeout=30)
with lock:
print("Inside")
time.sleep(15)
open(file_path, "a").write("Hello there!")
If you try this in multiple nodes, you will see "Inside" printed on all nodes immediately. This is a problem.
If you try this on single node (multiple separate python processes), only one will show Inside and wait for 15 seconds before the others shows Inside.
With SoftFileLock line uncommented above, it the remaining nodes wait showing that it locks correctly on multiple nodes.
Originally posted by @thinkahead in huggingface/transformers#30859
An issue is observed when using filelock on gpfs.
Specially, the problem occurs when the huggingface datasets create filelocks on a multi-node cluster with a gpfs filesystem.
More details can be found in
huggingface/transformers#30859
the reproduce code (from @thinkahead)
Originally posted by @thinkahead in huggingface/transformers#30859