-
Notifications
You must be signed in to change notification settings - Fork 450
With large number of short jobs joblib fails with CancelledError #957
Copy link
Copy link
Closed
Description
If I run this code
import time
from joblib import Parallel, delayed, load
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
from joblib import parallel_backend
cluster = SLURMCluster(diagnostics_port=1243)
cluster.scale(4)
client = Client(cluster) # registers as the default "dask" backend
parallel_backend('dask')
time.sleep(10) # wait for the cluster to be up and running
print("cluster setup")
def do_launch_classification():
def do_launch_classification_study():
time.sleep(0.1)
Parallel(n_jobs=-1,
verbose=1000)(delayed(do_launch_classification_study)()
for study_path in range(34)
for model, name in enumerate(range(24))
for i_split, split in enumerate(range(5)))
do_launch_classification()I get this error
distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fc813d945f0> of <Future: status: cancelled, type: list, key: do_launch_classification_study-batch-234b27edb1e0493f90035dc2c1de8c61>:
Traceback (most recent call last):
File "/scratch/hrichard/venvs/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 287, in execute_callback
fn(fut)
File "/home/parietal/hrichard/joblib/joblib/_dask.py", line 260, in callback_wrapper
result = future.result()
File "/scratch/hrichard/venvs/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 224, in result
raise result
concurrent.futures._base.CancelledError: do_launch_classification_study-batch-234b27edb1e0493f90035dc2c1de8c61My jobqueue.yaml:
jobqueue:
slurm:
name: dask-worker
cores: 40
memory: 250GB
processes: 2
local-directory: /scratch/hrichard/dask
log-directory: /scratch/hrichard/dask/log
death-timeout: 30Interestingly if I set time.sleep(1) instead of time.sleep(0.1) everything works fine.
Reactions are currently unavailable