Skip to content

With large number of short jobs joblib fails with CancelledError #957

@hugorichard

Description

@hugorichard

If I run this code

    import time
    from joblib import Parallel, delayed, load
    from dask_jobqueue import SLURMCluster
    from dask.distributed import Client
    from joblib import parallel_backend
    cluster = SLURMCluster(diagnostics_port=1243)
    cluster.scale(4)
    client = Client(cluster)  # registers as the default "dask" backend 
    parallel_backend('dask')
    time.sleep(10)  # wait for the cluster to be up and running
    print("cluster setup")


    def do_launch_classification():
        def do_launch_classification_study():
            time.sleep(0.1)

        Parallel(n_jobs=-1,
             verbose=1000)(delayed(do_launch_classification_study)()
                           for study_path in range(34)
                           for model, name in enumerate(range(24))
                           for i_split, split in enumerate(range(5)))
    do_launch_classification()

I get this error

distributed.client - ERROR - Error in callback <function DaskDistributedBackend.apply_async.<locals>.callback_wrapper at 0x7fc813d945f0> of <Future: status: cancelled, type: list, key: do_launch_classification_study-batch-234b27edb1e0493f90035dc2c1de8c61>:
Traceback (most recent call last):
  File "/scratch/hrichard/venvs/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 287, in execute_callback
    fn(fut)
  File "/home/parietal/hrichard/joblib/joblib/_dask.py", line 260, in callback_wrapper
    result = future.result()
  File "/scratch/hrichard/venvs/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 224, in result
    raise result
concurrent.futures._base.CancelledError: do_launch_classification_study-batch-234b27edb1e0493f90035dc2c1de8c61

My jobqueue.yaml:

    jobqueue:

        slurm:

            name: dask-worker

            cores: 40

            memory: 250GB

            processes: 2

            local-directory: /scratch/hrichard/dask 

            log-directory: /scratch/hrichard/dask/log 

            death-timeout: 30

Interestingly if I set time.sleep(1) instead of time.sleep(0.1) everything works fine.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions