Handle nested parallelism in distributed.joblib by TomAugspurger · Pull Request #1705 · dask/distributed

TomAugspurger · 2018-01-23T13:01:58Z

This implements the new API introduced in
joblib/joblib#538 for to ensure nested parallelism works
correctly.

This breaks our API as previously DaskDistributedBackend had a client attribute
with the Client that was used. I believe (though I may be wrong) that the backend object gets serialized in nested calls, and clients are not serializable.

Nested parallelism will be workers creating a DaskDistributedBackend, with no
'scheduler_host'. To avoid deadlocks, worker threads will secede and rejoin.

cc @ogrisel.

This implements the new API introduced in joblib/joblib#538 for to ensure nested parallelism works correctly. This breaks API as previously `DaskDistributedBackend` had a `client` attribute with the `Client` that was used. Nested parallelism will be workers creating a `DaskDistributedBackend`, with no 'scheduler_host'. To avoid deadlocks, worker threads will secede and rejoin.

mrocklin · 2018-01-23T14:11:41Z

I wonder if maybe we should just use get_client() all the time. Perhaps it was a mistake to create a client within this wrapper. Perhaps we should expect the user to create a client externally and either pass that in or else depend on the get_client() global default. Thoughts?

mrocklin · 2018-01-23T14:17:13Z

distributed/joblib.py

+        try:
+            rejoin()
+        except AttributeError:
+            pass


Should we rejoin here? Typically this is necessary if there is additional computational work to do in the task. That may not be the case here. It's harmless either way, but might add a tiny bit of unwanted delay.

I wasn't sure about that either. To make sure I understand, when the worker thread secedes, the scheduler will tell the worker to make a new thread to take its place? In that case, yes we should be OK just not rejoining.

Right, the thread leaves the thread pool, leaving it with n - 1 threads. We make a new thread to take it's place. Relevant code here: https://github.com/dask/distributed/blob/master/distributed/threadpoolexecutor.py#L94-L105

def secede(adjust=True): """ Have this thread secede from the ThreadPoolExecutor See Also -------- rejoin: rejoin the thread pool """ thread_state.proceed = False with threads_lock: thread_state.executor._threads.remove(threading.current_thread()) if adjust: thread_state.executor._adjust_thread_count()

mrocklin · 2018-01-23T14:18:21Z

distributed/tests/test_joblib.py

+    if joblib is None or LooseVersion(joblib.__version__) <= "0.11.0":
+        pytest.skip("Joblib >= 0.11.1 required.")
+    Parallel = joblib.Parallel
+    delayed = joblib.delayed


Slight preference to leave these as fully namespaced, especially given the double use of delayed

mrocklin · 2018-01-23T14:19:57Z

distributed/tests/test_joblib.py

+
+    with cluster() as (s, [a, b]):
+        with joblib.parallel_backend('dask.distributed', loop=loop,
+                                     scheduler_host=s['address']) as (ba, _):


Right, so about the "lets just always use the default client" argument from above, this would become the following:

with Client(s['address'], loop=loop) as client: with joblib.parallel_backend('dask') as (ba, _): ...

TomAugspurger · 2018-01-23T14:26:43Z

Perhaps we should expect the user to create a client externally and either pass that in or else depend on the get_client() global default.

Ah, I was getting a bit confused earlier about when get_client was raising a ValueError. I was thinking the ValueError only came when you were on a worker, but it could also be when the user hasn't created a client. I'll think more about requiring an existing client. I'm 50/50 right now.

Maybe @jcrist has thoughts here too.

mrocklin · 2018-01-23T14:30:17Z

Ah, I was getting a bit confused earlier about when get_client was raising a ValueError. I was thinking the ValueError only came when you were on a worker, but it could also be when the user hasn't created a client. I'll think more about requiring an existing client. I'm 50/50 right now.

I think that we can evolve into this API somewhat smoothly. I might suggest the following interface:

def __init__(self, scheduler_host=None, client=None, **kwargs):
    if client is None:
        if scheduler_host is not None:
            client = Client(scheduler_host, **kwargs)
        else:
            client = get_client()
    ...

This is now common convention. It also simplifies nested calling.

mrocklin · 2018-01-24T13:28:41Z

I've pushed my proposed changes to this fork/branch. Please feel free to accept or reject them as you like.

mrocklin · 2018-01-24T19:41:57Z

+1 from me

Also it looks like we need some of these changes due to a new release of joblib?

TomAugspurger · 2018-01-24T20:51:16Z

The change at https://github.com/dask/distributed/pull/1705/files#diff-bf3f88e462cfc84fc9199b072c6b4e6bR66 will be required for the new joblib. Everything else changing in joblib should be backwards compatible.

…

On Wed, Jan 24, 2018 at 8:41 PM, Matthew Rocklin ***@***.***> wrote: +1 from me Also it looks like we need some of these changes due to a new release of joblib? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1705 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkJlAAauWRFB8hFPQeaRe6LsTc2_ks5tN4eFgaJpZM4Rpmeh> .

ogrisel · 2018-01-24T22:23:42Z

distributed/joblib.py

+            pass
+
+        yield
+        # No need to rejoin here


Why? Won't that create too many threads on the distributed worker nodes?

Also, is it valid to call secede() several times in the same thread? That would happen if there are several consecutive calls to Parallel() in a nested function. We should probably add a test for that case.

If we're likely to do more work in this thread after completing then yes, we should rejoin.

@ogrisel I suspect that you're thinking of cases like the following?

results = Parallel()(...) # do work results2 = Parallel()(...)

ogrisel · 2018-01-24T22:28:45Z

distributed/tests/test_joblib.py

+    delayed = joblib.delayed
+
+    def get_nested_pids():
+        return Parallel(n_jobs=2)(delayed(os.getpid)() for _ in range(2))


It would be more interesting to have:

def get_nested_pids(): pids = set(Parallel(n_jobs=2)(delayed(os.getpid)() for _ in range(2))) return pids.union(Parallel(n_jobs=2)(delayed(os.getpid)() for _ in range(2)))

This was a good test. It failed before and passes with the most recent commit.

ogrisel

LGTM.

mrocklin reviewed Jan 23, 2018

View reviewed changes

Change joblib backend to use globally created client

f7f3eb4

This is now common convention. It also simplifies nested calling.

mrocklin and others added 3 commits January 24, 2018 09:32

Client doesn't close given loop

f3a9c39

Serialize DaskDistributedBackend without client

8ab747c

Avoid rejoining

f6fba7c

ogrisel suggested changes Jan 24, 2018

View reviewed changes

add rejoin back into retrieval context

7cd398d

ogrisel approved these changes Jan 25, 2018

View reviewed changes

TomAugspurger merged commit 156eb17 into dask:master Jan 25, 2018

TomAugspurger deleted the distributed-joblib-nested branch January 25, 2018 08:12

stsievert mentioned this pull request Jun 5, 2018

ENH: Hyperband implementation dask/dask-searchcv#72

Closed

3 tasks

TomAugspurger mentioned this pull request Jul 3, 2018

Dask hanging on long computation when used as joblib backend dask/dask#2665

Closed

Uh oh!

Conversation

TomAugspurger commented Jan 23, 2018

Uh oh!

mrocklin commented Jan 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jan 23, 2018

Uh oh!

mrocklin commented Jan 23, 2018

Uh oh!

mrocklin commented Jan 24, 2018

Uh oh!

mrocklin commented Jan 24, 2018

Uh oh!

TomAugspurger commented Jan 24, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants