Add SpecificationCluster by mrocklin · Pull Request #2675 · dask/distributed

mrocklin · 2019-05-09T00:22:19Z

This is intended to be a base for LocalCluster (and others) that want to
specify more heterogeneous information about workers.

Additionally, this PR does the following:

Starts the use of Python 3 only code in the main codebase
Cleans up a number of our intermittent testing failures (we had nannies that survived test cleanup before, and they were sending random messages to ports that were screwing up other tests)
Adds a couple of new small failures, notably silent shutdown is no longer entirely silent (working on it)

Docstring

Cluster that requires a full specification of workers

This attempts to handle much of the logistics of cleanly setting up and
tearing down a scheduler and workers, without handling any of the logic
around user inputs. It should form the base of other cluster creation
functions.

Examples

>>> spec = {
...     'my-worker': {"cls": Worker, "options": {"ncores": 1}},
...     'my-nanny': {"cls": Nanny, "options": {"ncores": 2}},
... }
>>> cluster = SpecCluster(workers=spec)

This is intended to be a base for LocalCluster (and others) that want to specify more heterogeneous information about workers.

mrocklin · 2019-05-09T00:24:36Z

I'm going to try this out with some heterogeneous GPU machines. This feels like a nice base on which to rewrite and cleanup LocalCluster though, a prospect for which I am excited :)

distributed/deploy/spec.py

dhirschfeld · 2019-05-16T04:13:22Z

Is the spec intended to be per-worker - e.g.:

spec = {
    'worker1': {"cls": Worker, "options": {"ncores": 1}},
    'nanny1': {"cls": Nanny, "options": {"ncores": 2}},
    'worker2': {"cls": Worker, "options": {"ncores": 1}},
    'nanny2': {"cls": Nanny, "options": {"ncores": 2}},
    'worker3': {"cls": Worker, "options": {"ncores": 1}},
    'nanny3': {"cls": Nanny, "options": {"ncores": 2}},
    ...
}

dhirschfeld · 2019-05-16T04:27:34Z

I'm just wondering if now is a good time to introduce the concept of "worker pools":
#2208 (comment)

e.g. you would pass pools in addition to workers and have the workers dict reference the specs defined in pools

>>> pool_specs = {
...     'default': {
...         'worker': {"cls": Worker, "options": {"ncores": 1}},
...         'nanny': {"cls": Nanny, "options": {"ncores": 2}},
...     },
...     'no-nanny': {
...         'worker': {"cls": Worker, "options": {"ncores": 1}},
...     },
... }

>>> worker_specs = {'worker1': 'default', 'worker2': 'no-nanny'}
>>> cluster = SpecCluster(workers=worker_specs, pools=pool_specs)

Previously nannies could leak out in various ways

mrocklin · 2019-05-16T14:51:41Z

I'm just wondering if now is a good time to introduce the concept of "worker pools":

This is related to that issue, but is lower level. I think that it would enable other people to add things like pools more easily. If this is something that you'd like to explore I encourage you to do so now. I agree that now would be a good time to explore this to help guide design.

distributed/deploy/spec.py

distributed/deploy/local.py

jcrist · 2019-05-21T18:11:40Z

distributed/deploy/spec.py

+            # If people call this frequently, we only want to run it once
+            return self._correct_state_waiting
+        else:
+            task = asyncio.Task(self._correct_state_internal())


You shouldn't create Tasks manually, but instead use asyncio.ensure_future.

distributed/deploy/spec.py

jcrist · 2019-05-21T18:14:07Z

distributed/deploy/spec.py

+                d = self.worker_spec[name]
+                cls, opts = d["cls"], d.get("options", {})
+                if "name" not in opts:
+                    opts = toolz.merge({"name": name}, opts, {"loop": self.loop})


Did you mean to include the loop in here?

Yes, ideally we want the worker to use the IOLoop used by the cluster object.

I mean that loop is only added if name is not in opts. Wouldn't you always want to pass it?

Ah, indeed. Looking at this again it looks like we do this in an async def function anyway, so IOLoop.current() should be valid regardless. I'll remove the reference to loop entirely, which should be helpful in reducing the contract too.

jcrist · 2019-05-21T18:18:05Z

distributed/deploy/spec.py

+            if workers:
+                await asyncio.wait(workers)
+                for w in workers:
+                    w._cluster = weakref.ref(self)


What is the cluster weakref for?

There are a lot of weakrefs around now. They're useful when tracking down leaking references to things.

jcrist · 2019-05-21T18:19:21Z

distributed/deploy/spec.py

+                for w in workers:
+                    w._cluster = weakref.ref(self)
+                    if self.status == "running":
+                        await w


The non running workers are never awaited, what happens to them? They're still added to the workers dict below.

This is again a tornado/asyncio difference. I've removed the running check and made things optimal, I think for both async def and gen.coroutine style functions.

jcrist · 2019-05-21T18:20:49Z

distributed/deploy/spec.py

+
+    async def _close(self):
+        while self.status == "closing":
+            await asyncio.sleep(0.1)


Instead of polling, could have a future for the closing operation (created by the first call to _close), and just wait on that?

Good thought. I'm inclined to wait on this for now though if that's ok.

jcrist · 2019-05-21T18:40:29Z

distributed/deploy/spec.py

+
+    def _correct_state(self):
+        if self._correct_state_waiting:
+            # If people call this frequently, we only want to run it once


I think this drops scale requests while a current scale request is processing:

Call scale

Spec updated

correct state task start, task is stored as _correct_state_waiting

scale returns

Call scale again

Spec updated

since previous call is still in progress, state is not corrected, no new workers are started/stopped. Spec and tasks are now out of sync. Also, since there are multiple await calls in _correct_state_internal, the worker_spec can be different at different points in that function, leading to potential bugs.

One naive solution would be to have a background task that loops forever, waiting on an event:

while self.running: await self._spec_updated.wait() # update workers to match spec # After updating, only clear the event if things are up to date # If things aren't up to date, then we loop again if self.spec_matches_current_state(): self._spec_updated.clear()

Then _correct_state would look like:

def _correct_state(self): # set the event, it's only ever cleared in the loop # We force synchronization here to prevent scheduling tons # of tasks all setting the event, this blocks until it's set. return self.sync(self._mark_state_updated) async def _mark_state_updated(self): self._state_updated.set()

There are likely other ways to handle this. In dask-gateway I have a task per worker/scheduler. As the spec updates, unfinished tasks are cancelled or new ones are fired. If a previous scale call is still in progress for a cluster, scale will block until that call has finished. Note that this only blocks while we update our internal task state (cancelling/firing new tasks), not until those tasks have completed.

since previous call is still in progress, state is not corrected, no new workers are started/stopped. Spec and tasks are now out of sync. Also, since there are multiple await calls in _correct_state_internal, the worker_spec can be different at different points in that function, leading to potential bugs.

So, the _correct_state_waiting attribute isn't the currently running task, it's the currently enqueued one. Once _correct_state starts running it immediately clears this attribute. After someone calls scale there is a clean, not-yet-run _correct_state_waiting future that will run soon.

Since _correct_state_internal waits on the created workers, this does mean that there's no way to cancel pending workers. This is fine for LocalCluster, but would be problematic if used as a base class for other cluster managers. The following would request and start 100 workers before scaling back down afaict:

cluster.scale(100) cluster.scale(2)

I think that this depends on what you mean by "waits on".

One approach is that for a cluster manager to reach a correct state it only has to successfully submit a request to the resource manager have received an acknowledgement that the resource manager is handling it. We're not guaranteeing full deployment, merely that we've done our part of the job. I would expect this to almost always be fairly fast.

Separately, there is now a Client.wait_for_workers(n=10) method that might be used for full client <-> scheduler checks.

jcrist · 2019-05-21T18:42:01Z

distributed/deploy/spec.py

+
+    async def _start(self):
+        while self.status == "starting":
+            await asyncio.sleep(0.01)


Same here as closing, could wait on the start task instead of polling.

jcrist · 2019-05-21T18:44:04Z

distributed/deploy/spec.py

+
+    def __enter__(self):
+        self.sync(self._correct_state)
+        self.sync(self._wait_for_workers)


Does this mean that __enter__ will only complete once the initial n workers have started? What happens if we request 2, 1 worker starts and 1 fails?

Yes, this might hang. I'm not sure we ever had a test in our test suite with this case. I'll add something.

Added a test in 5e94069

mrocklin · 2019-05-21T22:48:38Z

Thanks for the review @jcrist ! If you have a chance to pass through things tomorrow I would appreciate it

mrocklin · 2019-05-22T15:23:55Z

I plan to merge this later today if there are no further comments. Tests here are pretty decent, although I'll need to overhaul adaptive. I'd like to do this in a separate PR though.

mrocklin · 2019-05-22T21:44:52Z

OK. Merging this in. I intend to be active in this area for a while, so if there are still issues please feel free to raise them. I plan to do the following:

Fix up adaptive so that it moves logic into the scheduler, and makes tests here pass
Try out SpecCluster with Dask-Kubernetes. I imagine that this will force some changes here.

LocalCluster.__repr__ was removed in dask#2675.

LocalCluster.__repr__ was removed in #2675.

This is intended to be a base for LocalCluster (and others) that want to specify more heterogeneous information about workers. This forces the use of Python 3 and introduces more asyncio and async def handling. This cleans up a number of intermittent testing failures and improves our testing harness hygeine.

LocalCluster.__repr__ was removed in dask#2675.

Add SpecificationCluster

0666d46

This is intended to be a base for LocalCluster (and others) that want to specify more heterogeneous information about workers.

Pass in loop appropriately

f4c72e2

mrocklin mentioned this pull request May 9, 2019

Explicitly scaling up LocalCUDACluster() creates Dask workers that don't use all available GPUs. rapidsai/dask-cuda#47

Closed

mrocklin added 13 commits May 13, 2019 20:04

Merge branch 'master' of github.com:dask/distributed into spec-cluster

83088c8

Merge branch 'master' into spec-cluster

2ac4e6e

Implement SpecCluster.scale

ab7b54f

LocalCluster(SpecCluster)

ce17774

support scale_up/down

5edaf9d

handle Cluster.close returning a coroutine

f7a52f7

Merge branch 'master' of github.com:dask/distributed into spec-cluster

a4ea250

await workers to trigger errors

07fcd63

Cleanup test_adaptive

9c39fc6

Support silent logging

3fda334

cleanup remaining tests

757a123

Merge branch 'master' of github.com:dask/distributed into spec-cluster

db4484a

cleanup name= handling

d3f8c59

dhirschfeld reviewed May 16, 2019

View reviewed changes

distributed/deploy/spec.py Show resolved Hide resolved

mrocklin added 3 commits May 15, 2019 23:41

Cleanup the handling of nannies

f17665c

Previously nannies could leak out in various ways

fix up nanny

e573150

cleeanup test

2ea0303

mrocklin added 6 commits May 16, 2019 10:30

cleanup test_io_loop

99a883f

add debug info in Nanny._instances check

13ae0fa

Avoid very frequent calls to _correct_state

2307da5

cleanup test_nanny.py::test_wait_for_scheduler

3385a22

Add port=0 to scheduler tests

6c8b01c

cleanup test worker test_io_loop (again)

ce74d04

dhirschfeld reviewed May 20, 2019

View reviewed changes

distributed/deploy/spec.py Show resolved Hide resolved

mrocklin mentioned this pull request May 21, 2019

as_completed hangs intermittently with cancelled futures #2571

Closed

mrocklin added 4 commits May 21, 2019 09:20

cleanup test_file_descriptors check on comms

87b3372

allow missing bandwidth value

e1e0abf

Cleanup test_client_timeout

8360352

Add list of workers to assertion message

c13b848

jcrist reviewed May 21, 2019

View reviewed changes

mrocklin added 3 commits May 21, 2019 14:10

cleanup a bit after feedback

34052e7

Don't inject loop into workers

949e310

Add test for broken worker

5e94069

mrocklin force-pushed the spec-cluster branch from d157410 to 5e94069 Compare May 21, 2019 22:47

mrocklin added 2 commits May 22, 2019 09:31

Test that bare LocalCluster starts up by default

649a758

permit closed comms in test_file_descriptors

409f194

mrocklin merged commit 6e0c0a6 into dask:master May 22, 2019

mrocklin deleted the spec-cluster branch May 22, 2019 21:45

jrbourbeau mentioned this pull request May 23, 2019

Catch distributed SyntaxError in tests dask/dask#4836

Merged

2 tasks

mrocklin mentioned this pull request May 28, 2019

Clean up clients robustly in test harness #2076

Closed

lesteve added a commit to lesteve/distributed that referenced this pull request May 29, 2019

Add back LocalCluster.__repr__.

f04ec6c

LocalCluster.__repr__ was removed in dask#2675.

lesteve mentioned this pull request May 29, 2019

Add back LocalCluster.__repr__ #2732

Merged

lesteve added a commit to lesteve/distributed that referenced this pull request May 29, 2019

Add back LocalCluster.__repr__.

10e1e36

LocalCluster.__repr__ was removed in dask#2675.

mrocklin pushed a commit that referenced this pull request May 29, 2019

Add back LocalCluster.__repr__. (#2732)

4e3ba76

LocalCluster.__repr__ was removed in #2675.

calebho pushed a commit to calebho/distributed that referenced this pull request May 29, 2019

Add back LocalCluster.__repr__. (dask#2732)

a3ca788

LocalCluster.__repr__ was removed in dask#2675.

jcrist mentioned this pull request Jun 3, 2019

Identify and warn about starting processes in module #2708

Open

This was referenced Jun 26, 2019

worker_spec is set incorrectly dask/dask-jobqueue#280

Closed

Integrate SpecificationCluster and adaptive logic updates from Distributed dask/dask-jobqueue#281

Closed

jakirkham mentioned this pull request Jan 22, 2020

Cleanup old Tornado compatibility code #3402

Closed

Uh oh!

Conversation

mrocklin commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Docstring

Examples

Uh oh!

mrocklin commented May 9, 2019

Uh oh!

Uh oh!

dhirschfeld commented May 16, 2019

Uh oh!

dhirschfeld commented May 16, 2019

Uh oh!

mrocklin commented May 16, 2019

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented May 21, 2019

Uh oh!

mrocklin commented May 22, 2019

Uh oh!

mrocklin commented May 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrocklin commented May 9, 2019 •

edited

Loading

jcrist May 22, 2019 •

edited

Loading