Stateful Actors by mrocklin · Pull Request #2133 · dask/distributed

mrocklin · 2018-07-19T16:54:15Z

This allows Dask to manage remote stateful classes. This has a few advantages:

We can update state quickly without having to engage pure tasks
Actor operations happen between workers without the scheduler, and so have lower latency and don't suffer the same bottlenecks

And a few drawbacks

Makes no attempt at resilience to worker failure
Not fully integrated with the task scheduling framework (passing actor futures to won't give the same semantics as with normal futures, for example)

Example

In [1]: from dask.distributed import Client

In [2]: client = Client(processes=False)

In [3]: class Counter:
   ...:     n = 0
   ...:     def __init__(self):
   ...:         self.n = 0
   ...:     def increment(self):
   ...:         self.n += 1
   ...:         return self.n
   ...:     

In [4]: counter = client.submit(Counter, actors=True)

In [5]: counter = counter.result()

In [6]: counter.n
Out[6]: 0

In [7]: counter.increment()
Out[7]: <distributed.actor.ActorFuture at 0x7f31b44d5d68>

In [8]: _.result()
Out[8]: 1

In [9]: counter.n
Out[9]: 1

In [10]: %time counter.increment().result()
CPU times: user 4.06 ms, sys: 21 µs, total: 4.08 ms
Wall time: 3.91 ms
Out[10]: 2

In [11]: %%time
    ...: for i in range(1000):
    ...:     counter.increment()
    ...: 
CPU times: user 566 ms, sys: 40.1 ms, total: 606 ms
Wall time: 585 ms

In [12]: %time counter.n
CPU times: user 5.65 ms, sys: 125 µs, total: 5.77 ms
Wall time: 4.25 ms
Out[12]: 1002

Performance

Current roundtrip latency is around a millisecond

In [14]: %timeit counter.n
3.57 ms ± 727 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [15]: client = Client(direct_to_workers=True)

In [16]: counter = client.submit(Counter, actor=True)

In [17]: %timeit counter.n
1.25 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

mrocklin · 2018-07-26T00:53:40Z

I've removed the WIP label. This seems decent enough for me, at least as a first pass. I suspect that we'll change the implementation of ActorFuture eventually, but I think that the user API seems stabilized.

stsievert · 2018-07-29T17:33:22Z

I've drafted a parameter server that uses these Actors: https://gist.github.com/stsievert/7135380e1227236bde03a852cae93a37

I have a question: shouldn't client.gather work on actor futures too? Here's a minimal example of what I mean:

In [1]: from distributed import Client

In [2]: class Dummy:
   ...:     def __init__(self, value):
   ...:         self.value = value
   ...:     def get_value(self):
   ...:         return self.value
   ...:

In [3]: client = Client()

In [4]: futures = client.map(Dummy, [1, 2], actor=True)

In [5]: dummies = client.gather(futures)

In [6]: values = [dummy.get_value() for dummy in dummies]

In [7]: values
Out[7]: [<ActorFuture>, <ActorFuture>]

In [8]: client.gather(values)
Out[8]: [<ActorFuture>, <ActorFuture>]

I expected client.gather(values) == [1, 2], as with regular futures:

In [6]: v = [client.submit(lambda x: x, i) for i in range(2)]

In [7]: v
Out[7]:
[<Future: status: finished, type: int, key: <lambda>-e0ea2bfe577f8de576a4698ab710d5ae>,
 <Future: status: finished, type: int, key: <lambda>-982d509e7b5d6f50f0a2ed3fb2838fcf>]

In [8]: client.gather(v)
Out[8]: [0, 1]

mrocklin · 2018-07-29T17:56:16Z

I have a question: shouldn't client.gather work on actor futures too?

I agree that it would be nice to unify everything and maybe that will happen some day. But that isn't how things work now, and I don't think it's likely to be that way soon. I recommend just using the .result() API

mrocklin · 2018-07-30T18:44:06Z

@jcrist can I ask for your review on this?

jcrist

I only gave this a cursory review - the actual implementation is a bit opaque to me.

As far as api/docs, I'm a bit confused by the barrier between Future and ActorFuture - when is it ok to mix them and when do they need to be separated?

Additionally, overloading the existing api seems odd to me, specifically in graph-level operations. compute(graph, actors=True) seems to indicate that only the end results are actors, not the intermediate values, but I could also interpret things the other way.

Do you see use cases for returning Actor objects from compute/persist? If not, I'd be inclined to not overload the existing methods, and only support something like client.new_actor(cls, *args, **kwargs), which may help make things clearer (at least for me).

jcrist · 2018-07-30T19:50:20Z

distributed/actor.py

+
+    def __dir__(self):
+        o = set(dir(type(self)))
+        o.update({attr for attr in dir(self._cls) if not attr.startswith('_')})


No need for { brackets here

jcrist · 2018-07-30T19:51:54Z

distributed/actor.py

+    def __getattr__(self, key):
+        if not hasattr(self._cls, key):
+            raise AttributeError("%s does not have attribute %s" %
+                                 (type(self).__name__, key))


Should be able to just rely on getattr(self._cls, key) raising this error below.

Thanks, fixed

jcrist · 2018-07-30T19:56:11Z

distributed/client.py

        allow_other_workers = kwargs.pop('allow_other_workers', False)
+        actor = kwargs.pop('actor', False)
+        actors = kwargs.pop('actors', False)
+        actor = actor or actors


actor = kwargs.pop('actor', kwargs.pop('actors', False))

Why support both actor and actors here? I'd prefer only a single boolean flag between all functions to keep things consistent.

So actors would be general purpose, but it seemed a bit odd for submit

client.submit(Foo, actors=True)

It also seemed error prone to have one keyword for submit and one for map, so I just used both in both places. I agree that this is wonky though and am not surprised that it would not survive review. Do you have any suggestions? Use actors= everywhere, including submit?

jcrist · 2018-07-30T19:56:31Z

distributed/client.py

        fifo_timeout = kwargs.pop('fifo_timeout', '100ms')
+        actor = kwargs.pop('actor', False)
+        actors = kwargs.pop('actors', False)
+        actor = actor or actors


actor = kwargs.pop('actor', kwargs.pop('actors', False))

jcrist · 2018-07-30T19:59:18Z

distributed/core.py

+        try:
+            exception = protocol.pickle.loads(exception)
+        except Exception:
+            exception = Exception(exception)


Why are you doing this?

Sometimes exceptions don't come in as serialized bytes, sometimes they come in as just a string of an error message. This is the case when the worker produces a non-serializable Exception (happens sometimes) or when the scheduler needs to return an exception. This came up in testin in this PR. I think that long-term we probably need to have a better structure where we return a message that includes the exception and how the exception is represented. I would prefer not to handle that in this PR though.

jcrist · 2018-07-30T20:02:06Z

distributed/scheduler.py

+
+    for ts in ws.actors:
+        if ts.state not in {'memory', 'processing'}:
+            import pdb; pdb.set_trace()


Leftover from debugging.

Thanks, fixed.

mrocklin · 2018-07-30T21:06:52Z

Do you see use cases for returning Actor objects from compute/persist? If not, I'd be inclined to not overload the existing methods, and only support something like client.new_actor(cls, *args, **kwargs), which may help make things clearer (at least for me).

I've just added a test that uses actors with compute. Yes, I think that this is a valid use case.

There is also some maintenance cost to adding new future-creating methods like submit/map/compute/persist. Any new option like retries or restrictions ends up being added to all of these. I'm inclined to reuse when possible.

mrocklin · 2018-07-30T21:07:23Z

As far as api/docs, I'm a bit confused by the barrier between Future and ActorFuture - when is it ok to mix them and when do they need to be separated?

I'll try to add some documentation around this point. Thank you for raising it.

mrocklin · 2018-07-30T22:59:14Z

When I ran through a benchmark with a pseudo-parameter-server workload I found that I was getting latencies in the 5-10ms range, which seems pretty high. This lead to some work in profiling the scheduler and worker administrative threads (where I suspect most of the blame lies). Hopefully I can get this down in the future.

mrocklin · 2018-07-30T23:00:14Z

My expectation is that Dask will run somewhere around 1-2ms in the moderate future. We would probably have to look to some more serious changes to get below that (but that's certainly possible as well).

mrocklin force-pushed the actor branch from 513973c to ae5f6a7 Compare July 20, 2018 12:58

stsievert mentioned this pull request Jul 20, 2018

Parameter Server dask/dask-ml#171

Open

mrocklin force-pushed the actor branch from 59caeac to b8b985b Compare July 25, 2018 17:31

mrocklin changed the title ~~[WIP] Stateful Actors~~ Stateful Actors Jul 26, 2018

mrocklin force-pushed the actor branch from 6f16125 to bebf6a5 Compare July 26, 2018 19:35

jcrist reviewed Jul 30, 2018

View reviewed changes

mrocklin mentioned this pull request Aug 2, 2018

Investigate worker overhead #2156

Open

mrocklin force-pushed the actor branch 3 times, most recently from 140cfc2 to 1ff5c21 Compare August 5, 2018 14:51

shoyer mentioned this pull request Aug 6, 2018

Consolidating all tasks that write to a file on a single worker #2163

Open

mrocklin added 3 commits August 6, 2018 13:30

add direct_to_workers to Client

a3d6659

add Scheduler.proxy to workers

2043f95

Implement Actors

3ab9883

mrocklin force-pushed the actor branch from 6e88657 to 3ab9883 Compare August 6, 2018 17:31

mrocklin merged commit b16ee25 into dask:master Aug 6, 2018

mrocklin deleted the actor branch August 6, 2018 18:01

Uh oh!

Conversation

mrocklin commented Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example

Performance

Uh oh!

mrocklin commented Jul 26, 2018

Uh oh!

stsievert commented Jul 29, 2018

Uh oh!

mrocklin commented Jul 29, 2018

Uh oh!

mrocklin commented Jul 30, 2018

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 30, 2018

Uh oh!

mrocklin commented Jul 30, 2018

Uh oh!

mrocklin commented Jul 30, 2018

Uh oh!

mrocklin commented Jul 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrocklin commented Jul 19, 2018 •

edited

Loading