Refactor scheduler to use TaskState objects rather than dictionaries by pitrou · Pull Request #1594 · dask/distributed

pitrou · 2017-11-29T19:03:50Z

This refactors the scheduler to use TaskState, WorkerState, and ClientState objects rather than a forest of dictionaries. It was originally planned in order to make Dask more amenable to acceleration with compilers like PyPy and Cython, but is also probably a more accessible organization for new developers.

See #854

…eduler_state_refactor

mrocklin

In general what is here seems sensible to me. I suspect that it will improve readability for others as well.

mrocklin · 2017-11-29T19:13:04Z

@@ -239,78 +474,115 @@ def __init__(
        # Communication state
        self.loop = loop or IOLoop.current()
        self.worker_comms = dict()
+        # XXX rename to client_comms?


mrocklin · 2017-11-29T19:21:05Z

            self.log_event(['all', address], {'action': 'remove-worker',
                                              'worker': address,
-                                              'processing-tasks': self.processing[address]})
+                                              'processing-tasks': ws.processing})


It occurs to me that we should make a copy of the dict here (unrelated to this PR though).

mrocklin · 2017-11-29T19:27:59Z

+                    recommendations[dts.key] = 'released'
+
+            if not ts.waiters and not ts.who_wants:
+                # XXX what about 'fire-and-forget'?


Fire-and-forget should just be another client in ts.who_wants, no?

Well, yes, but fire-and-forget doesn't require the task to remain alive.

mrocklin · 2017-11-29T19:39:37Z

+    """
+    Transform a dict of {task state: value} into a dict of {task key: value}.
+    """
+    return {ts.key: value for ts, value in task_dict.items()}


I suppose that these will eventually go away as you work through more of the peripheral modules?

Ideally yes. Some though are used a lot in tests, so that will require a fair bit of manual fixing (unless there's a way to automate the changes).

mrocklin · 2017-11-29T19:42:06Z

+        processing = 0
+        waiting = 0
+        waiting_data = 0
+        for ts in scheduler.task_states.values():


This is a little unfortunate. I'm curious to see in what other situations we lose constant time diagnostics.

AFAIR, diagnostics and bokeh are the only places.

mrocklin · 2017-12-07T15:41:46Z

@pitrou in your opinion what remains to be done here?

Some questions:

Are you satisfied with current scheduling state documentation? If so then I may take a pass over it.
Are there legacy mappings that we should add back in for a release or two?

pitrou · 2017-12-07T15:43:06Z

Are you satisfied with current scheduling state documentation? If so then I may take a pass over it.

Yes, but I'd appreciate your reviewing it.

Are there legacy mappings that we should add back in for a release or two?

Yes, I removed some of them when I had finished suppressing all usage of them but that was probably a mistake.

what remains to be done here?

Other than the above, nothing IMHO.

mrocklin · 2017-12-07T16:24:01Z


-   These are the values that will eventually be sent to a worker when the task
-   is ready to run.
+.. class:: TaskState


Would it be possible to move this documentation to the class docstring?

I'll try to. Hopefully that will render correctly.

…ted into scheduler_state_refactor

mrocklin · 2017-12-07T20:32:38Z

Yes, I removed some of them when I had finished suppressing all usage of them but that was probably a mistake.

If you have time to add these back in that would be helpful.

I'm proposing a micro-release by the end of the day. After that I'm comfortable merging this. I might take another pass through tests to see if there are other things to clean up.

pitrou · 2017-12-07T20:37:58Z

If you have time to add these back in that would be helpful.

I can do that on Monday.

mrocklin · 2017-12-08T16:35:51Z

In the future if memory becomes an issue then we might consider using tuples or lists instead of sets for some of the TaskState attributes. In the common case the number of dependencies/dependents can be quite small and our most common operation is iterating over the collection.

pitrou · 2017-12-08T16:43:39Z

That's a possibility. Though if you look at the measurements above I wonder if other things come into play - 10 kB per task doesn't really match the basic size of a TaskState... and the sets were already there before (they were dict values rather than attribute values).

mrocklin · 2017-12-08T16:44:53Z

Do you have any suggestions on how to determine the origin of memory costs?

pitrou · 2017-12-08T16:47:37Z

First I would disable or minimize any sort of persistent logging (transition_log etc.). Then perhaps tracemalloc can help diagnose what is going on.

pitrou · 2017-12-08T16:53:32Z

Btw, disabling stealing seems to make runtimes much more predictable and memory consumption lower as well.

pitrou · 2017-12-08T17:01:44Z

If I monitor memory consumption just before and just after update_graph(), I get the following for 65535 tasks:

before: 194 MB
after: 342 MB (and that number doesn't really grow afterwards)

so each task actually costs 2.3 kB in the scheduler itself (notwithstanding stealing and other stuff).

More generally, I think any potentially costly operation could be wrapped in CPU and memory measurements, so that we have the option of logging what happens.

Update: with N=32768, the benchmark creates 65536 tasks not 32767 :-)

…ted into scheduler_state_refactor

mrocklin · 2017-12-11T15:08:27Z

OK to merge from me.

In dask/distributed#1594, the scheduler's internal maps of task objects were changed from using their keys to using TaskState objects. However, dask_drmaa.Adaptive was still querying for keys, causing new workers to never find the memory resource constraints for pending tasks and consequently tasks to never find workers with sufficient resources. This was causing the unit test test_adaptive_memory to wait indefinitely. Try to fix this to support both distributed pre- and post- 1.21.0, and un-skip test_adaptive_memory.

* Compatibility fixes with distributed 1.21.3 - Support passing kwargs to distributed.Adaptive.__init__, which now takes keyword arguments like minimum and maximum [number of workers]. - Add an optional workers argument to _retire_workers() to match dask/distributed#1797 -- currently Adaptive raises a TypeError. * Adaptive memory resource compatibility fix for distributed==1.21.0 In dask/distributed#1594, the scheduler's internal maps of task objects were changed from using their keys to using TaskState objects. However, dask_drmaa.Adaptive was still querying for keys, causing new workers to never find the memory resource constraints for pending tasks and consequently tasks to never find workers with sufficient resources. This was causing the unit test test_adaptive_memory to wait indefinitely. Try to fix this to support both distributed pre- and post- 1.21.0, and un-skip test_adaptive_memory. * basestring -> six.string_types (Was testing on python2, switching to python2/3-compatible) * Add six to requirements.txt Also a couple of miscellaneous comments, including Windows-specific comment for running docker-based tests. * Undo windows comments (moving to a separate PR) * Drop support for distributed < 1.21.0 Update requirements.txt to require distributed >= 1.21.0, since there are some internal changes in the way tasks are stored. Also drop the corresponding backwards- compatibility fixes. Feel free to revert if distributed 1.20.x support if desired.

jakirkham · 2018-04-04T21:12:57Z

+        for ts in touched_tasks:
+            for dts in ts.dependencies:
+                if dts.exception_blame:
+                    ts.exception_blame = dts.exception_blame
                    recommendations[key] = 'erred'


Guessing key here should now be ts.key. Fixing in PR ( #1900 ).

pitrou added 24 commits November 21, 2017 17:02

Refactor client state on scheduler

d24eaba

Refactor worker state on scheduler

e49f669

Merge branch 'master' of https://github.com/dask/distributed into sch…

3335905

…eduler_state_refactor

Refactor task state on scheduler

4db0309

WIP

24f41a6

Merge branch 'master' into scheduler_state_refactor

bdcd2f6

Merge branch 'master' of https://github.com/dask/distributed into sch…

feede72

…eduler_state_refactor

Move nbytes to TaskState

f79cdc9

Migrate processing and rprocessing

40e55f0

Small refactor

f975600

Migrate released and unrunnable

3fa8052

Migrate waiting and waiting_data

cd85579

Some clean ups

6e1bdf6

Migrate restrictions to task state

01bba7b

Migrate retries to task state

f7a0fcc

Move task_duration and unknown_durations to task state

cce503c

Remove Scheduler.ready

3f37f32

Migrate suspicious_tasks to task state

f079b4d

Migrate exceptions, tracebacks and exceptions_blame to task state

5841b97

Make restrictions None by default

b5b92a0

Rework validate_state() for the new state structures

c9c638d

Fix diagnostics tests

dd4c649

Fix bokeh tests

e5074c3

Rewrite CountsJSON to avoid using the mapping proxies

b9cafa8

mrocklin reviewed Nov 29, 2017

View reviewed changes

pitrou added 4 commits November 29, 2017 22:03

Rename comms to client_comms

6c30a32

Appease flake8

17007fe

Fix overly strict isinstance() assertions

d83d8cd

Fix bug in gather() when an unknown key is requested

5bf228d

Merge branch 'master' into scheduler_state_refactor

f340e11

mrocklin reviewed Dec 7, 2017

View reviewed changes

mrocklin and others added 3 commits December 7, 2017 11:46

minor text edits to scheduling-state

7e1d5a4

Use autoclass for TaskState, WorkerState, ClientState

b7016c1

Merge branch 'scheduler_state_refactor' of github.com:pitrou/distribu…

cd2d9ed

…ted into scheduler_state_refactor

mrocklin changed the title ~~Scheduler state refactor~~ Refactor scheduler to use TaskState objects rather than dictionaries Dec 7, 2017

mrocklin and others added 4 commits December 10, 2017 08:48

clean up additional client tests

481ed48

Reinstate legacy mappings

eb9dc84

Merge branch 'scheduler_state_refactor' of github.com:pitrou/distribu…

6937368

…ted into scheduler_state_refactor

Fix failing test

8d92626

mrocklin approved these changes Dec 11, 2017

View reviewed changes

pitrou merged commit 8404684 into dask:master Dec 11, 2017

pitrou deleted the scheduler_state_refactor branch December 11, 2017 15:14

jakirkham mentioned this pull request Dec 11, 2017

Updates to Dask w.r.t. recent Distributed changes dask/dask#2989

Merged

jcrist mentioned this pull request Dec 15, 2017

Be more strict about unicode/bytes conversions dask/hdfs3#147

Merged

azjps mentioned this pull request Mar 15, 2018

Compatibility fixes with distributed 1.21.3 dask/dask-drmaa#63

Merged

jakirkham reviewed Apr 4, 2018

View reviewed changes

Uh oh!

Conversation

pitrou commented Nov 29, 2017 • edited by mrocklin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Dec 7, 2017

Uh oh!

pitrou commented Dec 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Dec 7, 2017

Uh oh!

pitrou commented Dec 7, 2017

Uh oh!

mrocklin commented Dec 8, 2017

Uh oh!

pitrou commented Dec 8, 2017

Uh oh!

mrocklin commented Dec 8, 2017

Uh oh!

pitrou commented Dec 8, 2017

Uh oh!

pitrou commented Dec 8, 2017

Uh oh!

pitrou commented Dec 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Dec 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pitrou commented Nov 29, 2017 •

edited by mrocklin

Loading

pitrou commented Dec 8, 2017 •

edited

Loading