Skip to content

Task objects with Annotations#6217

Closed
sjperkins wants to merge 113 commits intodask:mainfrom
sjperkins:task-objects
Closed

Task objects with Annotations#6217
sjperkins wants to merge 113 commits intodask:mainfrom
sjperkins:task-objects

Conversation

@sjperkins
Copy link
Member

  • Tests added / passed
  • Passes black dask / flake8 dask

@sjperkins sjperkins marked this pull request as draft May 17, 2020 19:47
@sjperkins
Copy link
Member Author

Adds a task object slotted on function, args, kwargs and annotations.

Notes regarding this implementation:

  1. Task.from_spec converts task tuples into Task objects, and recursively applies this conversion on nested python types such as lists and dicts, according to https://docs.dask.org/en/latest/spec.html#definitions.
  2. dsk = Task.from_spec(dsk) lines are hard-coded into dask.local.get_async and dask.core.get. This exists to prove that Task.from_spec successfuly translates dask graphs in the test suite. They should be removed if this PR progresses.
  3. Old style tasks and new style task should be able to intermingle.
  4. Benchmarking is needed, obviously once the lines in (2) have been removed.

with normal dicts and functions which generate dicts
@sjperkins
Copy link
Member Author

sjperkins commented Jul 24, 2020

Just a quick update on this PR: I think the functionality is in a good state.

  1. The test cases pass on this PR and all except one on the associated distributed PR
  2. The local scheduling code handles both Task objects and Task tuples
  3. The distributed scheduler handles both Task objects and Task tuples
  4. The optimization code handles both Task objects and Task tuples

I therefore consider the bulk of the implementation complete. I would still like to:

  1. benchmark
  2. demonstrate some use cases with the distributed scheduler and SchedulerPlugins

@sjperkins
Copy link
Member Author

@mrocklin and others.

Following on from the meeting discussion, I thought I'd try and summarise the various approaches to supporting newer Task style objects:

1 Task Objects and Tuple Tasks with Conversion

Pros:

  • Flexible collection scheduling via scheduler plugins
  • dask Complexity + Maintenance stay the same as only need to deal with new style tasks
  • Supports old task graphs in downstream projects.
  • Blockwise + SubGraphCallable's use new style Task objects

Cons:

  • Task Conversion is an expense

    • When performed in cull for e.g. increases cull runtime by 6X in benchmarks

2 Task Objects and Tuple Tasks, no Conversion

Pros:

  • Flexible collection scheduling via scheduler plugins
  • Supports old task graphs in downstream projects
  • Blockwise + SubGraphCallable's use new style Task objects

Cons:

  • dask Complexity + Maintenance increases due to need to support both style tasks throughout codebase (+downstream)

    • subs benchmarks are ~20% slower supporting both in the current branch

3 Only Task Objects, no Conversion

Pros:

  • Flexible collection scheduling via scheduler plugins
  • Entire dask + distributed code base converted to new style Task objects
  • No runtime task conversion required, I might guess that benchmarks would retain their performance characteristics.
  • dask Complexity + Maintenance stay the same as only need to deal with new style tasks

Cons:

  • Will break low-level graph compatibility for downstream projects (dask 3.0?)

Currently, this PR is a combination of (1) and (2) as it is predicated on not breaking downstream projects. This results in either the expense of Conversion, or the expense of maintaining a code base with both Task objects and Tuple tasks.

I was intrigued by your idea of converting the entire dask code base to Task objects. The only downside to this that I can see is breaking downstream projects that use Tuple Tasks. However, the new graph format (dask 3.0?) would be simpler and I'd be interested in going forward with this approach. Another thought: the task conversion process (Task.from_spec, Task.from_call) could simply be repurposed as an upgrade path.

To summarise: This PR is at a three-way cross-roads, described above and it would be useful to identify what is desirable and what can be discarded, before going forward.

I'd be interested in hearing your thoughts on the subject. Perhaps an initial review is worthwhile at this point?

/cc @o-smirnov, @landmanbester, @JSKenyon @kkraus14 @madsbk

@sjperkins
Copy link
Member Author

I'm aware of and following #5644 and #6438

@sjperkins
Copy link
Member Author

Note that dask/distributed#2180 is the distributed sister PR and I've configured the CI in both PR's to refer to the opposite PR.

@mrocklin mrocklin mentioned this pull request Oct 2, 2020
Base automatically changed from master to main March 8, 2021 20:19
@jsignell
Copy link
Member

@sjperkins is this PR still useful as a proof of concept given that there are now task annotations?

@sjperkins
Copy link
Member Author

@jsignell I don't think I'll take it further, but it might be interesting for historical reasons. This would depend on the sjperkins fork remaining. I don't object to closing the PR.

@mrocklin mrocklin closed this Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants