[WIP] Rewrite dask order by fjetter · Pull Request #10557 · dask/dask

fjetter · 2023-10-12T12:10:29Z

This is a rewrite of the existing dask.order implementation

This is still WIP but for most unit tests it provides equivalent or strictly better ordering. There are still a couple of failures I have to address but the failures do not look critical and may even point to improper test logic.

From my perspective this rewrite is preserving most of the key logic of the current algorithm but it removes plenty of specialized edge cases. I will follow up with a more rigorous description of the changes once CI is green

TODO:

Perf testing of the actual order algorithm
Coiled benchmarks

mrocklin · 2023-10-12T12:16:49Z

Exciting

…

On Thu, Oct 12, 2023 at 7:10 AM Florian Jetter ***@***.***> wrote: This is a rewrite of the existing dask.order implementation Closes dask/distributed#8255 <dask/distributed#8255> This is still WIP but for most unit tests it provides equivalent or strictly better ordering. There are still a couple of failures I have to address but the failures do not look critical and may even point to improper test logic. From my perspective this rewrite is preserving most of the key logic of the current algorithm but it removes plenty of specialized edge cases. I will follow up with a more rigorous description of the changes once CI is green TODO: - Perf testing of the actual order algorithm - Coiled benchmarks ------------------------------ You can view, comment on, or merge this pull request online at: #10557 Commit Summary - 27bb6f5 <27bb6f5> more fixes to dask.order - bbfa1d8 <bbfa1d8> Rewrite dask.order File Changes (3 files <https://github.com/dask/dask/pull/10557/files>) - *M* dask/base.py <https://github.com/dask/dask/pull/10557/files#diff-10422b02c591d63ee295724faa14f7698b4a742c98ba20771c5f70d1a6926d06> (9) - *M* dask/order.py <https://github.com/dask/dask/pull/10557/files#diff-16341d447452e36a9d001fe3bcb08157cba6233ecbeb3b1f2d0af30b00c42677> (627) - *M* dask/tests/test_order.py <https://github.com/dask/dask/pull/10557/files#diff-5ff706869bdb554d4e160ef54f0241ce96be47c22dae92ab483b20ab88037d2a> (75) Patch Links: - https://github.com/dask/dask/pull/10557.patch - https://github.com/dask/dask/pull/10557.diff — Reply to this email directly, view it on GitHub <#10557>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTH5AJ4FE57ZDBHSE53X67M4FAVCNFSM6AAAAAA55QMWJKVHI2DSMVQWIX3LMV43ASLTON2WKOZRHEZTSOBVGA3DAMI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

fjetter · 2023-10-12T15:46:12Z

dask/tests/test_order.py

+
+
+def test_reduce_with_many_common_dependents():
+    ndeps = 3


TODO: Add some parametrization for this and n_reducers

fjetter · 2023-10-12T16:48:05Z

dask/tests/test_order.py

+@pytest.mark.xfail(reason="Why is `cde` a better path? Why even start at a0?")
 def test_switching_dependents(abcde):


I'm a little conflicted about this test case. Given this graph, the test asserts two properties

We start executing at a0

Once b3 is freed we are preferring cde over finishing the a* branch

I believe 1) is up for interpretation. It is currently implemented intentionally that we start with computation branches that are "tall and narrow", i.e. we should start with a0. From the classical argument about starting with the longest critical path, this argument also makes sense. The critical path argument is however softened in multi threaded execution environments since b3 is a producer of potential work since it unlocks multiple computation branches. Changing this causes other tests to fail that I'd need to investigate a little more before making an assessment on what is truly better.

Moving to 2.) assuming that we are indeed starting with a0, the test assertions are a little ambiguous if a4 should be executed before cde or after. Regardless, there is an actual case to be made for b3 since it is strictly speaking the path of lower memory pressure. However, the reason why this is true is because a further splits into two tasks and assuming equal weight per task, holding onto a7 and a8 before we release b3 is indeed one more task in memory.
This argument is pretty fragile and relies on how the branches split out. If any of the cde branches would further split apart, the argument would no longer be true.

So, all in all, I don't see why we would want to change our goal. Even if we were sensitive enough about the topology to infer how many final splits there are per branch (which we aren't right now and this is difficult/impossible to compute for non-trees), I'm not entirely convinced that this is reason alone to "switch". I could see a stronger case about changing the starting point, though.

Given that there is already a disclaimer that this is specific to the current implementation, I'm inclined to skip this for now and move on to testing. If I decide to go through with this implementation, I will likely delete the test case.

fjetter · 2023-10-13T10:53:27Z

TLDR The preliminary benchmark results are interesting but not super exciting. The results of this, so far, are not necessarily convincing enough to move forward with this but I believe this proposed algorithm is easier to maintain so I'll look into it a little more.
There are a couple of perf issues with this implementation I'll need to review (there are a couple of very hot loops) and rerun this again. If I can stabilize performance, this may be still interesting.

(The one thing that may be interesting to look into more closely is test_sum_residual which shows an impressive relative improvement in memory usage. The test is relatively small scale so I'll have to review this a little more closely to assess if real world workloads would benefit from this.)

Wall Time

There seem to be a mild general positive trend in wall time

(Please ignore the TPCH improvments. I didn't notice that the benchmark configuration is configured to compare released dask-expr vs main so I'll have to rerun this).

There are some runtime regressions in things like test_trivial_workstealing which is in absolute values only a two or three seconds. I'm currently reviewing the code and there are some perf issues that I think cause this. Should be gone in the next run as well.

(Peak) Memory Usage

Again, the impact of this PR appears to be overall positive (again, please ignore the TPCH block that is deeply red here).

However, the huge relative gains here are relatively small in absolute values so this is not super exciting

fjetter · 2023-10-13T14:35:16Z

Ok, just got the finished new benchmark results that include a couple of performance fixes of the algorihm and the results look much more robust (also with aligned dask-expr versions for the TCPH stuff)

Wall Time

There is still a thing with the work stealing but I'm not terribly concerned about that

(Peak) Memory

The signal around peak memory reduction for double diff and zarr filtering seems to be solid

fjetter · 2023-10-13T14:37:01Z

Apart from complexity reduction and apparently some memory performance gains, I'm also interested in this implementation since the new implementation aligns pretty well with prior ideas to deal with co-assignment and STA (dask/distributed#7141) that ultimately failed because of ordering issues. I may revisit this idea again but from a dask.order perspective and not as an extension as discussed there.

fjetter · 2023-10-18T10:55:54Z

The performance of the initial implementation was not good enough to be usable. I applied some optimizations now and think it is fine. It is slower in most cases, faster in few. Either way, it's not orders of magnitude.

For this optimization I took the benchmarks in the dask-benchmarks repo

I'm currently waiting for updated large scale benchmarks.

fjetter · 2023-10-18T11:05:09Z

The bad performance from the worst cases stems from the "change of tactical goal" which is still a bit of logic I preserved from earlier. Essentially, we do something expensive (e.g. sorting or diff-ing sets) just to briefly afterwards pause this effort and start again from scratch. This could likely be salvaged at the cost of readability but until we have more solid large scale benchmarks, I won't tune this further.

fjetter · 2023-10-18T11:19:11Z

Large scale benchmarks still are a mixed bag. Memory usage looks similar to the above and overall the same or better. However, the runtime performance of the algorithm itself seems to be not negligible, i.e. we cannot merge this.

I will pause work on this for a while

fjetter · 2023-12-13T09:41:17Z

fjetter commented Oct 12, 2023

View reviewed changes

fjetter mentioned this pull request Oct 16, 2023

[RFC] Order assignment groups #10562

Closed

fjetter added 11 commits October 17, 2023 19:34

more fixes to dask.order

b318052

Rewrite dask.order

15a20f1

Only reset layers_loaded after init_stack is empty

8f0c624

remove numpy import

3ef3b79

revert visualize

728b8a2

perf fixes

b32e9c3

more perf fixes

9758666

faster process_runnables

7101e4c

comments and code cleanup

ca7083a

clean up

cba43d6

revert changes to base

d188da6

fjetter force-pushed the rewrite_dask_order branch from e34825a to d188da6 Compare October 17, 2023 17:34

fjetter added 3 commits October 17, 2023 19:36

add parametrization

c00838a

add doc string back in

d3efee8

Faster deps handling

9b66d76

fjetter mentioned this pull request Nov 6, 2023

Is there any way of having .map_blocks be even more opaque to dask? pydata/xarray#8414

Closed

fjetter mentioned this pull request Nov 30, 2023

Dask.order rewrite using a critical path approach #10660

Merged

fjetter closed this Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Rewrite dask order#10557

[WIP] Rewrite dask order#10557
fjetter wants to merge 14 commits intodask:mainfrom
fjetter:rewrite_dask_order

fjetter commented Oct 12, 2023 •

edited

Loading

Uh oh!

mrocklin commented Oct 12, 2023 via email

Uh oh!

fjetter Oct 12, 2023

Uh oh!

fjetter Oct 12, 2023

Uh oh!

fjetter commented Oct 13, 2023

Uh oh!

fjetter commented Oct 13, 2023

Uh oh!

fjetter commented Oct 13, 2023

Uh oh!

fjetter commented Oct 18, 2023

Uh oh!

fjetter commented Oct 18, 2023

Uh oh!

fjetter commented Oct 18, 2023

Uh oh!

fjetter commented Dec 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@pytest.mark.xfail(reason="Why is `cde` a better path? Why even start at a0?")
		def test_switching_dependents(abcde):

Uh oh!

Conversation

fjetter commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Oct 12, 2023 via email

Uh oh!

fjetter Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

fjetter Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

fjetter commented Oct 13, 2023

Wall Time

(Peak) Memory Usage

Uh oh!

fjetter commented Oct 13, 2023

Wall Time

(Peak) Memory

Uh oh!

fjetter commented Oct 13, 2023

Uh oh!

fjetter commented Oct 18, 2023

Uh oh!

fjetter commented Oct 18, 2023

Uh oh!

fjetter commented Oct 18, 2023

Uh oh!

fjetter commented Dec 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fjetter commented Oct 12, 2023 •

edited

Loading