Skip to content

[WIP] Rewrite dask order#10557

Closed
fjetter wants to merge 14 commits intodask:mainfrom
fjetter:rewrite_dask_order
Closed

[WIP] Rewrite dask order#10557
fjetter wants to merge 14 commits intodask:mainfrom
fjetter:rewrite_dask_order

Conversation

@fjetter
Copy link
Copy Markdown
Member

@fjetter fjetter commented Oct 12, 2023

This is a rewrite of the existing dask.order implementation

Closes dask/distributed#8255

This is still WIP but for most unit tests it provides equivalent or strictly better ordering. There are still a couple of failures I have to address but the failures do not look critical and may even point to improper test logic.

From my perspective this rewrite is preserving most of the key logic of the current algorithm but it removes plenty of specialized edge cases. I will follow up with a more rigorous description of the changes once CI is green

TODO:

  • Perf testing of the actual order algorithm
  • Coiled benchmarks

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Oct 12, 2023 via email



def test_reduce_with_many_common_dependents():
ndeps = 3
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add some parametrization for this and n_reducers

Comment on lines 664 to 665
@pytest.mark.xfail(reason="Why is `cde` a better path? Why even start at a0?")
def test_switching_dependents(abcde):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little conflicted about this test case. Given this graph, the test asserts two properties

  1. We start executing at a0
  2. Once b3 is freed we are preferring cde over finishing the a* branch

I believe 1) is up for interpretation. It is currently implemented intentionally that we start with computation branches that are "tall and narrow", i.e. we should start with a0. From the classical argument about starting with the longest critical path, this argument also makes sense. The critical path argument is however softened in multi threaded execution environments since b3 is a producer of potential work since it unlocks multiple computation branches. Changing this causes other tests to fail that I'd need to investigate a little more before making an assessment on what is truly better.

Moving to 2.) assuming that we are indeed starting with a0, the test assertions are a little ambiguous if a4 should be executed before cde or after. Regardless, there is an actual case to be made for b3 since it is strictly speaking the path of lower memory pressure. However, the reason why this is true is because a further splits into two tasks and assuming equal weight per task, holding onto a7 and a8 before we release b3 is indeed one more task in memory.
This argument is pretty fragile and relies on how the branches split out. If any of the cde branches would further split apart, the argument would no longer be true.

So, all in all, I don't see why we would want to change our goal. Even if we were sensitive enough about the topology to infer how many final splits there are per branch (which we aren't right now and this is difficult/impossible to compute for non-trees), I'm not entirely convinced that this is reason alone to "switch". I could see a stronger case about changing the starting point, though.

Given that there is already a disclaimer that this is specific to the current implementation, I'm inclined to skip this for now and move on to testing. If I decide to go through with this implementation, I will likely delete the test case.

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Oct 13, 2023

TLDR The preliminary benchmark results are interesting but not super exciting. The results of this, so far, are not necessarily convincing enough to move forward with this but I believe this proposed algorithm is easier to maintain so I'll look into it a little more.
There are a couple of perf issues with this implementation I'll need to review (there are a couple of very hot loops) and rerun this again. If I can stabilize performance, this may be still interesting.

(The one thing that may be interesting to look into more closely is test_sum_residual which shows an impressive relative improvement in memory usage. The test is relatively small scale so I'll have to review this a little more closely to assess if real world workloads would benefit from this.)

Wall Time

There seem to be a mild general positive trend in wall time

(Please ignore the TPCH improvments. I didn't notice that the benchmark configuration is configured to compare released dask-expr vs main so I'll have to rerun this).

There are some runtime regressions in things like test_trivial_workstealing which is in absolute values only a two or three seconds. I'm currently reviewing the code and there are some perf issues that I think cause this. Should be gone in the next run as well.

image

(Peak) Memory Usage

Again, the impact of this PR appears to be overall positive (again, please ignore the TPCH block that is deeply red here).

However, the huge relative gains here are relatively small in absolute values so this is not super exciting

image

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Oct 13, 2023

Ok, just got the finished new benchmark results that include a couple of performance fixes of the algorihm and the results look much more robust (also with aligned dask-expr versions for the TCPH stuff)

Wall Time

There is still a thing with the work stealing but I'm not terribly concerned about that

image

(Peak) Memory

image

image

The signal around peak memory reduction for double diff and zarr filtering seems to be solid

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Oct 13, 2023

Apart from complexity reduction and apparently some memory performance gains, I'm also interested in this implementation since the new implementation aligns pretty well with prior ideas to deal with co-assignment and STA (dask/distributed#7141) that ultimately failed because of ordering issues. I may revisit this idea again but from a dask.order perspective and not as an extension as discussed there.

@fjetter fjetter force-pushed the rewrite_dask_order branch from e34825a to d188da6 Compare October 17, 2023 17:34
@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Oct 18, 2023

The performance of the initial implementation was not good enough to be usable. I applied some optimizations now and think it is fine. It is slower in most cases, faster in few. Either way, it's not orders of magnitude.

For this optimization I took the benchmarks in the dask-benchmarks repo
image

I'm currently waiting for updated large scale benchmarks.

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Oct 18, 2023

The bad performance from the worst cases stems from the "change of tactical goal" which is still a bit of logic I preserved from earlier. Essentially, we do something expensive (e.g. sorting or diff-ing sets) just to briefly afterwards pause this effort and start again from scratch. This could likely be salvaged at the cost of readability but until we have more solid large scale benchmarks, I won't tune this further.

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Oct 18, 2023

Large scale benchmarks still are a mixed bag. Memory usage looks similar to the above and overall the same or better. However, the runtime performance of the algorithm itself seems to be not negligible, i.e. we cannot merge this.

I will pause work on this for a while

@fjetter
Copy link
Copy Markdown
Member Author

fjetter commented Dec 13, 2023

@fjetter fjetter closed this Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression in test_decide_worker_coschedule_order_neighbors

2 participants