Conversation
|
Exciting
…On Thu, Oct 12, 2023 at 7:10 AM Florian Jetter ***@***.***> wrote:
This is a rewrite of the existing dask.order implementation
Closes dask/distributed#8255
<dask/distributed#8255>
This is still WIP but for most unit tests it provides equivalent or
strictly better ordering. There are still a couple of failures I have to
address but the failures do not look critical and may even point to
improper test logic.
From my perspective this rewrite is preserving most of the key logic of
the current algorithm but it removes plenty of specialized edge cases. I
will follow up with a more rigorous description of the changes once CI is
green
TODO:
- Perf testing of the actual order algorithm
- Coiled benchmarks
------------------------------
You can view, comment on, or merge this pull request online at:
#10557
Commit Summary
- 27bb6f5
<27bb6f5>
more fixes to dask.order
- bbfa1d8
<bbfa1d8>
Rewrite dask.order
File Changes
(3 files <https://github.com/dask/dask/pull/10557/files>)
- *M* dask/base.py
<https://github.com/dask/dask/pull/10557/files#diff-10422b02c591d63ee295724faa14f7698b4a742c98ba20771c5f70d1a6926d06>
(9)
- *M* dask/order.py
<https://github.com/dask/dask/pull/10557/files#diff-16341d447452e36a9d001fe3bcb08157cba6233ecbeb3b1f2d0af30b00c42677>
(627)
- *M* dask/tests/test_order.py
<https://github.com/dask/dask/pull/10557/files#diff-5ff706869bdb554d4e160ef54f0241ce96be47c22dae92ab483b20ab88037d2a>
(75)
Patch Links:
- https://github.com/dask/dask/pull/10557.patch
- https://github.com/dask/dask/pull/10557.diff
—
Reply to this email directly, view it on GitHub
<#10557>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTH5AJ4FE57ZDBHSE53X67M4FAVCNFSM6AAAAAA55QMWJKVHI2DSMVQWIX3LMV43ASLTON2WKOZRHEZTSOBVGA3DAMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
dask/tests/test_order.py
Outdated
|
|
||
|
|
||
| def test_reduce_with_many_common_dependents(): | ||
| ndeps = 3 |
There was a problem hiding this comment.
TODO: Add some parametrization for this and n_reducers
dask/tests/test_order.py
Outdated
| @pytest.mark.xfail(reason="Why is `cde` a better path? Why even start at a0?") | ||
| def test_switching_dependents(abcde): |
There was a problem hiding this comment.
I'm a little conflicted about this test case. Given this graph, the test asserts two properties
- We start executing at
a0 - Once
b3is freed we are preferringcdeover finishing thea*branch
I believe 1) is up for interpretation. It is currently implemented intentionally that we start with computation branches that are "tall and narrow", i.e. we should start with a0. From the classical argument about starting with the longest critical path, this argument also makes sense. The critical path argument is however softened in multi threaded execution environments since b3 is a producer of potential work since it unlocks multiple computation branches. Changing this causes other tests to fail that I'd need to investigate a little more before making an assessment on what is truly better.
Moving to 2.) assuming that we are indeed starting with a0, the test assertions are a little ambiguous if a4 should be executed before cde or after. Regardless, there is an actual case to be made for b3 since it is strictly speaking the path of lower memory pressure. However, the reason why this is true is because a further splits into two tasks and assuming equal weight per task, holding onto a7 and a8 before we release b3 is indeed one more task in memory.
This argument is pretty fragile and relies on how the branches split out. If any of the cde branches would further split apart, the argument would no longer be true.
So, all in all, I don't see why we would want to change our goal. Even if we were sensitive enough about the topology to infer how many final splits there are per branch (which we aren't right now and this is difficult/impossible to compute for non-trees), I'm not entirely convinced that this is reason alone to "switch". I could see a stronger case about changing the starting point, though.
Given that there is already a disclaimer that this is specific to the current implementation, I'm inclined to skip this for now and move on to testing. If I decide to go through with this implementation, I will likely delete the test case.
|
TLDR The preliminary benchmark results are interesting but not super exciting. The results of this, so far, are not necessarily convincing enough to move forward with this but I believe this proposed algorithm is easier to maintain so I'll look into it a little more. (The one thing that may be interesting to look into more closely is test_sum_residual which shows an impressive relative improvement in memory usage. The test is relatively small scale so I'll have to review this a little more closely to assess if real world workloads would benefit from this.) Wall TimeThere seem to be a mild general positive trend in wall time (Please ignore the TPCH improvments. I didn't notice that the benchmark configuration is configured to compare released dask-expr vs main so I'll have to rerun this). There are some runtime regressions in things like (Peak) Memory UsageAgain, the impact of this PR appears to be overall positive (again, please ignore the TPCH block that is deeply red here). However, the huge relative gains here are relatively small in absolute values so this is not super exciting |
|
Ok, just got the finished new benchmark results that include a couple of performance fixes of the algorihm and the results look much more robust (also with aligned dask-expr versions for the TCPH stuff) Wall TimeThere is still a thing with the work stealing but I'm not terribly concerned about that (Peak) MemoryThe signal around peak memory reduction for double diff and zarr filtering seems to be solid |
|
Apart from complexity reduction and apparently some memory performance gains, I'm also interested in this implementation since the new implementation aligns pretty well with prior ideas to deal with co-assignment and STA (dask/distributed#7141) that ultimately failed because of ordering issues. I may revisit this idea again but from a dask.order perspective and not as an extension as discussed there. |
e34825a to
d188da6
Compare
|
The bad performance from the worst cases stems from the "change of tactical goal" which is still a bit of logic I preserved from earlier. Essentially, we do something expensive (e.g. sorting or diff-ing sets) just to briefly afterwards pause this effort and start again from scratch. This could likely be salvaged at the cost of readability but until we have more solid large scale benchmarks, I won't tune this further. |
|
Large scale benchmarks still are a mixed bag. Memory usage looks similar to the above and overall the same or better. However, the runtime performance of the algorithm itself seems to be not negligible, i.e. we cannot merge this. I will pause work on this for a while |
|
|






This is a rewrite of the existing dask.order implementation
Closes dask/distributed#8255
This is still WIP but for most unit tests it provides equivalent or strictly better ordering. There are still a couple of failures I have to address but the failures do not look critical and may even point to improper test logic.
From my perspective this rewrite is preserving most of the key logic of the current algorithm but it removes plenty of specialized edge cases. I will follow up with a more rigorous description of the changes once CI is green
TODO: