Remove source and target optimization in array.store by djhoese · Pull Request #9732 · dask/dask

djhoese · 2022-12-08T20:10:42Z

As discussed in #8380, there are cases where a user may want to store multiple sources or store to multiple targets that share tasks somewhere in their history. In current main calling da.store with compute=False results in the individual dask graphs being optimized (fused tasks, renamed tasks, etc) which means future computations won't be able to combine/share tasks any more. In simplified code this looks like:

src1 = ...
src2 = ...  # <- shares tasks with src1
store1 = da.store(src1, target1, compute=False)
store2 = da.store(src2, target2, compute=False)
da.compute(store1, store2)  # <- this will compute shared tasks twice

As I was working on a fix for this, I realized that I don't think there is any need for da.store to optimize anything before final computation. I think all computation will optimize the graphs anyway so the only benefit to early optimization is to reduce the graphs/dictionaries that are passed around internally in da.store (a small benefit if I'm not mistaken).

Lastly a question on dask PR best practice: What are the feelings about including refactorings in a PR alongside bug fixes? For example, the da.store function could be split into a lot of sub-functions that would reduce cyclomatic complexity of the function and make it easier to read. But for overall appearance of the PR it could make it harder to review.

Closes da.store loses dependency information #8380
Tests added / passed
Passes pre-commit run --all-files

djhoese · 2022-12-08T20:11:31Z

Ah I should have added, since targets can also be Delayed objects, they are also optimized as part of the store function. In this PR I've removed this optimization as well and include a test showing unnecessary re-computation of a task shared by two targets.

djhoese · 2022-12-08T20:29:39Z

I think conda-forge is having a difficult day today.

gerritholl · 2022-12-16T10:47:42Z

This PR slows down my processing compared to dask 2022.12.0. With dask 2022.12.0, my script takes 09:18.68. With this PR it takes 10:14.39. I will try to come up with a MCVE that reproduces the slowdown.

mrocklin · 2022-12-16T14:25:53Z

@jakirkham

djhoese · 2022-12-16T16:32:15Z

@gerritholl I'll check out your satpy-related issues and hopefully do some profiling today. It may be something satpy is doing against "best practice", but I'm surprised that this PR would have worsened performance unless my assumptions about where graph optimizations are being done (when compute is called) are wrong.

djhoese · 2022-12-16T19:24:35Z

Question for @mrocklin and @jakirkham (or others): if I give a dask array to a Delayed function and then compute the result, is the graph optimized any differently than if I had computed the dask array directly? Put another way, does optimizing a dask array graph mean additional optimizations are considered that aren't considered for a Delayed function?

For Gerrit's issue, he's having trouble reproducing the slow results he was getting before. I'm going to try some other use cases with Satpy to see if I can weird results between 2022.12.0 and this PR.

jakirkham · 2022-12-20T01:19:34Z

dask/array/core.py

-    sources_layer = Array.__dask_optimize__(
-        sources_hlg, list(core.flatten([e.__dask_keys__() for e in sources]))
-    )


In PR ( #2980 ), Matt suggested we add this to ensure Array optimizations are applied before computation (since they wouldn't be for a Delayed object) ( #2980 (comment) ). So presumably if this is dropped it, these optimizations would need to be handled another way

That said, have not really closely followed the changes to graphs, optimizations, object specializations, etc. in Dask. So maybe best practices have changed. Maybe @rjzamora would know 🙂

I think that's what I've been seeing, but results still aren't as expected. I tried doing dask.config.set(delayed_optimize=da.optimization.optimize) in hopes of making the Delayed objects get optimized as if they were Arrays (which they are below the top 2 levels or so of the graph). But even with this I still saw worse performance than dask main with my own Satpy-based example.

Overall, from a high level, my hope is that there would be something we could do so no matter where these graphs appear inside other graphs (as inputs to delayed objects, etc) that they all optimize the same when computed/optimized with other graphs. 🤞

Admittedly the PR linked above and associated discussion is 4yrs old. It wouldn't surprise me if best practices had changed (potentially a lot) in the intervening time. So would defer to others more involved with HLG (like Rick) on what those are today.

@jakirkham to get this PR moving forward, or at least the discussion, what are the user accounts of the specific people you think could give insight on this problem (can you mention them here)? That problem being wanting to optimize a dask graph consisting of Delayed objects with the potential for dask arrays involved. The hope being that those arrays could be optimized in the same way regardless of the final task being a Delayed object or not.

Sorry not to be very helpful here. Will defer to @mrocklin on next steps

djhoese · 2025-04-08T19:08:59Z

I think this has been replaced by #11859 and #11844.

djhoese added 2 commits December 8, 2022 13:45

Remove source optimization in array.store

6bd8fb0

Remove target optimization in array.store

8a13612

github-actions bot added the array label Dec 8, 2022

Empty commit to restart CI

28ee3f7

djhoese mentioned this pull request Dec 9, 2022

da.store loses dependency information #8380

Closed

gerritholl mentioned this pull request Dec 16, 2022

group save_datasets result by file pytroll/satpy#2281

Merged

3 tasks

jakirkham reviewed Dec 20, 2022

View reviewed changes

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Apr 15, 2024

djhoese mentioned this pull request Mar 18, 2025

Add workaround so dask arrays are optimized in Delayed writing pytroll/satpy#3082

Merged

4 tasks

djhoese mentioned this pull request Mar 31, 2025

Perform all graph optimizations for all collection types #11854

Open

djhoese closed this Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove source and target optimization in array.store#9732

Remove source and target optimization in array.store#9732
djhoese wants to merge 3 commits intodask:mainfrom
djhoese:bugfix-store-nocompute-optimize

djhoese commented Dec 8, 2022

Uh oh!

djhoese commented Dec 8, 2022

Uh oh!

djhoese commented Dec 8, 2022

Uh oh!

gerritholl commented Dec 16, 2022

Uh oh!

mrocklin commented Dec 16, 2022

Uh oh!

djhoese commented Dec 16, 2022

Uh oh!

djhoese commented Dec 16, 2022

Uh oh!

jakirkham Dec 20, 2022

Uh oh!

djhoese Dec 20, 2022 •

edited

Loading

Uh oh!

jakirkham Dec 20, 2022

Uh oh!

djhoese Apr 28, 2023

Uh oh!

jakirkham May 8, 2023

Uh oh!

djhoese commented Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

djhoese commented Dec 8, 2022

Uh oh!

djhoese commented Dec 8, 2022

Uh oh!

djhoese commented Dec 8, 2022

Uh oh!

gerritholl commented Dec 16, 2022

Uh oh!

mrocklin commented Dec 16, 2022

Uh oh!

djhoese commented Dec 16, 2022

Uh oh!

djhoese commented Dec 16, 2022

Uh oh!

jakirkham Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

djhoese Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

djhoese Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

jakirkham May 8, 2023

Choose a reason for hiding this comment

Uh oh!

djhoese commented Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

djhoese Dec 20, 2022 •

edited

Loading