Skip to content

Remove source and target optimization in array.store#9732

Closed
djhoese wants to merge 3 commits intodask:mainfrom
djhoese:bugfix-store-nocompute-optimize
Closed

Remove source and target optimization in array.store#9732
djhoese wants to merge 3 commits intodask:mainfrom
djhoese:bugfix-store-nocompute-optimize

Conversation

@djhoese
Copy link
Copy Markdown
Contributor

@djhoese djhoese commented Dec 8, 2022

As discussed in #8380, there are cases where a user may want to store multiple sources or store to multiple targets that share tasks somewhere in their history. In current main calling da.store with compute=False results in the individual dask graphs being optimized (fused tasks, renamed tasks, etc) which means future computations won't be able to combine/share tasks any more. In simplified code this looks like:

src1 = ...
src2 = ...  # <- shares tasks with src1
store1 = da.store(src1, target1, compute=False)
store2 = da.store(src2, target2, compute=False)
da.compute(store1, store2)  # <- this will compute shared tasks twice

As I was working on a fix for this, I realized that I don't think there is any need for da.store to optimize anything before final computation. I think all computation will optimize the graphs anyway so the only benefit to early optimization is to reduce the graphs/dictionaries that are passed around internally in da.store (a small benefit if I'm not mistaken).

Lastly a question on dask PR best practice: What are the feelings about including refactorings in a PR alongside bug fixes? For example, the da.store function could be split into a lot of sub-functions that would reduce cyclomatic complexity of the function and make it easier to read. But for overall appearance of the PR it could make it harder to review.

@github-actions github-actions bot added the array label Dec 8, 2022
@djhoese
Copy link
Copy Markdown
Contributor Author

djhoese commented Dec 8, 2022

Ah I should have added, since targets can also be Delayed objects, they are also optimized as part of the store function. In this PR I've removed this optimization as well and include a test showing unnecessary re-computation of a task shared by two targets.

@djhoese
Copy link
Copy Markdown
Contributor Author

djhoese commented Dec 8, 2022

I think conda-forge is having a difficult day today.

@gerritholl
Copy link
Copy Markdown
Contributor

This PR slows down my processing compared to dask 2022.12.0. With dask 2022.12.0, my script takes 09:18.68. With this PR it takes 10:14.39. I will try to come up with a MCVE that reproduces the slowdown.

@mrocklin
Copy link
Copy Markdown
Member

@jakirkham

@djhoese
Copy link
Copy Markdown
Contributor Author

djhoese commented Dec 16, 2022

@gerritholl I'll check out your satpy-related issues and hopefully do some profiling today. It may be something satpy is doing against "best practice", but I'm surprised that this PR would have worsened performance unless my assumptions about where graph optimizations are being done (when compute is called) are wrong.

@djhoese
Copy link
Copy Markdown
Contributor Author

djhoese commented Dec 16, 2022

Question for @mrocklin and @jakirkham (or others): if I give a dask array to a Delayed function and then compute the result, is the graph optimized any differently than if I had computed the dask array directly? Put another way, does optimizing a dask array graph mean additional optimizations are considered that aren't considered for a Delayed function?

For Gerrit's issue, he's having trouble reproducing the slow results he was getting before. I'm going to try some other use cases with Satpy to see if I can weird results between 2022.12.0 and this PR.

Comment on lines -1169 to -1171
sources_layer = Array.__dask_optimize__(
sources_hlg, list(core.flatten([e.__dask_keys__() for e in sources]))
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PR ( #2980 ), Matt suggested we add this to ensure Array optimizations are applied before computation (since they wouldn't be for a Delayed object) ( #2980 (comment) ). So presumably if this is dropped it, these optimizations would need to be handled another way

That said, have not really closely followed the changes to graphs, optimizations, object specializations, etc. in Dask. So maybe best practices have changed. Maybe @rjzamora would know 🙂

Copy link
Copy Markdown
Contributor Author

@djhoese djhoese Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what I've been seeing, but results still aren't as expected. I tried doing dask.config.set(delayed_optimize=da.optimization.optimize) in hopes of making the Delayed objects get optimized as if they were Arrays (which they are below the top 2 levels or so of the graph). But even with this I still saw worse performance than dask main with my own Satpy-based example.

Overall, from a high level, my hope is that there would be something we could do so no matter where these graphs appear inside other graphs (as inputs to delayed objects, etc) that they all optimize the same when computed/optimized with other graphs. 🤞

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly the PR linked above and associated discussion is 4yrs old. It wouldn't surprise me if best practices had changed (potentially a lot) in the intervening time. So would defer to others more involved with HLG (like Rick) on what those are today.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham to get this PR moving forward, or at least the discussion, what are the user accounts of the specific people you think could give insight on this problem (can you mention them here)? That problem being wanting to optimize a dask graph consisting of Delayed objects with the potential for dask arrays involved. The hope being that those arrays could be optimized in the same way regardless of the final task being a Delayed object or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not to be very helpful here. Will defer to @mrocklin on next steps

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Apr 15, 2024
@djhoese
Copy link
Copy Markdown
Contributor Author

djhoese commented Apr 8, 2025

I think this has been replaced by #11859 and #11844.

@djhoese djhoese closed this Apr 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

array needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

da.store loses dependency information

4 participants