Skip to content

Conversation

@fjetter
Copy link
Member

@fjetter fjetter commented Apr 2, 2025

There was a bug in the earlier code that would not actually optimize HLGs jointly under some circumstances.

The fix I am proposing here is to perform the HLG merging during the Expression optimization step. From all I can tell this still does not materialize the layers sooner than they did before (there are still cases, e.g. if materialization is triggered to compute __dask_keys__ but we've done this before already)

Further, the FinalizeComputeHLG class no longer assembles the low level graph but instead the HighlevelGraph assembly is a separate step and this step is overriden in the finalizer. The way, the sequence can properly bubble down the finalize without compromising joint optimization

Actually closes #8380

cc @djhoese

@fjetter fjetter requested a review from Copilot April 2, 2025 10:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request addresses a bug in the joint optimization of HLGs by modifying how dependencies are handled during the expression optimization step and refactoring the finalization of compute graphs. Key changes include:

  • Adding and updating tests in dask/tests/test_expr.py and dask/array/tests/test_array_core.py to validate joint optimization and materialization behaviors.
  • Revising HLGExpr in dask/_expr.py to use a new property (hlg) and updating dask_graph and finalize_compute methods.
  • Refactoring the HLGFinalizeCompute class to subclass HLGExpr and adjust the final graph assembly process.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
dask/tests/test_expr.py Added tests for HLGExpr finalization and joint optimization of HLGs.
dask/array/tests/test_array_core.py Introduced a new test to verify shared task optimization with stored sources.
dask/_expr.py Updated methods in HLGExpr and refactored HLGFinalizeCompute for improved dependency handling.
Comments suppressed due to low confidence (1)

dask/_expr.py:1232

  • [nitpick] In HLGFinalizeCompute, the property 'hlg' retrieves the operand using 'expr = self.operand("dsk")' and then accesses 'expr.dsk'. For consistency with the rest of HLGExpr, consider using 'expr.hlg' instead of 'expr.dsk' throughout this property.
def hlg(self):

@github-actions
Copy link
Contributor

github-actions bot commented Apr 2, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ± 0        9 suites  ±0   3h 34m 6s ⏱️ + 5m 26s
 18 004 tests + 3   16 790 ✅ + 3   1 214 💤 ±0  0 ❌ ±0 
161 110 runs  +27  149 003 ✅ +27  12 107 💤 ±0  0 ❌ ±0 

Results for commit 80fb427. ± Comparison against base commit d9e07f5.

♻️ This comment has been updated with latest results.

@fjetter fjetter requested a review from Copilot April 2, 2025 14:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a bug in the HLG joint optimization by merging HLGs during the Expression optimization step and separating out the HighLevelGraph assembly in the finalizer. It also adds tests to ensure correct handling of expression sequences and optimizations.

  • Added new functions and tests in test_expr.py to validate HLGExpr sequence finalization and nested keys.
  • Modified optimize_double in test_base.py to avoid altering keys starting with "finalize".
  • Introduced a new store test in test_array_core.py to verify shared task optimization.
  • Refactored HLGExpr, _HLGExprSequence, and HLGFinalizeCompute in _expr.py to improve and separate finalization and optimization logic.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
dask/tests/test_expr.py New tests and helper functions for HLGExpr sequence optimization and key finalization.
dask/tests/test_base.py Adjusted optimize_double to skip optimization for keys starting with "finalize".
dask/array/tests/test_array_core.py Added a test to confirm correct task sharing post store and optimization.
dask/_expr.py Changes in HLGExpr, _HLGExprSequence, and HLGFinalizeCompute to adjust optimization flow.
Comments suppressed due to low confidence (3)

dask/tests/test_base.py:822

  • Ensure key_split is imported or defined within this module; otherwise, optimize_double may fail at runtime.
return { k: (mul, 2, v) if not key_split(k).startswith("finalize") else v for k, v in dsk.items() }

dask/_expr.py:1247

  • Returning self._layer() in dask_graph of HLGFinalizeCompute may be unclear if _layer is not explicitly defined; consider defining a dedicated _layer method to clearly separate finalized graph data.
return self._layer()

dask/_expr.py:1116

  • The assertion 'assert all_keys is not None' is redundant since all_keys is initialized as a list; consider removing it for cleaner code.
assert all_keys is not None

dask/_expr.py Outdated
all_keys.extend(op.__dask_keys__())
else:
all_keys.append(op.__dask_keys__())
assert all_keys is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_keys is a list. Did you want to test non-emptiness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was old code. The bot already told me this is stupid. I just didn't push the commit yet

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. No need to block on the assertion.

@fjetter fjetter merged commit 01202ab into dask:main Apr 4, 2025
4 of 5 checks passed
@fjetter fjetter deleted the hlg_expr_refactor branch April 4, 2025 09:23
@pfackeldey
Copy link

Our CI tests hang indefinitely over in https://github.com/dask-contrib/dask-awkward and https://github.com/scikit-hep/uproot5 since this PR (found this out using git bisect). This issue seems to be only present for running tests with pytest, there's no problem (as far as we can tell) with just running dask code otherwise. (If I understand it correctly, it seems to be stuck forever in queue.get() for the get_sync scheduler, but I'm not very sure about this.)

Do you have an idea what could cause this @fjetter & @hendrikmakait?
Thanks for your help in advance 🙏

cc @martindurant & @lgray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

da.store loses dependency information

3 participants