-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Ensure HLGs are using dependencies properly in optimization #11859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request addresses a bug in the joint optimization of HLGs by modifying how dependencies are handled during the expression optimization step and refactoring the finalization of compute graphs. Key changes include:
- Adding and updating tests in dask/tests/test_expr.py and dask/array/tests/test_array_core.py to validate joint optimization and materialization behaviors.
- Revising HLGExpr in dask/_expr.py to use a new property (hlg) and updating dask_graph and finalize_compute methods.
- Refactoring the HLGFinalizeCompute class to subclass HLGExpr and adjust the final graph assembly process.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| dask/tests/test_expr.py | Added tests for HLGExpr finalization and joint optimization of HLGs. |
| dask/array/tests/test_array_core.py | Introduced a new test to verify shared task optimization with stored sources. |
| dask/_expr.py | Updated methods in HLGExpr and refactored HLGFinalizeCompute for improved dependency handling. |
Comments suppressed due to low confidence (1)
dask/_expr.py:1232
- [nitpick] In HLGFinalizeCompute, the property 'hlg' retrieves the operand using 'expr = self.operand("dsk")' and then accesses 'expr.dsk'. For consistency with the rest of HLGExpr, consider using 'expr.hlg' instead of 'expr.dsk' throughout this property.
def hlg(self):
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 9 files ± 0 9 suites ±0 3h 34m 6s ⏱️ + 5m 26s Results for commit 80fb427. ± Comparison against base commit d9e07f5. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a bug in the HLG joint optimization by merging HLGs during the Expression optimization step and separating out the HighLevelGraph assembly in the finalizer. It also adds tests to ensure correct handling of expression sequences and optimizations.
- Added new functions and tests in test_expr.py to validate HLGExpr sequence finalization and nested keys.
- Modified optimize_double in test_base.py to avoid altering keys starting with "finalize".
- Introduced a new store test in test_array_core.py to verify shared task optimization.
- Refactored HLGExpr, _HLGExprSequence, and HLGFinalizeCompute in _expr.py to improve and separate finalization and optimization logic.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| dask/tests/test_expr.py | New tests and helper functions for HLGExpr sequence optimization and key finalization. |
| dask/tests/test_base.py | Adjusted optimize_double to skip optimization for keys starting with "finalize". |
| dask/array/tests/test_array_core.py | Added a test to confirm correct task sharing post store and optimization. |
| dask/_expr.py | Changes in HLGExpr, _HLGExprSequence, and HLGFinalizeCompute to adjust optimization flow. |
Comments suppressed due to low confidence (3)
dask/tests/test_base.py:822
- Ensure key_split is imported or defined within this module; otherwise, optimize_double may fail at runtime.
return { k: (mul, 2, v) if not key_split(k).startswith("finalize") else v for k, v in dsk.items() }
dask/_expr.py:1247
- Returning self._layer() in dask_graph of HLGFinalizeCompute may be unclear if _layer is not explicitly defined; consider defining a dedicated _layer method to clearly separate finalized graph data.
return self._layer()
dask/_expr.py:1116
- The assertion 'assert all_keys is not None' is redundant since all_keys is initialized as a list; consider removing it for cleaner code.
assert all_keys is not None
dask/_expr.py
Outdated
| all_keys.extend(op.__dask_keys__()) | ||
| else: | ||
| all_keys.append(op.__dask_keys__()) | ||
| assert all_keys is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_keys is a list. Did you want to test non-emptiness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was old code. The bot already told me this is stupid. I just didn't push the commit yet
hendrikmakait
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. No need to block on the assertion.
|
Our CI tests hang indefinitely over in https://github.com/dask-contrib/dask-awkward and https://github.com/scikit-hep/uproot5 since this PR (found this out using Do you have an idea what could cause this @fjetter & @hendrikmakait? cc @martindurant & @lgray |
There was a bug in the earlier code that would not actually optimize HLGs jointly under some circumstances.
The fix I am proposing here is to perform the HLG merging during the Expression optimization step. From all I can tell this still does not materialize the layers sooner than they did before (there are still cases, e.g. if materialization is triggered to compute
__dask_keys__but we've done this before already)Further, the
FinalizeComputeHLGclass no longer assembles the low level graph but instead the HighlevelGraph assembly is a separate step and this step is overriden in the finalizer. The way, the sequence can properly bubble down the finalize without compromising joint optimizationActually closes #8380
cc @djhoese