-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
HighLevelGraphs (HLGs) are discussed in many places throughout Dask’s GitHub organization, but there is no single place with an up-to-date status and outlook. This issue should be modified if/when the HLG roadmap changes.
Last Update: 2022/08/29
Current status
HighlevelGraph development is currently “frozen”.
The ongoing development of new highlevelgraph.Layer classes and/or HighlevelGraph optimizations is strongly discouraged. Bug fixes are obviously fine, but expansion of the existing system should be avoided. Dask users and down-stream developers should avoid interactive with Layer classes directly. The motivation for the current development freeze is the fact that there is a consensus that (1) the current design of HLGs is not sufficient for effective graph optimizations, and that (2) HLG Layers are too complicated to implement.
Reasons for current pain
- HLG serialization is convoluted and buggy
- The current API design makes it difficult to implement an effective
Layerclass. For example, it can be challenging to implementcullwithout materializing the subgraph - The current collection/HighLevelGraph/Layer hierarchy makes it nearly impossible to perform general optimizations. Some important optimizations that are missing or significantly limited:
- Dask does not track partition metadata (like Parquet statistics) to optimize simple queries like
len,min, andmax. - Dask does not support automatic predicate pushdown (i.e. Dask will not detect filter operations in the task graph, and use them to re-write IO Layers with
filters=). - Dask does not support automatic column projection when the implicit or explicit selection and IO operation are separated by non-trivial operations like merge, shuffle and/or groupby. This limitation often requires the user to explicitly define
read_parquet(..., columns=...)to reduce memory pressure.
- Dask does not track partition metadata (like Parquet statistics) to optimize simple queries like
Roadmap
1 - Address HLG-serialization
Relevant PRs:
- WIP: Ship graphs from client to scheduler with pickle distributed#6028
- Remove dask_distributed_pack methods from HighLevelGraphs and Layers #8864
Serialization for client-scheduler communication has been a consistent challenge since the introduction of highlevelgraph.Layer. Most of this pain can be resolved by adopting Pickle for HLG serialization.
The current plan is to move forward with distributed#6028 and #8864, which proposes that we always use pickle to send the HLG from the client to scheduler, and therefore remove the problematic __dask_distributed_pack__ protocol from HighLevelGraph/Layer.
2 - Explore alternate designs for current collection/HLG/Layer API
Even if serialization challenges are addressed by 6028 (and follow-up work), the current collection/HLG/Layer hierarchy and HLG/Layer design will still pose a serious problem for developers. Therefore we probably need to do one of the following:
- Option A - Revise the
HighLevelGraphandLayerclasses to simplify the implementation of newLayersubclasses and to encapsulate the necessary information for high-level optimizations. This may require some refactoring of the collection APIs to movemeta/divisions/chunksownership into the HLG. - Option B - Redesign the collection APIs to use a high-level-expression (HLE) API instead of HLGs/Layers (see original HLE discussion in High Level Expressions #7933). One possible variation on an HLE-like redesign is under active exploration (see [Discussion] Dask-collection refactor plan #9117), but that work is still very far from reflecting a clear design proposal. In fact, the final outcome of that work may be a recommendation to go with option A.
Regardless of the specific path we choose, it seems likely that we will want to manage graph-materialization logic and collection metadata in the same place. Therefore, it may be difficult to avoid significant changes to the DataFrame and Array APIs.
Possible "Option A" Poposal
Step 1: Simplify the Layer and HLG-culling APIs (See: #9216)
- Remove
Mappingbase to simplify culling and make "accidental" Layer materialization highly unlikely.- The "new" Layer should only be materialized by an explicit method (like
Layer.subgraph(keys)), which should include implicit culling.
- The "new" Layer should only be materialized by an explicit method (like
- Remove boiler-plate code related to
Mappingbehavior (keys(),__len__, etc..) - Remove
Layer.cullandHighLevelGraph.cull. - Outcome: Layer design and maintenance becomes significantly simpler. HLG-based optimizations (besides culling) are not addressed in any way.
Step 2: Revise HighLevelGraph/Layer dependency tracking
- Extend the
LayerAPI to track it's own collection dependencies. - Add an
output_layerattribute toHighLevelGraph, and make this the primary "state" of an HLGHighLevelGraph.layersandHighLevelGraph.dependenciesshould simply refer toHighLevelGraph.output_layer
- Outcome:
Layerobjects track direct Layer dependencies by tracking direct collection dependencies.
Step 3: Move collection metadata into Layer
- Move
name,meta, anddivisionsmanagement from the collection object toLayer- It may be best to store and manage this metadata within a new
CollectionMetadataclass. This may make it easier to modify/expand the kinds of metadata we track at theLayerlevel (e.g. partition statistics).
- It may be best to store and manage this metadata within a new
- Outcome: The "identity" of a Dask collections becomes synonymous with the HLG's output layer
Step 4: Add regeneration mechanism to Layer
- Add
creation_infoattribute toLayer- This attribute should specify the
func,args, andkwargsneeded to re-create the collection associated with the currentLayer. Layerdependencies can be regenerated recursively
- This attribute should specify the
Step 5: Re-visit HLG optimizations
3 - Reach consensus on collection/HLG/Layer design
At some point in the medium-term future, it will be necessary to make a decision to pursue option A or option B. If there is no clear reference implementation and/or design document in place for option B, then I suggest we focus on option A (changing the existing HLG/Layer design). The motivation for making A the “default” is an acknowledgment that attacking option B in an incremental way is likely to be more difficult without a detailed plan in place.
4 - Develop new roadmap
At this point, the existing “roadmap” should be obsolete, and ready for replacement.