Skip to content

[Discussion] HighLevelGraph Development Roadmap #9159

@rjzamora

Description

@rjzamora

HighLevelGraphs (HLGs) are discussed in many places throughout Dask’s GitHub organization, but there is no single place with an up-to-date status and outlook. This issue should be modified if/when the HLG roadmap changes.

Last Update: 2022/08/29

Current status

HighlevelGraph development is currently “frozen”.

The ongoing development of new highlevelgraph.Layer classes and/or HighlevelGraph optimizations is strongly discouraged. Bug fixes are obviously fine, but expansion of the existing system should be avoided. Dask users and down-stream developers should avoid interactive with Layer classes directly. The motivation for the current development freeze is the fact that there is a consensus that (1) the current design of HLGs is not sufficient for effective graph optimizations, and that (2) HLG Layers are too complicated to implement.

Reasons for current pain

  • HLG serialization is convoluted and buggy
  • The current API design makes it difficult to implement an effective Layer class. For example, it can be challenging to implement cull without materializing the subgraph
  • The current collection/HighLevelGraph/Layer hierarchy makes it nearly impossible to perform general optimizations. Some important optimizations that are missing or significantly limited:
    • Dask does not track partition metadata (like Parquet statistics) to optimize simple queries like len, min, and max.
    • Dask does not support automatic predicate pushdown (i.e. Dask will not detect filter operations in the task graph, and use them to re-write IO Layers with filters=).
    • Dask does not support automatic column projection when the implicit or explicit selection and IO operation are separated by non-trivial operations like merge, shuffle and/or groupby. This limitation often requires the user to explicitly define read_parquet(..., columns=...) to reduce memory pressure.

Roadmap

1 - Address HLG-serialization

Relevant PRs:

Serialization for client-scheduler communication has been a consistent challenge since the introduction of highlevelgraph.Layer. Most of this pain can be resolved by adopting Pickle for HLG serialization.

The current plan is to move forward with distributed#6028 and #8864, which proposes that we always use pickle to send the HLG from the client to scheduler, and therefore remove the problematic __dask_distributed_pack__ protocol from HighLevelGraph/Layer.

2 - Explore alternate designs for current collection/HLG/Layer API

Even if serialization challenges are addressed by 6028 (and follow-up work), the current collection/HLG/Layer hierarchy and HLG/Layer design will still pose a serious problem for developers. Therefore we probably need to do one of the following:

  • Option A - Revise the HighLevelGraph and Layer classes to simplify the implementation of new Layer subclasses and to encapsulate the necessary information for high-level optimizations. This may require some refactoring of the collection APIs to move meta/divisions/chunks ownership into the HLG.
  • Option B - Redesign the collection APIs to use a high-level-expression (HLE) API instead of HLGs/Layers (see original HLE discussion in High Level Expressions #7933). One possible variation on an HLE-like redesign is under active exploration (see [Discussion] Dask-collection refactor plan #9117), but that work is still very far from reflecting a clear design proposal. In fact, the final outcome of that work may be a recommendation to go with option A.

Regardless of the specific path we choose, it seems likely that we will want to manage graph-materialization logic and collection metadata in the same place. Therefore, it may be difficult to avoid significant changes to the DataFrame and Array APIs.

Possible "Option A" Poposal

Step 1: Simplify the Layer and HLG-culling APIs (See: #9216)

  • Remove Mapping base to simplify culling and make "accidental" Layer materialization highly unlikely.
    • The "new" Layer should only be materialized by an explicit method (like Layer.subgraph(keys)), which should include implicit culling.
  • Remove boiler-plate code related to Mapping behavior (keys(), __len__, etc..)
  • Remove Layer.cull and HighLevelGraph.cull.
  • Outcome: Layer design and maintenance becomes significantly simpler. HLG-based optimizations (besides culling) are not addressed in any way.

Step 2: Revise HighLevelGraph/Layer dependency tracking

  • Extend the Layer API to track it's own collection dependencies.
  • Add an output_layer attribute to HighLevelGraph, and make this the primary "state" of an HLG
    • HighLevelGraph.layers and HighLevelGraph.dependencies should simply refer to HighLevelGraph.output_layer
  • Outcome: Layer objects track direct Layer dependencies by tracking direct collection dependencies.

Step 3: Move collection metadata into Layer

  • Move name, meta, and divisions management from the collection object to Layer
    • It may be best to store and manage this metadata within a new CollectionMetadata class. This may make it easier to modify/expand the kinds of metadata we track at the Layer level (e.g. partition statistics).
  • Outcome: The "identity" of a Dask collections becomes synonymous with the HLG's output layer

Step 4: Add regeneration mechanism to Layer

  • Add creation_info attribute to Layer
    • This attribute should specify the func, args, and kwargs needed to re-create the collection associated with the current Layer.
    • Layer dependencies can be regenerated recursively

Step 5: Re-visit HLG optimizations

3 - Reach consensus on collection/HLG/Layer design

At some point in the medium-term future, it will be necessary to make a decision to pursue option A or option B. If there is no clear reference implementation and/or design document in place for option B, then I suggest we focus on option A (changing the existing HLG/Layer design). The motivation for making A the “default” is an acknowledgment that attacking option B in an incremental way is likely to be more difficult without a detailed plan in place.

4 - Develop new roadmap

At this point, the existing “roadmap” should be obsolete, and ready for replacement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    highlevelgraphIssues relating to HighLevelGraphs.needs attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions