[Discussion] HighLevelGraph Development Roadmap

`HighLevelGraph`s (HLGs) are discussed in many places throughout Dask’s GitHub organization, but there is no single place with an up-to-date status and outlook.  This issue should be modified if/when the HLG roadmap changes.

**Last Update**: 2022/08/29

## Current status

`HighlevelGraph` development is currently “frozen”.

The ongoing development of new `highlevelgraph.Layer` classes and/or `HighlevelGraph` optimizations is strongly discouraged. Bug fixes are obviously fine, but expansion of the existing system should be avoided. Dask users and down-stream developers should avoid interactive with `Layer` classes directly.  The motivation for the current development freeze is the fact that there is a consensus that (1) the current design of HLGs is not sufficient for effective graph optimizations, and that (2) HLG Layers are too complicated to implement. 

### Reasons for current pain

- HLG serialization is convoluted and buggy
- The current API design makes it difficult to implement an effective `Layer` class. For example, it can be challenging to implement `cull` without materializing the subgraph
- The current collection/HighLevelGraph/Layer hierarchy makes it nearly impossible to perform general optimizations. Some important optimizations that are missing or significantly limited:
  - Dask does not track **partition metadata** (like Parquet statistics) to optimize simple queries like `len`, `min`, and `max`.
  - Dask does not support automatic **predicate pushdown** (i.e. Dask will not detect filter operations in the task graph, and use them to re-write IO Layers with `filters=`).
  - Dask does not support automatic **column projection** when the implicit or explicit selection and IO operation are separated by non-trivial operations like merge, shuffle and/or groupby. This limitation often requires the user to explicitly define `read_parquet(..., columns=...)` to reduce memory pressure.


## Roadmap

### 1 - Address HLG-serialization

**Relevant PRs**:

- https://github.com/dask/distributed/pull/6028
- https://github.com/dask/dask/pull/8864

Serialization for client-scheduler communication has been a consistent challenge since the introduction of `highlevelgraph.Layer`. Most of this pain can be resolved by adopting Pickle for HLG serialization.

The current plan is to move forward with distributed#6028 and dask#8864, which proposes that we **always** use pickle to send the HLG from the client to scheduler, and therefore remove the problematic `__dask_distributed_pack__` protocol from `HighLevelGraph`/`Layer`.


### 2 - Explore alternate designs for current collection/HLG/Layer API

Even if serialization challenges are addressed by 6028 (and follow-up work), the current collection/HLG/Layer hierarchy and HLG/Layer design will still pose a serious problem for developers. Therefore we probably need to do one of the following:

- **Option A** - Revise the `HighLevelGraph` and `Layer` classes to simplify the implementation of new `Layer` subclasses and to encapsulate the necessary information for high-level optimizations. This may require some refactoring of the collection APIs to move `meta`/`divisions`/`chunks` ownership into the HLG.
- **Option B** - Redesign the collection APIs to use a high-level-expression (HLE) API instead of HLGs/Layers (see *original* HLE discussion in #7933). One possible variation on an HLE-like redesign is under active exploration (see #9117), but that work is still **very** far from reflecting a clear design proposal. In fact, the final outcome of that work **may** be a recommendation to go with option A.

Regardless of the specific path we choose, it seems likely that we will want to manage graph-materialization logic and collection metadata in the same place.  Therefore, it may be difficult to avoid significant changes to the DataFrame and Array APIs.


#### Possible "Option A" Poposal 

**Step 1: Simplify the Layer and HLG-culling APIs** (See: https://github.com/dask/dask/issues/9216)

- Remove `Mapping` base to simplify culling and make "accidental" Layer materialization highly unlikely.
  - The "new" Layer should only be materialized by an explicit method (like `Layer.subgraph(keys)`), which should include implicit culling.
- Remove boiler-plate code related to `Mapping` behavior (`keys()`, `__len__`, etc..)
- Remove `Layer.cull` and `HighLevelGraph.cull`.
- **Outcome**: Layer design and maintenance becomes significantly simpler. HLG-based optimizations (besides culling) are not addressed in any way.

**Step 2: Revise HighLevelGraph/Layer dependency tracking**

- Extend the `Layer` API to track it's own collection dependencies.
- Add an `output_layer` attribute to `HighLevelGraph`, and make this the primary "state" of an HLG
  - `HighLevelGraph.layers` and `HighLevelGraph.dependencies` should simply refer to `HighLevelGraph.output_layer` 
- **Outcome**: `Layer` objects track direct *Layer* dependencies by tracking direct *collection* dependencies.

**Step 3: Move collection metadata into Layer**

- Move `name`, `meta`, and `divisions` management from the collection object to `Layer`
  - It *may* be best to store and manage this metadata within a new `CollectionMetadata` class.  This may make it easier to modify/expand the kinds of metadata we track at the `Layer` level (e.g. partition statistics).
- **Outcome**: The "identity" of a Dask collections becomes synonymous with the HLG's output layer

**Step 4: Add regeneration mechanism to Layer**

- Add `creation_info` attribute to `Layer`
  - This attribute should specify the `func`, `args`, and `kwargs` needed to re-create the **collection** associated with the current `Layer`.
  - `Layer` dependencies can be regenerated recursively

**Step 5: Re-visit HLG optimizations**


### 3 - Reach consensus on collection/HLG/Layer design

At some point in the medium-term future, it will be necessary to make a decision to pursue option A or option B. If there is no clear reference implementation and/or design document in place for option B, then I suggest we focus on option A (changing the existing HLG/Layer design).  The motivation for making A the “default” is an acknowledgment that attacking option B in an incremental way is likely to be more difficult without a detailed plan in place. 

### 4 - Develop *new* roadmap

At this point, the existing “roadmap” should be obsolete, and ready for replacement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] HighLevelGraph Development Roadmap #9159

Current status

Reasons for current pain

Roadmap

1 - Address HLG-serialization

2 - Explore alternate designs for current collection/HLG/Layer API

Possible "Option A" Poposal

3 - Reach consensus on collection/HLG/Layer design

4 - Develop new roadmap

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Discussion] HighLevelGraph Development Roadmap #9159

Description

Current status

Reasons for current pain

Roadmap

1 - Address HLG-serialization

2 - Explore alternate designs for current collection/HLG/Layer API

Possible "Option A" Poposal

3 - Reach consensus on collection/HLG/Layer design

4 - Develop new roadmap

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions