Standardizing type for `divisions`

Looking through Dask's codebase, it seems like there isn't a consistent typing for a Dask object's `divisions`; in some places (like [`set_sorted_index`](https://github.com/dask/dask/blob/f5881891505b9a2ba2da195befb11ad7b4c7bb23/dask/dataframe/shuffle.py#L1017)), we return an object with a tuple `divisions`, while in others (such as [`set_partition`](https://github.com/dask/dask/blob/f5881891505b9a2ba2da195befb11ad7b4c7bb23/dask/dataframe/shuffle.py#L325)) we return an object with a list `divisions`. This becomes an issue in cases where we compare `divisions` between different objects, as we can run into cases where the elements contained in two objects' `divisions` are identical, but they are not seen as equal.

Some questions that come to mind:

- Is there an ideal type for `divisions`? I would assume tuples since `divisions` is generally treated as immutable even in the list case, but list functionality is used in several places in the codebase to assemble `divisions`.
- If there is an ideal type for `divisions`, how can we enforce it? It seems like one reason this problem exists is because in most places, list and tuple `divisions` function exactly the same - it is typically only when they are compared that issues arise. One potential solution would be to make `divisions` a property with a setter method that either:
    - Implicitly sets the input value to whatever type we desire `divisions` to be
    - Raises an error if the input value is not the proper `divisions` type
- If there is no ideal type for `divisions`, is there a workaround for comparisons?

cc @jsignell as I notice you are doing some work on `divisions` in #8379

EDIT:

To give additional context, I encountered this issue while debugging some breakage in dask-sql:

In some cases, when performing `JOIN` operations, dask-sql implicitly calls [`single_partition_join`](https://github.com/dask/dask/blob/f5881891505b9a2ba2da195befb11ad7b4c7bb23/dask/dataframe/multi.py#L389-L444) through `dd.merge`. Recently, #8341 did some refactoring to this function which, among other things, changed the divisions of the merged result from a tuple to a list (I don't think this was an intention of the PR, just a side effect).

This causes breakage later on in dask-sql if we attempt to subscript the result of this merge operation with a Series (something like `df[df[col].where(...)]`) with identical tuple divisions, as `DataFrame.__getitem__` does a divisions check to decide whether or not to `_maybe_align_partitions`:

https://github.com/dask/dask/blob/f5881891505b9a2ba2da195befb11ad7b4c7bb23/dask/dataframe/core.py#L4108-L4113

And `_maybe_align_partitions` does a divisions equality check to decide whether or not to actually `align_partitions` (which fails if not all `divisions` are known):

https://github.com/dask/dask/blob/f5881891505b9a2ba2da195befb11ad7b4c7bb23/dask/dataframe/multi.py#L166-L169

Here's a minimal reproducer of that particular issue:

```python
import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame({"a": list(range(40))})
ddf = dd.from_pandas(df, npartitions=4)

cond = ddf.a > 20

# set unknown but inequal divisions
ddf.divisions = [None] * 5
cond.divisions = (None,) * 5

ddf[cond]
```

	if isinstance(key, Series):
	# do not perform dummy calculation, as columns will not be changed.
	if self.divisions != key.divisions:
	from .multi import _maybe_align_partitions

	self, key = _maybe_align_partitions([self, key])

	divisions = dfs[0].divisions
	if not all(df.divisions == divisions for df in dfs):
	dfs2 = iter(align_partitions(*dfs)[0])
	return [a if not isinstance(a, _Frame) else next(dfs2) for a in args]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardizing type for `divisions` #8388

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Standardizing type for divisions #8388

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Standardizing type for `divisions` #8388