When `divisions` has repeats, `set_index` puts all data in the last partition instead of balancing it

Reported in https://stackoverflow.com/a/70178087/17100540 by @DahnJ.

Given imbalanced data, you want the same value to be split across multiple output partitions:
```python
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] * 10, "B": range(10)})
In [4]: ddf = dd.from_pandas(df, npartitions=5)
In [5]: s = ddf.set_index("A")
In [6]: s.divisions
(0, 0, 0, 0, 0, 0)
```

However, when the shuffle actually happens, all the data ends up in the last possible partition:
```python
In [6]: dask.compute(*s.to_delayed())
Out[6]: 
(Empty DataFrame
 Columns: [B]
 Index: [],
 Empty DataFrame
 Columns: [B]
 Index: [],
 Empty DataFrame
 Columns: [B]
 Index: [],
 Empty DataFrame
 Columns: [B]
 Index: [],
    B
 A   
 0  2
 0  3
 0  4
 0  5
 0  8
 0  9
 0  0
 0  1
 0  6
 0  7)
```

_EDIT: I realized that although `set_index` will calculate `(0, 0, 0, 0, 0, 0)` as good divisions, if you pass in `divisions=(0, 0, 0, 0, 0, 0)`, you'll get `ValueError: New division must be unique, except for the last element`. So clearly there's some disagreement about whether repeated values are even valid divisions (xref https://github.com/dask/dask/pull/8393, cc @charlesbluca). This, plus @SultanOrazbayev's last code snippet [on SO](https://stackoverflow.com/a/69616798/17100540), makes me think DataFrame generally doesn't support duplicate divisions, and it's simply a bug that `set_index` doesn't deduplicate the output of `partition_quantiles`. That said, I think we should support duplicate divisions, since it's a reasonable thing to need when you have imbalanced data._

The problem is in the [`set_partitions_pre`](https://github.com/dask/dask/blob/a5aecac8313fea30c5503f534c71f325b1775b9c/dask/dataframe/shuffle.py#L793-L813) step of `set_index`, where we calculate which output partition number each row should belong to. We call `searchsorted` on `divisions` (as a Series); `s` is the values we're reindexing by:
https://github.com/dask/dask/blob/a5aecac8313fea30c5503f534c71f325b1775b9c/dask/dataframe/shuffle.py#L796

Because of `side="right"`, when there are duplicate values in `divisions`, `searchsorted` always returns the index of the last duplicate—hence why all the data ends up in the last possible partition. With `side="left"`, it would be the other way around (all data in the first partition).

With some clever counting, I think we could deal with this while still using `searchsorted`. We'd probably want to remove duplicates from `divisions` and keep an auxiliary list of how many duplicates each division has (basically [compute a run-length encoding](https://stackoverflow.com/a/69693227/17100540)). Then, if the output partition for a row has duplicates, we pick randomly between the N options? Or something like that.

The way the output partition is selected from N options is probably the trickiest part. Assuming that `divisions` represents an approximately uniform partitioning of the data, then I _think_ picking from the N options at random would maintain that desired distribution.

Though I'm not sure how to handle the "edge" partitions. For `divisions=(0, 1, 1, 2)`, a value of `1` could go to any of the three output partitions. However, the first and last partitions will also get `0`s and `2`s, respectively, whereas the middle partition will only get `1`s. So to maintain uniform partition sizes, you'd want to send more `1`s to the middle partition than the others—basically bias the random selection towards that? The problem is, you don't know how much to do so, because you don't know how many `1`s there are relative to `0`s and `2`s, or even how many other values there are between `0` and `1`.

**Environment**:

- Dask version: a5aecac8313fea30c5503f534c71f325b1775b9c
- Python version: 3.8.8
- Operating System: macOS
- Install method (conda, pip, source): source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When `divisions` has repeats, `set_index` puts all data in the last partition instead of balancing it #8437

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

When divisions has repeats, set_index puts all data in the last partition instead of balancing it #8437

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

When `divisions` has repeats, `set_index` puts all data in the last partition instead of balancing it #8437