[Discussion] Don't compute divisions by default in `set_index`?

Calling `df.set_index('foo')` currently calculates all of `df` immediately, in order to pick values for `divisions`. This is helpful, but also makes it very easy to trigger a large computation, and the eagerness is inconsistent with Dask's lazy API.

Downsides of this are illustrated in [this blog post](https://targomo.medium.com/how-we-learned-to-love-dask-and-achieved-a-40x-speedup-aa14e72d99c0#534f) by @DahnJ. This default behavior makes it very easy for you to do something that's very not performant, and relatively hard to figure out why it's happening.

If you're running a whole script (versus a notebook with separate cells), it would not be immediately clear that `set_index` is even calling compute internally, and you'll be left puzzling over the dashboard about why the "calculation seems to re-set and begin again".

Additionally, since it's so easy to compute divisions automatically, users rarely interact with them. This makes the idea of setting them manually intimidating, when in fact it's actually pretty straightforward. (Hopefully better documentation helps with this https://github.com/dask/dask/pull/8379). And since users don't interact with them, they wouldn't realize there's even an option to avoid this computation until they read the docstring of `set_index`.

Basically, I think _not_ passing divisions to `set_index` should be considered an antipattern. There are only a few cases where it's unavoidable and you need to calculate divisions, and it's always far less performant.

Therefore, if we agree it's an antipattern, we should have our API guide users away from it (the "pit of success").

So I propose:
```python-traceback
>>> df.set_index('foo')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: No divisions specified. Pass `divisions=[...]` to specify the "dividing lines" used to split
the new index into partitions.

Alternatively, pass `divisions="compute"` to have Dask compute good divisions for you right now.
This can be expensive, since it computes the entire DataFrame eagerly.
If you use `divisions="compute"`, we recommend printing the `.divisions` of the resulting DataFrame
and copy-pasting them back into this line of code, so if you re-run it, divisions won't be re-calculated.

For more information, see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.
```

This way:
1. Users have to be explicitly aware of the eager compute, and have to ask for it.
2. Users can't pick the antipattern without first being presented with the recommended alternative.
3. The `"compute"` in the code is hopefully a flag to other readers that some computation is going on; you don't have to know `set_index`'s default behavior to recognize this.

Of course, we'd have a deprecation cycle with a warning before this became an error. I'd also be excited to add some tools to help users:
1. Generate reasonable divisions when they know something about the data (save typing)
2. Check whether the divisions they've picked are good
3. Suggest "how" to pick divisions?

cc @jsignell @scharlottej13 @ian-r-rose @jrbourbeau @jcrist 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] Don't compute divisions by default in `set_index`? #8435

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Discussion] Don't compute divisions by default in set_index? #8435

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Discussion] Don't compute divisions by default in `set_index`? #8435