Skip to content

[Discussion] Don't compute divisions by default in set_index? #8435

@gjoseph92

Description

@gjoseph92

Calling df.set_index('foo') currently calculates all of df immediately, in order to pick values for divisions. This is helpful, but also makes it very easy to trigger a large computation, and the eagerness is inconsistent with Dask's lazy API.

Downsides of this are illustrated in this blog post by @DahnJ. This default behavior makes it very easy for you to do something that's very not performant, and relatively hard to figure out why it's happening.

If you're running a whole script (versus a notebook with separate cells), it would not be immediately clear that set_index is even calling compute internally, and you'll be left puzzling over the dashboard about why the "calculation seems to re-set and begin again".

Additionally, since it's so easy to compute divisions automatically, users rarely interact with them. This makes the idea of setting them manually intimidating, when in fact it's actually pretty straightforward. (Hopefully better documentation helps with this #8379). And since users don't interact with them, they wouldn't realize there's even an option to avoid this computation until they read the docstring of set_index.

Basically, I think not passing divisions to set_index should be considered an antipattern. There are only a few cases where it's unavoidable and you need to calculate divisions, and it's always far less performant.

Therefore, if we agree it's an antipattern, we should have our API guide users away from it (the "pit of success").

So I propose:

>>> df.set_index('foo')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: No divisions specified. Pass `divisions=[...]` to specify the "dividing lines" used to split
the new index into partitions.

Alternatively, pass `divisions="compute"` to have Dask compute good divisions for you right now.
This can be expensive, since it computes the entire DataFrame eagerly.
If you use `divisions="compute"`, we recommend printing the `.divisions` of the resulting DataFrame
and copy-pasting them back into this line of code, so if you re-run it, divisions won't be re-calculated.

For more information, see https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.set_index.html.

This way:

  1. Users have to be explicitly aware of the eager compute, and have to ask for it.
  2. Users can't pick the antipattern without first being presented with the recommended alternative.
  3. The "compute" in the code is hopefully a flag to other readers that some computation is going on; you don't have to know set_index's default behavior to recognize this.

Of course, we'd have a deprecation cycle with a warning before this became an error. I'd also be excited to add some tools to help users:

  1. Generate reasonable divisions when they know something about the data (save typing)
  2. Check whether the divisions they've picked are good
  3. Suggest "how" to pick divisions?

cc @jsignell @scharlottej13 @ian-r-rose @jrbourbeau @jcrist

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframediscussionDiscussing a topic with no specific actions yetneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions