Skip to content

Non-index-based partitioning of Dask DataFrames #6246

@TomAugspurger

Description

@TomAugspurger

#6066 recently added a DataFrame.shuffle
method for partitioning a dataframe by one or more columns based on the hash of
the column's values. All rows with an equal has will end up in the same partition.

How should this be exposed to users through the DataFrame data model? At the
moment, .shuffle() just results in a complex task graph (and perhaps the
clearing of .divisions).

  1. How should this affect .divisions? The current semantics and implementation
    of .divisions doesn't handle hash-based, column partitioning. The semantics
    are off, since we don't actually sort the data by the (hashed) values. And we
    might partition by multiple columns, so we might want a 2-D structure to
    whatever concept generalizes .divisions.
  2. As usual with data model questions: how should this affect the repr?

Sorry for just raising questions and no answers thus far. I'll think on it some more.

cc @rjzamora.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions