Non-index-based partitioning of Dask DataFrames

https://github.com/dask/dask/pull/6066 recently added a `DataFrame.shuffle`
method for partitioning a dataframe by one or more columns based on the hash of
the column's values. All rows with an equal has will end up in the same partition.

How should this be exposed to users through the DataFrame data model? At the
moment, `.shuffle()` just results in a complex task graph (and perhaps the
clearing of `.divisions`).

1. How should this affect `.divisions`? The current semantics and implementation
   of `.divisions` doesn't handle hash-based, column partitioning. The semantics
   are off, since we don't actually sort the data by the (hashed) values. And we
   might partition by multiple columns, so we might want a 2-D structure to
   whatever concept generalizes `.divisions`.
2. As usual with data model questions: how should this affect the repr?

Sorry for just raising questions and no answers thus far. I'll think on it some more.

cc @rjzamora.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-index-based partitioning of Dask DataFrames #6246

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Non-index-based partitioning of Dask DataFrames #6246

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions