-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
Description
#6066 recently added a DataFrame.shuffle
method for partitioning a dataframe by one or more columns based on the hash of
the column's values. All rows with an equal has will end up in the same partition.
How should this be exposed to users through the DataFrame data model? At the
moment, .shuffle() just results in a complex task graph (and perhaps the
clearing of .divisions).
- How should this affect
.divisions? The current semantics and implementation
of.divisionsdoesn't handle hash-based, column partitioning. The semantics
are off, since we don't actually sort the data by the (hashed) values. And we
might partition by multiple columns, so we might want a 2-D structure to
whatever concept generalizes.divisions. - As usual with data model questions: how should this affect the repr?
Sorry for just raising questions and no answers thus far. I'll think on it some more.
cc @rjzamora.
Reactions are currently unavailable