Conversation
| def _collapse(partition): | ||
| return pd.Series( | ||
| list(partition.itertuples(index=False, name=None)), | ||
| index=partition.index, | ||
| name=tuple(partition.columns), | ||
| ) |
There was a problem hiding this comment.
Oooo - It may make sense to precede this PR with a simpler PR to support multi-column sort_values using this trick.
@charlesbluca - Note that this approach is not as performant as direct DataFrame.quantiles/DataFrame.searchsorted support in pandas, but it should "unblock" multi-column sorting :)
There was a problem hiding this comment.
Yeah this looks nice - thanks for the heads up! I can start up a WIP using this in sort_values
There was a problem hiding this comment.
Cool yeah! I expect this to take a while to work out. There are still some open questions about how things should behave. So anything that can come out of here and be useful is great!
There was a problem hiding this comment.
@charlesbluca - I started exploring this a bit in this branch (couldn't help myself). It is quite slow compared to 0th column partitioning, but does seem to work for cases where multiple columns are required for sufficient repartitioning.
black dask/flake8 dask/isort daskThis first commit is pulled from @TomAugspurger's original branch: TomAugspurger@0e741e1
My plan is to try to keep moving forward with that work and raise NotImplemented all over the place.