WIP: Add even-odd block sort for dataframes by chmp · Pull Request #2367 · dask/dask

chmp · 2017-05-21T15:28:51Z

This PR implements #958. It uses a block-wise even-odd sort. Currently, only DataFrame.sort_values is implemented. I'd prefer to gauge interest first, before implementing Series.sort_values (which should take minimal effort).

The sort is performed in approximately npartitions iterations. For each iteration, two neighboring partitions are merged, sorted and split again. To transfer information, the neighborhoods are shifted for each iteration. Within each partition the pandas standard, i.e., quicksort, is used.

mrocklin · 2017-05-21T15:36:52Z

I'm curious if we can reuse the existing logic around shuffling and setting the index.

chmp · 2017-05-21T15:53:43Z

TBH, I didn't look at existing functionality, but copied the code verbatim from a project of mine where I needed sort_values. Since I spent hardly any time on this PR, I'm perfectly fine with closing it, if you feel this implementation introduces too much code duplication.

EDIT: regarding the failed test: it is not caused by my code contributions. All tests added by me passed successfully. Note sure, whether the timeout is a real issue though.

mrocklin · 2017-05-21T15:58:50Z

I think that having a sort_values method is valuable, so I'd like to see this work continue. However I do suspect that the implementation currently in dataframe/shuffle.py is worth looking at. I suspect that it can be faster in some situations.

chmp · 2017-05-22T19:01:16Z

You were right, it's quite simple to use set_index to implement sort_values. However, dataframes with one column appearing much more frequently than others are problematic. Sorting them will create hugely unbalanced partitions:

>>> df = pd.DataFrame({'val': [0] * 100})
>>> ddf = dd.from_pandas(df, npartitions=10)
>>> len(set_index(degenerate_ddf, 'val').reset_index().get_partition(9).compute())
100

For this reason, I made the set_index based implementation optional and kept my original one. If you'd like I can also remove the alternative implementation all-together.

mrocklin · 2017-05-30T13:36:54Z

It may be that we can use some of the logic behind set_index without actually using set_index. My concern about the even-odd sorting technique is that it might involve sending many copies of the data around the network. Is this assumption correct? If not then can you provide a brief explanation of the costs of this algorithm?

The approach taken in set_index or shuffle.py generally might be worth looking into

mrocklin · 2017-05-30T13:40:32Z

For example maybe we shouldn't do even/odd but should break things up into groups of four or eight or some other number. Maybe it makes sense to re-partition data beforehand to reduce the number of copies. Etc. These sorts of problems and more have been handled in the code in shuffle.py. It would be nice to have solutions to these problems in sort_values as well and it would be nice not to have two copies of the same code lying about.

Thoughts on how to merge these approaches?

chmp · 2017-06-07T17:38:14Z

Hm. Maybe it's best to let somebody else pickup that thread. Since I don't have access to machines to test on, I can only speculate about performance. That makes it really hard to adapt the implementation.

Add even-odd block sort for dataframes

9d8963d

Add set_index based sort_values

2d2253c

chmp closed this Feb 9, 2020

gerrymanoim mentioned this pull request Feb 26, 2021

ENH: Adds sort_values to dask.DataFrame #7286

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Add even-odd block sort for dataframes#2367

WIP: Add even-odd block sort for dataframes#2367
chmp wants to merge 2 commits intodask:masterfrom
chmp:feature/sort_values

chmp commented May 21, 2017

Uh oh!

mrocklin commented May 21, 2017

Uh oh!

chmp commented May 21, 2017 •

edited

Loading

Uh oh!

mrocklin commented May 21, 2017

Uh oh!

chmp commented May 22, 2017

Uh oh!

mrocklin commented May 30, 2017

Uh oh!

mrocklin commented May 30, 2017

Uh oh!

chmp commented Jun 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chmp commented May 21, 2017

Uh oh!

mrocklin commented May 21, 2017

Uh oh!

chmp commented May 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented May 21, 2017

Uh oh!

chmp commented May 22, 2017

Uh oh!

mrocklin commented May 30, 2017

Uh oh!

mrocklin commented May 30, 2017

Uh oh!

chmp commented Jun 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chmp commented May 21, 2017 •

edited

Loading