WIP: Add even-odd block sort for dataframes#2367
WIP: Add even-odd block sort for dataframes#2367chmp wants to merge 2 commits intodask:masterfrom chmp:feature/sort_values
Conversation
|
I'm curious if we can reuse the existing logic around shuffling and setting the index. |
|
TBH, I didn't look at existing functionality, but copied the code verbatim from a project of mine where I needed EDIT: regarding the failed test: it is not caused by my code contributions. All tests added by me passed successfully. Note sure, whether the timeout is a real issue though. |
|
I think that having a |
|
You were right, it's quite simple to use >>> df = pd.DataFrame({'val': [0] * 100})
>>> ddf = dd.from_pandas(df, npartitions=10)
>>> len(set_index(degenerate_ddf, 'val').reset_index().get_partition(9).compute())
100For this reason, I made the set_index based implementation optional and kept my original one. If you'd like I can also remove the alternative implementation all-together. |
|
It may be that we can use some of the logic behind set_index without actually using set_index. My concern about the even-odd sorting technique is that it might involve sending many copies of the data around the network. Is this assumption correct? If not then can you provide a brief explanation of the costs of this algorithm? The approach taken in |
|
For example maybe we shouldn't do even/odd but should break things up into groups of four or eight or some other number. Maybe it makes sense to re-partition data beforehand to reduce the number of copies. Etc. These sorts of problems and more have been handled in the code in shuffle.py. It would be nice to have solutions to these problems in sort_values as well and it would be nice not to have two copies of the same code lying about. Thoughts on how to merge these approaches? |
|
Hm. Maybe it's best to let somebody else pickup that thread. Since I don't have access to machines to test on, I can only speculate about performance. That makes it really hard to adapt the implementation. |
This PR implements #958. It uses a block-wise even-odd sort. Currently, only
DataFrame.sort_valuesis implemented. I'd prefer to gauge interest first, before implementingSeries.sort_values(which should take minimal effort).The sort is performed in approximately
npartitionsiterations. For each iteration, two neighboring partitions are merged, sorted and split again. To transfer information, the neighborhoods are shifted for each iteration. Within each partition the pandas standard, i.e., quicksort, is used.