Skip to content

WIP: Add even-odd block sort for dataframes#2367

Closed
chmp wants to merge 2 commits intodask:masterfrom
chmp:feature/sort_values
Closed

WIP: Add even-odd block sort for dataframes#2367
chmp wants to merge 2 commits intodask:masterfrom
chmp:feature/sort_values

Conversation

@chmp
Copy link
Contributor

@chmp chmp commented May 21, 2017

This PR implements #958. It uses a block-wise even-odd sort. Currently, only DataFrame.sort_values is implemented. I'd prefer to gauge interest first, before implementing Series.sort_values (which should take minimal effort).

The sort is performed in approximately npartitions iterations. For each iteration, two neighboring partitions are merged, sorted and split again. To transfer information, the neighborhoods are shifted for each iteration. Within each partition the pandas standard, i.e., quicksort, is used.

@mrocklin
Copy link
Member

I'm curious if we can reuse the existing logic around shuffling and setting the index.

@chmp
Copy link
Contributor Author

chmp commented May 21, 2017

TBH, I didn't look at existing functionality, but copied the code verbatim from a project of mine where I needed sort_values. Since I spent hardly any time on this PR, I'm perfectly fine with closing it, if you feel this implementation introduces too much code duplication.

EDIT: regarding the failed test: it is not caused by my code contributions. All tests added by me passed successfully. Note sure, whether the timeout is a real issue though.

@mrocklin
Copy link
Member

I think that having a sort_values method is valuable, so I'd like to see this work continue. However I do suspect that the implementation currently in dataframe/shuffle.py is worth looking at. I suspect that it can be faster in some situations.

@chmp
Copy link
Contributor Author

chmp commented May 22, 2017

You were right, it's quite simple to use set_index to implement sort_values. However, dataframes with one column appearing much more frequently than others are problematic. Sorting them will create hugely unbalanced partitions:

>>> df = pd.DataFrame({'val': [0] * 100})
>>> ddf = dd.from_pandas(df, npartitions=10)
>>> len(set_index(degenerate_ddf, 'val').reset_index().get_partition(9).compute())
100

For this reason, I made the set_index based implementation optional and kept my original one. If you'd like I can also remove the alternative implementation all-together.

@mrocklin
Copy link
Member

It may be that we can use some of the logic behind set_index without actually using set_index. My concern about the even-odd sorting technique is that it might involve sending many copies of the data around the network. Is this assumption correct? If not then can you provide a brief explanation of the costs of this algorithm?

The approach taken in set_index or shuffle.py generally might be worth looking into

@mrocklin
Copy link
Member

For example maybe we shouldn't do even/odd but should break things up into groups of four or eight or some other number. Maybe it makes sense to re-partition data beforehand to reduce the number of copies. Etc. These sorts of problems and more have been handled in the code in shuffle.py. It would be nice to have solutions to these problems in sort_values as well and it would be nice not to have two copies of the same code lying about.

Thoughts on how to merge these approaches?

@chmp
Copy link
Contributor Author

chmp commented Jun 7, 2017

Hm. Maybe it's best to let somebody else pickup that thread. Since I don't have access to machines to test on, I can only speculate about performance. That makes it really hard to adapt the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants