-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10796: [C++] Implement optimized RecordBatch sorting #8890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b87ef64 to
661487a
Compare
|
cc @kou |
661487a to
07611a8
Compare
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Great!
left-to-right radix sort
Wow!
| [&](uint64_t index) { return !array.IsNull(index); }); | ||
| // Sort all nulls by second and following sort keys | ||
| // TODO: could we instead run an independent sort from the second key on | ||
| // this slice? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like ConcreteRecordBatchColumnSorter's next_column_?
It would work.
| private: | ||
| // TODO instead of resolving chunks for each column independently, we could | ||
| // split the table into RecordBatches and pay the cost of chunked indexing | ||
| // at the first column only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we always do it?
My understanding that each chunked array in a table can have different number of chunks. For example, the table is valid:
a: [[0, 1], [2, 3, 4]]
b: [[10], [11, 12], [13], [14]]
I'm not sure we can split the table into record batches efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TableBatchReader can be used for that.
Add two RecordBatch sorting implementations:
Both implementations benefit from direct indexed access into the contiguous RecordBatch columns (as opposed to table sorting, which must index into the chunks).
Add some RecordBatch-sorting benchmarks.
Also, add and improve tests; and fix a bug related to sorting of NaNs and nulls.
Benchmarks (changes less than 10% in absolute value not shown):