-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14165: [C++] Improve table sort performance #11273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns. This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks.
|
@ursabot please benchmark |
|
|
|
Benchmark runs are scheduled for baseline = 83e4591 and contender = eb0bb4e. Results will be available as each benchmark for each run completes. |
| ChunkedArrayVector columns; | ||
| columns.reserve(fields.size()); | ||
| for (const auto& factory : column_factories) { | ||
| columns.push_back(std::make_shared<ChunkedArray>(factory(length))); | ||
| } | ||
| auto table = Table::Make(schema, std::move(columns)); | ||
| ASSERT_OK(table->ValidateFull()); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just construct the batch directly here? (It seems before, it reused the table since the chunking was uniform, but now we're building a new table - might as well just make the batch directly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point!
|
|
||
| // XXX this implementation is rather inefficient as it computes chunk indices | ||
| // at every comparison. Instead we should iterate over individual batches | ||
| // and remember ChunkLocation entries in the max-heap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @aocsa (we can just file a follow-up for this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed ARROW-14183 for this.
When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns. This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks. Closes apache#11273 from pitrou/ARROW-14165-table-sort Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: David Li <li.davidm96@gmail.com>
When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns.
This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks.