ARROW-14165: [C++] Improve table sort performance #11273

pitrou · 2021-09-29T18:36:53Z

When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns.

This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks.

When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns. This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks.

pitrou · 2021-09-29T18:37:08Z

@ursabot please benchmark

github-actions · 2021-09-29T18:37:13Z

https://issues.apache.org/jira/browse/ARROW-14165

github-actions · 2021-09-29T18:37:14Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

ursabot · 2021-09-29T18:37:15Z

Benchmark runs are scheduled for baseline = 83e4591 and contender = eb0bb4e. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

lidavidm · 2021-09-29T20:05:07Z

cpp/src/arrow/compute/kernels/vector_sort_test.cc

+  ChunkedArrayVector columns;
+  columns.reserve(fields.size());
+  for (const auto& factory : column_factories) {
+    columns.push_back(std::make_shared<ChunkedArray>(factory(length)));
+  }
+  auto table = Table::Make(schema, std::move(columns));
+  ASSERT_OK(table->ValidateFull());
+


Why not just construct the batch directly here? (It seems before, it reused the table since the chunking was uniform, but now we're building a new table - might as well just make the batch directly.)

Hmm, good point!

lidavidm · 2021-09-29T20:17:25Z

cpp/src/arrow/compute/kernels/vector_sort.cc


+  // XXX this implementation is rather inefficient as it computes chunk indices
+  // at every comparison.  Instead we should iterate over individual batches
+  // and remember ChunkLocation entries in the max-heap.


CC @aocsa (we can just file a follow-up for this)

Filed ARROW-14183 for this.

When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns. This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks. Closes apache#11273 from pitrou/ARROW-14165-table-sort Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: David Li <li.davidm96@gmail.com>

github-actions bot added the Component: C++ label Sep 29, 2021

pitrou requested a review from lidavidm September 29, 2021 20:07

lidavidm approved these changes Sep 29, 2021

View reviewed changes

Build RecordBatch directly

e8b0da7

lidavidm approved these changes Sep 30, 2021

View reviewed changes

lidavidm closed this in 2b001ac Sep 30, 2021

pitrou deleted the ARROW-14165-table-sort branch September 30, 2021 13:04

asfimport mentioned this pull request Sep 30, 2021

[C++] Improve table sort performance #2 #29752

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-14165: [C++] Improve table sort performance #11273

ARROW-14165: [C++] Improve table sort performance #11273

Uh oh!

pitrou commented Sep 29, 2021

Uh oh!

pitrou commented Sep 29, 2021

Uh oh!

github-actions bot commented Sep 29, 2021

Uh oh!

github-actions bot commented Sep 29, 2021

Uh oh!

ursabot commented Sep 29, 2021 •

edited

Loading

Uh oh!

lidavidm Sep 29, 2021

Uh oh!

pitrou Sep 29, 2021

Uh oh!

lidavidm Sep 29, 2021

Uh oh!

lidavidm Sep 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-14165: [C++] Improve table sort performance #11273

ARROW-14165: [C++] Improve table sort performance #11273

Uh oh!

Conversation

pitrou commented Sep 29, 2021

Uh oh!

pitrou commented Sep 29, 2021

Uh oh!

github-actions bot commented Sep 29, 2021

Uh oh!

github-actions bot commented Sep 29, 2021

Uh oh!

ursabot commented Sep 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm Sep 29, 2021

Choose a reason for hiding this comment

Uh oh!

pitrou Sep 29, 2021

Choose a reason for hiding this comment

Uh oh!

lidavidm Sep 29, 2021

Choose a reason for hiding this comment

Uh oh!

lidavidm Sep 30, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ursabot commented Sep 29, 2021 •

edited

Loading