Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Sep 29, 2021

When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns.

This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks.

When sorting a table, rechunk it homogeneously as record batches,
to pay the price of chunked indexing once for all columns.

This helps performance when cardinality is low in the first sort
column, yielding up to a 60% speedup on the set of sorting benchmarks.
@pitrou
Copy link
Member Author

pitrou commented Sep 29, 2021

@ursabot please benchmark

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@ursabot
Copy link

ursabot commented Sep 29, 2021

Benchmark runs are scheduled for baseline = 83e4591 and contender = eb0bb4e. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Comment on lines 1886 to 1893
ChunkedArrayVector columns;
columns.reserve(fields.size());
for (const auto& factory : column_factories) {
columns.push_back(std::make_shared<ChunkedArray>(factory(length)));
}
auto table = Table::Make(schema, std::move(columns));
ASSERT_OK(table->ValidateFull());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just construct the batch directly here? (It seems before, it reused the table since the chunking was uniform, but now we're building a new table - might as well just make the batch directly.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point!


// XXX this implementation is rather inefficient as it computes chunk indices
// at every comparison. Instead we should iterate over individual batches
// and remember ChunkLocation entries in the max-heap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @aocsa (we can just file a follow-up for this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed ARROW-14183 for this.

@lidavidm lidavidm closed this in 2b001ac Sep 30, 2021
@pitrou pitrou deleted the ARROW-14165-table-sort branch September 30, 2021 13:04
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
When sorting a table, rechunk it homogeneously as record batches, to pay the price of chunked indexing once for all columns.

This helps performance when cardinality is low in the first sort column, yielding up to a 60% speedup on the set of sorting benchmarks.

Closes apache#11273 from pitrou/ARROW-14165-table-sort

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants