Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Sep 28, 2021

Use the same strategy as for chunked array sorting:

  • first sort each RecordBatch individually, taking advantage of data contiguity for fast indexing
  • then merge sorted batches recursively, using slower chunked indexing

Benchmarks show up to 200% speedups on some benchmark parameters, along with very minor regressions.

@github-actions
Copy link

Use the same strategy as for chunked array sorting:
- first sort each RecordBatch individually, taking advantage of
  data contiguity for fast indexing
- then merge sorted batches recursively, using slower chunked
  indexing

Benchmarks show up to 200% speedups on some benchmark parameters, along with very minor regressions.
@pitrou pitrou force-pushed the ARROW-10898-table-merge-sort branch from a24e0eb to f25bf77 Compare September 28, 2021 18:54
@pitrou
Copy link
Member Author

pitrou commented Sep 28, 2021

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Sep 28, 2021

Benchmark runs are scheduled for baseline = 8dbe574 and contender = f25bf77. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@pitrou pitrou closed this in a46d262 Sep 29, 2021
@pitrou pitrou deleted the ARROW-10898-table-merge-sort branch September 29, 2021 07:12
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Use the same strategy as for chunked array sorting:
- first sort each RecordBatch individually, taking advantage of data contiguity for fast indexing
- then merge sorted batches recursively, using slower chunked indexing

Benchmarks show up to 200% speedups on some benchmark parameters, along with very minor regressions.

Closes apache#11262 from pitrou/ARROW-10898-table-merge-sort

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants