Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Dec 10, 2020

Add two RecordBatch sorting implementations:

  • A single-pass left-to-right radix sort that's fast up to ~8 sort keys
  • A single-pass multiple-key-comparing sort that gives decent performance for large numbers of sort keys

Both implementations benefit from direct indexed access into the contiguous RecordBatch columns (as opposed to table sorting, which must index into the chunks).

Add some RecordBatch-sorting benchmarks.

Also, add and improve tests; and fix a bug related to sorting of NaNs and nulls.

Benchmarks (changes less than 10% in absolute value not shown):

                                           benchmark            baseline           contender  change %                                                                                                                                                            counters
10   RecordBatchSortIndicesInt64Narrow/1048576/100/8    1.482m items/sec    5.410m items/sec   265.083    {'run_name': 'RecordBatchSortIndicesInt64Narrow/1048576/100/8', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
20     RecordBatchSortIndicesInt64Narrow/1048576/0/8    1.524m items/sec    5.478m items/sec   259.478      {'run_name': 'RecordBatchSortIndicesInt64Narrow/1048576/0/8', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
60   RecordBatchSortIndicesInt64Narrow/1048576/100/2    2.276m items/sec    7.803m items/sec   242.839    {'run_name': 'RecordBatchSortIndicesInt64Narrow/1048576/100/2', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 2}
21     RecordBatchSortIndicesInt64Narrow/1048576/0/2    2.340m items/sec    7.802m items/sec   233.369      {'run_name': 'RecordBatchSortIndicesInt64Narrow/1048576/0/2', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 2}
23     RecordBatchSortIndicesInt64Wide/1048576/100/2    4.673m items/sec    9.867m items/sec   111.164      {'run_name': 'RecordBatchSortIndicesInt64Wide/1048576/100/2', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
61       RecordBatchSortIndicesInt64Wide/1048576/0/2    4.677m items/sec    9.820m items/sec   109.971        {'run_name': 'RecordBatchSortIndicesInt64Wide/1048576/0/2', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
35     RecordBatchSortIndicesInt64Wide/1048576/100/8    4.680m items/sec    9.822m items/sec   109.850      {'run_name': 'RecordBatchSortIndicesInt64Wide/1048576/100/8', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
55       RecordBatchSortIndicesInt64Wide/1048576/0/8    4.755m items/sec    9.933m items/sec   108.895        {'run_name': 'RecordBatchSortIndicesInt64Wide/1048576/0/8', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
59      RecordBatchSortIndicesInt64Wide/1048576/0/16    4.794m items/sec    8.408m items/sec    75.389       {'run_name': 'RecordBatchSortIndicesInt64Wide/1048576/0/16', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
16    RecordBatchSortIndicesInt64Wide/1048576/100/16    4.733m items/sec    8.177m items/sec    72.780     {'run_name': 'RecordBatchSortIndicesInt64Wide/1048576/100/16', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 3}
29    RecordBatchSortIndicesInt64Narrow/1048576/0/16    1.640m items/sec    2.627m items/sec    60.146     {'run_name': 'RecordBatchSortIndicesInt64Narrow/1048576/0/16', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
9   RecordBatchSortIndicesInt64Narrow/1048576/100/16    1.559m items/sec    2.342m items/sec    50.201   {'run_name': 'RecordBatchSortIndicesInt64Narrow/1048576/100/16', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
4          TableSortIndicesInt64Narrow/1048576/0/2/1    2.415m items/sec    2.699m items/sec    11.723          {'run_name': 'TableSortIndicesInt64Narrow/1048576/0/2/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 2}
51        TableSortIndicesInt64Narrow/1048576/0/2/32    1.814m items/sec    2.023m items/sec    11.513         {'run_name': 'TableSortIndicesInt64Narrow/1048576/0/2/32', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
49        TableSortIndicesInt64Narrow/1048576/0/16/4    1.542m items/sec    1.717m items/sec    11.361         {'run_name': 'TableSortIndicesInt64Narrow/1048576/0/16/4', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
30         TableSortIndicesInt64Narrow/1048576/0/2/4    2.272m items/sec    2.516m items/sec    10.733          {'run_name': 'TableSortIndicesInt64Narrow/1048576/0/2/4', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 2}
25         TableSortIndicesInt64Narrow/1048576/0/8/4    1.542m items/sec    1.706m items/sec    10.628          {'run_name': 'TableSortIndicesInt64Narrow/1048576/0/8/4', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
11        TableSortIndicesInt64Narrow/1048576/0/16/1    1.691m items/sec    1.866m items/sec    10.316         {'run_name': 'TableSortIndicesInt64Narrow/1048576/0/16/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
12         TableSortIndicesInt64Narrow/1048576/0/8/1    1.683m items/sec    1.856m items/sec    10.280          {'run_name': 'TableSortIndicesInt64Narrow/1048576/0/8/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 1}
[...]
6      RecordBatchSortIndicesInt64Narrow/1048576/0/1  185.050m items/sec  164.579m items/sec   -11.062    {'run_name': 'RecordBatchSortIndicesInt64Narrow/1048576/0/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 122}

@pitrou pitrou requested a review from bkietz December 10, 2020 17:52
@pitrou pitrou force-pushed the ARROW-10796-batch-sort branch 2 times, most recently from b87ef64 to 661487a Compare December 10, 2020 18:02
@pitrou
Copy link
Member Author

pitrou commented Dec 10, 2020

cc @kou

@pitrou pitrou force-pushed the ARROW-10796-batch-sort branch from 661487a to 07611a8 Compare December 10, 2020 18:30
@github-actions
Copy link

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Great!

left-to-right radix sort

Wow!

[&](uint64_t index) { return !array.IsNull(index); });
// Sort all nulls by second and following sort keys
// TODO: could we instead run an independent sort from the second key on
// this slice?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like ConcreteRecordBatchColumnSorter's next_column_?
It would work.

private:
// TODO instead of resolving chunks for each column independently, we could
// split the table into RecordBatches and pay the cost of chunked indexing
// at the first column only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always do it?
My understanding that each chunked array in a table can have different number of chunks. For example, the table is valid:

a: [[0, 1], [2, 3, 4]]
b: [[10], [11, 12], [13], [14]]

I'm not sure we can split the table into record batches efficiently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableBatchReader can be used for that.

@pitrou pitrou closed this in 8ae596a Dec 14, 2020
@pitrou pitrou deleted the ARROW-10796-batch-sort branch December 14, 2020 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants