-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-17623: [C++][Acero] Helper classes for ranking in window functions #14049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17623: [C++][Acero] Helper classes for ranking in window functions #14049
Conversation
|
|
0c606d6 to
5ad5c25
Compare
5ad5c25 to
773826e
Compare
4de6fb4 to
82f17cc
Compare
82f17cc to
4ce8e8c
Compare
jvanstraten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace asked me to have a look at this and accompanying google docs. I think I managed to get through the docs alright, but with the additional optimizations in the merge tree code especially proved to be a bit too much for me to handle within the span of a couple hours... 😅 Nevertheless, I'm pretty sure I did find one bug (kEmptyRange used directly as a condition) and a couple of misspelling/typo and comment nits. Other than that this looks pretty solid for as far as I understand what's going on.
|
|
||
| // Storage for a bit vector to be used with BitVectorNavigator and its variants. | ||
| // | ||
| // Supports weaved bit vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While these are (presumably) Arrow-internal classes, and while I managed to figure out from the implementation what certain terms mean, I feel like this code could do with just a little higher comment density. For example:
| // Supports weaved bit vectors. | |
| // Supports weaved bit vectors, a cache-efficient representation of multiple bit vectors | |
| // with independent navigators (num_children) of equal length (num_bits_per_child). |
(assuming I understood correctly from context) would have saved me a couple of minutes of staring at the implementation.
| } | ||
| }; | ||
|
|
||
| class BitMatrixWithCounts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
band_size doesn't seem to really be explained anywhere;
| class BitMatrixWithCounts { | |
| // A matrix of bits represented by a bit vector with counts of size bit_count for each | |
| // row. Storage is organized into bands of weaved bit vectors for cache-efficiency, each | |
| // with band_size children. | |
| // | |
| class BitMatrixWithCounts { |
| void SplitSubset(int source_level, const T* source_level_vector, T* target_level_vector, | ||
| int64_t read_begin, int64_t read_end, ThreadContext& thread_ctx); | ||
|
|
||
| void SetMorselLoglen(int morsel_loglen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this function might be a remnant from debugging? Since it's private, dead, doesn't really do much, and calling it seems like a great way to break the data structure :)
Also, I don't think the word "morsel" is defined in a comment anywhere. I think I pieced it together from the implementation and the google doc (though it doesn't use the term anywhere either), but IMO that shouldn't be necessary for what seems to be a pretty core concept for the data structure.
| int64_t num_bits = source_end - source_begin; | ||
| if (num_bits == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is paranoid, but I personally always use < instead of == in instances like this. I imagine negative ranges should never happen, but I don't think replacing == with < would cost any additional processor cycles, and I'm pretty sure this would segfault rather spectacularly if given a negative range as things stand.
| // Compute boundaries of the children nodes (cummulative sum of children | ||
| // sizes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Compute boundaries of the children nodes (cummulative sum of children | |
| // sizes). | |
| // Compute boundaries of the children nodes (cumulative sum of children | |
| // sizes). |
| } | ||
| }; | ||
|
|
||
| enum class WindowFrameSequenceType { CUMMULATIVE, SLIDING, GENERIC }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This misspelling of "cumulative" does require refactoring.
| int64_t batch_begin, int64_t batch_end, int64_t* results) { | ||
| int64_t num_rows = tie_begins.bit_count(); | ||
|
|
||
| if (rank_type == RankType::ROW_NUMBER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't a switch statement be more suitable for this group of if statements? Likewise in WindowRank_Framed1D::Eval().
|
|
||
| if (rank_type == RankType::RANK_TIES_HIGH) { | ||
| // To compute TIES_HIGH variant, we can reverse boundaries, | ||
| // global ranks by substracting their values from num_rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // global ranks by substracting their values from num_rows | |
| // global ranks by subtracting their values from num_rows |
| // Note that the number of rows considered to be in the frame depends | ||
| // whether the current row is inside or outside of the ranges defining its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Note that the number of rows considered to be in the frame depends | |
| // whether the current row is inside or outside of the ranges defining its | |
| // Note that the number of rows considered to be in the frame depends on | |
| // whether the current row is inside or outside of the ranges defining its |
| case RankType::ROW_NUMBER: | ||
| // Count the number of rows in the frame with lower global rank. | ||
| // | ||
| if (global_ranks_within_frame[frame_row_index] < | ||
| global_ranks[current_row_index]) { | ||
| ++rank; | ||
| } | ||
| break; | ||
| case RankType::RANK_TIES_LOW: | ||
| // Count the number of rows in the frame with lower global rank. | ||
| // | ||
| if (global_ranks_within_frame[frame_row_index] < | ||
| global_ranks[current_row_index]) { | ||
| ++rank; | ||
| } | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bodies of these cases are identical. I think that may be correct, but wouldn't it be better to be explicit about that by collapsing them?
| case RankType::ROW_NUMBER: | |
| // Count the number of rows in the frame with lower global rank. | |
| // | |
| if (global_ranks_within_frame[frame_row_index] < | |
| global_ranks[current_row_index]) { | |
| ++rank; | |
| } | |
| break; | |
| case RankType::RANK_TIES_LOW: | |
| // Count the number of rows in the frame with lower global rank. | |
| // | |
| if (global_ranks_within_frame[frame_row_index] < | |
| global_ranks[current_row_index]) { | |
| ++rank; | |
| } | |
| break; | |
| case RankType::ROW_NUMBER: | |
| case RankType::RANK_TIES_LOW: | |
| // Count the number of rows in the frame with lower global rank. | |
| // | |
| if (global_ranks_within_frame[frame_row_index] < | |
| global_ranks[current_row_index]) { | |
| ++rank; | |
| } | |
| break; |
alkis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass by review. I only reviewed bit_vector_navigator.h.
| : container_(container), child_index_(child_index) {} | ||
|
|
||
| int64_t block_count() const { | ||
| return bit_util::CeilDiv(container_->num_bits_per_child_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CeilDiv takes arguments as signed integers so it ends up generating suboptimal code.
Because of the above, the codegen here and below has unnecessary branches: https://godbolt.org/z/6Me84646P
| int64_t last_word = word_count() - 1; | ||
| int num_bits_last_word = | ||
| static_cast<int>((bit_count() - 1) % BitVectorWithCountsBase::kBitsPerWord + 1); | ||
| uint64_t last_word_mask = ~0ULL >> (64 - num_bits_last_word); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If num_bits_last_word is 0 this will not generate 0 mask. Is this a bug?
| int64_t begin = 0; | ||
| int64_t end = block_count(); | ||
| while (end - begin > 1) { | ||
| int64_t middle = (begin + end) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in principle this can overflow so it should be begin + (end - begin) / 2
| int reject_left_half = | ||
| (rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0; | ||
| begin = begin + (middle - begin) * reject_left_half; | ||
| end = middle + (end - middle) * reject_left_half; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so unlike std::upper_bound this is a branchless binary search which will be good when the window is small and bad when the window is large. Do we know it always pays off?
| int reject_left_half = | ||
| (rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0; | ||
| begin = begin + (middle - begin) * reject_left_half; | ||
| end = middle + (end - middle) * reject_left_half; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why multiply when you can use conditional moves? cmov is 1 cycle and multiply is 3+.
| int64_t begin = 0; | ||
| int64_t end = block_count(); | ||
| while (end - begin > 1) { | ||
| int64_t middle = (begin + end) / 2; | ||
| int reject_left_half = | ||
| (rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0; | ||
| begin = begin + (middle - begin) * reject_left_half; | ||
| end = middle + (end - middle) * reject_left_half; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be written with only one cmov/mul? Something like:
int i = 0;
for (int n = block_count(); n > 1; ) {
int mid = n / 2;
i += (rank >= top_counts[apply_stride_and_offset(i + mid)]) ? mid : 0;
n -= mid;
}
caveat emptor: untested so could be buggy
| // Continue with binary search in the selected word. | ||
| // | ||
| uint64_t word = words[apply_stride_and_offset(word_index)]; | ||
| int pop_count_prefix = 0; | ||
| int bit_count_prefix = 0; | ||
| const uint64_t masks[6] = {0xFFFFFFFFULL, 0xFFFFULL, 0xFFULL, 0xFULL, 0x3ULL, 0x1ULL}; | ||
| int bit_count_left_half = 32; | ||
| for (int i = 0; i < 6; ++i) { | ||
| int pop_count_left_half = | ||
| static_cast<int>(ARROW_POPCOUNT64((word >> bit_count_prefix) & masks[i])); | ||
| int reject_left_half = (rank >= pop_count_prefix + pop_count_left_half) ? 1 : 0; | ||
| pop_count_prefix += reject_left_half * pop_count_left_half; | ||
| bit_count_prefix += reject_left_half * bit_count_left_half; | ||
| bit_count_left_half /= 2; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done faster with tzcnt and pdep instructions wherever available except zen2 (pdep is slow there). Otherwise the binary search could look like above for a tighter loop.
| // | ||
| if (rank_begin < 0) { | ||
| int64_t num_ranks_to_skip = | ||
| std::min(rank_end, static_cast<int64_t>(0)) - rank_begin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: static_cast<int64_t>(0) -> int64_t{0} or 0LL
|
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
Helper classes for evaluating different variants of ranks, as specified in window functions section of SQL Standard. The details of how they work can be found in ranking related sections of this document: https://docs.google.com/document/d/1oOCwr6IUfuBg9DdLyxUocEA-HaxVP9eOqvqtIDeJUwU/.
These helper classes will later be used to build window functions exec node inside Acero.
There are two data structures introduced in this change: bit vector navigator and merge tree. The details of these structures are described in the related sections of this document: https://docs.google.com/document/d/1VFeO1Ma2LArWU30fOWeY_vBtlBoJWUcHZFcsCNgYCIs/.
Merge tree is implemented in a way that already supports parallel execution, even though it is used in a single-threaded way inside the rank computing class.
There are three scenarios that require different processing for ranks: global ranks, framed 1D ranks and framed 2D ranks. In the first case all input rows use the entire input set as a frame within which the rank of each row is calculated. In the other two cases every row can use its own frame, that can be composed of up to 3 ranges of rows in the sort order related to frames. This sort order may or may not be the same as the sort order used for ranking, if it is we are dealing with 1D case (a single attribute is used for sorting for the purpose of defining the frame and for the purpose of determining the ranking) otherwise it is a 2D case (two different sort orders related to two different attributes are involved).
There are two sets of classes for computing ranks: one set is performance optimized and is meant as the implementation to use, the other set is a reference implementation used for comparison of results in testing and is impractically inefficient for many cases.