Skip to content

Conversation

@michalursa
Copy link
Contributor

@michalursa michalursa commented Sep 6, 2022

Helper classes for evaluating different variants of ranks, as specified in window functions section of SQL Standard. The details of how they work can be found in ranking related sections of this document: https://docs.google.com/document/d/1oOCwr6IUfuBg9DdLyxUocEA-HaxVP9eOqvqtIDeJUwU/.

These helper classes will later be used to build window functions exec node inside Acero.

There are two data structures introduced in this change: bit vector navigator and merge tree. The details of these structures are described in the related sections of this document: https://docs.google.com/document/d/1VFeO1Ma2LArWU30fOWeY_vBtlBoJWUcHZFcsCNgYCIs/.

Merge tree is implemented in a way that already supports parallel execution, even though it is used in a single-threaded way inside the rank computing class.

There are three scenarios that require different processing for ranks: global ranks, framed 1D ranks and framed 2D ranks. In the first case all input rows use the entire input set as a frame within which the rank of each row is calculated. In the other two cases every row can use its own frame, that can be composed of up to 3 ranges of rows in the sort order related to frames. This sort order may or may not be the same as the sort order used for ranking, if it is we are dealing with 1D case (a single attribute is used for sorting for the purpose of defining the frame and for the purpose of determining the ranking) otherwise it is a 2D case (two different sort orders related to two different attributes are involved).

There are two sets of classes for computing ranks: one set is performance optimized and is meant as the implementation to use, the other set is a reference implementation used for comparison of results in testing and is impractically inefficient for many cases.

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@michalursa michalursa force-pushed the ARROW-17623-window-func-ranking branch from 0c606d6 to 5ad5c25 Compare November 15, 2022 01:35
@michalursa michalursa force-pushed the ARROW-17623-window-func-ranking branch from 5ad5c25 to 773826e Compare November 15, 2022 01:44
@michalursa michalursa marked this pull request as ready for review November 15, 2022 01:44
@michalursa michalursa force-pushed the ARROW-17623-window-func-ranking branch 2 times, most recently from 4de6fb4 to 82f17cc Compare November 15, 2022 03:58
@michalursa michalursa force-pushed the ARROW-17623-window-func-ranking branch from 82f17cc to 4ce8e8c Compare November 15, 2022 05:04
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace asked me to have a look at this and accompanying google docs. I think I managed to get through the docs alright, but with the additional optimizations in the merge tree code especially proved to be a bit too much for me to handle within the span of a couple hours... 😅 Nevertheless, I'm pretty sure I did find one bug (kEmptyRange used directly as a condition) and a couple of misspelling/typo and comment nits. Other than that this looks pretty solid for as far as I understand what's going on.


// Storage for a bit vector to be used with BitVectorNavigator and its variants.
//
// Supports weaved bit vectors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these are (presumably) Arrow-internal classes, and while I managed to figure out from the implementation what certain terms mean, I feel like this code could do with just a little higher comment density. For example:

Suggested change
// Supports weaved bit vectors.
// Supports weaved bit vectors, a cache-efficient representation of multiple bit vectors
// with independent navigators (num_children) of equal length (num_bits_per_child).

(assuming I understood correctly from context) would have saved me a couple of minutes of staring at the implementation.

}
};

class BitMatrixWithCounts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

band_size doesn't seem to really be explained anywhere;

Suggested change
class BitMatrixWithCounts {
// A matrix of bits represented by a bit vector with counts of size bit_count for each
// row. Storage is organized into bands of weaved bit vectors for cache-efficiency, each
// with band_size children.
//
class BitMatrixWithCounts {

void SplitSubset(int source_level, const T* source_level_vector, T* target_level_vector,
int64_t read_begin, int64_t read_end, ThreadContext& thread_ctx);

void SetMorselLoglen(int morsel_loglen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this function might be a remnant from debugging? Since it's private, dead, doesn't really do much, and calling it seems like a great way to break the data structure :)

Also, I don't think the word "morsel" is defined in a comment anywhere. I think I pieced it together from the implementation and the google doc (though it doesn't use the term anywhere either), but IMO that shouldn't be necessary for what seems to be a pretty core concept for the data structure.

Comment on lines +240 to +241
int64_t num_bits = source_end - source_begin;
if (num_bits == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is paranoid, but I personally always use < instead of == in instances like this. I imagine negative ranges should never happen, but I don't think replacing == with < would cost any additional processor cycles, and I'm pretty sure this would segfault rather spectacularly if given a negative range as things stand.

Comment on lines +415 to +416
// Compute boundaries of the children nodes (cummulative sum of children
// sizes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Compute boundaries of the children nodes (cummulative sum of children
// sizes).
// Compute boundaries of the children nodes (cumulative sum of children
// sizes).

}
};

enum class WindowFrameSequenceType { CUMMULATIVE, SLIDING, GENERIC };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This misspelling of "cumulative" does require refactoring.

int64_t batch_begin, int64_t batch_end, int64_t* results) {
int64_t num_rows = tie_begins.bit_count();

if (rank_type == RankType::ROW_NUMBER) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a switch statement be more suitable for this group of if statements? Likewise in WindowRank_Framed1D::Eval().


if (rank_type == RankType::RANK_TIES_HIGH) {
// To compute TIES_HIGH variant, we can reverse boundaries,
// global ranks by substracting their values from num_rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// global ranks by substracting their values from num_rows
// global ranks by subtracting their values from num_rows

Comment on lines +124 to +125
// Note that the number of rows considered to be in the frame depends
// whether the current row is inside or outside of the ranges defining its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note that the number of rows considered to be in the frame depends
// whether the current row is inside or outside of the ranges defining its
// Note that the number of rows considered to be in the frame depends on
// whether the current row is inside or outside of the ranges defining its

Comment on lines +404 to +419
case RankType::ROW_NUMBER:
// Count the number of rows in the frame with lower global rank.
//
if (global_ranks_within_frame[frame_row_index] <
global_ranks[current_row_index]) {
++rank;
}
break;
case RankType::RANK_TIES_LOW:
// Count the number of rows in the frame with lower global rank.
//
if (global_ranks_within_frame[frame_row_index] <
global_ranks[current_row_index]) {
++rank;
}
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bodies of these cases are identical. I think that may be correct, but wouldn't it be better to be explicit about that by collapsing them?

Suggested change
case RankType::ROW_NUMBER:
// Count the number of rows in the frame with lower global rank.
//
if (global_ranks_within_frame[frame_row_index] <
global_ranks[current_row_index]) {
++rank;
}
break;
case RankType::RANK_TIES_LOW:
// Count the number of rows in the frame with lower global rank.
//
if (global_ranks_within_frame[frame_row_index] <
global_ranks[current_row_index]) {
++rank;
}
break;
case RankType::ROW_NUMBER:
case RankType::RANK_TIES_LOW:
// Count the number of rows in the frame with lower global rank.
//
if (global_ranks_within_frame[frame_row_index] <
global_ranks[current_row_index]) {
++rank;
}
break;

Copy link
Contributor

@alkis alkis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass by review. I only reviewed bit_vector_navigator.h.

: container_(container), child_index_(child_index) {}

int64_t block_count() const {
return bit_util::CeilDiv(container_->num_bits_per_child_,
Copy link
Contributor

@alkis alkis Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CeilDiv takes arguments as signed integers so it ends up generating suboptimal code.

Because of the above, the codegen here and below has unnecessary branches: https://godbolt.org/z/6Me84646P

int64_t last_word = word_count() - 1;
int num_bits_last_word =
static_cast<int>((bit_count() - 1) % BitVectorWithCountsBase::kBitsPerWord + 1);
uint64_t last_word_mask = ~0ULL >> (64 - num_bits_last_word);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If num_bits_last_word is 0 this will not generate 0 mask. Is this a bug?

int64_t begin = 0;
int64_t end = block_count();
while (end - begin > 1) {
int64_t middle = (begin + end) / 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in principle this can overflow so it should be begin + (end - begin) / 2

Comment on lines +182 to +186
int reject_left_half =
(rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0;
begin = begin + (middle - begin) * reject_left_half;
end = middle + (end - middle) * reject_left_half;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so unlike std::upper_bound this is a branchless binary search which will be good when the window is small and bad when the window is large. Do we know it always pays off?

Comment on lines +182 to +185
int reject_left_half =
(rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0;
begin = begin + (middle - begin) * reject_left_half;
end = middle + (end - middle) * reject_left_half;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why multiply when you can use conditional moves? cmov is 1 cycle and multiply is 3+.

Comment on lines +178 to +186
int64_t begin = 0;
int64_t end = block_count();
while (end - begin > 1) {
int64_t middle = (begin + end) / 2;
int reject_left_half =
(rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0;
begin = begin + (middle - begin) * reject_left_half;
end = middle + (end - middle) * reject_left_half;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be written with only one cmov/mul? Something like:

int i = 0;
for (int n = block_count(); n > 1; ) {
  int mid = n / 2;
  i += (rank >= top_counts[apply_stride_and_offset(i + mid)]) ? mid : 0;
  n -= mid;
}

caveat emptor: untested so could be buggy

Comment on lines +207 to +221
// Continue with binary search in the selected word.
//
uint64_t word = words[apply_stride_and_offset(word_index)];
int pop_count_prefix = 0;
int bit_count_prefix = 0;
const uint64_t masks[6] = {0xFFFFFFFFULL, 0xFFFFULL, 0xFFULL, 0xFULL, 0x3ULL, 0x1ULL};
int bit_count_left_half = 32;
for (int i = 0; i < 6; ++i) {
int pop_count_left_half =
static_cast<int>(ARROW_POPCOUNT64((word >> bit_count_prefix) & masks[i]));
int reject_left_half = (rank >= pop_count_prefix + pop_count_left_half) ? 1 : 0;
pop_count_prefix += reject_left_half * pop_count_left_half;
bit_count_prefix += reject_left_half * bit_count_left_half;
bit_count_left_half /= 2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done faster with tzcnt and pdep instructions wherever available except zen2 (pdep is slow there). Otherwise the binary search could look like above for a tighter loop.

//
if (rank_begin < 0) {
int64_t num_ranks_to_skip =
std::min(rank_end, static_cast<int64_t>(0)) - rank_begin;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static_cast<int64_t>(0) -> int64_t{0} or 0LL

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants