ARROW-17623: [C++][Acero] Helper classes for ranking in window functions #14049

michalursa · 2022-09-06T06:42:52Z

Helper classes for evaluating different variants of ranks, as specified in window functions section of SQL Standard. The details of how they work can be found in ranking related sections of this document: https://docs.google.com/document/d/1oOCwr6IUfuBg9DdLyxUocEA-HaxVP9eOqvqtIDeJUwU/.

These helper classes will later be used to build window functions exec node inside Acero.

There are two data structures introduced in this change: bit vector navigator and merge tree. The details of these structures are described in the related sections of this document: https://docs.google.com/document/d/1VFeO1Ma2LArWU30fOWeY_vBtlBoJWUcHZFcsCNgYCIs/.

Merge tree is implemented in a way that already supports parallel execution, even though it is used in a single-threaded way inside the rank computing class.

There are three scenarios that require different processing for ranks: global ranks, framed 1D ranks and framed 2D ranks. In the first case all input rows use the entire input set as a frame within which the rank of each row is calculated. In the other two cases every row can use its own frame, that can be composed of up to 3 ranges of rows in the sort order related to frames. This sort order may or may not be the same as the sort order used for ranking, if it is we are dealing with 1D case (a single attribute is used for sorting for the purpose of defining the frame and for the purpose of determining the ranking) otherwise it is a 2D case (two different sort orders related to two different attributes are involved).

There are two sets of classes for computing ranks: one set is performance optimized and is meant as the implementation to use, the other set is a reference implementation used for comparison of results in testing and is impractically inefficient for many cases.

github-actions · 2022-09-06T06:55:11Z

https://issues.apache.org/jira/browse/ARROW-17623

github-actions · 2022-09-06T06:55:13Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

jvanstraten

@westonpace asked me to have a look at this and accompanying google docs. I think I managed to get through the docs alright, but with the additional optimizations in the merge tree code especially proved to be a bit too much for me to handle within the span of a couple hours... 😅 Nevertheless, I'm pretty sure I did find one bug (kEmptyRange used directly as a condition) and a couple of misspelling/typo and comment nits. Other than that this looks pretty solid for as far as I understand what's going on.

jvanstraten · 2022-11-16T15:37:12Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+
+// Storage for a bit vector to be used with BitVectorNavigator and its variants.
+//
+// Supports weaved bit vectors.


While these are (presumably) Arrow-internal classes, and while I managed to figure out from the implementation what certain terms mean, I feel like this code could do with just a little higher comment density. For example:

Suggested change

// Supports weaved bit vectors.

// Supports weaved bit vectors, a cache-efficient representation of multiple bit vectors

// with independent navigators (num_children) of equal length (num_bits_per_child).

(assuming I understood correctly from context) would have saved me a couple of minutes of staring at the implementation.

jvanstraten · 2022-11-16T15:40:12Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+  }
+};
+
+class BitMatrixWithCounts {


band_size doesn't seem to really be explained anywhere;

Suggested change

class BitMatrixWithCounts {

// A matrix of bits represented by a bit vector with counts of size bit_count for each

// row. Storage is organized into bands of weaved bit vectors for cache-efficiency, each

// with band_size children.

//

class BitMatrixWithCounts {

jvanstraten · 2022-11-16T15:47:29Z

cpp/src/arrow/compute/exec/window_functions/merge_tree.h

+  void SplitSubset(int source_level, const T* source_level_vector, T* target_level_vector,
+                   int64_t read_begin, int64_t read_end, ThreadContext& thread_ctx);
+
+  void SetMorselLoglen(int morsel_loglen);


I feel like this function might be a remnant from debugging? Since it's private, dead, doesn't really do much, and calling it seems like a great way to break the data structure :)

Also, I don't think the word "morsel" is defined in a comment anywhere. I think I pieced it together from the implementation and the google doc (though it doesn't use the term anywhere either), but IMO that shouldn't be necessary for what seems to be a pretty core concept for the data structure.

jvanstraten · 2022-11-16T15:53:19Z

cpp/src/arrow/compute/exec/window_functions/merge_tree.cc

+  int64_t num_bits = source_end - source_begin;
+  if (num_bits == 0) {


Maybe this is paranoid, but I personally always use < instead of == in instances like this. I imagine negative ranges should never happen, but I don't think replacing == with < would cost any additional processor cycles, and I'm pretty sure this would segfault rather spectacularly if given a negative range as things stand.

jvanstraten · 2022-11-16T15:53:56Z

cpp/src/arrow/compute/exec/window_functions/merge_tree.cc

+    // Compute boundaries of the children nodes (cummulative sum of children
+    // sizes).


Suggested change

// Compute boundaries of the children nodes (cummulative sum of children

// sizes).

// Compute boundaries of the children nodes (cumulative sum of children

// sizes).

jvanstraten · 2022-11-16T15:59:21Z

cpp/src/arrow/compute/exec/window_functions/window_frame.h

+  }
+};
+
+enum class WindowFrameSequenceType { CUMMULATIVE, SLIDING, GENERIC };


This misspelling of "cumulative" does require refactoring.

jvanstraten · 2022-11-16T16:00:22Z

cpp/src/arrow/compute/exec/window_functions/window_rank.cc

+                             int64_t batch_begin, int64_t batch_end, int64_t* results) {
+  int64_t num_rows = tie_begins.bit_count();
+
+  if (rank_type == RankType::ROW_NUMBER) {


Wouldn't a switch statement be more suitable for this group of if statements? Likewise in WindowRank_Framed1D::Eval().

jvanstraten · 2022-11-16T16:01:33Z

cpp/src/arrow/compute/exec/window_functions/window_rank.cc

+
+  if (rank_type == RankType::RANK_TIES_HIGH) {
+    // To compute TIES_HIGH variant, we can reverse boundaries,
+    // global ranks by substracting their values from num_rows


Suggested change

// global ranks by substracting their values from num_rows

// global ranks by subtracting their values from num_rows

jvanstraten · 2022-11-16T16:02:14Z

cpp/src/arrow/compute/exec/window_functions/window_rank.cc

+    // Note that the number of rows considered to be in the frame depends
+    // whether the current row is inside or outside of the ranges defining its


Suggested change

// Note that the number of rows considered to be in the frame depends

// whether the current row is inside or outside of the ranges defining its

// Note that the number of rows considered to be in the frame depends on

// whether the current row is inside or outside of the ranges defining its

jvanstraten · 2022-11-16T16:04:00Z

cpp/src/arrow/compute/exec/window_functions/window_rank.cc

+        case RankType::ROW_NUMBER:
+          // Count the number of rows in the frame with lower global rank.
+          //
+          if (global_ranks_within_frame[frame_row_index] <
+              global_ranks[current_row_index]) {
+            ++rank;
+          }
+          break;
+        case RankType::RANK_TIES_LOW:
+          // Count the number of rows in the frame with lower global rank.
+          //
+          if (global_ranks_within_frame[frame_row_index] <
+              global_ranks[current_row_index]) {
+            ++rank;
+          }
+          break;


The bodies of these cases are identical. I think that may be correct, but wouldn't it be better to be explicit about that by collapsing them?

Suggested change

case RankType::ROW_NUMBER:

// Count the number of rows in the frame with lower global rank.

//

if (global_ranks_within_frame[frame_row_index] <

global_ranks[current_row_index]) {

++rank;

}

break;

case RankType::RANK_TIES_LOW:

// Count the number of rows in the frame with lower global rank.

//

if (global_ranks_within_frame[frame_row_index] <

global_ranks[current_row_index]) {

++rank;

}

break;

case RankType::ROW_NUMBER:

case RankType::RANK_TIES_LOW:

// Count the number of rows in the frame with lower global rank.

//

if (global_ranks_within_frame[frame_row_index] <

global_ranks[current_row_index]) {

++rank;

}

break;

alkis

Pass by review. I only reviewed bit_vector_navigator.h.

alkis · 2023-03-05T23:05:30Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+      : container_(container), child_index_(child_index) {}
+
+  int64_t block_count() const {
+    return bit_util::CeilDiv(container_->num_bits_per_child_,


CeilDiv takes arguments as signed integers so it ends up generating suboptimal code.

Because of the above, the codegen here and below has unnecessary branches: https://godbolt.org/z/6Me84646P

alkis · 2023-03-05T23:10:09Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+    int64_t last_word = word_count() - 1;
+    int num_bits_last_word =
+        static_cast<int>((bit_count() - 1) % BitVectorWithCountsBase::kBitsPerWord + 1);
+    uint64_t last_word_mask = ~0ULL >> (64 - num_bits_last_word);


If num_bits_last_word is 0 this will not generate 0 mask. Is this a bug?

alkis · 2023-03-05T23:16:53Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+    int64_t begin = 0;
+    int64_t end = block_count();
+    while (end - begin > 1) {
+      int64_t middle = (begin + end) / 2;


in principle this can overflow so it should be begin + (end - begin) / 2

alkis · 2023-03-05T23:21:01Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+      int reject_left_half =
+          (rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0;
+      begin = begin + (middle - begin) * reject_left_half;
+      end = middle + (end - middle) * reject_left_half;
+    }


so unlike std::upper_bound this is a branchless binary search which will be good when the window is small and bad when the window is large. Do we know it always pays off?

alkis · 2023-03-05T23:22:22Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+      int reject_left_half =
+          (rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0;
+      begin = begin + (middle - begin) * reject_left_half;
+      end = middle + (end - middle) * reject_left_half;


why multiply when you can use conditional moves? cmov is 1 cycle and multiply is 3+.

alkis · 2023-03-05T23:28:16Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+    int64_t begin = 0;
+    int64_t end = block_count();
+    while (end - begin > 1) {
+      int64_t middle = (begin + end) / 2;
+      int reject_left_half =
+          (rank >= top_counters[apply_stride_and_offset(middle)]) ? 1 : 0;
+      begin = begin + (middle - begin) * reject_left_half;
+      end = middle + (end - middle) * reject_left_half;
+    }


Could this be written with only one cmov/mul? Something like:

int i = 0; for (int n = block_count(); n > 1; ) { int mid = n / 2; i += (rank >= top_counts[apply_stride_and_offset(i + mid)]) ? mid : 0; n -= mid; }

caveat emptor: untested so could be buggy

alkis · 2023-03-05T23:40:24Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+    // Continue with binary search in the selected word.
+    //
+    uint64_t word = words[apply_stride_and_offset(word_index)];
+    int pop_count_prefix = 0;
+    int bit_count_prefix = 0;
+    const uint64_t masks[6] = {0xFFFFFFFFULL, 0xFFFFULL, 0xFFULL, 0xFULL, 0x3ULL, 0x1ULL};
+    int bit_count_left_half = 32;
+    for (int i = 0; i < 6; ++i) {
+      int pop_count_left_half =
+          static_cast<int>(ARROW_POPCOUNT64((word >> bit_count_prefix) & masks[i]));
+      int reject_left_half = (rank >= pop_count_prefix + pop_count_left_half) ? 1 : 0;
+      pop_count_prefix += reject_left_half * pop_count_left_half;
+      bit_count_prefix += reject_left_half * bit_count_left_half;
+      bit_count_left_half /= 2;
+    }


This can be done faster with tzcnt and pdep instructions wherever available except zen2 (pdep is slow there). Otherwise the binary search could look like above for a tighter loop.

alkis · 2023-03-05T23:41:40Z

cpp/src/arrow/compute/exec/window_functions/bit_vector_navigator.h

+    //
+    if (rank_begin < 0) {
+      int64_t num_ranks_to_skip =
+          std::min(rank_end, static_cast<int64_t>(0)) - rank_begin;


nit: static_cast<int64_t>(0) -> int64_t{0} or 0LL

amol- · 2023-03-30T17:26:42Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

github-actions bot added the Component: C++ label Sep 6, 2022

michalursa force-pushed the ARROW-17623-window-func-ranking branch from 0c606d6 to 5ad5c25 Compare November 15, 2022 01:35

michalursa added 2 commits November 14, 2022 17:42

Window Functions adding helper classes for ranking functions

b970021

Fixing bugs in window rank functions

faf10fd

michalursa force-pushed the ARROW-17623-window-func-ranking branch from 5ad5c25 to 773826e Compare November 15, 2022 01:44

michalursa marked this pull request as ready for review November 15, 2022 01:44

michalursa force-pushed the ARROW-17623-window-func-ranking branch 2 times, most recently from 4de6fb4 to 82f17cc Compare November 15, 2022 03:58

Window Functions rewrite of helper classes for ranking

4ce8e8c

michalursa force-pushed the ARROW-17623-window-func-ranking branch from 82f17cc to 4ce8e8c Compare November 15, 2022 05:04

jvanstraten reviewed Nov 16, 2022

View reviewed changes

asfimport mentioned this pull request Nov 16, 2022

[C++][Acero] Window Functions add helper classes for ranking #32867

Open

alkis reviewed Mar 5, 2023

View reviewed changes

amol- closed this Mar 30, 2023

	// Supports weaved bit vectors.
	// Supports weaved bit vectors, a cache-efficient representation of multiple bit vectors
	// with independent navigators (num_children) of equal length (num_bits_per_child).

-class BitMatrixWithCounts {
+// A matrix of bits represented by a bit vector with counts of size bit_count for each
+// row. Storage is organized into bands of weaved bit vectors for cache-efficiency, each
+// with band_size children.
+//
+class BitMatrixWithCounts {

		int64_t num_bits = source_end - source_begin;
		if (num_bits == 0) {

		// Compute boundaries of the children nodes (cummulative sum of children
		// sizes).

	// global ranks by substracting their values from num_rows
	// global ranks by subtracting their values from num_rows

		// Note that the number of rows considered to be in the frame depends
		// whether the current row is inside or outside of the ranges defining its

ARROW-17623: [C++][Acero] Helper classes for ranking in window functions #14049

ARROW-17623: [C++][Acero] Helper classes for ranking in window functions #14049

Uh oh!

Conversation

michalursa commented Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 6, 2022

Uh oh!

github-actions bot commented Sep 6, 2022

Uh oh!

jvanstraten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alkis left a comment

Choose a reason for hiding this comment

Uh oh!

alkis Mar 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amol- commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michalursa commented Sep 6, 2022 •

edited

Loading

alkis Mar 5, 2023 •

edited

Loading