ARROW-12010: [C++][Compute] Improve performance of the hash table used in GroupIdentifier #9768

michalursa · 2021-03-22T08:35:27Z

This is the draft version of the code implementing functionality for mapping arbitrary set of input columns considered a key in grouping operation into a vector containing integer group identifiers (same combinations of input key columns get same ids).

I will continue working on it and updating it with:

integration with initial hash group by implementation in Arrow project, once it is finished and merged into master
unit tests
documentation

At this point group ids, row ids, offsets, hash values are 32-bit. The overflow checks are missing in current version and still need to be fixed.

The entry point for id mapping is GroupBy class. It uses three main modules: storage defined in groupby_storage* files, hash defined in groupby_hash* files and hash table defined in groupby_map* files. Key values stored with the hash table are row oriented. Storage part of the code defines functions converting from column oriented storage to row oriented storage and back. It also implements comparison and appending keys to the incremental store.

I plan to add design doc in a form of a readme file later on.

The individual modules and functions present here have been tested with unit tests and are passing them but unit tests are not included in this change yet.

github-actions · 2021-03-22T09:15:27Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2021-03-23T05:14:52Z

https://issues.apache.org/jira/browse/ARROW-12010

bkietz · 2021-04-20T15:12:42Z

cpp/src/arrow/CMakeLists.txt

+    list(APPEND ARROW_SRCS engine/key_hash_avx2.cc)
+    set_source_files_properties(engine/key_hash_avx2.cc PROPERTIES
+                                SKIP_PRECOMPILE_HEADERS ON)
+    set_source_files_properties(engine/key_hash_avx2.cc PROPERTIES
+                                COMPILE_FLAGS ${ARROW_AVX2_FLAG})


Let's extract this to a macro,

macro(append_avx2_src SRC) if(ARROW_HAVE_RUNTIME_AVX2) list(APPEND ARROW_SRCS ${SRC}) set_source_files_properties(${SRC} PROPERTIES SKIP_PRECOMPILE_HEADERS ON) set_source_files_properties(${SRC} PROPERTIES COMPILE_FLAGS ${ARROW_AVX2_FLAG}) endif() endmacro()

bkietz · 2021-04-20T15:17:42Z

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

    ExpectConsume(*ExecBatch::Make(key_batch), expected);
  }

+  void AssertEquivalentIds(const Datum& expected, const Datum& actual) {


Please include a comment describing what this does compared to AssertDatumsEqual

bkietz · 2021-04-20T15:19:14Z

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

+        max_right_id = right_ids[i];
+      }
+    }
+    std::vector<bool> right_to_left_present;


std::vector can be sized and initialized on construction

Suggested change

std::vector<bool> right_to_left_present;

std::vector<bool> right_to_left_present(max_right_id + 1, false);

cpp/src/arrow/engine/groupby.h

cpp/src/arrow/engine/key_encode.h

bkietz · 2021-04-20T15:59:19Z

cpp/src/arrow/engine/util.h

+    --num_vectors_;
+  }
+  static constexpr int64_t padding = 64;
+  MemoryPool* pool_;


TempVectorStack doesn't seem to use this MemoryPool outside of init(), could you remove it and add a comment describing the use of this class?

cpp/src/arrow/engine/util.h

bkietz · 2021-04-20T16:02:26Z

cpp/src/arrow/engine/util_avx2.cc

+      // second
+      input = _mm256_shuffle_epi8(
+          input, _mm256_setr_epi64x(0x0e0c0a0806040200ULL, 0x0f0d0b0907050301ULL,
+                                    0x0e0c0a0806040200ULL, 0x0f0d0b0907050301ULL));


Please create constexpr variables for any magic numbers

bkietz · 2021-04-20T16:06:50Z

cpp/src/arrow/engine/key_map.h

+// ** The order of bytes is reversed - highest byte represents 0th bucket.
+// No other part of data structure uses this reversed order.
+//
+class SwissTable {


Please add a higher level doc comment describing this class and it's utility. Specifically: describe how equality comparison and appending/storage of new entries are deferred to callbacks.

Separately, add a comment detailing its implementation and usage of stamps for vectorized probing.

bkietz · 2021-04-20T16:33:05Z

cpp/src/arrow/engine/key_encode.h

+
+  class EncoderNulls {
+   public:
+    static void Encode(KeyRowArray& rows, const std::vector<KeyColumnArray>& cols,


Please avoid mutable references. For out arguments, please use a mutable pointer

Suggested change

static void Encode(KeyRowArray& rows, const std::vector<KeyColumnArray>& cols,

static void Encode(KeyRowArray* rows, const std::vector<KeyColumnArray>& cols,

bkietz · 2021-04-20T16:51:26Z

cpp/src/arrow/engine/key_encode.cc

+          *reinterpret_cast<uint16_t*>(row_base + i * row_size) =
+              reinterpret_cast<const uint16_t*>(col_base)[i];
+        }


There's a lot of unaligned accesses like this one. This is undefined behavior in C++ and it's not supported on all platforms. Could we use SafeLoadAs and SafeStore? If those produce a performance regression, can we optimize them?

cpp/src/arrow/compute/exec/key_hash.cc

…d in Grouper

…alursa/arrow into ARROW-12010-GroupIdentifier

bkietz · 2021-05-19T14:17:37Z

@michalursa looks like tests are hanging on our bigendian CI. Is this quick to address or should we leave this for follow up

…dian problem

nealrichardson · 2021-05-19T20:40:51Z

@michalursa looks like tests are hanging on our bigendian CI. Is this quick to address or should we leave this for follow up

cc @kiszk

michalursa · 2021-05-19T22:13:56Z

@michalursa looks like tests are hanging on our bigendian CI. Is this quick to address or should we leave this for follow up

cc @kiszk

It requires a bit of thinking (or debugging) to make the code work with big endian. But since we still have the other group by implementation, for now I am disabling new one on big endian architectures, and will move this issue to a separate jira.

bkietz

LGTM, I'll merge.

Please add follow up JIRAs and link them here, including:

Extract vectorized bit utilities
Enable GrouperFastImpl for big endian platforms
Unit tests should iterate through supported hardware flags (so an AVX2 build should test AVX2 and scalar implementations)

michalursa marked this pull request as draft March 22, 2021 08:36

michalursa marked this pull request as ready for review March 22, 2021 08:39

michalursa marked this pull request as draft March 22, 2021 08:45

github-actions bot added the Component: C++ label Mar 22, 2021

michalursa changed the title ~~Arrow-12010: [C++][Compute] DRAFT Improve performance of the hash table used in GroupIdentifier~~ ARROW-12010: [C++][Compute] DRAFT Improve performance of the hash table used in GroupIdentifier Mar 23, 2021

bkietz self-requested a review March 23, 2021 20:40

michalursa force-pushed the ARROW-12010-GroupIdentifier branch from 765371c to 66c2b88 Compare March 24, 2021 07:29

bkietz requested changes Apr 20, 2021

View reviewed changes

bkietz reviewed Apr 20, 2021

View reviewed changes

bkietz marked this pull request as ready for review May 5, 2021 18:27

bkietz changed the title ~~ARROW-12010: [C++][Compute] DRAFT Improve performance of the hash table used in GroupIdentifier~~ ARROW-12010: [C++][Compute] Improve performance of the hash table used in GroupIdentifier May 5, 2021

bkietz force-pushed the ARROW-12010-GroupIdentifier branch from cddc9a2 to f23c506 Compare May 6, 2021 17:03

michalursa mentioned this pull request May 10, 2021

ARROW-12725: [C++][Compute] Column at a time hash and comparison in group by #10290

Closed

bkietz requested changes May 11, 2021

View reviewed changes

cpp/src/arrow/compute/exec/key_hash.cc Outdated Show resolved Hide resolved

ARROW-12010: [C++][Compute] Improve performance of the hash table use…

ca50d93

…d in Grouper

bkietz force-pushed the ARROW-12010-GroupIdentifier branch from caded37 to ca50d93 Compare May 12, 2021 00:53

bkietz and others added 11 commits May 12, 2021 11:48

lint fixes

cdd053c

GrouperFastImpl: adding comments in the code

ee87e86

GrouperFastImpl: replacing cpu_info with hardware_flags

6fb7837

GrouperFastImpl: fixing build errors

a65697c

GrouperFastImpl: fixing build errors

dab5462

Merge branch 'ARROW-12010-GroupIdentifier' of https://github.com/mich…

9e53424

…alursa/arrow into ARROW-12010-GroupIdentifier

GrouperFastImpl: fixing build errors

f9581ed

GrouperFastImpl: more build fixes

6a93dff

GrouperFastImpl: more build fixes

3cdd601

Merge branch 'ARROW-12010-GroupIdentifier' of https://github.com/mich…

6f16a70

…alursa/arrow into ARROW-12010-GroupIdentifier

GrouperFastImpl: more build fixes

113706a

GrouperFastImpl: fixing more build errors plus a workaround for bigen…

4ca96af

…dian problem

bkietz approved these changes May 20, 2021

View reviewed changes

bkietz closed this in c697a41 May 20, 2021

asfimport mentioned this pull request May 20, 2021

[C++][Compute] Improve performance of the hash table used in GroupIdentifier #27841

Closed

uchenily mentioned this pull request Mar 20, 2025

[C++][Compute][Acero] Poor aggregate performance when there is a large number of batches on the build side #45847

Open

	std::vector<bool> right_to_left_present;
	std::vector<bool> right_to_left_present(max_right_id + 1, false);

	static void Encode(KeyRowArray& rows, const std::vector<KeyColumnArray>& cols,
	static void Encode(KeyRowArray* rows, const std::vector<KeyColumnArray>& cols,

ARROW-12010: [C++][Compute] Improve performance of the hash table used in GroupIdentifier #9768

ARROW-12010: [C++][Compute] Improve performance of the hash table used in GroupIdentifier #9768

Uh oh!

Conversation

michalursa commented Mar 22, 2021

Uh oh!

github-actions bot commented Mar 22, 2021

Uh oh!

github-actions bot commented Mar 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkietz commented May 19, 2021

Uh oh!

nealrichardson commented May 19, 2021

Uh oh!

michalursa commented May 19, 2021

Uh oh!

bkietz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkietz left a comment •

edited

Loading