Skip to content

Conversation

@michalursa
Copy link
Contributor

This is the draft version of the code implementing functionality for mapping arbitrary set of input columns considered a key in grouping operation into a vector containing integer group identifiers (same combinations of input key columns get same ids).

I will continue working on it and updating it with:

  • integration with initial hash group by implementation in Arrow project, once it is finished and merged into master
  • unit tests
  • documentation

At this point group ids, row ids, offsets, hash values are 32-bit. The overflow checks are missing in current version and still need to be fixed.

The entry point for id mapping is GroupBy class. It uses three main modules: storage defined in groupby_storage* files, hash defined in groupby_hash* files and hash table defined in groupby_map* files. Key values stored with the hash table are row oriented. Storage part of the code defines functions converting from column oriented storage to row oriented storage and back. It also implements comparison and appending keys to the incremental store.

I plan to add design doc in a form of a readme file later on.

The individual modules and functions present here have been tested with unit tests and are passing them but unit tests are not included in this change yet.

@michalursa michalursa marked this pull request as draft March 22, 2021 08:36
@michalursa michalursa marked this pull request as ready for review March 22, 2021 08:39
@michalursa michalursa marked this pull request as draft March 22, 2021 08:45
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@michalursa michalursa changed the title Arrow-12010: [C++][Compute] DRAFT Improve performance of the hash table used in GroupIdentifier ARROW-12010: [C++][Compute] DRAFT Improve performance of the hash table used in GroupIdentifier Mar 23, 2021
@github-actions
Copy link

@bkietz bkietz self-requested a review March 23, 2021 20:40
@michalursa michalursa force-pushed the ARROW-12010-GroupIdentifier branch from 765371c to 66c2b88 Compare March 24, 2021 07:29
Comment on lines 409 to 413
list(APPEND ARROW_SRCS engine/key_hash_avx2.cc)
set_source_files_properties(engine/key_hash_avx2.cc PROPERTIES
SKIP_PRECOMPILE_HEADERS ON)
set_source_files_properties(engine/key_hash_avx2.cc PROPERTIES
COMPILE_FLAGS ${ARROW_AVX2_FLAG})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract this to a macro,

macro(append_avx2_src SRC)
  if(ARROW_HAVE_RUNTIME_AVX2)
    list(APPEND ARROW_SRCS ${SRC})
    set_source_files_properties(${SRC} PROPERTIES
                                SKIP_PRECOMPILE_HEADERS ON)
    set_source_files_properties(${SRC} PROPERTIES
                                COMPILE_FLAGS ${ARROW_AVX2_FLAG})
  endif()
endmacro()

ExpectConsume(*ExecBatch::Make(key_batch), expected);
}

void AssertEquivalentIds(const Datum& expected, const Datum& actual) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a comment describing what this does compared to AssertDatumsEqual

max_right_id = right_ids[i];
}
}
std::vector<bool> right_to_left_present;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector can be sized and initialized on construction

Suggested change
std::vector<bool> right_to_left_present;
std::vector<bool> right_to_left_present(max_right_id + 1, false);

--num_vectors_;
}
static constexpr int64_t padding = 64;
MemoryPool* pool_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TempVectorStack doesn't seem to use this MemoryPool outside of init(), could you remove it and add a comment describing the use of this class?

// second
input = _mm256_shuffle_epi8(
input, _mm256_setr_epi64x(0x0e0c0a0806040200ULL, 0x0f0d0b0907050301ULL,
0x0e0c0a0806040200ULL, 0x0f0d0b0907050301ULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create constexpr variables for any magic numbers

// ** The order of bytes is reversed - highest byte represents 0th bucket.
// No other part of data structure uses this reversed order.
//
class SwissTable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a higher level doc comment describing this class and it's utility. Specifically: describe how equality comparison and appending/storage of new entries are deferred to callbacks.

Separately, add a comment detailing its implementation and usage of stamps for vectorized probing.


class EncoderNulls {
public:
static void Encode(KeyRowArray& rows, const std::vector<KeyColumnArray>& cols,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid mutable references. For out arguments, please use a mutable pointer

Suggested change
static void Encode(KeyRowArray& rows, const std::vector<KeyColumnArray>& cols,
static void Encode(KeyRowArray* rows, const std::vector<KeyColumnArray>& cols,

Comment on lines 451 to 450
*reinterpret_cast<uint16_t*>(row_base + i * row_size) =
reinterpret_cast<const uint16_t*>(col_base)[i];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of unaligned accesses like this one. This is undefined behavior in C++ and it's not supported on all platforms. Could we use SafeLoadAs and SafeStore? If those produce a performance regression, can we optimize them?

@bkietz bkietz marked this pull request as ready for review May 5, 2021 18:27
@bkietz bkietz changed the title ARROW-12010: [C++][Compute] DRAFT Improve performance of the hash table used in GroupIdentifier ARROW-12010: [C++][Compute] Improve performance of the hash table used in GroupIdentifier May 5, 2021
@bkietz bkietz force-pushed the ARROW-12010-GroupIdentifier branch from cddc9a2 to f23c506 Compare May 6, 2021 17:03
@bkietz bkietz force-pushed the ARROW-12010-GroupIdentifier branch from caded37 to ca50d93 Compare May 12, 2021 00:53
@bkietz
Copy link
Member

bkietz commented May 19, 2021

@michalursa looks like tests are hanging on our bigendian CI. Is this quick to address or should we leave this for follow up

@nealrichardson
Copy link
Member

@michalursa looks like tests are hanging on our bigendian CI. Is this quick to address or should we leave this for follow up

cc @kiszk

@michalursa
Copy link
Contributor Author

@michalursa looks like tests are hanging on our bigendian CI. Is this quick to address or should we leave this for follow up

cc @kiszk

It requires a bit of thinking (or debugging) to make the code work with big endian. But since we still have the other group by implementation, for now I am disabling new one on big endian architectures, and will move this issue to a separate jira.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll merge.

Please add follow up JIRAs and link them here, including:

  • Extract vectorized bit utilities
  • Enable GrouperFastImpl for big endian platforms
  • Unit tests should iterate through supported hardware flags (so an AVX2 build should test AVX2 and scalar implementations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants