Skip to content

Conversation

@michalursa
Copy link
Contributor

@michalursa michalursa commented May 10, 2021

Current implementation of hash group by (GrouperFastImpl class) converts input ExecBatches to row-oriented format,
then hashes and compares rows as if they were a single column.

It is more efficient to avoid relatively costly encoding and instead compute hashes of individual columns in column-oriented format mixing them together, and similarly comparing column-oriented data to row-oriented data in the hash table without converting.

Encoding only happens for a subset of input rows that are inserted into the hash table - they introduce new groups.
Keys in hash table remain stored as row-oriented.

There are 3 components that changed:

  • key_compare.cc - contains implementation of one column at a time comparison between column-oriented and row-oriented data
  • key_hash.cc - implements hashing individual columns of column-oriented key data and mixing the individual results together for a final multi-column hash
  • key_encode.cc - implements encoding in row-format only a selection of input rows; previously all input rows were encoded without support for selective encoding

Performance before and after:
Here are selected results (in millions of rows per second) of arrow-compute-aggregate-benchmark for 0.01% nulls in input columns (Ubuntu VM, Clang 10.0, Intel I9-10980XE):

Test Before (AVX2) After (AVX2) Before (Scalar) After (Scalar)
TinyString 53 96 25 38
SmallString 48 80 23 36
MediumString 47 65 22 32
TinyInt 340 407 130 136
SmallInt 246 287 131 137
MediumInt 194 208 106 111

@github-actions
Copy link

@michalursa michalursa changed the title ARROW-12010: [C++][Compute] Column at a time hash and comparison in group by ARROW-12725: [C++][Compute] Column at a time hash and comparison in group by May 12, 2021
@github-actions
Copy link

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalursa looks like this needs to be rebased

@michalursa michalursa force-pushed the ARROW-12725-column-at-a-time-groupby branch from 478197f to a620889 Compare August 18, 2021 20:54
@michalursa michalursa force-pushed the ARROW-12725-column-at-a-time-groupby branch from a620889 to 5294f98 Compare August 18, 2021 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants