ARROW-13295: [C++] add hash_mean, hash_variance, hash_stddev kernels #10792

lidavidm · 2021-07-23T14:08:50Z

Note these don't use pairwise summation and so may be prone to precision issues.

github-actions · 2021-07-23T14:09:09Z

https://issues.apache.org/jira/browse/ARROW-13295

cpp/src/arrow/compute/kernels/hash_aggregate.cc

pitrou · 2021-07-26T13:42:09Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

Should there be the same comment in GroupedSumImpl?

pitrou · 2021-07-26T13:46:14Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

The error message should be updated for this operation.

pitrou · 2021-07-26T14:08:34Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

A pity this is duplicating the existing math from the scalar aggregate kernel.

How would you feel about factoring the underlying math in a simple VarStdOp<ArrowType> that you would feed values to? You would have one VarStdOp in the scalar aggregate kernel and num_groups_ ones in the hash aggregate kernel.

That might be a bit different performance-wise because you would have an array-of-structures std::vector<VarStdOp> rather than a structure-of-arrays of the current three BufferBuilder, but I'm not sure it's really important here.

OTOH, Consume would not really benefit because the scalar aggregate kernel uses pairwise-summation for floating-point input.

cc @bkietz for advice

Also @cyb70289

Another option that Ben's mentioned would be to treat scalar aggregation as a hash aggregation with one group, though then we should immediately tackle the pairwise summation issue.

https://issues.apache.org/jira/browse/ARROW-13451

I'll see if I can get a comparison up before I'm out.

It's unfortunately not very good. See #10813.

I'll try refactoring it instead and compare the performance.

pitrou · 2021-07-26T14:11:13Z

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

These tests are unfortunately verbose, but perhaps you could add nevertheless another test with a different ddof (and one group with an insufficient number of non-null values)?

…nels

lidavidm · 2021-07-27T15:46:57Z

@ursabot please benchmark lang=C++

ursabot · 2021-07-27T15:47:08Z

Benchmark runs are scheduled for baseline = 31b60f3 and contender = df99462. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Provided benchmark filters do not have any benchmark groups to be executed on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 (mimalloc)
[Skipped ⚠️ Only ['Python', 'R', 'JavaScript'] langs are supported on ursa-i9-9960x] ursa-i9-9960x (mimalloc)
[Failed] ursa-thinkcentre-m75q (mimalloc)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

pitrou · 2021-07-29T11:31:48Z

@ursabot please benchmark lang=C++

ursabot · 2021-07-29T11:32:11Z

Benchmark runs are scheduled for baseline = 31b60f3 and contender = 90c91fa. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Provided benchmark filters do not have any benchmark groups to be executed on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2 (mimalloc)
[Skipped ⚠️ Only ['Python', 'R', 'JavaScript'] langs are supported on ursa-i9-9960x] ursa-i9-9960x (mimalloc)
[Finished ⬇️0.67% ⬆️0.05%] ursa-thinkcentre-m75q (mimalloc)
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

lidavidm · 2021-08-02T03:36:48Z

From conbench it looks like the changes generally don't affect the performance of the existing variance/stddev kernels except for maybe the int32 variance case (likely a fluke?).

pitrou · 2021-08-02T14:14:26Z

It's difficult to navigate the conbench results, but the results don't seem concerning in any way.

pitrou

+1, thank you very much @lidavidm

github-actions bot added the Component: C++ label Jul 23, 2021

cyb70289 reviewed Jul 26, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/hash_aggregate.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/kernels/hash_aggregate.cc Outdated Show resolved Hide resolved

pitrou reviewed Jul 26, 2021

View reviewed changes

lidavidm force-pushed the arrow-13295 branch from 3f72328 to 198338c Compare July 26, 2021 15:07

lidavidm added 6 commits July 27, 2021 10:54

ARROW-13295: [C++] add hash_mean, hash_variance, hash_stddev kernels

565c88e

ARROW-13295: [C++] Avoid redefining enum

0f23df9

ARROW-13295: [C++] Fix some casts, use ApproxEquals

1c53cd9

ARROW-13295: [C++] More ApproxEquals

44403a0

ARROW-13295: [C++] Improve tests, comments, cast

98b004e

ARROW-13295: [C++] Try to share math between scalar/hash variance ker…

df99462

…nels

lidavidm force-pushed the arrow-13295 branch from 198338c to df99462 Compare July 27, 2021 15:46

ARROW-13295: [C++] Add missing pragma

90c91fa

pitrou approved these changes Aug 2, 2021

View reviewed changes

pitrou closed this in c02a40f Aug 2, 2021

asfimport mentioned this pull request Aug 9, 2021

[C++] Implement hash_aggregate mean/stdev/variance kernels #18738

Closed

ARROW-13295: [C++] add hash_mean, hash_variance, hash_stddev kernels #10792

ARROW-13295: [C++] add hash_mean, hash_variance, hash_stddev kernels #10792

Uh oh!

Conversation

lidavidm commented Jul 23, 2021

Uh oh!

github-actions bot commented Jul 23, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Jul 27, 2021

Uh oh!

ursabot commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Jul 29, 2021

Uh oh!

ursabot commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Aug 2, 2021

Uh oh!

pitrou commented Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ursabot commented Jul 27, 2021 •

edited

Loading

ursabot commented Jul 29, 2021 •

edited

Loading

pitrou commented Aug 2, 2021 •

edited

Loading