-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13295: [C++] add hash_mean, hash_variance, hash_stddev kernels #10792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be the same comment in GroupedSumImpl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message should be updated for this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pity this is duplicating the existing math from the scalar aggregate kernel.
How would you feel about factoring the underlying math in a simple VarStdOp<ArrowType> that you would feed values to? You would have one VarStdOp in the scalar aggregate kernel and num_groups_ ones in the hash aggregate kernel.
That might be a bit different performance-wise because you would have an array-of-structures std::vector<VarStdOp> rather than a structure-of-arrays of the current three BufferBuilder, but I'm not sure it's really important here.
OTOH, Consume would not really benefit because the scalar aggregate kernel uses pairwise-summation for floating-point input.
cc @bkietz for advice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @cyb70289
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option that Ben's mentioned would be to treat scalar aggregation as a hash aggregation with one group, though then we should immediately tackle the pairwise summation issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can get a comparison up before I'm out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunately not very good. See #10813.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try refactoring it instead and compare the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are unfortunately verbose, but perhaps you could add nevertheless another test with a different ddof (and one group with an insufficient number of non-null values)?
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for baseline = 31b60f3 and contender = df99462. Results will be available as each benchmark for each run completes. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for baseline = 31b60f3 and contender = 90c91fa. Results will be available as each benchmark for each run completes. |
|
From conbench it looks like the changes generally don't affect the performance of the existing variance/stddev kernels except for maybe the int32 variance case (likely a fluke?). |
|
It's difficult to navigate the conbench results, but the results don't seem concerning in any way. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you very much @lidavidm
Note these don't use pairwise summation and so may be prone to precision issues.