-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3121: [C++] Mean aggregate kernel #3708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3121: [C++] Mean aggregate kernel #3708
Conversation
7f31a91 to
91d9b63
Compare
Codecov Report
@@ Coverage Diff @@
## master #3708 +/- ##
==========================================
+ Coverage 87.65% 87.68% +0.02%
==========================================
Files 701 703 +2
Lines 85747 85824 +77
Branches 1200 1200
==========================================
+ Hits 75162 75254 +92
+ Misses 10466 10456 -10
+ Partials 119 114 -5
Continue to review full report at Codecov.
|
02a0bcf to
2d4c2ba
Compare
|
I think the nodejs failure is unrelated and this is ready to merge. |
|
@bkietz @pitrou @emkornfield This is ready to review. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one comment.
cpp/src/arrow/compute/test-util.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should assert false or abort on other datum kinds, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll also try to include this while implementing ARROW-3123.
cpp/src/arrow/compute/test-util.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
I think you want typename std::enable_if<!IsFloatingPoint<Type>::value>::type, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, shame on me! Clearly something is wrong then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and ConsumeSparse, ConsumeTiny) looks pretty handy, maybe later we could refactor it into BitUtil::VisitBitmap() or so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soon the complexity of future needs will outgrow the current implementation, e.g. we'll need to mix with selection vectors/bitmap, groups, etc. I intend to open a discussion in the mailing list about providing iterator facilities with many options (batch, masked, groups).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crazy edge case: if we're summing floating point and one of the null masked values happens to be NaN then multiplying by zero will not prevent the sum from also being NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll see what I can do without too much change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the performance diff of doing ternary operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also would it pay to have a constant expression (or just static array) in bit utils that precomputes all the shifts similar to the popcount array below
2d4c2ba to
7929140
Compare
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I saw any blocking issues mostly style stuff and documentation suggestions (please do fix the NaN issue raised).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the performance diff of doing ternary operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Le 26/02/2019 à 14:27, François Saint-Jacques a écrit :
Soon the complexity of future needs will outgrow the current
implementation, e.g. we'll need to mix with selection vectors/bitmap,
groups, etc. I intend to open a discussion in the mailing list about
providing iterator facilities with many options (batch, masked, groups).
Also take a look at existing facilities (ArrayDataVisitor) in
visitor-inline.h
Regards
Antoine.
|
|
@pitrou, the initial (sum) implementation used the visitor facility. |
|
This is good to go, the failure is from python https://issues.apache.org/jira/browse/ARROW-4684 |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| Datum(std::make_shared<ScalarType>(0.0))); | ||
|
|
||
| ValidateMean<TypeParam>(&this->ctx_, "[1, 1, 1, 1, 1, 1, 1, 1]", | ||
| Datum(std::make_shared<ScalarType>(1.0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For self-documenting code you could do:
auto null = std::make_shared<ScalarType>(0.0, false);
Also keep in mind that Datum has implicit ctors, so these Datum(...) are unneeded
| if (data_type == nullptr) | ||
| return Status::Invalid("Datum must be array-like"); | ||
| else if (!is_integer(data_type->id()) && !is_floating(data_type->id())) | ||
| return Status::Invalid("Datum must contain a NumericType"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to always use braces
Implements the mean (average) kernel aggregates on numeric columns. The final type is always a double. Refactored the Sum kernel implementation to share common parts, notably the consume part is identical. Only the Finalize and output type differ.