-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13627: [C++] Fully support ScalarAggregateOptions in (hash) any/all/sum/product/mean #10942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
D'oh, I forgot to check the R tests that motivated this in the first place. |
|
Blocked on ARROW-13638. |
r/src/compute.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this appeared unused to me.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R changes look great; one question about the C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possible optimization: if options.skip_nulls, either check the bitmask up front for missings and exit early if any, or exit after the first one is found? It looks like as it is, we still go through and count/sum/etc. all non-null values always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, we can short-circuit as soon as nulls_observed if we have !skip_nulls. Updated, thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about this condition: if there are nulls and options.skip_nulls is false, this kernel can still return true (when this->any is true)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it looked weird to me as well, but that is how Kleene logic works, and you can observe this in R:
> any(c(NA), na.rm = FALSE)
[1] NA
> any(c(NA, TRUE), na.rm = FALSE)
[1] TRUE
> any(c(NA, FALSE), na.rm = FALSE)
[1] NA
> any(c(), na.rm = FALSE)
[1] FALSEThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here: if the input is [false, true, null] and skip_nulls is false, then the result is false rather than null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and this matches base R/dplyr's behavior:
> all(c(FALSE, TRUE, NA), na.rm = FALSE)
[1] FALSE
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the two (probably silly) questions I asked about the any/all semantics, here are some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the non-zero min_count tests here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cases as written were pointless, so I removed them. I've added some new cases instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: the Sum, Product and Mean aggregators probably have a lot of code in common. Do you think it can be factored out in some kind of mixin or base class?
Or, conversely, that the operation-specific code can be moved into a separate class on which the main aggregator implementation would be templated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made all 5 kernels use CRTP. GroupedMeanImpl is kind of iffy under this pattern but the other 4 kernels consolidate nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably shortcut here:
if (!BitUtil::GetBit(seen, *g) && BitUtil::GetBit(bitmap, position)) {
BitUtil::SetBitTo(seen, *g);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the remark above about Sum / Mean / Product, I also wonder if Any and All can be reconciled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this SumMeanProductKeepNulls?
| ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/4)); | ||
| ValidateMean<TypeParam>(json, null_result, | ||
| ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/15)); | ||
| ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there should be some tests with skip_nulls=false and a non-zero min_count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping here.
This should let R easily support na.RM = TRUE / FALSE by setting skip_nulls = false / true (respectively).