-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6978: [R] Add bindings for sum and mean compute kernels #7308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Good observations. If you open JIRA issues for these we can address each one in due course, none strike me as too difficult (with the null/not-null thing, what we would do is short-circuit and return null if there is any null. Maybe you want to implement that "in software" in R since I'm unclear how useful the option is in C++) |
|
I can make JIRAs. Check and return NA if there are any NA is what I did in R in this PR. That works, of course, but I wonder if we do want it in C++ for consistency with other functions (min/max handle this in C++ already, for example). |
|
OK, we should introduce a |
|
Also for pandas such null-skipping option woudl be useful (although the default there aligns with what arrow already does, but pandas also exposes that as an option)
This would mean that the output type starts to depend on the values of the input, and not just on the types of the input (which is something to avoid, I think?). |
5e7248c to
247a104
Compare
Was curious how easy this would be and what issues I'd run into. So far I've done `sum()` and `mean()`, and I hit some roadblocks in doing min/max. Issues: * See workarounds in [sum.Array](https://github.com/apache/arrow/compare/master...nealrichardson:r-sum?expand=1#diff-695287341c6d5011a12d7d9bd3ae07adR350-R357). (1) missing/null values are always dropped, while R provides an option for how to treat missingness. Interestingly, it looks like min and max do have this null option supported (in `MinMaxOptions`) but sum and mean do not (yet).([ARROW-9054](https://issues.apache.org/jira/browse/ARROW-9054)) * (2) There is no sum method for boolean type ([ARROW-9055](https://issues.apache.org/jira/browse/ARROW-9055)) * There is no sum implemented for Scalars, which may be fine (I would expect it to be no-op for numeric types, just thought logically it maybe shouldn't error) ([ARROW-9056](https://issues.apache.org/jira/browse/ARROW-9056)) * For min and max, the path is to call the minmax kernel, which returns a struct containing min and max fields, and then extract the appropriate field from the struct. However, StructScalar doesn't seem to have field accessing methods like StructArray has ([ARROW-8769](https://issues.apache.org/jira/browse/ARROW-8769)) * I also ran into the fact that MakeArrayFromScalar can't handle structs (previously ticketed as [ARROW-6604](https://issues.apache.org/jira/browse/ARROW-6604)), which means I can't convert a ScalarStruct to an R object (data.frame). Not required for these kernel bindings though if ARROW-9070 happens. Also: * Summing integers seems to promote to return int64 if given int32 (I didn't try with smaller ints), even when overflow is not a danger (I was adding numbers 1 to 5). It would be nice if it returned the same type it got unless it has to go bigger to avoid overflow. * Given how flexible it is to use CallFunction on a new function without having to write any additional C++ binding, it was surprising/frustrating that in order to call minmax, I had to first add boilerplate to instantiate default MinMaxOptions https://github.com/apache/arrow/pull/7308/files#diff-ffc5e6f7dfee7f9bed7733b7fb41e632R163-R167. I would expect that kernels would be able to instantiate their own default options if called with no options. ([ARROW-9091](https://issues.apache.org/jira/browse/ARROW-9091)) Closes apache#7308 from nealrichardson/r-sum Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Was curious how easy this would be and what issues I'd run into. So far I've done
sum()andmean(), and I hit some roadblocks in doing min/max.Issues:
MinMaxOptions) but sum and mean do not (yet).(ARROW-9054)Also: