Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented May 31, 2020

Was curious how easy this would be and what issues I'd run into. So far I've done sum() and mean(), and I hit some roadblocks in doing min/max.

Issues:

  • See workarounds in sum.Array. (1) missing/null values are always dropped, while R provides an option for how to treat missingness. Interestingly, it looks like min and max do have this null option supported (in MinMaxOptions) but sum and mean do not (yet).(ARROW-9054)
  • (2) There is no sum method for boolean type (ARROW-9055)
  • There is no sum implemented for Scalars, which may be fine (I would expect it to be no-op for numeric types, just thought logically it maybe shouldn't error) (ARROW-9056)
  • For min and max, the path is to call the minmax kernel, which returns a struct containing min and max fields, and then extract the appropriate field from the struct. However, StructScalar doesn't seem to have field accessing methods like StructArray has (ARROW-8769)
  • I also ran into the fact that MakeArrayFromScalar can't handle structs (previously ticketed as ARROW-6604), which means I can't convert a ScalarStruct to an R object (data.frame). Not required for these kernel bindings though if ARROW-9070 happens.

Also:

  • Summing integers seems to promote to return int64 if given int32 (I didn't try with smaller ints), even when overflow is not a danger (I was adding numbers 1 to 5). It would be nice if it returned the same type it got unless it has to go bigger to avoid overflow.
  • Given how flexible it is to use CallFunction on a new function without having to write any additional C++ binding, it was surprising/frustrating that in order to call minmax, I had to first add boilerplate to instantiate default MinMaxOptions https://github.com/apache/arrow/pull/7308/files#diff-ffc5e6f7dfee7f9bed7733b7fb41e632R163-R167. I would expect that kernels would be able to instantiate their own default options if called with no options. (ARROW-9091)

@github-actions
Copy link

@wesm
Copy link
Member

wesm commented Jun 1, 2020

Good observations. If you open JIRA issues for these we can address each one in due course, none strike me as too difficult (with the null/not-null thing, what we would do is short-circuit and return null if there is any null. Maybe you want to implement that "in software" in R since I'm unclear how useful the option is in C++)

@nealrichardson
Copy link
Member Author

I can make JIRAs.

Check and return NA if there are any NA is what I did in R in this PR. That works, of course, but I wonder if we do want it in C++ for consistency with other functions (min/max handle this in C++ already, for example).

@wesm
Copy link
Member

wesm commented Jun 1, 2020

OK, we should introduce a ScalarAggregateOptions providing null handling behavior, then

@jorisvandenbossche
Copy link
Member

Also for pandas such null-skipping option woudl be useful (although the default there aligns with what arrow already does, but pandas also exposes that as an option)

Summing integers seems to promote to return int64 if given int32 (I didn't try with smaller ints), even when overflow is not a danger (I was adding numbers 1 to 5). It would be nice if it returned the same type it got unless it has to go bigger to avoid overflow.

This would mean that the output type starts to depend on the values of the input, and not just on the types of the input (which is something to avoid, I think?).
But does the kernel machinery allow right now to specify the output type? (or is this always implicitly inferred from the input types?) This is how numpy solves this: by default it will also use int64 for the result, but it has the option to specify the output dtype (so you could do np.array([1, 2, 3], dtype="int8").sum(dtype="int8") to preserve int8 type for the sum)

@nealrichardson nealrichardson force-pushed the r-sum branch 2 times, most recently from 5e7248c to 247a104 Compare June 6, 2020 22:01
@nealrichardson nealrichardson marked this pull request as ready for review June 6, 2020 22:19
sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Jun 11, 2020
Was curious how easy this would be and what issues I'd run into. So far I've done `sum()` and `mean()`, and I hit some roadblocks in doing min/max.

Issues:

* See workarounds in [sum.Array](https://github.com/apache/arrow/compare/master...nealrichardson:r-sum?expand=1#diff-695287341c6d5011a12d7d9bd3ae07adR350-R357). (1) missing/null values are always dropped, while R provides an option for how to treat missingness. Interestingly, it looks like min and max do have this null option supported (in `MinMaxOptions`) but sum and mean do not (yet).([ARROW-9054](https://issues.apache.org/jira/browse/ARROW-9054))
* (2) There is no sum method for boolean type ([ARROW-9055](https://issues.apache.org/jira/browse/ARROW-9055))
* There is no sum implemented for Scalars, which may be fine (I would expect it to be no-op for numeric types, just thought logically it maybe shouldn't error) ([ARROW-9056](https://issues.apache.org/jira/browse/ARROW-9056))
* For min and max, the path is to call the minmax kernel, which returns a struct containing min and max fields, and then extract the appropriate field from the struct. However, StructScalar doesn't seem to have field accessing methods like StructArray has ([ARROW-8769](https://issues.apache.org/jira/browse/ARROW-8769))
* I also ran into the fact that MakeArrayFromScalar can't handle structs (previously ticketed as [ARROW-6604](https://issues.apache.org/jira/browse/ARROW-6604)), which means I can't convert a ScalarStruct to an R object (data.frame). Not required for these kernel bindings though if ARROW-9070 happens.

Also:

* Summing integers seems to promote to return int64 if given int32 (I didn't try with smaller ints), even when overflow is not a danger (I was adding numbers 1 to 5). It would be nice if it returned the same type it got unless it has to go bigger to avoid overflow.
* Given how flexible it is to use CallFunction on a new function without having to write any additional C++ binding, it was surprising/frustrating that in order to call minmax, I had to first add boilerplate to instantiate default MinMaxOptions https://github.com/apache/arrow/pull/7308/files#diff-ffc5e6f7dfee7f9bed7733b7fb41e632R163-R167. I would expect that kernels would be able to instantiate their own default options if called with no options. ([ARROW-9091](https://issues.apache.org/jira/browse/ARROW-9091))

Closes apache#7308 from nealrichardson/r-sum

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
@nealrichardson nealrichardson deleted the r-sum branch August 19, 2020 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants