ARROW-13627: [C++] Fully support ScalarAggregateOptions in (hash) any/all/sum/product/mean #10942

lidavidm · 2021-08-16T15:47:11Z

This should let R easily support na.RM = TRUE / FALSE by setting skip_nulls = false / true (respectively).

github-actions · 2021-08-16T15:47:30Z

https://issues.apache.org/jira/browse/ARROW-13627

lidavidm · 2021-08-16T17:28:55Z

D'oh, I forgot to check the R tests that motivated this in the first place.

lidavidm · 2021-08-16T18:40:09Z

Blocked on ARROW-13638.

lidavidm · 2021-08-17T22:06:33Z

r/src/compute.cpp

FWIW, this appeared unused to me.

nealrichardson

R changes look great; one question about the C++

nealrichardson · 2021-08-18T12:58:33Z

cpp/src/arrow/compute/kernels/aggregate_basic.cc

Is there a possible optimization: if options.skip_nulls, either check the bitmask up front for missings and exit early if any, or exit after the first one is found? It looks like as it is, we still go through and count/sum/etc. all non-null values always.

Ah yes, we can short-circuit as soon as nulls_observed if we have !skip_nulls. Updated, thanks for pointing this out.

pitrou · 2021-08-19T09:27:33Z

cpp/src/arrow/compute/kernels/aggregate_basic.cc

I'm curious about this condition: if there are nulls and options.skip_nulls is false, this kernel can still return true (when this->any is true)?

Yes, it looked weird to me as well, but that is how Kleene logic works, and you can observe this in R:

> any(c(NA), na.rm = FALSE) [1] NA > any(c(NA, TRUE), na.rm = FALSE) [1] TRUE > any(c(NA, FALSE), na.rm = FALSE) [1] NA > any(c(), na.rm = FALSE) [1] FALSE

pitrou · 2021-08-19T09:30:48Z

cpp/src/arrow/compute/kernels/aggregate_basic.cc

Same question here: if the input is [false, true, null] and skip_nulls is false, then the result is false rather than null?

Yes, and this matches base R/dplyr's behavior:

> all(c(FALSE, TRUE, NA), na.rm = FALSE) [1] FALSE

pitrou

Apart from the two (probably silly) questions I asked about the any/all semantics, here are some comments.

pitrou · 2021-08-19T09:34:53Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Why remove the non-zero min_count tests here?

The cases as written were pointless, so I removed them. I've added some new cases instead.

pitrou · 2021-08-19T09:35:27Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

Same question here.

pitrou · 2021-08-19T09:45:23Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

Sidenote: the Sum, Product and Mean aggregators probably have a lot of code in common. Do you think it can be factored out in some kind of mixin or base class?
Or, conversely, that the operation-specific code can be moved into a separate class on which the main aggregator implementation would be templated?

I've made all 5 kernels use CRTP. GroupedMeanImpl is kind of iffy under this pattern but the other 4 kernels consolidate nicely.

pitrou · 2021-08-19T09:47:35Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

Can probably shortcut here:

if (!BitUtil::GetBit(seen, *g) && BitUtil::GetBit(bitmap, position)) { BitUtil::SetBitTo(seen, *g); }

pitrou · 2021-08-19T09:49:20Z

cpp/src/arrow/compute/kernels/hash_aggregate.cc

Similarly to the remark above about Sum / Mean / Product, I also wonder if Any and All can be reconciled.

pitrou · 2021-08-19T09:50:45Z

cpp/src/arrow/compute/kernels/hash_aggregate_test.cc

Call this SumMeanProductKeepNulls?

pitrou · 2021-08-23T16:19:59Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

-                          ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/4));
  ValidateMean<TypeParam>(json, null_result,
-                          ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/15));
+                          ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/0));


It seems like there should be some tests with skip_nulls=false and a non-zero min_count?

Added, sorry.

pitrou · 2021-08-23T16:20:15Z

cpp/src/arrow/compute/kernels/aggregate_test.cc

github-actions bot added Component: C++ Component: Python labels Aug 16, 2021

lidavidm marked this pull request as draft August 16, 2021 17:28

lidavidm force-pushed the arrow-13627 branch from 9b73866 to 45d0ab2 Compare August 17, 2021 21:36

lidavidm marked this pull request as ready for review August 17, 2021 21:36

github-actions bot added the Component: R label Aug 17, 2021

lidavidm commented Aug 17, 2021

View reviewed changes

r/src/compute.cpp Outdated

Copy link

Member Author

lidavidm Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this appeared unused to me.

nealrichardson reacted with thumbs up emoji

nealrichardson approved these changes Aug 18, 2021

View reviewed changes

pitrou reviewed Aug 19, 2021

View reviewed changes

pitrou mentioned this pull request Aug 19, 2021

ARROW-13613: [C++] Add decimal support to (hash) sum/mean/product #10931

Closed

lidavidm added 5 commits August 19, 2021 14:10

ARROW-13627: [C++] Support skip_nulls in sum/mean/product

b406294

ARROW-13627: [C++] Short-circuit

68bd7d7

ARROW-13627: [C++] Restore tests

0f8857f

ARROW-13627: [C++] Consolidate grouped any/all impl

a2d8def

ARROW-13627: [C++] Consolidate grouped sum/mean/product impl

4ec5fb0

lidavidm force-pushed the arrow-13627 branch from ef9c81a to 4ec5fb0 Compare August 19, 2021 19:14

pitrou reviewed Aug 23, 2021

View reviewed changes

ARROW-13627: [C++] Add more test cases

a332bd8

pitrou approved these changes Aug 23, 2021

View reviewed changes

pitrou closed this in b93f38a Aug 23, 2021

asfimport mentioned this pull request Sep 7, 2021

[C++] ScalarAggregateOptions don't make sense (in hash aggregation) #29266

Closed

ARROW-13627: [C++] Fully support ScalarAggregateOptions in (hash) any/all/sum/product/mean #10942

ARROW-13627: [C++] Fully support ScalarAggregateOptions in (hash) any/all/sum/product/mean #10942

Uh oh!

Conversation

lidavidm commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 16, 2021

Uh oh!

lidavidm commented Aug 16, 2021

Uh oh!

lidavidm commented Aug 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lidavidm commented Aug 16, 2021 •

edited

Loading

pitrou Aug 19, 2021 •

edited

Loading