Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Aug 16, 2021

This should let R easily support na.RM = TRUE / FALSE by setting skip_nulls = false / true (respectively).

@github-actions
Copy link

@lidavidm
Copy link
Member Author

D'oh, I forgot to check the R tests that motivated this in the first place.

@lidavidm lidavidm marked this pull request as draft August 16, 2021 17:28
@lidavidm
Copy link
Member Author

Blocked on ARROW-13638.

@lidavidm lidavidm marked this pull request as ready for review August 17, 2021 21:36
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this appeared unused to me.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R changes look great; one question about the C++

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possible optimization: if options.skip_nulls, either check the bitmask up front for missings and exit early if any, or exit after the first one is found? It looks like as it is, we still go through and count/sum/etc. all non-null values always.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, we can short-circuit as soon as nulls_observed if we have !skip_nulls. Updated, thanks for pointing this out.

Copy link
Member

@pitrou pitrou Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about this condition: if there are nulls and options.skip_nulls is false, this kernel can still return true (when this->any is true)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looked weird to me as well, but that is how Kleene logic works, and you can observe this in R:

> any(c(NA), na.rm = FALSE)
[1] NA
> any(c(NA, TRUE), na.rm = FALSE)
[1] TRUE
> any(c(NA, FALSE), na.rm = FALSE)
[1] NA
> any(c(), na.rm = FALSE)
[1] FALSE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here: if the input is [false, true, null] and skip_nulls is false, then the result is false rather than null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and this matches base R/dplyr's behavior:

> all(c(FALSE, TRUE, NA), na.rm = FALSE)
[1] FALSE

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the two (probably silly) questions I asked about the any/all semantics, here are some comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the non-zero min_count tests here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cases as written were pointless, so I removed them. I've added some new cases instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: the Sum, Product and Mean aggregators probably have a lot of code in common. Do you think it can be factored out in some kind of mixin or base class?
Or, conversely, that the operation-specific code can be moved into a separate class on which the main aggregator implementation would be templated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made all 5 kernels use CRTP. GroupedMeanImpl is kind of iffy under this pattern but the other 4 kernels consolidate nicely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably shortcut here:

if (!BitUtil::GetBit(seen, *g) && BitUtil::GetBit(bitmap, position)) {
  BitUtil::SetBitTo(seen, *g);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the remark above about Sum / Mean / Product, I also wonder if Any and All can be reconciled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this SumMeanProductKeepNulls?

ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/4));
ValidateMean<TypeParam>(json, null_result,
ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/15));
ScalarAggregateOptions(/*skip_nulls=*/false, /*min_count=*/0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there should be some tests with skip_nulls=false and a non-zero min_count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants