-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true #10896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true #10896
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably made much simpler, e.g.:
bool* out_value = &checked_cast<BooleanScalar*>(out)->value;
if (in.type->id() == Type::FLOAT) {
*out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<FloatType>::Unbox(in));
}
else if (in.type->id() == Type::DOUBLE) {
*out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in));
}
else {
*out_value = true;
}
return Status::OK();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a check for validity prior to checking float type to prevent unnecessary checks when null bit is enabled. Nit: Also, I would change logic to to use && instead of ||, feels a bit more readable (and less characters).
bool* out_value = &checked_cast<BooleanScalar*>(out)->value;
if (in.is_valid) {
switch (in.type->id()) {
case Type::FLOAT:
*out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<FloatType>::Unbox(in));
case Type::DOUBLE:
*out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in));
default:
*out_value = false;
}
} else {
*out_value = true;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I would be curious if it is worth it to have a specialized version for floating point types (similar to how arithmetic kernels are implemented), as the options.nan_is_null check does not applies to other data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the version for Scalar inputs. Micro-optimizing it is futile. The version for Array inputs may be more interesting ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put this in the anonymous namespace above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was addressed on 2e772c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is comparing pointer values, which is incorrect, even though it will do what you expect most of the time.
(one could e.g. instantiate a separate instance using std::make_shared<FloatType>())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the break statement for the default case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed on 9a69e3e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the check for NaN values only applies to floating-point types and when nan_is_null is true, other types/cases can use the logic as it was.
In the case of ArrayData there are 3 scenarios:
a) arr.GetNullCount() == arr.length // All data is null
b) arr.GetNullCount() == 0 // No data is null
c) arr.MayHaveNulls() == true // Some data is null
IIUC, the null bitmap of the input ArrayData is not guaranteed to be consistent with the data (an ArrayData can be malformed bc buffer values can be modified directly). Scenarios (a)-(b) invoke arr.GetNullCount() which iterates through all the arr values and update the null count.
Given that scenarios (b)-(c) are the common case and the array data has to be traversed to identify the NaN values, (as an optimization) I suggest to not check the null count at all. Nevertheless, only check for NaNs in non-null indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed f404910 which contains the ArrayData validity for nulls (not sure if it's the best way to do).
There is a way to test those changes in R ? As the ticket is also related to R language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to test kernel implementation in R (or Python), first bindings have to be put in place with the desired default option. The C++ kernels are invoked via the language bindings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Christian8491 when this PR is ready, the C++ code is all reviewed and approved, and the CI is all green, please let me know and I can push a commit that changes the R bindings to use this new option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @ianmcook I will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou The is_null compute function most of the time will iterate through the bitmap of the input ArrayData and when nan_is_null option is set, it can increase the null count. Would it be a good idea to update the null count of the input array (arr.SetNullCount(...)) as a side-effect of invoking this compute function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you update the null count of the input array? I'm not sure I understand what you're proposing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point now, thanks! Modifying the input array does not makes sense because nan_is_null property applies only to the intermediate/output of the compute function. Please ignore these comments.
…N-to-evaluate-to-t ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true
docs/source/cpp/compute.rst
Outdated
|
|
||
| * \(9) Output is true iff the corresponding input element is null. | ||
| * \(9) Output is true if the corresponding input element is null or if NaN | ||
| values are treated as null via the :struct:`NanNullOptions`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrase to Output is true if the corresponding input element is null. NaN values can be considered as null via ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed on 915c868
|
Add Python tests that make use of the arr = pa.array([1, 2, 3, None, np.nan])
arr.is_null() # expected = pa.array([False, False, False, True, False])
arr.is_null(nan_is_null=True) # expected = pa.array([False, False, False, True, True]) |
|
@ianmcook could you provide me some help for R bindings in order to use this new option. |
|
@Christian8491 could you please confirm that it is safe to set |
Yes @ianmcook as the logic demands a first check for floating type cc @edponce |
|
I pushed 9a4f39a to update the R bindings to use this option. Thanks again @Christian8491 for doing this! |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some more comments. I'm going to push the required changes myself.
cpp/src/arrow/compute/api_scalar.h
Outdated
| int64_t start, stop, step; | ||
| }; | ||
|
|
||
| class ARROW_EXPORT NanNullOptions : public FunctionOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest renaming this NullOptions, in case other values than NaN need to be considered null at some point.
| {"values"}); | ||
| const FunctionDoc is_null_doc( | ||
| "Return true if null, NaN values can be considered as null", | ||
| ("For each input value, emit true if the value is null. Default behavior is to emit " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention, function descriptions are explicitly line-wrapped with \n characters. I suggest doing the same here.
| CheckScalarUnary("is_null", MakeScalar(std::numeric_limits<double>::infinity()), | ||
| MakeScalar(false), &options); | ||
| CheckScalarUnary("is_null", MakeNullScalar(float64()), MakeScalar(true), &options); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the repetition can easily be avoided here, for example by making this a templated test.
| return Expression._call("is_valid", [self]) | ||
|
|
||
| def is_null(self): | ||
| def is_null(self, bint nan_is_null=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this argument keyword-only.
python/pyarrow/array.pxi
Outdated
| return 0 | ||
|
|
||
| def is_null(self): | ||
| def is_null(self, nan_is_null=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
| StrftimeOptions, | ||
| DayOfWeekOptions, | ||
| NanNullOptions, | ||
| TakeOptions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to maintain this alphabetically-ordered (though DayOfWeekOptions already violates it :-/).
|
|
||
| TEST_F(TestFloatValidityKernels, FloatScalarIsNull) { | ||
| CheckScalarUnary("is_null", MakeScalar(42.0f), MakeScalar(false)); | ||
| CheckScalarUnary("is_null", MakeScalar(std::nanf("")), MakeScalar(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These separate tests shouldn't be necessary, as CheckScalarUnary already tests Scalar values implicitly when Array values are given.
|
Thanks @pitrou for the feedback! Will be more carefully next time with details like |
No description provided.