ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true #10896

Christian8491 · 2021-08-06T18:35:32Z

No description provided.

github-actions · 2021-08-06T18:35:51Z

https://issues.apache.org/jira/browse/ARROW-12959

cpp/src/arrow/compute/api_scalar.cc

pitrou · 2021-08-09T16:20:55Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

This can probably made much simpler, e.g.:

bool* out_value = &checked_cast<BooleanScalar*>(out)->value; if (in.type->id() == Type::FLOAT) { *out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<FloatType>::Unbox(in)); } else if (in.type->id() == Type::DOUBLE) { *out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in)); } else { *out_value = true; } return Status::OK();

I would add a check for validity prior to checking float type to prevent unnecessary checks when null bit is enabled. Nit: Also, I would change logic to to use && instead of ||, feels a bit more readable (and less characters).

bool* out_value = &checked_cast<BooleanScalar*>(out)->value; if (in.is_valid) { switch (in.type->id()) { case Type::FLOAT: *out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<FloatType>::Unbox(in)); case Type::DOUBLE: *out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in)); default: *out_value = false; } } else { *out_value = true; }

@pitrou I would be curious if it is worth it to have a specialized version for floating point types (similar to how arithmetic kernels are implemented), as the options.nan_is_null check does not applies to other data types.

This is the version for Scalar inputs. Micro-optimizing it is futile. The version for Array inputs may be more interesting ;-)

pitrou · 2021-08-09T16:22:11Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

You can put this in the anonymous namespace above.

It was addressed on 2e772c1

pitrou · 2021-08-09T16:24:23Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

This is comparing pointer values, which is incorrect, even though it will do what you expect most of the time.
(one could e.g. instantiate a separate instance using std::make_shared<FloatType>())

edponce · 2021-08-09T21:29:22Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

Remove the break statement for the default case.

Removed on 9a69e3e

edponce · 2021-08-09T23:27:40Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

Note the check for NaN values only applies to floating-point types and when nan_is_null is true, other types/cases can use the logic as it was.

In the case of ArrayData there are 3 scenarios:
a) arr.GetNullCount() == arr.length // All data is null
b) arr.GetNullCount() == 0 // No data is null
c) arr.MayHaveNulls() == true // Some data is null

IIUC, the null bitmap of the input ArrayData is not guaranteed to be consistent with the data (an ArrayData can be malformed bc buffer values can be modified directly). Scenarios (a)-(b) invoke arr.GetNullCount() which iterates through all the arr values and update the null count.

Given that scenarios (b)-(c) are the common case and the array data has to be traversed to identify the NaN values, (as an optimization) I suggest to not check the null count at all. Nevertheless, only check for NaNs in non-null indices.

I have pushed f404910 which contains the ArrayData validity for nulls (not sure if it's the best way to do).
There is a way to test those changes in R ? As the ticket is also related to R language.

No need to test kernel implementation in R (or Python), first bindings have to be put in place with the desired default option. The C++ kernels are invoked via the language bindings.

@Christian8491 when this PR is ready, the C++ code is all reviewed and approved, and the CI is all green, please let me know and I can push a commit that changes the R bindings to use this new option.

Sure @ianmcook I will.

@pitrou The is_null compute function most of the time will iterate through the bitmap of the input ArrayData and when nan_is_null option is set, it can increase the null count. Would it be a good idea to update the null count of the input array (arr.SetNullCount(...)) as a side-effect of invoking this compute function?

Why would you update the null count of the input array? I'm not sure I understand what you're proposing.

I see your point now, thanks! Modifying the input array does not makes sense because nan_is_null property applies only to the intermediate/output of the compute function. Please ignore these comments.

…N-to-evaluate-to-t ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true

edponce · 2021-08-18T16:31:27Z

docs/source/cpp/compute.rst


-* \(9) Output is true iff the corresponding input element is null.
+* \(9) Output is true if the corresponding input element is null or if NaN
+  values are treated as null via the :struct:`NanNullOptions`.


Rephrase to Output is true if the corresponding input element is null. NaN values can be considered as null via ...

Addressed on 915c868

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

edponce · 2021-08-18T19:53:51Z

Add Python tests that make use of the nan_is_null option, see https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py#L1302

arr = pa.array([1, 2, 3, None, np.nan])
arr.is_null()                  # expected = pa.array([False, False, False, True, False])
arr.is_null(nan_is_null=True)  # expected = pa.array([False, False, False, True, True])

Christian8491 · 2021-08-19T03:09:06Z

@ianmcook could you provide me some help for R bindings in order to use this new option.

ianmcook · 2021-08-20T22:26:13Z

@Christian8491 could you please confirm that it is safe to set nan_is_null to true when the datum has a non-float type, and it will have no effect or performance cost in that case?

Christian8491 · 2021-08-20T22:30:33Z

@Christian8491 could you please confirm that it is safe to set nan_is_null to true when the datum has a non-float type, and it will have no effect or performance cost in that case?

Yes @ianmcook as the logic demands a first check for floating type cc @edponce

ianmcook · 2021-08-20T23:05:42Z

I pushed 9a4f39a to update the R bindings to use this option. Thanks again @Christian8491 for doing this!

…ate-to-tru

pitrou

Here are some more comments. I'm going to push the required changes myself.

pitrou · 2021-08-26T12:16:57Z

cpp/src/arrow/compute/api_scalar.h

  int64_t start, stop, step;
 };

+class ARROW_EXPORT NanNullOptions : public FunctionOptions {


I suggest renaming this NullOptions, in case other values than NaN need to be considered null at some point.

pitrou · 2021-08-26T12:18:32Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

-                              {"values"});
+const FunctionDoc is_null_doc(
+    "Return true if null, NaN values can be considered as null",
+    ("For each input value, emit true if the value is null. Default behavior is to emit "


By convention, function descriptions are explicitly line-wrapped with \n characters. I suggest doing the same here.

pitrou · 2021-08-26T12:19:49Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

+  CheckScalarUnary("is_null", MakeScalar(std::numeric_limits<double>::infinity()),
+                   MakeScalar(false), &options);
+  CheckScalarUnary("is_null", MakeNullScalar(float64()), MakeScalar(true), &options);
+}


I think the repetition can easily be avoided here, for example by making this a templated test.

pitrou · 2021-08-26T12:20:32Z

python/pyarrow/_dataset.pyx

        return Expression._call("is_valid", [self])

-    def is_null(self):
+    def is_null(self, bint nan_is_null=False):


I would make this argument keyword-only.

pitrou · 2021-08-26T12:20:38Z

python/pyarrow/array.pxi

            return 0

-    def is_null(self):
+    def is_null(self, nan_is_null=False):


pitrou · 2021-08-26T12:21:16Z

python/pyarrow/compute.py

    StrftimeOptions,
    DayOfWeekOptions,
+    NanNullOptions,
    TakeOptions,


We should try to maintain this alphabetically-ordered (though DayOfWeekOptions already violates it :-/).

pitrou · 2021-08-26T12:35:49Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

+
+TEST_F(TestFloatValidityKernels, FloatScalarIsNull) {
+  CheckScalarUnary("is_null", MakeScalar(42.0f), MakeScalar(false));
+  CheckScalarUnary("is_null", MakeScalar(std::nanf("")), MakeScalar(false));


These separate tests shouldn't be necessary, as CheckScalarUnary already tests Scalar values implicitly when Array values are given.

Christian8491 · 2021-08-26T16:27:49Z

Thanks @pitrou for the feedback! Will be more carefully next time with details like alphabetically-ordered things 👍

github-actions bot added the Component: C++ label Aug 6, 2021

Christian8491 changed the title ~~ARROW-12959: [C++] option for is null_na_n to evaluate to true~~ ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true Aug 6, 2021

edponce suggested changes Aug 7, 2021

View reviewed changes

cpp/src/arrow/compute/api_scalar.cc Outdated Show resolved Hide resolved

pitrou requested changes Aug 9, 2021

View reviewed changes

edponce suggested changes Aug 9, 2021

View reviewed changes

edponce reviewed Aug 9, 2021

View reviewed changes

Christian8491 added 14 commits August 11, 2021 12:49

Add NanNullOptions for IsNull kernel

a0239aa

Add kNanNullOptionsType to RegisterScalarOptions

bdc3453

Add Init param for MafeFunction and docs for is_null kernel

518bd3b

Add defaults() method to NanNullOptions class

c98c665

Add implementation for IsNullOperator, Scalar case

c993efc

Fix test compilation issue

da9fecc

Apply clang format, add tests for isnull

7b6ca78

Improve message for is_null tests

e4adc7d

Apply requested changes for IsNullOperator

8bdfe99

Remove default break, add todo to handle ArrayData for is_null

685c038

Apply SetBitsTo for NaN values when passed NanNullOptions

05b458b

move kNanNullOptions to anonymous namespace

c47838e

Fix for arrow-compute-expression-test (is_null)

5f6e8dd

Add bindings in cython

d91d64f

github-actions bot added the Component: Python label Aug 11, 2021

Christian8491 added 8 commits August 11, 2021 19:48

Add specialized tests for is_null

41515f1

Fix Sanitizer for KNanNullOptions

918539f

Remove cython bindigs

527ad78

binding in cython layer, fix python tests (theorically)

7847bc3

Add NanNullOptions for IsNull kernel

4cf4e4b

Add kNanNullOptionsType to RegisterScalarOptions

8b0c427

Add Init param for MafeFunction and docs for is_null kernel

59625b3

Add defaults() method to NanNullOptions class

3f91336

edponce and others added 3 commits August 18, 2021 11:21

update function doc

8d26437

fix NanNullOptions in Python bindings

4ad9337

Merge pull request #2 from edponce/ARROW-12959-R-Option-for-is-nullNa…

2831063

…N-to-evaluate-to-t ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true

edponce suggested changes Aug 18, 2021

View reviewed changes

edponce reviewed Aug 18, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_validity_test.cc Outdated Show resolved Hide resolved

Christian8491 added 2 commits August 18, 2021 17:18

Requested changes for scalar_validity_test and docs

915c868

Add python test using nan_is_null

4588822

Christian8491 marked this pull request as ready for review August 19, 2021 03:06

Update R bindings

9a4f39a

github-actions bot added the Component: R label Aug 20, 2021

ianmcook changed the title ~~ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true~~ ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true Aug 20, 2021

edponce approved these changes Aug 21, 2021

View reviewed changes

Christian8491 added 4 commits August 23, 2021 14:58

Merge branch 'master' into ARROW-12959-Option-for-is-nullNaN-to-evalu…

1a9a826

…ate-to-tru

Add cpp/submodules

71b85d2

Merge branch 'master' into ARROW-12959-Option-for-is-nullNaN-to-evalu…

50afc05

…ate-to-tru

parquet_testing sub without changes

7804c49

ianmcook requested a review from pitrou August 26, 2021 01:16

Merge branch 'master' into ARROW-12959-Option-for-is-nullNaN-to-evalu…

0e47c99

…ate-to-tru

pitrou requested changes Aug 26, 2021

View reviewed changes

Address review comments, also restructure docs a bit

30751fc

pitrou approved these changes Aug 26, 2021

View reviewed changes

Some more test cleanup

f09d52a

pitrou closed this in ee5a86a Aug 26, 2021

asfimport mentioned this pull request Aug 26, 2021

[C++][R] Option for is_null(NaN) to evaluate to true #28680

Closed

ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true #10896

ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true #10896

Uh oh!

Conversation

Christian8491 commented Aug 6, 2021

Uh oh!

github-actions bot commented Aug 6, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edponce Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edponce Aug 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edponce Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

edponce commented Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Christian8491 commented Aug 19, 2021

Uh oh!

ianmcook commented Aug 20, 2021

Uh oh!

Christian8491 commented Aug 20, 2021

Uh oh!

ianmcook commented Aug 20, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edponce Aug 9, 2021 •

edited

Loading

edponce Aug 11, 2021 •

edited

Loading

edponce Aug 19, 2021 •

edited

Loading

edponce commented Aug 18, 2021 •

edited

Loading