Skip to content

Conversation

@Christian8491
Copy link
Contributor

No description provided.

@github-actions
Copy link

github-actions bot commented Aug 6, 2021

@Christian8491 Christian8491 changed the title ARROW-12959: [C++] option for is null_na_n to evaluate to true ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true Aug 6, 2021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably made much simpler, e.g.:

bool* out_value = &checked_cast<BooleanScalar*>(out)->value;
if (in.type->id() == Type::FLOAT) {
  *out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<FloatType>::Unbox(in));
}
else if (in.type->id() == Type::DOUBLE) {
  *out_value = !options.nan_is_null || !std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in));
}
else {
  *out_value = true;
}
return Status::OK();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a check for validity prior to checking float type to prevent unnecessary checks when null bit is enabled. Nit: Also, I would change logic to to use && instead of ||, feels a bit more readable (and less characters).

    bool* out_value = &checked_cast<BooleanScalar*>(out)->value;
    if (in.is_valid) {
      switch (in.type->id()) {
        case Type::FLOAT:
          *out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<FloatType>::Unbox(in));
        case Type::DOUBLE:
          *out_value = options.nan_is_null && std::isnan(internal::UnboxScalar<DoubleType>::Unbox(in));
        default:
          *out_value = false;
      }
    } else {
      *out_value = true;
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I would be curious if it is worth it to have a specialized version for floating point types (similar to how arithmetic kernels are implemented), as the options.nan_is_null check does not applies to other data types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the version for Scalar inputs. Micro-optimizing it is futile. The version for Array inputs may be more interesting ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put this in the anonymous namespace above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was addressed on 2e772c1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is comparing pointer values, which is incorrect, even though it will do what you expect most of the time.
(one could e.g. instantiate a separate instance using std::make_shared<FloatType>())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the break statement for the default case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed on 9a69e3e

Copy link
Contributor

@edponce edponce Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the check for NaN values only applies to floating-point types and when nan_is_null is true, other types/cases can use the logic as it was.

In the case of ArrayData there are 3 scenarios:
a) arr.GetNullCount() == arr.length // All data is null
b) arr.GetNullCount() == 0 // No data is null
c) arr.MayHaveNulls() == true // Some data is null

IIUC, the null bitmap of the input ArrayData is not guaranteed to be consistent with the data (an ArrayData can be malformed bc buffer values can be modified directly). Scenarios (a)-(b) invoke arr.GetNullCount() which iterates through all the arr values and update the null count.

Given that scenarios (b)-(c) are the common case and the array data has to be traversed to identify the NaN values, (as an optimization) I suggest to not check the null count at all. Nevertheless, only check for NaNs in non-null indices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed f404910 which contains the ArrayData validity for nulls (not sure if it's the best way to do).
There is a way to test those changes in R ? As the ticket is also related to R language.

Copy link
Contributor

@edponce edponce Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to test kernel implementation in R (or Python), first bindings have to be put in place with the desired default option. The C++ kernels are invoked via the language bindings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Christian8491 when this PR is ready, the C++ code is all reviewed and approved, and the CI is all green, please let me know and I can push a commit that changes the R bindings to use this new option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @ianmcook I will.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou The is_null compute function most of the time will iterate through the bitmap of the input ArrayData and when nan_is_null option is set, it can increase the null count. Would it be a good idea to update the null count of the input array (arr.SetNullCount(...)) as a side-effect of invoking this compute function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you update the null count of the input array? I'm not sure I understand what you're proposing.

Copy link
Contributor

@edponce edponce Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point now, thanks! Modifying the input array does not makes sense because nan_is_null property applies only to the intermediate/output of the compute function. Please ignore these comments.


* \(9) Output is true iff the corresponding input element is null.
* \(9) Output is true if the corresponding input element is null or if NaN
values are treated as null via the :struct:`NanNullOptions`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase to Output is true if the corresponding input element is null. NaN values can be considered as null via ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed on 915c868

@edponce
Copy link
Contributor

edponce commented Aug 18, 2021

Add Python tests that make use of the nan_is_null option, see https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py#L1302

arr = pa.array([1, 2, 3, None, np.nan])
arr.is_null()                  # expected = pa.array([False, False, False, True, False])
arr.is_null(nan_is_null=True)  # expected = pa.array([False, False, False, True, True])

@Christian8491 Christian8491 marked this pull request as ready for review August 19, 2021 03:06
@Christian8491
Copy link
Contributor Author

@ianmcook could you provide me some help for R bindings in order to use this new option.

@ianmcook
Copy link
Member

@Christian8491 could you please confirm that it is safe to set nan_is_null to true when the datum has a non-float type, and it will have no effect or performance cost in that case?

@Christian8491
Copy link
Contributor Author

@Christian8491 could you please confirm that it is safe to set nan_is_null to true when the datum has a non-float type, and it will have no effect or performance cost in that case?

Yes @ianmcook as the logic demands a first check for floating type cc @edponce

@ianmcook ianmcook changed the title ARROW-12959: [C++] Option for is_null(NaN) to evaluate to true ARROW-12959: [C++][R] Option for is_null(NaN) to evaluate to true Aug 20, 2021
@ianmcook
Copy link
Member

I pushed 9a4f39a to update the R bindings to use this option. Thanks again @Christian8491 for doing this!

@ianmcook ianmcook requested a review from pitrou August 26, 2021 01:16
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more comments. I'm going to push the required changes myself.

int64_t start, stop, step;
};

class ARROW_EXPORT NanNullOptions : public FunctionOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest renaming this NullOptions, in case other values than NaN need to be considered null at some point.

{"values"});
const FunctionDoc is_null_doc(
"Return true if null, NaN values can be considered as null",
("For each input value, emit true if the value is null. Default behavior is to emit "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, function descriptions are explicitly line-wrapped with \n characters. I suggest doing the same here.

CheckScalarUnary("is_null", MakeScalar(std::numeric_limits<double>::infinity()),
MakeScalar(false), &options);
CheckScalarUnary("is_null", MakeNullScalar(float64()), MakeScalar(true), &options);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the repetition can easily be avoided here, for example by making this a templated test.

return Expression._call("is_valid", [self])

def is_null(self):
def is_null(self, bint nan_is_null=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this argument keyword-only.

return 0

def is_null(self):
def is_null(self, nan_is_null=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

StrftimeOptions,
DayOfWeekOptions,
NanNullOptions,
TakeOptions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to maintain this alphabetically-ordered (though DayOfWeekOptions already violates it :-/).


TEST_F(TestFloatValidityKernels, FloatScalarIsNull) {
CheckScalarUnary("is_null", MakeScalar(42.0f), MakeScalar(false));
CheckScalarUnary("is_null", MakeScalar(std::nanf("")), MakeScalar(false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These separate tests shouldn't be necessary, as CheckScalarUnary already tests Scalar values implicitly when Array values are given.

@Christian8491
Copy link
Contributor Author

Thanks @pitrou for the feedback! Will be more carefully next time with details like alphabetically-ordered things 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants