Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Feb 14, 2023

@Fokko Fokko requested a review from AlenkaF as a code owner February 14, 2023 22:28
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34154 has been automatically assigned in GitHub to PR creator.

@Fokko Fokko force-pushed the fd-nan-expression branch from 78ec952 to 36db249 Compare February 14, 2023 22:45
@Fokko
Copy link
Contributor Author

Fokko commented Feb 14, 2023

Is this expected? 🔍

>       assert result.equals(expected)
E       assert False
E        +  where False = <built-in method equals of pyarrow.lib.BooleanArray object at 0x7f41dfe37c90>(<pyarrow.lib.BooleanArray object at 0x7f41dfe37a60>\n[\n  false,\n  false,\n  false,\n  false,\n  true\n])
E        +    where <built-in method equals of pyarrow.lib.BooleanArray object at 0x7f41dfe37c90> = <pyarrow.lib.BooleanArray object at 0x7f41dfe37c90>\n[\n  false,\n  false,\n  false,\n  null,\n  true\n].equals

@AlenkaF
Copy link
Member

AlenkaF commented Feb 15, 2023

Looking at the tests in C++, I think it is expected:

TEST(TestValidityKernels, IsNan) {
for (const auto& ty : IntTypes()) {
CheckScalar("is_nan", {ArrayFromJSON(ty, "[0, 1, 42, null]")},
ArrayFromJSON(boolean(), "[false, false, false, null]"));
}
for (const auto& ty : {decimal128(4, 2), decimal256(4, 2)}) {
CheckScalar("is_nan", {ArrayFromJSON(ty, R"(["0.00", "1.01", "-42.00", null])")},
ArrayFromJSON(boolean(), "[false, false, false, null]"));
}
CheckScalar("is_nan", {std::make_shared<NullArray>(4)},
ArrayFromJSON(boolean(), "[null, null, null, null]"));
}

void TestIsNan() {
auto ty = this->type_singleton();
CheckScalarUnary("is_nan", ArrayFromJSON(ty, "[]"), ArrayFromJSON(boolean(), "[]"));
// All NaN
CheckScalarUnary("is_nan", ArrayFromJSON(ty, "[NaN, NaN, NaN, NaN, NaN]"),
ArrayFromJSON(boolean(), "[true, true, true, true, true]"));
// No NaN
CheckScalarUnary(
"is_nan", ArrayFromJSON(ty, "[0.0, 1.0, 2.0, 3.0, Inf, null]"),
ArrayFromJSON(boolean(), "[false, false, false, false, false, null]"));
// Some NaNs
CheckScalarUnary("is_nan", ArrayFromJSON(ty, "[0.0, NaN, 2.0, NaN, Inf, null]"),
ArrayFromJSON(boolean(), "[false, true, false, true, false, null]"));
}
};

@jorisvandenbossche
Copy link
Member

Is this expected?

It's not super clear from the assert message, but so it is the "null" that is different (see my inline comment). As Alenka mentioned above, that is indeed expected (and tested on the C++ side) behaviour.

In general, predicate functions (checking something, returning a boolean) propagate nulls (in the idea of "you don't know the value, so you also don't know if it is NaN or not, so whether it should be True or False).
(one exception on that rule is of course is_null to check for nulls itself)

@Fokko
Copy link
Contributor Author

Fokko commented Feb 23, 2023

I learned something today. Thanks @AlenkaF for pointing that out.

Updated the PR with also adding the option of returning None's in the docstring 👍🏻

@AlenkaF
Copy link
Member

AlenkaF commented Feb 23, 2023

Great, thanks for adding this!
One thing I see missing is taking into account that is_nan expression only handles numeric types. What happens if we call is_nan on an array of strings or any non-numeric type?

@Fokko Fokko force-pushed the fd-nan-expression branch 2 times, most recently from a56f3fb to 2229918 Compare February 23, 2023 15:23
@Fokko Fokko force-pushed the fd-nan-expression branch from 2229918 to 2793dd5 Compare February 23, 2023 15:38
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, added some comments on the docstrings.

You also still have a related doctest failure:

__________________ [doctest] pyarrow.lib.ChunkedArray.is_nan ___________________
331 ChunkedArray.is_nan(self)
332 
333         Return boolean array indicating the nan values.
334 
335         Examples
336         --------
337         >>> import pyarrow as pa
338         >>> import numpy as np
339         >>> arr = pa.chunked_array([[2, np.nan, 4], [4, None, 100]])
340         >>> arr.is_nan()
Differences (unified diff with -expected +actual):
    @@ -1,12 +1,10 @@
    -<pyarrow.lib.ChunkedArray object at ...>
    +<pyarrow.lib.ChunkedArray object at 0x7f91d0ff3360>
     [
       [
         false,
         true,
    -    false
    -  ],
    -  [
         false,
    -    None,
    +    false,
    +    null,
         false
       ]

FYI you can check those locally as well: https://arrow.apache.org/docs/dev/developers/python.html#doctest

Returns
-------
array : boolean Array, where null values are None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this "where null values are None" mean exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That a null value, will result in a Python None:

def test_is_nan():
arr = pa.array([1, 2, 3, None, np.nan])
result = arr.is_nan()
expected = pa.array([False, False, False, None, True])
assert result.equals(expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, I didn't get that at first. Now, I don't think that is something we need to mention here in this docstring (that's true for all functions that return arrow data). And it's also only when converting to native Python objects that this happens, the actual result here is a pyarrow.array, and that doesn't use None (so the explanation would be relevant for a method like Array.to_pylist().

The reason you see it in the tests, is because we allow you to use None to specify a null value when constructing an array (but the None gets converted to a proper "null").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Okay, that's fine with me as well. It was not directly obvious for me, but it makes sense now I'm aware of it.

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 2, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 3, 2023
@jorisvandenbossche jorisvandenbossche changed the title GH-34154: [Python] Add is_nan expression GH-34154: [Python] Add is_nan method to Array and Expression Mar 9, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche jorisvandenbossche merged commit 2ff4e3a into apache:main Mar 9, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Mar 9, 2023
@ursabot
Copy link

ursabot commented Mar 9, 2023

Benchmark runs are scheduled for baseline = b679a96 and contender = 2ff4e3a. 2ff4e3a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.15% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.06% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2ff4e3a2 ec2-t3-xlarge-us-east-2
[Finished] 2ff4e3a2 test-mac-arm
[Finished] 2ff4e3a2 ursa-i9-9960x
[Finished] 2ff4e3a2 ursa-thinkcentre-m75q
[Finished] b679a96d ec2-t3-xlarge-us-east-2
[Failed] b679a96d test-mac-arm
[Finished] b679a96d ursa-i9-9960x
[Finished] b679a96d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Add is_nan for Python

4 participants