GH-34154: [Python] Add `is_nan` method to Array and Expression #34184

Fokko · 2023-02-14T22:28:56Z

Closes: [Python] Add is_nan for Python #34154

github-actions · 2023-02-14T22:29:17Z

Closes: [Python] Add is_nan for Python #34154

github-actions · 2023-02-14T22:29:20Z

⚠️ GitHub issue #34154 has been automatically assigned in GitHub to PR creator.

Closes apache#34154

Fokko · 2023-02-14T23:00:37Z

Is this expected? 🔍

>       assert result.equals(expected)
E       assert False
E        +  where False = <built-in method equals of pyarrow.lib.BooleanArray object at 0x7f41dfe37c90>(<pyarrow.lib.BooleanArray object at 0x7f41dfe37a60>\n[\n  false,\n  false,\n  false,\n  false,\n  true\n])
E        +    where <built-in method equals of pyarrow.lib.BooleanArray object at 0x7f41dfe37c90> = <pyarrow.lib.BooleanArray object at 0x7f41dfe37c90>\n[\n  false,\n  false,\n  false,\n  null,\n  true\n].equals

AlenkaF · 2023-02-15T09:00:06Z

Looking at the tests in C++, I think it is expected:

arrow/cpp/src/arrow/compute/kernels/scalar_validity_test.cc

Lines 121 to 132 in 53752ad

    
           TEST(TestValidityKernels, IsNan) { 
        
             for (const auto& ty : IntTypes()) { 
        
               CheckScalar("is_nan", {ArrayFromJSON(ty, "[0, 1, 42, null]")}, 
        
                           ArrayFromJSON(boolean(), "[false, false, false, null]")); 
        
             } 
        
             for (const auto& ty : {decimal128(4, 2), decimal256(4, 2)}) { 
        
               CheckScalar("is_nan", {ArrayFromJSON(ty, R"(["0.00", "1.01", "-42.00", null])")}, 
        
                           ArrayFromJSON(boolean(), "[false, false, false, null]")); 
        
             } 
        
             CheckScalar("is_nan", {std::make_shared<NullArray>(4)}, 
        
                         ArrayFromJSON(boolean(), "[null, null, null, null]")); 
        
           }

arrow/cpp/src/arrow/compute/kernels/scalar_validity_test.cc

Lines 229 to 244 in 53752ad

    
             void TestIsNan() { 
        
               auto ty = this->type_singleton(); 
        
               CheckScalarUnary("is_nan", ArrayFromJSON(ty, "[]"), ArrayFromJSON(boolean(), "[]")); 
        
               // All NaN 
        
               CheckScalarUnary("is_nan", ArrayFromJSON(ty, "[NaN, NaN, NaN, NaN, NaN]"), 
        
                                ArrayFromJSON(boolean(), "[true, true, true, true, true]")); 
        
               // No NaN 
        
               CheckScalarUnary( 
        
                   "is_nan", ArrayFromJSON(ty, "[0.0, 1.0, 2.0, 3.0, Inf, null]"), 
        
                   ArrayFromJSON(boolean(), "[false, false, false, false, false, null]")); 
        
               // Some NaNs 
        
               CheckScalarUnary("is_nan", ArrayFromJSON(ty, "[0.0, NaN, 2.0, NaN, Inf, null]"), 
        
                                ArrayFromJSON(boolean(), "[false, true, false, true, false, null]")); 
        
             } 
        
           };

python/pyarrow/tests/test_compute.py

jorisvandenbossche · 2023-02-16T15:29:21Z

Is this expected?

It's not super clear from the assert message, but so it is the "null" that is different (see my inline comment). As Alenka mentioned above, that is indeed expected (and tested on the C++ side) behaviour.

In general, predicate functions (checking something, returning a boolean) propagate nulls (in the idea of "you don't know the value, so you also don't know if it is NaN or not, so whether it should be True or False).
(one exception on that rule is of course is_null to check for nulls itself)

Fokko · 2023-02-23T07:55:38Z

I learned something today. Thanks @AlenkaF for pointing that out.

Updated the PR with also adding the option of returning None's in the docstring 👍🏻

…pression

AlenkaF · 2023-02-23T11:37:23Z

Great, thanks for adding this!
One thing I see missing is taking into account that is_nan expression only handles numeric types. What happens if we call is_nan on an array of strings or any non-numeric type?

jorisvandenbossche

Code looks good, added some comments on the docstrings.

You also still have a related doctest failure:

__________________ [doctest] pyarrow.lib.ChunkedArray.is_nan ___________________
331 ChunkedArray.is_nan(self)
332 
333         Return boolean array indicating the nan values.
334 
335         Examples
336         --------
337         >>> import pyarrow as pa
338         >>> import numpy as np
339         >>> arr = pa.chunked_array([[2, np.nan, 4], [4, None, 100]])
340         >>> arr.is_nan()
Differences (unified diff with -expected +actual):
    @@ -1,12 +1,10 @@
    -<pyarrow.lib.ChunkedArray object at ...>
    +<pyarrow.lib.ChunkedArray object at 0x7f91d0ff3360>
     [
       [
         false,
         true,
    -    false
    -  ],
    -  [
         false,
    -    None,
    +    false,
    +    null,
         false
       ]

FYI you can check those locally as well: https://arrow.apache.org/docs/dev/developers/python.html#doctest

jorisvandenbossche · 2023-02-28T08:47:40Z

python/pyarrow/array.pxi

+
+        Returns
+        -------
+        array : boolean Array, where null values are None


What does this "where null values are None" mean exactly?

That a null value, will result in a Python None:

arrow/python/pyarrow/tests/test_compute.py

Lines 1639 to 1643 in df625f2

def test_is_nan():

arr = pa.array([1, 2, 3, None, np.nan])

result = arr.is_nan()

expected = pa.array([False, False, False, None, True])

assert result.equals(expected)

Ah, OK, I didn't get that at first. Now, I don't think that is something we need to mention here in this docstring (that's true for all functions that return arrow data). And it's also only when converting to native Python objects that this happens, the actual result here is a pyarrow.array, and that doesn't use None (so the explanation would be relevant for a method like Array.to_pylist().

The reason you see it in the tests, is because we allow you to use None to specify a null value when constructing an array (but the None gets converted to a proper "null").

@jorisvandenbossche Okay, that's fine with me as well. It was not directly obvious for me, but it makes sense now I'm aware of it.

python/pyarrow/_compute.pyx

python/pyarrow/array.pxi

python/pyarrow/table.pxi

python/pyarrow/tests/test_compute.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…n-expression

…pression

jorisvandenbossche

Thanks!

ursabot · 2023-03-09T23:36:52Z

Benchmark runs are scheduled for baseline = b679a96 and contender = 2ff4e3a. 2ff4e3a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.15% ⬆️0.03%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.06% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2ff4e3a2 ec2-t3-xlarge-us-east-2
[Finished] 2ff4e3a2 test-mac-arm
[Finished] 2ff4e3a2 ursa-i9-9960x
[Finished] 2ff4e3a2 ursa-thinkcentre-m75q
[Finished] b679a96d ec2-t3-xlarge-us-east-2
[Failed] b679a96d test-mac-arm
[Finished] b679a96d ursa-i9-9960x
[Finished] b679a96d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Fokko requested a review from AlenkaF as a code owner February 14, 2023 22:28

github-actions bot added the Component: Python label Feb 14, 2023

apacheGH-34154: [Python] Add is_nan expression

36db249

Closes apache#34154

Fokko force-pushed the fd-nan-expression branch from 78ec952 to 36db249 Compare February 14, 2023 22:45

jorisvandenbossche reviewed Feb 16, 2023

View reviewed changes

python/pyarrow/tests/test_compute.py Outdated Show resolved Hide resolved

Fokko added 2 commits February 23, 2023 08:55

Update test and docstring

ba349b5

Merge branch 'main' of https://github.com/apache/arrow into fd-nan-ex…

1020257

…pression

Fokko force-pushed the fd-nan-expression branch 2 times, most recently from a56f3fb to 2229918 Compare February 23, 2023 15:23

Add tests

2793dd5

Fokko force-pushed the fd-nan-expression branch from 2229918 to 2793dd5 Compare February 23, 2023 15:38

Oops

df625f2

jorisvandenbossche reviewed Feb 28, 2023

View reviewed changes

Apply suggestions from code review

c254cf5

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 2, 2023

Update array.pxi

d4c1361

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 3, 2023

Fokko added 3 commits March 3, 2023 22:00

Fix tests

384e62d

Merge branch 'fd-nan-expression' of github.com:Fokko/arrow into fd-na…

80533a5

…n-expression

Merge branch 'main' of https://github.com/apache/arrow into fd-nan-ex…

86b35e7

…pression

Merge branch 'main' of https://github.com/apache/arrow into fd-nan-ex…

9d2dde1

…pression

jorisvandenbossche changed the title ~~GH-34154: [Python] Add is_nan expression~~ GH-34154: [Python] Add is_nan method to Array and Expression Mar 9, 2023

jorisvandenbossche approved these changes Mar 9, 2023

View reviewed changes

jorisvandenbossche merged commit 2ff4e3a into apache:main Mar 9, 2023

jorisvandenbossche mentioned this pull request Mar 9, 2023

[Python] Add is_nan for Python #34154

Closed

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Mar 9, 2023

Fokko mentioned this pull request Mar 15, 2023

Python: Integration tests apache/iceberg#6398

Merged

	def test_is_nan():
	arr = pa.array([1, 2, 3, None, np.nan])
	result = arr.is_nan()
	expected = pa.array([False, False, False, None, True])
	assert result.equals(expected)

GH-34154: [Python] Add is_nan method to Array and Expression #34184

GH-34154: [Python] Add is_nan method to Array and Expression #34184

Uh oh!

Conversation

Fokko commented Feb 14, 2023 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2023

Uh oh!

github-actions bot commented Feb 14, 2023

Uh oh!

Fokko commented Feb 14, 2023

Uh oh!

AlenkaF commented Feb 15, 2023

Uh oh!

Uh oh!

jorisvandenbossche commented Feb 16, 2023

Uh oh!

Fokko commented Feb 23, 2023

Uh oh!

AlenkaF commented Feb 23, 2023

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 2, 2023

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 2, 2023

Choose a reason for hiding this comment

Uh oh!

Fokko Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Mar 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GH-34154: [Python] Add `is_nan` method to Array and Expression #34184

GH-34154: [Python] Add `is_nan` method to Array and Expression #34184

Fokko commented Feb 14, 2023 •

edited by jorisvandenbossche

Loading