ARROW-971: [C++][Compute] IsValid, IsNull kernels #7410

bkietz · 2020-06-11T21:13:57Z

No description provided.

github-actions · 2020-06-11T21:16:59Z

https://issues.apache.org/jira/browse/ARROW-971

cpp/src/arrow/compute/kernels/codegen_internal.h

cpp/src/arrow/compute/exec.cc

cpp/src/arrow/util/iterator_test.cc

cpp/src/arrow/compute/kernels/test_util.h

wesm · 2020-06-13T02:58:09Z

I rebased after fixing the linting issue ARROW-9120

cpp/src/arrow/testing/random.h

cpp/src/arrow/array/builder_base.cc

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

wesm · 2020-06-13T03:15:29Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

Honestly this seems quite complex / opaque to me -- I am not going to be eager to write unit tests in this fashion. I would personally rather keep things as simple and explicit as possible so it's obvious what is being tested.

I agree that this looks over-engineered for what it seems to be doing (I may be misunderstanding) -- basically generating test data and checking expected values.

I'll separate this into a separate PR, then. I think testing against random data has value, similar to using a fuzzer. Currently our random tests have a lot of boilerplate for generating the inputs and a lot of ad-hoc code for computing the expected values. IMHO it's worthwhile to have an interface for doing sanity checks across a wide swath of parameter space without needing to specify each of those manually (even if this is not enabled by default).

You are free to open a separate PR but I'm against this particular implementation of this approach to testing so unless something materially changes about the way the developer specifies the tests I'm likely to vote against the PR

https://issues.apache.org/jira/browse/ARROW-9135

@wesm if you have any ideas which would make this interface simpler, could you comment in the JIRA?

I'm not sure that the approach to testing is viable at all for this part of the project. The examples that were in this PR contained effectively reimplementations of the kernel logic in the property test specification. Let's look at Add for example

const auto& out_type = args[0]->type; if (!args[0]->is_valid || !args[1]->is_valid) { return MakeNullScalar(out_type); } if (is_integer(out_type->id())) { ARROW_ASSIGN_OR_RAISE(auto lhs, Cast<UInt64Scalar>(args[0])); ARROW_ASSIGN_OR_RAISE(auto rhs, Cast<UInt64Scalar>(args[1])); return UInt64Scalar(lhs->value + rhs->value).CastTo(out_type); } if (is_floating(out_type->id())) { ARROW_ASSIGN_OR_RAISE(auto lhs, Cast<DoubleScalar>(args[0])); ARROW_ASSIGN_OR_RAISE(auto rhs, Cast<DoubleScalar>(args[1])); return DoubleScalar(lhs->value + rhs->value).CastTo(out_type); } return Status::NotImplemented("NYI");

I don't think it's a good idea to have a parallel collection of reimplementations of kernels.

wesm · 2020-06-13T03:19:56Z

I'm not thrilled about the testing mixin. Can we split all that out into a separate PR (if at all, will see what @pitrou thinks about it) so that these kernels are not held hostage over it?

cpp/src/arrow/compute/kernels/scalar_validity.cc

pitrou

I agree with Wes that the added test harness doesn't seem worthwhile.

cpp/src/arrow/testing/random.h

pitrou · 2020-06-15T12:13:30Z

cpp/src/arrow/compute/kernels/test_util.h

I'm not sure what the name "property" is supposed to refer to.

https://en.wikipedia.org/wiki/Property_testing

pitrou · 2020-06-15T12:14:35Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

Am I wrong, or is this generating a new test subclass for simply different parameter values? This does not seem sound, especially as it will blow up compile times.

It isn't generating new classes. TYPED_TEST_SUITE does this (one template instantiation per type) but INSTANTIATE_TEST_SUITE_P only iterates over the provided values (which in this case are all ScalarFunctionPropertyTestParam)

pitrou · 2020-06-15T12:15:23Z

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

I agree that this looks over-engineered for what it seems to be doing (I may be misunderstanding) -- basically generating test data and checking expected values.

cpp/src/arrow/compute/kernels/scalar_validity.cc

wesm

Looks good, a handful of comments

wesm · 2020-06-16T15:35:32Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

+
+  static void Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) {
+    if (arr.buffers[0] != nullptr && out->offset == arr.offset &&
+        out->length == arr.length) {


I don't think that the offset and length checks are needed anymore, you can either remove them or make them DCHECK

As it happens this fails for sliced input. I'll restore the checks. For your interest, one approach i tried for is_valid was a MetaFunction which invoked a no-op ScalarFunction with INTERSECTION null handling then yanked the null bitmap from that. Unfortunately INTERSECTION doesn't currently support the zero copy case and I wasn't sure the approach would be acceptable but it did avoid repetition of null bitmap handling logic. What do you think?

Ah yes, indeed zero copy is not possible here in general. What do you think about doing the zero copy when arr.offset % 8 == 0? Then you can slice the bitmap.

cpp/src/arrow/compute/kernels/scalar_validity.cc

cpp/src/arrow/compute/kernels/scalar_validity_test.cc

cpp/src/arrow/testing/random.h

cpp/src/arrow/testing/random.cc

wesm

+1. I understand that the changes to CheckScalarUnary break some of the ascii_* kernels. I'll disable those tests in this patch and then I'll fix them in a patch that is based on this patch.

wesm · 2020-06-16T21:01:51Z

cpp/src/arrow/compute/kernels/scalar_validity.cc

+      out->buffers[1] = arr.offset == 0
+                            ? arr.buffers[0]
+                            : SliceBuffer(arr.buffers[0], arr.offset / 8, arr.length / 8);
+      out->offset = arr.offset % 8;


I don't know off the top of my head what will be the implications of modifying the offset of the output value, but I think it should be okay, and we can always fix it later if it becomes an issue

In general I'm not clear on what the kernel is/isn't allowed to modify in the out data. I've added can_write_into_slices=false , so IIUC the kernel should always exclusively own the out data.

It's not worth thinking too hard right now because kernel pipelining (temporary elision) is not implemented yet so until that happens it would be just speculation.

wesm · 2020-06-16T21:05:14Z

@bkietz if you want to disable the ascii_* tests that are failing please go ahead

…issing break statements

wesm · 2020-06-16T22:02:56Z

Merging

wesm reviewed Jun 11, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/codegen_internal.h Outdated Show resolved Hide resolved

bkietz commented Jun 12, 2020

View reviewed changes

cpp/src/arrow/compute/exec.cc Outdated Show resolved Hide resolved

bkietz commented Jun 12, 2020

View reviewed changes

cpp/src/arrow/util/iterator_test.cc Outdated Show resolved Hide resolved

bkietz commented Jun 12, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/test_util.h Outdated Show resolved Hide resolved

wesm force-pushed the 971-Implement-Array-isvalid-n branch from 5703901 to 8fa6330 Compare June 13, 2020 02:57

wesm reviewed Jun 13, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_validity.cc Outdated Show resolved Hide resolved

pitrou reviewed Jun 15, 2020

View reviewed changes

bkietz added 6 commits June 15, 2020 10:42

ARROW-971: [C++][Compute] IsValid, IsNull kernels

8295c3f

revert Property testing

9085048

use CheckScalarUnary

174a53f

document RAG::ArrayOf

1d6f6c9

remove unused #includes

71d082e

explicit internal:: usage

7835000

bkietz force-pushed the 971-Implement-Array-isvalid-n branch from 8fa6330 to 7835000 Compare June 15, 2020 14:43

bkietz added 4 commits June 15, 2020 10:59

fix anonymous namespace

8034e7a

switch to in-kernel allocation when zero copy is unavailable

932ecc1

remove debug check

cada19f

rename Operators for validity kernels

95e2545

wesm reviewed Jun 16, 2020

View reviewed changes

bkietz added 2 commits June 16, 2020 13:28

review comments

6f31cd2

use zero copy with buffer slices

5bd1057

wesm approved these changes Jun 16, 2020

View reviewed changes

Disable AsciiUpper/AsciiLower unit tests until ARROW-9122 done, add m…

be7a03f

…issing break statements

wesm closed this in 60cdc75 Jun 16, 2020

bkietz deleted the 971-Implement-Array-isvalid-n branch February 25, 2021 16:32

This was referenced Jun 17, 2020

[C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions #16570

Closed

[C++] Improve configurability of RandomArrayGenerator::ArrayOf #25257

Closed

asfimport mentioned this pull request Jun 15, 2020

[C++][Compute] Provide a kernel property testing API #25246

Closed

ARROW-971: [C++][Compute] IsValid, IsNull kernels #7410

ARROW-971: [C++][Compute] IsValid, IsNull kernels #7410

Uh oh!

Conversation

bkietz commented Jun 11, 2020 • edited by wesm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 11, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wesm commented Jun 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wesm Jun 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jun 13, 2020

Uh oh!

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

bkietz commented Jun 11, 2020 •

edited by wesm

Loading

wesm Jun 13, 2020 •

edited

Loading

bkietz Jun 16, 2020 •

edited

Loading