-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-971: [C++][Compute] IsValid, IsNull kernels #7410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5703901 to
8fa6330
Compare
|
I rebased after fixing the linting issue ARROW-9120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly this seems quite complex / opaque to me -- I am not going to be eager to write unit tests in this fashion. I would personally rather keep things as simple and explicit as possible so it's obvious what is being tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this looks over-engineered for what it seems to be doing (I may be misunderstanding) -- basically generating test data and checking expected values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll separate this into a separate PR, then. I think testing against random data has value, similar to using a fuzzer. Currently our random tests have a lot of boilerplate for generating the inputs and a lot of ad-hoc code for computing the expected values. IMHO it's worthwhile to have an interface for doing sanity checks across a wide swath of parameter space without needing to specify each of those manually (even if this is not enabled by default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are free to open a separate PR but I'm against this particular implementation of this approach to testing so unless something materially changes about the way the developer specifies the tests I'm likely to vote against the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm if you have any ideas which would make this interface simpler, could you comment in the JIRA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that the approach to testing is viable at all for this part of the project. The examples that were in this PR contained effectively reimplementations of the kernel logic in the property test specification. Let's look at Add for example
const auto& out_type = args[0]->type;
if (!args[0]->is_valid || !args[1]->is_valid) {
return MakeNullScalar(out_type);
}
if (is_integer(out_type->id())) {
ARROW_ASSIGN_OR_RAISE(auto lhs, Cast<UInt64Scalar>(args[0]));
ARROW_ASSIGN_OR_RAISE(auto rhs, Cast<UInt64Scalar>(args[1]));
return UInt64Scalar(lhs->value + rhs->value).CastTo(out_type);
}
if (is_floating(out_type->id())) {
ARROW_ASSIGN_OR_RAISE(auto lhs, Cast<DoubleScalar>(args[0]));
ARROW_ASSIGN_OR_RAISE(auto rhs, Cast<DoubleScalar>(args[1]));
return DoubleScalar(lhs->value + rhs->value).CastTo(out_type);
}
return Status::NotImplemented("NYI");
I don't think it's a good idea to have a parallel collection of reimplementations of kernels.
|
I'm not thrilled about the testing mixin. Can we split all that out into a separate PR (if at all, will see what @pitrou thinks about it) so that these kernels are not held hostage over it? |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Wes that the added test harness doesn't seem worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the name "property" is supposed to refer to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I wrong, or is this generating a new test subclass for simply different parameter values? This does not seem sound, especially as it will blow up compile times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't generating new classes. TYPED_TEST_SUITE does this (one template instantiation per type) but INSTANTIATE_TEST_SUITE_P only iterates over the provided values (which in this case are all ScalarFunctionPropertyTestParam)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this looks over-engineered for what it seems to be doing (I may be misunderstanding) -- basically generating test data and checking expected values.
8fa6330 to
7835000
Compare
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a handful of comments
|
|
||
| static void Call(KernelContext* ctx, const ArrayData& arr, ArrayData* out) { | ||
| if (arr.buffers[0] != nullptr && out->offset == arr.offset && | ||
| out->length == arr.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the offset and length checks are needed anymore, you can either remove them or make them DCHECK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it happens this fails for sliced input. I'll restore the checks. For your interest, one approach i tried for is_valid was a MetaFunction which invoked a no-op ScalarFunction with INTERSECTION null handling then yanked the null bitmap from that. Unfortunately INTERSECTION doesn't currently support the zero copy case and I wasn't sure the approach would be acceptable but it did avoid repetition of null bitmap handling logic. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, indeed zero copy is not possible here in general. What do you think about doing the zero copy when arr.offset % 8 == 0? Then you can slice the bitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I understand that the changes to CheckScalarUnary break some of the ascii_* kernels. I'll disable those tests in this patch and then I'll fix them in a patch that is based on this patch.
| out->buffers[1] = arr.offset == 0 | ||
| ? arr.buffers[0] | ||
| : SliceBuffer(arr.buffers[0], arr.offset / 8, arr.length / 8); | ||
| out->offset = arr.offset % 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know off the top of my head what will be the implications of modifying the offset of the output value, but I think it should be okay, and we can always fix it later if it becomes an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm not clear on what the kernel is/isn't allowed to modify in the out data. I've added can_write_into_slices=false , so IIUC the kernel should always exclusively own the out data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not worth thinking too hard right now because kernel pipelining (temporary elision) is not implemented yet so until that happens it would be just speculation.
|
@bkietz if you want to disable the ascii_* tests that are failing please go ahead |
…issing break statements
|
Merging |
No description provided.