-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value #7635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello and welcome to the project! Thanks for starting to work on this. I have some high level and low level comments and can help guide you through this work since this part of the project is pretty new
cpp/src/arrow/compute/api_scalar.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these input validation checks should be here. Instead, the kernel should be implemented as an Arity::Binary() kernel with input validation handled by the kernel dispatch / executor layer. It's fine if the initial version has the type signature Array/Scalar instead of Any/Any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been reworked
cpp/src/arrow/compute/api_scalar.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments above. I think this should be implemented as a binary function since the "fill values" could be provided by an array. This also will allow the execution layer to (in the near future) insert implicit casts where needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per above, I think this kernel would be better implemented without a KernelInit function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output is never null (unless the fill value is null), right? So you don't need to touch the validity bitmap. In fact, the default mode of the kernel should not allocate one (also for reasons of zero copy, I will comment below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reworked this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree:
- If the fill value is not null, then the output is non-null, so there is no need to allocate a validity bitmap in most cases. So the default behavior should be to use
NullHandling::COMPUTED_NO_PREALLOCATEfor the nulls - Since the kernel can do zero-copy when the input has no nulls, this needs to use
MemAllocation::NO_PREALLOCATEand instead leave memory allocation to the kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking time to review - please bear with me while I work through all these.
First point makes sense. For the second, do you mind expanding on that for my understanding? Zero copy in relation to what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You see the places where you're doing
*out_arr = data;
I presume your intent is to pass along the memory from the input argument (when the input has no nulls)? That's what's meant by zero-copy, no need to do any processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked, following your guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rework this to use a combination of BitBlockCounter and probably GenerateBitsUnrolled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The has been reworked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are declared elsewhere I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see them declared a couple of times in the code base as PrimitiveDictionaries eg in scalar_set_lookup_test.cc - should I use that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the ArrayFromJSON functions instead for specifying the test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of the kernel when passing a scalar with is_valid=false is not validated (and probably yield incorrect results)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have included a test case
8a0ab4a to
168a5b9
Compare
|
This is close to what I'm looking for. I'm going to push some changes to this branch in a little while and then will merge this |
… width, other fixes
168a5b9 to
bc45320
Compare
|
I changed the implementations to do a single memory allocation and avoid the builder classes, which will be faster, and fixed some other stuff. Additionally, instantiating fewer templates since we can use e.g. a UInt64 template to process all 64-bit types including Double, etc. |
| } | ||
| if (!fill_value.is_scalar()) { | ||
| ctx->SetStatus(Status::Invalid("fill value must be a scalar")); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: these type checks are not necessary. You can safely assume that once Exec is called that the types have already been checked
|
+1, will merge once the build passes |
|
@c-jamie thanks for the patch, could you let me know your ASF JIRA username (or create one if you don't have one) so I can assign the issue to you? |
|
I've commented on https://issues.apache.org/jira/browse/ARROW-1567 Thanks for all the help! |
This PR implements fill null for most primitive types.
Please note this is my first time contribiting to such a project. I've attempted to follow all available guidelines.
There also may be obvious implementation details missing, as I am relatively new to CPP and it's intricacies.