-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11044: [C++] Add "replace" kernel #9024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jorisvandenbossche: Would you have you look at this in-progress PR and share your thought ? Replace kernel implemented for most (if not all) types but missing null handling. |
|
Note on ArrayData::GetValues(i) in the context of commit ab2d3d2:
On my case, I fixed the bug in my code by forcing absolute_offset to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use full imports and put them inside the anonymous namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @emkornfield, not sure to get your point... Do you mean:
namespace arrow {
namespace compute {
namespace internal {
namespace {
using ::arrow::internal::BitBlockCount;
using ::arrow::internal::BitBlockCounter;
...
...
}
}
}
}
❓
|
I wouldn't expect a "replace" operation to do this. Instead, this looks more like a "select" operation. @wesm What do you think? |
|
This might actually rather be the "setitem" kernel as proposed in https://issues.apache.org/jira/browse/ARROW-9430 ? (which is something we want as well) |
|
@jorisvandenbossche: Thank you for bringing up ARROW-9430. Then @ALL please tell me what would be a good name. I may have to rework the code a bit. As mentionned in the top description, I started based on fill_null kernel code and the "null handling" logic came as an afterthought, so the code might look a bit convoluted right now... |
|
"setitem" is confusing to me (I would expect something where you pass indices and values to set at those indices). "select" is used in Numpy. Another possible is "cond". |
That's https://issues.apache.org/jira/browse/ARROW-9431. But it's true that for a "setitem" kernel, you would expect the "values to set" ( So what this PR proposes, is actually more a "where" kernel (using numpy's terminology), for which we have https://issues.apache.org/jira/browse/ARROW-10640. |
|
I'd say that personally, I find "select" more intuitively guessable than "where" (which is a weird name for a function). But I could also live with it if it's Numpy's choice. @michalursa |
|
Agreed that "where" is an unintuitive name. For me it's fine to directly use the more general name from numpy, which would actually be both "choose" or "select" (I think the single boolean mask can be seen as a special case for both generalized functions: assuming that the single boolean mask from "where" is interpreted as 0/1 integer indices) for "choose", or is interpreted as a boolean mask + its inverse mask for "select"). Another option mentioned in the JIRA is "if_else" (term used in the R world) or "case_when" (inspired on SQL). |
|
"if_else", "choose", "select" are all fine with me. |
|
Wow, I was not expecting such brainstorming on the name of this kernel. My initial intent was to mimic pandas.Series.replace(to_replace, value), where in C++ to_replace could be an BooleanArray mask (same length as input array) which trigger value replacement => So this kernel could be used after any combination of Compare and Boolean operations to implement specific replacement logic. This could be extended to another implementation where to_replace could be an IntegerArray (length <= as input array) of indexes to replace. Here is how I use "replace" in combination with "is_nan" (ARROW-11043) to implement fillna(): |
Run clang-format to fix indentation and pass CI lint check.
Fix newline characters at end of file for CI lint check.
Add null handling to "replace" kernel: * Add some more tests with nulls (start with the boolean kernel)
Add null handling to "replace" kernel: * Add null handling to the boolean kernel implementation. * Test pass but still need more tests.
Generalize null handling to "replace" kernel: * Add null handling to the Number, Timestamps, and String kernel implementations. * Add more tests regarding null handling logic (with nulls either in input values and mask). * Still need to fix a tricky bug in test of the String kernel implementation.
Fix String kernel implementation: * offsets were used inconsistently hence a bug on sliced arrays. * ArrayData::GetValues<T>(i) use "byte offset" even for buffer[2] which store variable width values... => Should it rather use buffer[1] value offset ?
All type kernels implemented and tested, polish PR: * apply clang-format to pass CI lint check.
Sorry for the distraction :-) That said, for clarity it seems that "if_else" would be both explicit and non-misleading. The signature could be Do you want to update this PR? |
d4608a9 to
356c300
Compare
|
@bu2 would you have time to update this based on the above comments? (#9024 (comment)) |
| out_values += block.length; | ||
| in_values += block.length; | ||
| } | ||
| output->buffers[1] = out_buf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, you could use std::move here. (applies to several places, where buffer shared_ptr's are assigned)
|
Closing as superceded by #10410. |
Purpose a "replace" compute kernel which could fulfil
ARROW-10641 - [C++] A "replace" or "map" kernel to replace values in array based on mapping (@jorisvandenbossche)ARROW-9430 - [C++/Python] Kernel for SetItem(BooleanArray, values) (@xhochy). The implementation started on the basis of "fill_null" kernel. But it takes an additional BooleanArray parameter which is used as a mask to trigger value replacement.With following "null handling" logic:
(each column indicates the validity bit and the corresponding value. "v" for current value, "r" for replacement value).