-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12712: [C++] String repeat kernel #11023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Related to a replicate operation, there was a previous discussion in Zulip chat of having a general replicate functionality where string repeat is a particular case. Arrow already has |
|
You may instead be interested in two things I added recently: ArrayBuilder::AppendScalar(const Scalar&, int64_t) and ArrayBuilder::AppendArraySlice. This would let you implement a generalized repeat without allocating and concatenating lots of intermediate arrays, and would let you preallocate the final array as well. |
|
@lidavidm Those Also, note that the current |
|
Sure, I'm talking about more general repeat methods, though I guess now I question what you might want to repeat other than binary-like types and I suppose lists. |
|
The current str_repeat(['a', 'b', 'c'], repeats=[1,2,3]) # ['a', 'bb', 'ccc']Possible solution: Override the |
To me, this means that the kernel is simply a binary kernel. |
|
I agree that this is a binary kernel because the number of repeats is required. |
6942fa8 to
23d7516
Compare
|
This PR depends on #11082 (ARROW-13898) which adds supports for string binary compute functions. |
7e0babd to
f487172
Compare
|
@edponce Please ping when this is ready for review. Thanks! |
|
Ready for review cc @pitrou |
|
Based on the semantics of the scalar binary kernels, I am adding a kernel exec generator for binary string transforms. This includes an output adapter and array iterator. |
|
Feel free to undraft when this is ready @edponce . |
d4ed4f8 to
60c73c7
Compare
60c73c7 to
b1d4cb1
Compare
|
I have the following questions which I am not sure how to resolve:
Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R |
|
For 1: why support floating point or boolean arguments in the first place? It seems quite odd. I think those should be explicit casts if the user wants them, and they can choose safe/unsafe cast as appropriate. For 2: I don't think there's a clear argument either way. If the difference in behavior is critical in a particular application, the data could always be checked/massaged either way beforehand. Otherwise I would lean towards explicitly erroring. (You could argue that in Python, you're explicitly calling into Arrow and hence it's clear there may be a difference, while in R the user is likely using dplyr and so a difference may not be top-of-mind.) |
|
I agree with David about erroring: it seems a bit odd to me that negative values are silently assumed to be 0. In the first message, you have the following error: That seems odd, I would expect |
d5d98f9 to
6aabb4f
Compare
|
Renamed R internal function |
|
Benchmark runs are scheduled for baseline = 5897217 and contender = 0ead7c9. 0ead7c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR adds the string repeat compute function named "string_repeat". String repeat is a binary function that accepts Binary/StringType(s) and the repetition value(s). Repetition values can be:
To support inputs of different shapes for this kernel, kernel exec generators and base classes for binary string transforms are also included.