Skip to content

Conversation

@edponce
Copy link
Contributor

@edponce edponce commented Aug 27, 2021

This PR adds the string repeat compute function named "string_repeat". String repeat is a binary function that accepts Binary/StringType(s) and the repetition value(s). Repetition values can be:

  • a single value applied to all strings
  • an array of values where each repeat count corresponds to the string in the same position

To support inputs of different shapes for this kernel, kernel exec generators and base classes for binary string transforms are also included.

@github-actions
Copy link

@edponce
Copy link
Contributor Author

edponce commented Aug 28, 2021

Related to a replicate operation, there was a previous discussion in Zulip chat of having a general replicate functionality where string repeat is a particular case.

Arrow already has MakeArrayFromScalar and RepeatedArrayFactory which use concatenate implementation internally. Can this be used in this PR? These are specifically for Array types and in kernel transform method uses raw pointers.

@lidavidm
Copy link
Member

You may instead be interested in two things I added recently: ArrayBuilder::AppendScalar(const Scalar&, int64_t) and ArrayBuilder::AppendArraySlice. This would let you implement a generalized repeat without allocating and concatenating lots of intermediate arrays, and would let you preallocate the final array as well.

@edponce
Copy link
Contributor Author

edponce commented Aug 28, 2021

@lidavidm Those ArrayBuilder methods do work to perform this operation but will require not following the common approach used for string kernels based on the already provided StringTransformXXX infrastructure. Specifically, it would require overriding ExecArray() (while duplicating most of it). For how things currently are, I think using the ArrayBuilder/MakeScalar methods for StrRepeat is not preferable.

Also, note that the current StrRepeat implementation only allocates once the entire array for all repeated strings via ExecArray(). StrRepeat overrides MaxCodeunits() to return input_ncodeunits * n_repeats.

@lidavidm
Copy link
Member

Sure, I'm talking about more general repeat methods, though I guess now I question what you might want to repeat other than binary-like types and I suppose lists.

@edponce
Copy link
Contributor Author

edponce commented Aug 30, 2021

The current StringTransformXXX classes do not easily support non-scalar options. In this PR, we want to be able to do the following:

str_repeat(['a', 'b', 'c'], repeats=[1,2,3])  # ['a', 'bb', 'ccc']

Possible solution: Override the ExecArray of StringTransformExecBase and specialize for kernels that require the current index of the input string. This is done by passing the string index to the transform->Transform(..., i) call. We need to keep in mind that these indexes are relative to the current ExecBatch so we need to offset accordingly.

cc @pitrou @lidavidm

@pitrou
Copy link
Member

pitrou commented Aug 30, 2021

The current StringTransformXXX classes do not easily support non-scalar options. In this PR, we want to be able to do the following:

str_repeat(['a', 'b', 'c'], repeats=[1,2,3])  # ['a', 'bb', 'ccc']

To me, this means that the kernel is simply a binary kernel.

@edponce
Copy link
Contributor Author

edponce commented Aug 30, 2021

I agree that this is a binary kernel because the number of repeats is required.

@edponce edponce force-pushed the ARROW-12712-String-repeat-kernel branch 2 times, most recently from 6942fa8 to 23d7516 Compare September 4, 2021 04:21
@edponce
Copy link
Contributor Author

edponce commented Sep 4, 2021

This PR depends on #11082 (ARROW-13898) which adds supports for string binary compute functions.

@edponce edponce force-pushed the ARROW-12712-String-repeat-kernel branch 3 times, most recently from 7e0babd to f487172 Compare September 9, 2021 04:48
@pitrou
Copy link
Member

pitrou commented Sep 9, 2021

@edponce Please ping when this is ready for review. Thanks!

@edponce
Copy link
Contributor Author

edponce commented Sep 9, 2021

Ready for review cc @pitrou

@edponce edponce marked this pull request as draft September 13, 2021 01:43
@edponce
Copy link
Contributor Author

edponce commented Sep 14, 2021

Based on the semantics of the scalar binary kernels, I am adding a kernel exec generator for binary string transforms. This includes an output adapter and array iterator.

@pitrou
Copy link
Member

pitrou commented Sep 16, 2021

Feel free to undraft when this is ready @edponce .

@edponce edponce force-pushed the ARROW-12712-String-repeat-kernel branch from d4ed4f8 to 60c73c7 Compare September 20, 2021 09:24
@edponce edponce force-pushed the ARROW-12712-String-repeat-kernel branch from 60c73c7 to b1d4cb1 Compare October 19, 2021 01:18
@edponce edponce marked this pull request as ready for review October 19, 2021 09:05
@edponce edponce requested a review from pitrou October 19, 2021 09:12
@edponce
Copy link
Contributor Author

edponce commented Oct 19, 2021

I have the following questions which I am not sure how to resolve:

  1. I tried allowing integers, floating point, and boolean to the num_repeats argument. These are casted to Int64Type via DispatchBest but the floating point casting to integers triggers truncation error. How can this be achieved?
  2. Should an error be return if num_repeats argument is non-negative? Currently, a negative value is treated as a zero-value to match Python behavior, but base R strrep triggers error.
  3. Added R binding named as strrep but the base R version is used instead cc @jonkeane
Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R

cc @lidavidm @bkietz

@lidavidm
Copy link
Member

For 1: why support floating point or boolean arguments in the first place? It seems quite odd. I think those should be explicit casts if the user wants them, and they can choose safe/unsafe cast as appropriate.

For 2: I don't think there's a clear argument either way. If the difference in behavior is critical in a particular application, the data could always be checked/massaged either way beforehand. Otherwise I would lean towards explicitly erroring. (You could argue that in Python, you're explicitly calling into Arrow and hence it's clear there may be a difference, while in R the user is likely using dplyr and so a difference may not be top-of-mind.)

@jonkeane
Copy link
Member

I agree with David about erroring: it seems a bit odd to me that negative values are silently assumed to be 0.

In the first message, you have the following error:

Warning: Expression strrep(x, 3) not supported in Arrow; pulling data into R

That seems odd, I would expect 3 to be fine, is there a missing - somewhere?

@edponce edponce force-pushed the ARROW-12712-String-repeat-kernel branch from d5d98f9 to 6aabb4f Compare November 3, 2021 15:55
@edponce
Copy link
Contributor Author

edponce commented Nov 3, 2021

Renamed R internal function str_dup to duplicate_string because it was shadowing stringr's str_dup and kernel binding for binary_repeat.
Thanks to @thisisnic for identifying this subtle issue!

@lidavidm lidavidm closed this in 0ead7c9 Nov 4, 2021
@ursabot
Copy link

ursabot commented Nov 4, 2021

Benchmark runs are scheduled for baseline = 5897217 and contender = 0ead7c9. 0ead7c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.25% ⬆️0.89%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants