-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12709: [C++] Add binary_join_element_wise #10520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for working on this @lidavidm! I will add the relevant functions to the R bindings after this is merged (ARROW-11514). |
|
Some naming nitpicks ;) |
|
Or perhaps |
|
That would indeed be more consistent. |
|
Ah, maybe then we should rename |
|
The use of "binary" in the names of these string join kernels is unfortunate; it's not clear at first glance whether "binary" is a reference to the arity or to the input type. |
|
I'm unclear on what the intended behavior is when you choose the |
|
The null handling behavior never affects the separator: it only describes how to handle the other values. It's intended to let you mimic libcudf. If this makes sense, I'll try to update the docs to be clearer. |
|
Oh, I see; I was confused about what happens when the separator is an array; I see now. Thank you! |
Yeah, personally I would be fine to have an alias that uses "string" to have some more recognizable name ("string_join_element_wise/list"), but not sure we want to start adding aliases in general .. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. It would be nice to add a benchmark alongside the binary_join (ideally with the same data, to compare performance of the two kernels).
|
I added a benchmark but am not too happy with the performance. It looks like there's a lot of overhead in the core kernel implementation and not all that much time is spent actually copying strings. Accordingly, performance increases the more arrays there are to concatenate. This is with n=10 arrays to concatenate but with n=100 the bytes per second goes up to ~1000M/s. The benchmark is not the same as BinaryJoin exactly since BinaryJoin can concatenate a varying number of strings per output slot while BinaryJoinElementWise joins a fixed number of strings per output slot. (That, plus the extra options offered by this kernel, could probably explain why there's more overhead here - BinaryJoin doesn't have to deal with unboxing scalars or arrays on every step, tracking nulls, etc. but there's definitely ways to improve the implementation here too.) |
|
I'm not really surprised and was actually expecting worse results :-) There are two factors:
|
|
An implementation iterating over each column in turn might be more performant, but at the cost of more complexity:
|
|
I was just thinking that. The current implementation already computes the output lengths (accounting for nullability) up front, so, it might not be too bad. I also noticed the current implementation effectively branches on scalar/array twice per column in the loop, trying to optimize that improved the cases with fewer columns at the expense of those with more columns (though that may be the right tradeoff). But I'll give the columnwise approach a try. |
|
Awkwardly, it's slower (with fewer columns, much faster with a lot of columns) once made columnwise. Before: After: This holds even for longer input arrays, though the crossover point moves down a bit. |
|
Cool, then let's keep the row-oriented version. People are highly unlikely to call this with more than a dozen columns, IMHO. |
|
A minor tweak improves the <64 column case by ~10%, at least (though at quite the expense of the lots-of-columns case): |
|
Other ideas if you want to go a bit further:
|
| SeparatorFactory make_separator) { | ||
| // Unfortunately benchmark is not 1:1 with BinaryJoin since BinaryJoin can join a | ||
| // varying number of inputs per output | ||
| const int64_t n_strings = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be a bit low. For the few columns case, this may be exposing some fixed cost overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's fair. I've bumped it up to 65536 rows.
Old impl:
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar 104542 ns 104540 ns 6671 bytes_per_second=1113.52M/s
BinaryJoinArrayArray 114749 ns 114750 ns 6062 bytes_per_second=1014.45M/s
BinaryJoinElementWiseArrayScalar/2 3017902 ns 3017894 ns 229 bytes_per_second=506.693M/s
BinaryJoinElementWiseArrayScalar/8 10470391 ns 10470204 ns 67 bytes_per_second=585.289M/s
BinaryJoinElementWiseArrayScalar/64 77328574 ns 77273626 ns 9 bytes_per_second=634.122M/s
BinaryJoinElementWiseArrayScalar/128 112622860 ns 112606581 ns 6 bytes_per_second=870.334M/s
BinaryJoinElementWiseArrayArray/2 3508926 ns 3508898 ns 200 bytes_per_second=435.791M/s
BinaryJoinElementWiseArrayArray/8 11349990 ns 11349827 ns 61 bytes_per_second=539.928M/s
BinaryJoinElementWiseArrayArray/64 76564083 ns 76563761 ns 9 bytes_per_second=640.001M/s
BinaryJoinElementWiseArrayArray/128 111560076 ns 111556809 ns 6 bytes_per_second=878.524M/s
Current impl:
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar 107309 ns 107307 ns 6537 bytes_per_second=1084.81M/s
BinaryJoinArrayArray 117132 ns 117131 ns 5966 bytes_per_second=993.829M/s
BinaryJoinElementWiseArrayScalar/2 2624376 ns 2624345 ns 267 bytes_per_second=582.677M/s
BinaryJoinElementWiseArrayScalar/8 9394312 ns 9394273 ns 73 bytes_per_second=652.322M/s
BinaryJoinElementWiseArrayScalar/64 77934790 ns 77934514 ns 9 bytes_per_second=628.744M/s
BinaryJoinElementWiseArrayScalar/128 128550012 ns 128549432 ns 5 bytes_per_second=762.394M/s
BinaryJoinElementWiseArrayArray/2 3258818 ns 3258767 ns 214 bytes_per_second=469.24M/s
BinaryJoinElementWiseArrayArray/8 10201948 ns 10201752 ns 66 bytes_per_second=600.69M/s
BinaryJoinElementWiseArrayArray/64 79305591 ns 79303789 ns 8 bytes_per_second=617.888M/s
BinaryJoinElementWiseArrayArray/128 129524057 ns 129524481 ns 5 bytes_per_second=756.655M/s
Still about the same relative to each other.
|
@lidavidm Do you want to try out more optimization ideas or is this ready for review again? |
|
This should be ready - I think we can revisit the performance if it becomes a concern. |
| /// Nulls in inputs are skipped. | ||
| SKIP, | ||
| /// Nulls in inputs are replaced with the replacement string. | ||
| REPLACE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but I think we should avoid ALL_CAPS because of potential conflicts with third-party macros. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to stay consistent with the existing enums. (Also see: the whole ML discussion…)
If we reach a consensus there I'm happy to rename all the enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some other things I noticed in ARROW-13025 like a toplevel enum (not enum class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think FunctionOptions classes and enums are recent enough that we may want to do a cleanup pass on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(as a separate JIRA, of course!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, can merge if green
|
Hmm, apparently some Docker images failed to pull :-/ |
|
In the Python bindings this seems to work OK: |
|
I'm probably mistaken (and forgive my clumsy R), but it appears to work for me? |
|
Thanks Joris and David. @lidavidm in your examples in the above comment, I believe the But when I call the The problem I'm seeing only manifests when running it through dplyr, like so: So it sounds like this is happening in the dplyr layer. I added a few lines of code in #10547 to work around this, so no action required. Thanks! |
This adds a variadic scalar string join kernel, using the last argument (min 1 argument) as the separator. An options class allows emitting null (the default), skipping null non-separator arguments, or replacing null non-separator arguments with another string (mimicking libcudf). Closes apache#10520 from lidavidm/arrow-12709 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This adds a variadic scalar string join kernel, using the last argument (min 1 argument) as the separator. An options class allows emitting null (the default), skipping null non-separator arguments, or replacing null non-separator arguments with another string (mimicking libcudf).