Skip to content

Conversation

@lidavidm
Copy link
Member

This adds a variadic scalar string join kernel, using the last argument (min 1 argument) as the separator. An options class allows emitting null (the default), skipping null non-separator arguments, or replacing null non-separator arguments with another string (mimicking libcudf).

@github-actions
Copy link

@ianmcook
Copy link
Member

Thanks for working on this @lidavidm!

I will add the relevant functions to the R bindings after this is merged (ARROW-11514).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 14, 2021

Some naming nitpicks ;)
I think "var_args_join" is not super clear. Having a notion about it being for string data would be good, and the scalar list of string join kernel that was just added in ARROW-10959 is called "binary_join" (binary because it supports all binary and not just string, which is the same here I think). So something about "binary_join_var_args" ?
Another reference is the variadic element-wise min/max kernels that were added, where "element_wise" was used and not "var_args" (which I think is more descriptive for what it does instead of describing the implementation detail of being variadic). That would then give something like "binary_join_element_wise" (a mouthful ..)

@lidavidm
Copy link
Member Author

Or perhaps element_wise_binary_join since it's also element_wise_min?

@jorisvandenbossche
Copy link
Member

That would indeed be more consistent.
Personally, searching for the function / "tab-completion" in mind, I think having the name start with "binary_join" or "string_join" is more useful (I could say the same for the min/max version though).
We could also rename the just merged "binary_join" to "binary_join_list" to more clearly differentiate the different kernels.

@lidavidm
Copy link
Member Author

Ah, maybe then we should rename element_wise_min to min_element_wise.

@ianmcook
Copy link
Member

The use of "binary" in the names of these string join kernels is unfortunate; it's not clear at first glance whether "binary" is a reference to the arity or to the input type.

@lidavidm lidavidm changed the title ARROW-12709: [C++] Add var_args_join ARROW-12709: [C++] Add binary_join_element_wise Jun 14, 2021
@ianmcook
Copy link
Member

ianmcook commented Jun 14, 2021

I'm unclear on what the intended behavior is when you choose the SKIP null handling behavior and the separator is an array. Could you describe that please? Thanks

@lidavidm
Copy link
Member Author

lidavidm commented Jun 14, 2021

The null handling behavior never affects the separator: it only describes how to handle the other values. It's intended to let you mimic libcudf.

binary_join_element_wise(['a', 'b', 'c', 'h'], ['d', null, 'e', 'i'], ['f', 'g', null, 'j'], [';', '-', ',', null])
EMIT_NULL:                ['a', null, null, null]
SKIP:                     ['a;d;f', 'b-g', 'c,e', null] # b and g are joined; the middle null value is ignored
REPLACE(replacement=foo): ['a;d;f', 'b-foo-g', 'c,e,foo', null]

If this makes sense, I'll try to update the docs to be clearer.

@ianmcook
Copy link
Member

Oh, I see; I was confused about what happens when the separator is an array; I see now. Thank you!

@jorisvandenbossche
Copy link
Member

The use of "binary" in the names of these string join kernels is unfortunate; it's not clear at first glance whether "binary" is a reference to the arity or to the input type.

Yeah, personally I would be fine to have an alias that uses "string" to have some more recognizable name ("string_join_element_wise/list"), but not sure we want to start adding aliases in general ..

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. It would be nice to add a benchmark alongside the binary_join (ideally with the same data, to compare performance of the two kernels).

@lidavidm
Copy link
Member Author

I added a benchmark but am not too happy with the performance. It looks like there's a lot of overhead in the core kernel implementation and not all that much time is spent actually copying strings. Accordingly, performance increases the more arrays there are to concatenate. This is with n=10 arrays to concatenate but with n=100 the bytes per second goes up to ~1000M/s.

The benchmark is not the same as BinaryJoin exactly since BinaryJoin can concatenate a varying number of strings per output slot while BinaryJoinElementWise joins a fixed number of strings per output slot. (That, plus the extra options offered by this kernel, could probably explain why there's more overhead here - BinaryJoin doesn't have to deal with unboxing scalars or arrays on every step, tracking nulls, etc. but there's definitely ways to improve the implementation here too.)

-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                104640 ns       104632 ns         6693 bytes_per_second=1112.55M/s
BinaryJoinArrayArray                 115231 ns       115119 ns         6084 bytes_per_second=1011.2M/s
BinaryJoinElementWiseArrayScalar     218594 ns       218524 ns         3206 bytes_per_second=534.461M/s
BinaryJoinElementWiseArrayArray      218135 ns       218114 ns         3209 bytes_per_second=535.466M/s

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

I'm not really surprised and was actually expecting worse results :-) There are two factors:

  • the reshuffling (row-oriented vs. column-oriented) is an intrinsic problem of this operation and will stress the memory subsystem much more (while binary_join is purely sequential)
  • the implementation iterates row a time, and therefore does dynamic switching between various cases (scalar, array, null options...) in each loop iteration

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

An implementation iterating over each column in turn might be more performant, but at the cost of more complexity:

  • you first need to compute the output null bitmap and offsets
  • you need to keep a vector of output pointers into the output data (a bit similar to the CSV writer)

@lidavidm
Copy link
Member Author

I was just thinking that. The current implementation already computes the output lengths (accounting for nullability) up front, so, it might not be too bad.

I also noticed the current implementation effectively branches on scalar/array twice per column in the loop, trying to optimize that improved the cases with fewer columns at the expense of those with more columns (though that may be the right tradeoff). But I'll give the columnwise approach a try.

@lidavidm
Copy link
Member Author

Awkwardly, it's slower (with fewer columns, much faster with a lot of columns) once made columnwise.

Before:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    104808 ns       104807 ns         6670 bytes_per_second=1110.69M/s
BinaryJoinArrayArray                     115151 ns       115150 ns         6081 bytes_per_second=1010.93M/s
BinaryJoinElementWiseArrayScalar/2        48782 ns        48781 ns        14300 bytes_per_second=481.829M/s
BinaryJoinElementWiseArrayScalar/8       172532 ns       172526 ns         4088 bytes_per_second=540.323M/s
BinaryJoinElementWiseArrayScalar/64      873553 ns       873533 ns          801 bytes_per_second=855.336M/s
BinaryJoinElementWiseArrayScalar/128    1240809 ns      1240796 ns          568 bytes_per_second=1.17663G/s
BinaryJoinElementWiseArrayArray/2         57097 ns        57096 ns        12238 bytes_per_second=411.659M/s
BinaryJoinElementWiseArrayArray/8        184990 ns       184989 ns         3726 bytes_per_second=503.921M/s
BinaryJoinElementWiseArrayArray/64       886392 ns       886393 ns          781 bytes_per_second=842.927M/s
BinaryJoinElementWiseArrayArray/128     1229550 ns      1229457 ns          569 bytes_per_second=1.18749G/s

After:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    107844 ns       107840 ns         6519 bytes_per_second=1079.45M/s
BinaryJoinArrayArray                     115818 ns       115816 ns         5997 bytes_per_second=1005.11M/s
BinaryJoinElementWiseArrayScalar/2        46680 ns        46680 ns        14774 bytes_per_second=503.519M/s
BinaryJoinElementWiseArrayScalar/8       164875 ns       164871 ns         4230 bytes_per_second=565.411M/s
BinaryJoinElementWiseArrayScalar/64      796642 ns       796620 ns          884 bytes_per_second=937.919M/s
BinaryJoinElementWiseArrayScalar/128     735257 ns       735242 ns          954 bytes_per_second=1.98569G/s
BinaryJoinElementWiseArrayArray/2         58653 ns        58651 ns        11987 bytes_per_second=400.746M/s
BinaryJoinElementWiseArrayArray/8        222740 ns       222735 ns         3147 bytes_per_second=418.523M/s
BinaryJoinElementWiseArrayArray/64       966210 ns       966211 ns          727 bytes_per_second=773.294M/s
BinaryJoinElementWiseArrayArray/128      833193 ns       833171 ns          850 bytes_per_second=1.75229G/s

This holds even for longer input arrays, though the crossover point moves down a bit.

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

Cool, then let's keep the row-oriented version. People are highly unlikely to call this with more than a dozen columns, IMHO.

@lidavidm
Copy link
Member Author

A minor tweak improves the <64 column case by ~10%, at least (though at quite the expense of the lots-of-columns case):

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    105557 ns       105555 ns         6631 bytes_per_second=1102.82M/s
BinaryJoinArrayArray                     115929 ns       115928 ns         6002 bytes_per_second=1004.14M/s
BinaryJoinElementWiseArrayScalar/2        42475 ns        42474 ns        16457 bytes_per_second=553.377M/s
BinaryJoinElementWiseArrayScalar/8       146675 ns       146674 ns         4764 bytes_per_second=635.559M/s
BinaryJoinElementWiseArrayScalar/64     1002679 ns      1002670 ns          707 bytes_per_second=745.175M/s
BinaryJoinElementWiseArrayScalar/128    1889566 ns      1889539 ns          366 bytes_per_second=791.199M/s
BinaryJoinElementWiseArrayArray/2         52690 ns        52689 ns        13200 bytes_per_second=446.092M/s
BinaryJoinElementWiseArrayArray/8        161577 ns       161576 ns         4281 bytes_per_second=576.939M/s
BinaryJoinElementWiseArrayArray/64      1003126 ns      1003108 ns          687 bytes_per_second=744.85M/s
BinaryJoinElementWiseArrayArray/128     1859612 ns      1859601 ns          367 bytes_per_second=803.936M/s

@pitrou
Copy link
Member

pitrou commented Jun 15, 2021

Other ideas if you want to go a bit further:

  • unroll the handling of the separator column (i.e. do it outside the loop)
  • unroll the handling of the first two non-separator columns (similarly)

SeparatorFactory make_separator) {
// Unfortunately benchmark is not 1:1 with BinaryJoin since BinaryJoin can join a
// varying number of inputs per output
const int64_t n_strings = 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a bit low. For the few columns case, this may be exposing some fixed cost overhead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's fair. I've bumped it up to 65536 rows.

Old impl:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    104542 ns       104540 ns         6671 bytes_per_second=1113.52M/s
BinaryJoinArrayArray                     114749 ns       114750 ns         6062 bytes_per_second=1014.45M/s
BinaryJoinElementWiseArrayScalar/2      3017902 ns      3017894 ns          229 bytes_per_second=506.693M/s
BinaryJoinElementWiseArrayScalar/8     10470391 ns     10470204 ns           67 bytes_per_second=585.289M/s
BinaryJoinElementWiseArrayScalar/64    77328574 ns     77273626 ns            9 bytes_per_second=634.122M/s
BinaryJoinElementWiseArrayScalar/128  112622860 ns    112606581 ns            6 bytes_per_second=870.334M/s
BinaryJoinElementWiseArrayArray/2       3508926 ns      3508898 ns          200 bytes_per_second=435.791M/s
BinaryJoinElementWiseArrayArray/8      11349990 ns     11349827 ns           61 bytes_per_second=539.928M/s
BinaryJoinElementWiseArrayArray/64     76564083 ns     76563761 ns            9 bytes_per_second=640.001M/s
BinaryJoinElementWiseArrayArray/128   111560076 ns    111556809 ns            6 bytes_per_second=878.524M/s

Current impl:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    107309 ns       107307 ns         6537 bytes_per_second=1084.81M/s
BinaryJoinArrayArray                     117132 ns       117131 ns         5966 bytes_per_second=993.829M/s
BinaryJoinElementWiseArrayScalar/2      2624376 ns      2624345 ns          267 bytes_per_second=582.677M/s
BinaryJoinElementWiseArrayScalar/8      9394312 ns      9394273 ns           73 bytes_per_second=652.322M/s
BinaryJoinElementWiseArrayScalar/64    77934790 ns     77934514 ns            9 bytes_per_second=628.744M/s
BinaryJoinElementWiseArrayScalar/128  128550012 ns    128549432 ns            5 bytes_per_second=762.394M/s
BinaryJoinElementWiseArrayArray/2       3258818 ns      3258767 ns          214 bytes_per_second=469.24M/s
BinaryJoinElementWiseArrayArray/8      10201948 ns     10201752 ns           66 bytes_per_second=600.69M/s
BinaryJoinElementWiseArrayArray/64     79305591 ns     79303789 ns            8 bytes_per_second=617.888M/s
BinaryJoinElementWiseArrayArray/128   129524057 ns    129524481 ns            5 bytes_per_second=756.655M/s

Still about the same relative to each other.

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

@lidavidm Do you want to try out more optimization ideas or is this ready for review again?

@lidavidm
Copy link
Member Author

This should be ready - I think we can revisit the performance if it becomes a concern.

/// Nulls in inputs are skipped.
SKIP,
/// Nulls in inputs are replaced with the replacement string.
REPLACE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I think we should avoid ALL_CAPS because of potential conflicts with third-party macros. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to stay consistent with the existing enums. (Also see: the whole ML discussion…)

If we reach a consensus there I'm happy to rename all the enums.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some other things I noticed in ARROW-13025 like a toplevel enum (not enum class).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think FunctionOptions classes and enums are recent enough that we may want to do a cleanup pass on them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as a separate JIRA, of course!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can merge if green

@pitrou
Copy link
Member

pitrou commented Jun 16, 2021

Hmm, apparently some Docker images failed to pull :-/

@pitrou pitrou closed this in 57ecc73 Jun 16, 2021
@ianmcook
Copy link
Member

ianmcook commented Jun 17, 2021

@lidavidm Working on the R bindings for this in #10547, I noticed that when using the REPLACE null handling behavior, scalar null inputs are not replaced with the replacement string; instead it emits a null. Assuming that's not just a quirk of the R bindings, would that be straightforward to fix?

@jorisvandenbossche
Copy link
Member

In the Python bindings this seems to work OK:

In [16]: pc.binary_join_element_wise(["a", "b"], pa.scalar(None, type="string"), "c", ["-", "+"], null_handling="replace", null_replacement="NULL")
Out[16]: 
<pyarrow.lib.StringArray object at 0x7f4e48ce96a0>
[
  "a-NULL-c",
  "b+NULL+c"
]

@lidavidm
Copy link
Member Author

I'm probably mistaken (and forgive my clumsy R), but it appears to work for me?

> paste0(Scalar$create("a"), Scalar$create(Array$create(c("c", NA))[2]), Scalar$create("c"))
[1] "aNAc"
> paste0(Scalar$create("a"), Scalar$create(Array$create(c("c", NA))[2]), Array$create(c("c", "d")))
[1] "aNAc" "aNAd"
> paste0(Array$create(c("a", "b")), Scalar$create(Array$create(c("c", NA))[2]), Array$create(c("c", "d")))
[1] "aNAc" "bNAd"

@ianmcook
Copy link
Member

Thanks Joris and David.

@lidavidm in your examples in the above comment, I believe the paste0() operation is running in R, not in Arrow. It's converting the Arrow Scalars to R character vectors and combining the strings there.

But when I call the binary_join_element_wise kernel through the call_function() interface, it works as expected:

> call_function("binary_join_element_wise", Scalar$create("a"), Scalar$create(NA_character_), Scalar$create("c"), Scalar$create(""), options = list(null_handling = NullHandlingBehavior$REPLACE, null_replacement = "NA"))
Scalar
aNAc

The problem I'm seeing only manifests when running it through dplyr, like so:

> Table$create(x = "a", z = "c") %>%
+   transmute(paste0(x, NA_character_, z)) %>%
+   collect()
# A tibble: 1 x 1
  `paste0(x, NA, z)`
  <chr>             
1 NA                

So it sounds like this is happening in the dplyr layer.

I added a few lines of code in #10547 to work around this, so no action required. Thanks!

sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Jun 23, 2021
This adds a variadic scalar string join kernel, using the last argument (min 1 argument) as the separator. An options class allows emitting null (the default), skipping null non-separator arguments, or replacing null non-separator arguments with another string (mimicking libcudf).

Closes apache#10520 from lidavidm/arrow-12709

Lead-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants