ARROW-12709: [C++] Add binary_join_element_wise #10520

lidavidm · 2021-06-11T21:23:12Z

This adds a variadic scalar string join kernel, using the last argument (min 1 argument) as the separator. An options class allows emitting null (the default), skipping null non-separator arguments, or replacing null non-separator arguments with another string (mimicking libcudf).

github-actions · 2021-06-11T21:23:34Z

https://issues.apache.org/jira/browse/ARROW-12709

docs/source/cpp/compute.rst

ianmcook · 2021-06-14T15:30:44Z

Thanks for working on this @lidavidm!

I will add the relevant functions to the R bindings after this is merged (ARROW-11514).

jorisvandenbossche · 2021-06-14T15:37:10Z

Some naming nitpicks ;)
I think "var_args_join" is not super clear. Having a notion about it being for string data would be good, and the scalar list of string join kernel that was just added in ARROW-10959 is called "binary_join" (binary because it supports all binary and not just string, which is the same here I think). So something about "binary_join_var_args" ?
Another reference is the variadic element-wise min/max kernels that were added, where "element_wise" was used and not "var_args" (which I think is more descriptive for what it does instead of describing the implementation detail of being variadic). That would then give something like "binary_join_element_wise" (a mouthful ..)

lidavidm · 2021-06-14T15:38:33Z

Or perhaps element_wise_binary_join since it's also element_wise_min?

jorisvandenbossche · 2021-06-14T15:48:50Z

That would indeed be more consistent.
Personally, searching for the function / "tab-completion" in mind, I think having the name start with "binary_join" or "string_join" is more useful (I could say the same for the min/max version though).
We could also rename the just merged "binary_join" to "binary_join_list" to more clearly differentiate the different kernels.

lidavidm · 2021-06-14T16:08:51Z

Ah, maybe then we should rename element_wise_min to min_element_wise.

ianmcook · 2021-06-14T16:24:19Z

The use of "binary" in the names of these string join kernels is unfortunate; it's not clear at first glance whether "binary" is a reference to the arity or to the input type.

ianmcook · 2021-06-14T22:24:38Z

I'm unclear on what the intended behavior is when you choose the SKIP null handling behavior and the separator is an array. Could you describe that please? Thanks

lidavidm · 2021-06-14T22:29:30Z

The null handling behavior never affects the separator: it only describes how to handle the other values. It's intended to let you mimic libcudf.

binary_join_element_wise(['a', 'b', 'c', 'h'], ['d', null, 'e', 'i'], ['f', 'g', null, 'j'], [';', '-', ',', null])
EMIT_NULL:                ['a', null, null, null]
SKIP:                     ['a;d;f', 'b-g', 'c,e', null] # b and g are joined; the middle null value is ignored
REPLACE(replacement=foo): ['a;d;f', 'b-foo-g', 'c,e,foo', null]

If this makes sense, I'll try to update the docs to be clearer.

ianmcook · 2021-06-14T22:56:43Z

Oh, I see; I was confused about what happens when the separator is an array; I see now. Thank you!

jorisvandenbossche · 2021-06-15T06:41:25Z

The use of "binary" in the names of these string join kernels is unfortunate; it's not clear at first glance whether "binary" is a reference to the arity or to the input type.

Yeah, personally I would be fine to have an alias that uses "string" to have some more recognizable name ("string_join_element_wise/list"), but not sure we want to start adding aliases in general ..

pitrou

Thanks a lot. It would be nice to add a benchmark alongside the binary_join (ideally with the same data, to compare performance of the two kernels).

cpp/src/arrow/compute/kernels/scalar_string.cc

lidavidm · 2021-06-15T13:15:46Z

I added a benchmark but am not too happy with the performance. It looks like there's a lot of overhead in the core kernel implementation and not all that much time is spent actually copying strings. Accordingly, performance increases the more arrays there are to concatenate. This is with n=10 arrays to concatenate but with n=100 the bytes per second goes up to ~1000M/s.

The benchmark is not the same as BinaryJoin exactly since BinaryJoin can concatenate a varying number of strings per output slot while BinaryJoinElementWise joins a fixed number of strings per output slot. (That, plus the extra options offered by this kernel, could probably explain why there's more overhead here - BinaryJoin doesn't have to deal with unboxing scalars or arrays on every step, tracking nulls, etc. but there's definitely ways to improve the implementation here too.)

-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                104640 ns       104632 ns         6693 bytes_per_second=1112.55M/s
BinaryJoinArrayArray                 115231 ns       115119 ns         6084 bytes_per_second=1011.2M/s
BinaryJoinElementWiseArrayScalar     218594 ns       218524 ns         3206 bytes_per_second=534.461M/s
BinaryJoinElementWiseArrayArray      218135 ns       218114 ns         3209 bytes_per_second=535.466M/s

pitrou · 2021-06-15T13:36:42Z

I'm not really surprised and was actually expecting worse results :-) There are two factors:

the reshuffling (row-oriented vs. column-oriented) is an intrinsic problem of this operation and will stress the memory subsystem much more (while binary_join is purely sequential)
the implementation iterates row a time, and therefore does dynamic switching between various cases (scalar, array, null options...) in each loop iteration

pitrou · 2021-06-15T13:40:17Z

An implementation iterating over each column in turn might be more performant, but at the cost of more complexity:

you first need to compute the output null bitmap and offsets
you need to keep a vector of output pointers into the output data (a bit similar to the CSV writer)

lidavidm · 2021-06-15T13:42:36Z

I was just thinking that. The current implementation already computes the output lengths (accounting for nullability) up front, so, it might not be too bad.

I also noticed the current implementation effectively branches on scalar/array twice per column in the loop, trying to optimize that improved the cases with fewer columns at the expense of those with more columns (though that may be the right tradeoff). But I'll give the columnwise approach a try.

lidavidm · 2021-06-15T15:20:08Z

Awkwardly, it's slower (with fewer columns, much faster with a lot of columns) once made columnwise.

Before:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    104808 ns       104807 ns         6670 bytes_per_second=1110.69M/s
BinaryJoinArrayArray                     115151 ns       115150 ns         6081 bytes_per_second=1010.93M/s
BinaryJoinElementWiseArrayScalar/2        48782 ns        48781 ns        14300 bytes_per_second=481.829M/s
BinaryJoinElementWiseArrayScalar/8       172532 ns       172526 ns         4088 bytes_per_second=540.323M/s
BinaryJoinElementWiseArrayScalar/64      873553 ns       873533 ns          801 bytes_per_second=855.336M/s
BinaryJoinElementWiseArrayScalar/128    1240809 ns      1240796 ns          568 bytes_per_second=1.17663G/s
BinaryJoinElementWiseArrayArray/2         57097 ns        57096 ns        12238 bytes_per_second=411.659M/s
BinaryJoinElementWiseArrayArray/8        184990 ns       184989 ns         3726 bytes_per_second=503.921M/s
BinaryJoinElementWiseArrayArray/64       886392 ns       886393 ns          781 bytes_per_second=842.927M/s
BinaryJoinElementWiseArrayArray/128     1229550 ns      1229457 ns          569 bytes_per_second=1.18749G/s

After:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    107844 ns       107840 ns         6519 bytes_per_second=1079.45M/s
BinaryJoinArrayArray                     115818 ns       115816 ns         5997 bytes_per_second=1005.11M/s
BinaryJoinElementWiseArrayScalar/2        46680 ns        46680 ns        14774 bytes_per_second=503.519M/s
BinaryJoinElementWiseArrayScalar/8       164875 ns       164871 ns         4230 bytes_per_second=565.411M/s
BinaryJoinElementWiseArrayScalar/64      796642 ns       796620 ns          884 bytes_per_second=937.919M/s
BinaryJoinElementWiseArrayScalar/128     735257 ns       735242 ns          954 bytes_per_second=1.98569G/s
BinaryJoinElementWiseArrayArray/2         58653 ns        58651 ns        11987 bytes_per_second=400.746M/s
BinaryJoinElementWiseArrayArray/8        222740 ns       222735 ns         3147 bytes_per_second=418.523M/s
BinaryJoinElementWiseArrayArray/64       966210 ns       966211 ns          727 bytes_per_second=773.294M/s
BinaryJoinElementWiseArrayArray/128      833193 ns       833171 ns          850 bytes_per_second=1.75229G/s

This holds even for longer input arrays, though the crossover point moves down a bit.

pitrou · 2021-06-15T16:02:21Z

Cool, then let's keep the row-oriented version. People are highly unlikely to call this with more than a dozen columns, IMHO.

lidavidm · 2021-06-15T16:53:02Z

A minor tweak improves the <64 column case by ~10%, at least (though at quite the expense of the lots-of-columns case):

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
BinaryJoinArrayScalar                    105557 ns       105555 ns         6631 bytes_per_second=1102.82M/s
BinaryJoinArrayArray                     115929 ns       115928 ns         6002 bytes_per_second=1004.14M/s
BinaryJoinElementWiseArrayScalar/2        42475 ns        42474 ns        16457 bytes_per_second=553.377M/s
BinaryJoinElementWiseArrayScalar/8       146675 ns       146674 ns         4764 bytes_per_second=635.559M/s
BinaryJoinElementWiseArrayScalar/64     1002679 ns      1002670 ns          707 bytes_per_second=745.175M/s
BinaryJoinElementWiseArrayScalar/128    1889566 ns      1889539 ns          366 bytes_per_second=791.199M/s
BinaryJoinElementWiseArrayArray/2         52690 ns        52689 ns        13200 bytes_per_second=446.092M/s
BinaryJoinElementWiseArrayArray/8        161577 ns       161576 ns         4281 bytes_per_second=576.939M/s
BinaryJoinElementWiseArrayArray/64      1003126 ns      1003108 ns          687 bytes_per_second=744.85M/s
BinaryJoinElementWiseArrayArray/128     1859612 ns      1859601 ns          367 bytes_per_second=803.936M/s

pitrou · 2021-06-15T17:01:05Z

Other ideas if you want to go a bit further:

unroll the handling of the separator column (i.e. do it outside the loop)
unroll the handling of the first two non-separator columns (similarly)

pitrou · 2021-06-15T17:02:54Z

cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc

+                                  SeparatorFactory make_separator) {
+  // Unfortunately benchmark is not 1:1 with BinaryJoin since BinaryJoin can join a
+  // varying number of inputs per output
+  const int64_t n_strings = 1000;


I think this might be a bit low. For the few columns case, this may be exposing some fixed cost overhead.

Ah that's fair. I've bumped it up to 65536 rows.

Old impl:

----------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------- BinaryJoinArrayScalar 104542 ns 104540 ns 6671 bytes_per_second=1113.52M/s BinaryJoinArrayArray 114749 ns 114750 ns 6062 bytes_per_second=1014.45M/s BinaryJoinElementWiseArrayScalar/2 3017902 ns 3017894 ns 229 bytes_per_second=506.693M/s BinaryJoinElementWiseArrayScalar/8 10470391 ns 10470204 ns 67 bytes_per_second=585.289M/s BinaryJoinElementWiseArrayScalar/64 77328574 ns 77273626 ns 9 bytes_per_second=634.122M/s BinaryJoinElementWiseArrayScalar/128 112622860 ns 112606581 ns 6 bytes_per_second=870.334M/s BinaryJoinElementWiseArrayArray/2 3508926 ns 3508898 ns 200 bytes_per_second=435.791M/s BinaryJoinElementWiseArrayArray/8 11349990 ns 11349827 ns 61 bytes_per_second=539.928M/s BinaryJoinElementWiseArrayArray/64 76564083 ns 76563761 ns 9 bytes_per_second=640.001M/s BinaryJoinElementWiseArrayArray/128 111560076 ns 111556809 ns 6 bytes_per_second=878.524M/s

Current impl:

----------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------- BinaryJoinArrayScalar 107309 ns 107307 ns 6537 bytes_per_second=1084.81M/s BinaryJoinArrayArray 117132 ns 117131 ns 5966 bytes_per_second=993.829M/s BinaryJoinElementWiseArrayScalar/2 2624376 ns 2624345 ns 267 bytes_per_second=582.677M/s BinaryJoinElementWiseArrayScalar/8 9394312 ns 9394273 ns 73 bytes_per_second=652.322M/s BinaryJoinElementWiseArrayScalar/64 77934790 ns 77934514 ns 9 bytes_per_second=628.744M/s BinaryJoinElementWiseArrayScalar/128 128550012 ns 128549432 ns 5 bytes_per_second=762.394M/s BinaryJoinElementWiseArrayArray/2 3258818 ns 3258767 ns 214 bytes_per_second=469.24M/s BinaryJoinElementWiseArrayArray/8 10201948 ns 10201752 ns 66 bytes_per_second=600.69M/s BinaryJoinElementWiseArrayArray/64 79305591 ns 79303789 ns 8 bytes_per_second=617.888M/s BinaryJoinElementWiseArrayArray/128 129524057 ns 129524481 ns 5 bytes_per_second=756.655M/s

Still about the same relative to each other.

pitrou · 2021-06-16T09:03:25Z

@lidavidm Do you want to try out more optimization ideas or is this ready for review again?

lidavidm · 2021-06-16T11:29:14Z

This should be ready - I think we can revisit the performance if it becomes a concern.

pitrou · 2021-06-16T12:50:35Z

cpp/src/arrow/compute/api_scalar.h

+    /// Nulls in inputs are skipped.
+    SKIP,
+    /// Nulls in inputs are replaced with the replacement string.
+    REPLACE,


Nit, but I think we should avoid ALL_CAPS because of potential conflicts with third-party macros. What do you think?

I was trying to stay consistent with the existing enums. (Also see: the whole ML discussion…)

If we reach a consensus there I'm happy to rename all the enums.

There's some other things I noticed in ARROW-13025 like a toplevel enum (not enum class).

Fair enough.

Hmm, I think FunctionOptions classes and enums are recent enough that we may want to do a cleanup pass on them.

(as a separate JIRA, of course!)

ARROW-13093

pitrou

+1, can merge if green

pitrou · 2021-06-16T13:29:42Z

Hmm, apparently some Docker images failed to pull :-/

ianmcook · 2021-06-17T06:22:31Z

@lidavidm Working on the R bindings for this in #10547, I noticed that when using the REPLACE null handling behavior, scalar null inputs are not replaced with the replacement string; instead it emits a null. Assuming that's not just a quirk of the R bindings, would that be straightforward to fix?

jorisvandenbossche · 2021-06-17T07:16:55Z

In the Python bindings this seems to work OK:

In [16]: pc.binary_join_element_wise(["a", "b"], pa.scalar(None, type="string"), "c", ["-", "+"], null_handling="replace", null_replacement="NULL")
Out[16]: 
<pyarrow.lib.StringArray object at 0x7f4e48ce96a0>
[
  "a-NULL-c",
  "b+NULL+c"
]

lidavidm · 2021-06-17T12:36:56Z

I'm probably mistaken (and forgive my clumsy R), but it appears to work for me?

> paste0(Scalar$create("a"), Scalar$create(Array$create(c("c", NA))[2]), Scalar$create("c"))
[1] "aNAc"
> paste0(Scalar$create("a"), Scalar$create(Array$create(c("c", NA))[2]), Array$create(c("c", "d")))
[1] "aNAc" "aNAd"
> paste0(Array$create(c("a", "b")), Scalar$create(Array$create(c("c", NA))[2]), Array$create(c("c", "d")))
[1] "aNAc" "bNAd"

ianmcook · 2021-06-17T16:12:01Z

Thanks Joris and David.

@lidavidm in your examples in the above comment, I believe the paste0() operation is running in R, not in Arrow. It's converting the Arrow Scalars to R character vectors and combining the strings there.

But when I call the binary_join_element_wise kernel through the call_function() interface, it works as expected:

> call_function("binary_join_element_wise", Scalar$create("a"), Scalar$create(NA_character_), Scalar$create("c"), Scalar$create(""), options = list(null_handling = NullHandlingBehavior$REPLACE, null_replacement = "NA"))
Scalar
aNAc

The problem I'm seeing only manifests when running it through dplyr, like so:

> Table$create(x = "a", z = "c") %>%
+   transmute(paste0(x, NA_character_, z)) %>%
+   collect()
# A tibble: 1 x 1
  `paste0(x, NA, z)`
  <chr>             
1 NA

So it sounds like this is happening in the dplyr layer.

I added a few lines of code in #10547 to work around this, so no action required. Thanks!

This adds a variadic scalar string join kernel, using the last argument (min 1 argument) as the separator. An options class allows emitting null (the default), skipping null non-separator arguments, or replacing null non-separator arguments with another string (mimicking libcudf). Closes apache#10520 from lidavidm/arrow-12709 Lead-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: C++ Component: Python labels Jun 11, 2021

ianmcook reviewed Jun 14, 2021

View reviewed changes

docs/source/cpp/compute.rst Outdated Show resolved Hide resolved

lidavidm added 4 commits June 14, 2021 14:53

ARROW-12709: [C++] Add var_args_join

c188236

ARROW-12709: [C++] Clarify docs for var_args_join

6a0e497

ARROW-12709: [C++] Rename to binary_join_element_wise

061e3e6

ARROW-12709: [C++] Rename to min/max_element_wise

bd878f5

lidavidm force-pushed the arrow-12709 branch from 3ee8f3e to bd878f5 Compare June 14, 2021 19:09

lidavidm changed the title ~~ARROW-12709: [C++] Add var_args_join~~ ARROW-12709: [C++] Add binary_join_element_wise Jun 14, 2021

pitrou reviewed Jun 15, 2021

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_string.cc Show resolved Hide resolved

cpp/src/arrow/compute/kernels/scalar_string.cc Show resolved Hide resolved

ARROW-12709: [C++] Add benchmark and debug checks

c3f20e5

ARROW-12709: [C++] Parameterize benchmark

1a1ff7c

ARROW-12709: [C++] Improve few-columns case at expense of many-columns

d85e362

pitrou reviewed Jun 15, 2021

View reviewed changes

ARROW-12709: [C++] Run benchmark with more elements

3e960d0

lidavidm force-pushed the arrow-12709 branch from 826a307 to 3e960d0 Compare June 15, 2021 17:09

pitrou reviewed Jun 16, 2021

View reviewed changes

Use same number of rows as in binary_join benchmark

491d41a

pitrou approved these changes Jun 16, 2021

View reviewed changes

pitrou closed this in 57ecc73 Jun 16, 2021

ianmcook mentioned this pull request Jun 18, 2021

ARROW-11514: [R][C++] Bindings for paste(), paste0(), str_c() #10547

Closed

asfimport mentioned this pull request Jun 17, 2021

[C++] Add variadic string join kernel #28453

Closed

ARROW-12709: [C++] Add binary_join_element_wise #10520

ARROW-12709: [C++] Add binary_join_element_wise #10520

Uh oh!

Conversation

lidavidm commented Jun 11, 2021

Uh oh!

github-actions bot commented Jun 11, 2021

Uh oh!

Uh oh!

ianmcook commented Jun 14, 2021

Uh oh!

jorisvandenbossche commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Jun 14, 2021

Uh oh!

jorisvandenbossche commented Jun 14, 2021

Uh oh!

lidavidm commented Jun 14, 2021

Uh oh!

ianmcook commented Jun 14, 2021

Uh oh!

ianmcook commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianmcook commented Jun 14, 2021

Uh oh!

jorisvandenbossche commented Jun 15, 2021

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lidavidm commented Jun 15, 2021

Uh oh!

pitrou commented Jun 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Jun 15, 2021

Uh oh!

lidavidm commented Jun 15, 2021

Uh oh!

lidavidm commented Jun 15, 2021

Uh oh!

pitrou commented Jun 15, 2021

Uh oh!

lidavidm commented Jun 15, 2021

Uh oh!

pitrou commented Jun 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 16, 2021

Uh oh!

lidavidm commented Jun 16, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 16, 2021

jorisvandenbossche commented Jun 14, 2021 •

edited

Loading

ianmcook commented Jun 14, 2021 •

edited

Loading

lidavidm commented Jun 14, 2021 •

edited

Loading

pitrou commented Jun 15, 2021 •

edited

Loading

ianmcook commented Jun 17, 2021 •

edited

Loading