ARROW-9991: [C++] Split kernels for strings/binary #8271

maartenbreddels · 2020-09-25T11:59:27Z

Contains:

split_pattern kernel with max_split and reverse option
ascii_split_whitespace similar to Python's bytes.split
utf8_split_whitespace similar to Python's str.split

It should be easy to add new split methods, e.g. a regex one in the future.

github-actions · 2020-09-25T12:12:43Z

https://issues.apache.org/jira/browse/ARROW-9991

jorisvandenbossche

Thanks for this! A few high-level comments (didn't (yet) look at C++ implementation, just the docs/tests)

jorisvandenbossche · 2020-09-30T12:48:15Z

python/pyarrow/compute.py

This might be premature optimization, but in theory the pattern could also be a second argument ([array, pattern]), keeping the option open to later implement [array, array_of_patterns] where both could be an array of the same length? (in case you want a different split pattern per value)
That would also make SplitOptions could be shared between both split_pattern and split_whitespace

(ah, I see you commented about that above yourself)

I can take a look, I think it makes sense to add this indeed.

cpp/src/arrow/compute/api_scalar.h

jorisvandenbossche · 2020-09-30T12:55:58Z

cpp/src/arrow/compute/kernels/scalar_string.cc

Should the kernel name be string_split_pattern instead of split_pattern? (like we have string_is_ascii and not is_ascii and binary_length and not length)

I don't know, because it can in principle also take binary arrays.

Then binary_split_pattern seems "most correct"? Although I personally don't really like the fact that this naming scheme "hides" useful string functionality behind the binary_ prefix (which is the same with binary_length, though)

jorisvandenbossche · 2020-09-30T13:09:45Z

This also raised a question: Should both the list entry and the string array entry have a missing/null value if the input string contains a null value? I think it should because if we ask for the underlying string array, and the string value that the missing list entry points to is not a missing value, it will be an empty string, which seems odd to me.

Buf if the original string was a null, then the output also contains a (top-level) null, and not a list with a null?

And then such top-level nulls are typically not put as an entry in the actual values array (or at least when arrow is building up a list array itself, the format might also not allow otherwise). Eg:

In [10]: arr = pa.array([["a", "b"], None, ["c"], [None]])

In [11]: arr
Out[11]: 
<pyarrow.lib.ListArray object at 0x7f2316c1d108>
[
  [
    "a",
    "b"
  ],
  null,
  [
    "c"
  ],
  [
    null
  ]
]

In [12]: arr.values
Out[12]: 
<pyarrow.lib.StringArray object at 0x7f2316c1d5e8>
[
  "a",
  "b",
  "c",
  null
]

(so only the null that was inside a list is represented as null in the underlying values array)

pitrou

I didn't look at everything yet since it seems the PR is unfinished.

Among the comments below, please especially notice the ones mentioning how to use buffer builders and VisitArrayDataInline. This will make your life easier, and further maintenance too.

pitrou · 2020-09-30T14:17:43Z

cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc

Is there really a point in benchmarking random data? Presumably 0 split will happen...
You also don't need to worry about benchmarks in this PR, they may be added later.

I've changed the pattern to "a", which is more meaningful for a true benchmark. But I use these as a sanity check if there aren't and major performance issues after refactoring, so I'd like to keep this in.

cpp/src/arrow/compute/kernels/scalar_string_test.cc

cpp/src/arrow/util/utf8.h

python/pyarrow/_compute.pyx

cpp/src/arrow/compute/kernels/scalar_string.cc

pitrou · 2020-09-30T14:45:04Z

cpp/src/arrow/compute/kernels/scalar_string.cc

More than double. Your example shows it growing from 2 to 5, but if the pattern is length 1 (probably a common case?) you're really allocating one output string per input byte. This is probably too much.

Instead, you may use:

a presized BufferBuilder for the string bytes

a dynamically-sized TypedBufferBuilder<offset_type> for the string offsets

a presized TypedBufferBuilder<int32_t> for the list offsets

And the output's null bitmap is the same as the input's null bitmap, so no need to recompute it!

Thanks for those tips, this improved the code a lot. If the offsets were preallocated is there a helper class to wrap that? (say after https://issues.apache.org/jira/browse/ARROW-10207 is fixed).

pitrou · 2020-10-07T09:53:18Z

@maartenbreddels Please ping me when you need a new review.

maartenbreddels · 2020-10-07T12:03:09Z

@pitrou @jorisvandenbossche I think this is ready for review, provided the points below are not an issue.

Open questions

kernel name: string_split_pattern vs split_pattern. I can imagine that we also want to support binary_split_pattern, so should we have to separate kernel names for this, or keep it split_pattern and implement support for binary in the future?
API docs: I'm not sure the compute.rst docs I added should follow some standard, or if that table can be produce using some tool, I've done it by hand for now.
Should split_pattern be a binary kernel, where pattern is the second argument? As @jorisvandenbossche suggested we could start by only implementing only the second argument being a scalar. I tried to implement this, but I had to change the testing framework/tools for this a lot. So I think the should be a separate issue, otherwise this PR will never end. OTOH, this will mean a breaking change.

pitrou · 2020-10-07T12:07:03Z

I can imagine that we also want to support binary_split_pattern, so should we have to separate kernel names for this, or keep it split_pattern and implement support for binary in the future?

The implementation would be exactly the same, right? I think "split_pattern" is fine.

API docs: I'm not sure the compute.rst docs I added should follow some standard, or if that table can be produce using some tool, I've done it by hand for now.

We do it by hand indeed.

Should split_pattern be a binary kernel, where pattern is the second argument?

I don't think that would make sense, no.

cpp/src/arrow/compute/kernels/scalar_string.cc

docs/source/cpp/compute.rst

pitrou · 2020-10-19T10:34:56Z

@maartenbreddels Don't hesitate to tell me when this PR is ready again.

maartenbreddels · 2020-10-19T10:53:55Z

Hi @pitrou, yes, I've addressed all comments, should be good for a (last?) review.

pitrou · 2020-10-19T13:20:45Z

Will update for the new FunctionDoc functionality and then merge.

pitrou · 2020-10-19T14:32:46Z

macOS CI failures are unrelated.

pitrou · 2020-10-19T14:33:37Z

Thank you very much @maartenbreddels !

maartenbreddels · 2020-10-19T17:34:40Z

Thanks for the reviews Antoine!

Contains: * `split_pattern` kernel with max_split and reverse option * `ascii_split_whitespace` similar to Python's `bytes.split` * `utf8_split_whitespace` similar to Python's `str.split` It should be easy to add new split methods, e.g. a regex one in the future. Closes #8271 from maartenbreddels/ARROW-9991 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

maartenbreddels force-pushed the ARROW-9991 branch from e4209ee to 9420f3d Compare September 25, 2020 14:36

jorisvandenbossche reviewed Sep 30, 2020

View reviewed changes

pitrou reviewed Sep 30, 2020

View reviewed changes

maartenbreddels force-pushed the ARROW-9991 branch from 4359a53 to 4f403d3 Compare October 6, 2020 13:33

maartenbreddels force-pushed the ARROW-9991 branch from 5e8bb4f to 82488b0 Compare October 7, 2020 10:49

pitrou reviewed Oct 8, 2020

View reviewed changes

maartenbreddels force-pushed the ARROW-9991 branch from 33a34e7 to b3c90bb Compare October 14, 2020 07:35

ARROW-9991: [C++] Split kernels for strings/binary

7d2b6a2

pitrou force-pushed the ARROW-9991 branch from 218b8bc to 7d2b6a2 Compare October 19, 2020 13:33

pitrou changed the title ~~ARROW-9991: [C++] split kernels for strings/binary~~ ARROW-9991: [C++] Split kernels for strings/binary Oct 19, 2020

pitrou closed this in e56cb37 Oct 19, 2020

maartenbreddels deleted the ARROW-9991 branch October 19, 2020 17:34

maartenbreddels mentioned this pull request Jan 25, 2021

String features for use with Apache Arrow v3.0 vaexio/vaex#1186

Merged

5 tasks

This was referenced May 10, 2021

[C++] split kernels for strings/binary #26016

Closed

[C++] Unary kernels that results in a list have no preallocated offset buffer #26211

Closed

[C++] String split kernels do not propagate nulls correctly on sliced input #26212

Closed

ARROW-9991: [C++] Split kernels for strings/binary #8271

ARROW-9991: [C++] Split kernels for strings/binary #8271

Uh oh!

Conversation

maartenbreddels commented Sep 25, 2020 • edited by pitrou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 25, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Sep 30, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Oct 7, 2020

Uh oh!

maartenbreddels commented Oct 7, 2020

Uh oh!

pitrou commented Oct 7, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Oct 19, 2020

Uh oh!

maartenbreddels commented Oct 19, 2020

Uh oh!

pitrou commented Oct 19, 2020

Uh oh!

pitrou commented Oct 19, 2020

Uh oh!

pitrou commented Oct 19, 2020

Uh oh!

maartenbreddels commented Oct 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maartenbreddels commented Sep 25, 2020 •

edited by pitrou

Loading