Skip to content

Conversation

@maartenbreddels
Copy link
Contributor

@maartenbreddels maartenbreddels commented Sep 25, 2020

Contains:

  • split_pattern kernel with max_split and reverse option
  • ascii_split_whitespace similar to Python's bytes.split
  • utf8_split_whitespace similar to Python's str.split

It should be easy to add new split methods, e.g. a regex one in the future.

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! A few high-level comments (didn't (yet) look at C++ implementation, just the docs/tests)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be premature optimization, but in theory the pattern could also be a second argument ([array, pattern]), keeping the option open to later implement [array, array_of_patterns] where both could be an array of the same length? (in case you want a different split pattern per value)
That would also make SplitOptions could be shared between both split_pattern and split_whitespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ah, I see you commented about that above yourself)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look, I think it makes sense to add this indeed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the kernel name be string_split_pattern instead of split_pattern? (like we have string_is_ascii and not is_ascii and binary_length and not length)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, because it can in principle also take binary arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then binary_split_pattern seems "most correct"? Although I personally don't really like the fact that this naming scheme "hides" useful string functionality behind the binary_ prefix (which is the same with binary_length, though)

@jorisvandenbossche
Copy link
Member

This also raised a question: Should both the list entry and the string array entry have a missing/null value if the input string contains a null value? I think it should because if we ask for the underlying string array, and the string value that the missing list entry points to is not a missing value, it will be an empty string, which seems odd to me.

Buf if the original string was a null, then the output also contains a (top-level) null, and not a list with a null?

And then such top-level nulls are typically not put as an entry in the actual values array (or at least when arrow is building up a list array itself, the format might also not allow otherwise). Eg:

In [10]: arr = pa.array([["a", "b"], None, ["c"], [None]])

In [11]: arr
Out[11]: 
<pyarrow.lib.ListArray object at 0x7f2316c1d108>
[
  [
    "a",
    "b"
  ],
  null,
  [
    "c"
  ],
  [
    null
  ]
]

In [12]: arr.values
Out[12]: 
<pyarrow.lib.StringArray object at 0x7f2316c1d5e8>
[
  "a",
  "b",
  "c",
  null
]

(so only the null that was inside a list is represented as null in the underlying values array)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at everything yet since it seems the PR is unfinished.

Among the comments below, please especially notice the ones mentioning how to use buffer builders and VisitArrayDataInline. This will make your life easier, and further maintenance too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really a point in benchmarking random data? Presumably 0 split will happen...
You also don't need to worry about benchmarks in this PR, they may be added later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the pattern to "a", which is more meaningful for a true benchmark. But I use these as a sanity check if there aren't and major performance issues after refactoring, so I'd like to keep this in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than double. Your example shows it growing from 2 to 5, but if the pattern is length 1 (probably a common case?) you're really allocating one output string per input byte. This is probably too much.

Instead, you may use:

  • a presized BufferBuilder for the string bytes
  • a dynamically-sized TypedBufferBuilder<offset_type> for the string offsets
  • a presized TypedBufferBuilder<int32_t> for the list offsets

And the output's null bitmap is the same as the input's null bitmap, so no need to recompute it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those tips, this improved the code a lot. If the offsets were preallocated is there a helper class to wrap that? (say after https://issues.apache.org/jira/browse/ARROW-10207 is fixed).

@pitrou
Copy link
Member

pitrou commented Oct 7, 2020

@maartenbreddels Please ping me when you need a new review.

@maartenbreddels
Copy link
Contributor Author

@pitrou @jorisvandenbossche I think this is ready for review, provided the points below are not an issue.

Open questions

  • kernel name: string_split_pattern vs split_pattern. I can imagine that we also want to support binary_split_pattern, so should we have to separate kernel names for this, or keep it split_pattern and implement support for binary in the future?
  • API docs: I'm not sure the compute.rst docs I added should follow some standard, or if that table can be produce using some tool, I've done it by hand for now.
  • Should split_pattern be a binary kernel, where pattern is the second argument? As @jorisvandenbossche suggested we could start by only implementing only the second argument being a scalar. I tried to implement this, but I had to change the testing framework/tools for this a lot. So I think the should be a separate issue, otherwise this PR will never end. OTOH, this will mean a breaking change.

@pitrou
Copy link
Member

pitrou commented Oct 7, 2020

I can imagine that we also want to support binary_split_pattern, so should we have to separate kernel names for this, or keep it split_pattern and implement support for binary in the future?

The implementation would be exactly the same, right? I think "split_pattern" is fine.

API docs: I'm not sure the compute.rst docs I added should follow some standard, or if that table can be produce using some tool, I've done it by hand for now.

We do it by hand indeed.

Should split_pattern be a binary kernel, where pattern is the second argument?

I don't think that would make sense, no.

@pitrou
Copy link
Member

pitrou commented Oct 19, 2020

@maartenbreddels Don't hesitate to tell me when this PR is ready again.

@maartenbreddels
Copy link
Contributor Author

Hi @pitrou, yes, I've addressed all comments, should be good for a (last?) review.

@pitrou
Copy link
Member

pitrou commented Oct 19, 2020

Will update for the new FunctionDoc functionality and then merge.

@pitrou pitrou changed the title ARROW-9991: [C++] split kernels for strings/binary ARROW-9991: [C++] Split kernels for strings/binary Oct 19, 2020
@pitrou
Copy link
Member

pitrou commented Oct 19, 2020

macOS CI failures are unrelated.

@pitrou pitrou closed this in e56cb37 Oct 19, 2020
@pitrou
Copy link
Member

pitrou commented Oct 19, 2020

Thank you very much @maartenbreddels !

@maartenbreddels
Copy link
Contributor Author

Thanks for the reviews Antoine!

@maartenbreddels maartenbreddels deleted the ARROW-9991 branch October 19, 2020 17:34
kszucs pushed a commit that referenced this pull request Oct 19, 2020
Contains:
 * `split_pattern` kernel with max_split and reverse option
 * `ascii_split_whitespace` similar to Python's `bytes.split`
 * `utf8_split_whitespace` similar to Python's `str.split`

It should be easy to add new split methods, e.g. a regex one in the future.

Closes #8271 from maartenbreddels/ARROW-9991

Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants