-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9991: [C++] Split kernels for strings/binary #8271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e4209ee to
9420f3d
Compare
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! A few high-level comments (didn't (yet) look at C++ implementation, just the docs/tests)
python/pyarrow/compute.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be premature optimization, but in theory the pattern could also be a second argument ([array, pattern]), keeping the option open to later implement [array, array_of_patterns] where both could be an array of the same length? (in case you want a different split pattern per value)
That would also make SplitOptions could be shared between both split_pattern and split_whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ah, I see you commented about that above yourself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look, I think it makes sense to add this indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the kernel name be string_split_pattern instead of split_pattern? (like we have string_is_ascii and not is_ascii and binary_length and not length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, because it can in principle also take binary arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then binary_split_pattern seems "most correct"? Although I personally don't really like the fact that this naming scheme "hides" useful string functionality behind the binary_ prefix (which is the same with binary_length, though)
Buf if the original string was a null, then the output also contains a (top-level) null, and not a list with a null? And then such top-level nulls are typically not put as an entry in the actual values array (or at least when arrow is building up a list array itself, the format might also not allow otherwise). Eg: (so only the null that was inside a list is represented as null in the underlying values array) |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at everything yet since it seems the PR is unfinished.
Among the comments below, please especially notice the ones mentioning how to use buffer builders and VisitArrayDataInline. This will make your life easier, and further maintenance too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really a point in benchmarking random data? Presumably 0 split will happen...
You also don't need to worry about benchmarks in this PR, they may be added later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the pattern to "a", which is more meaningful for a true benchmark. But I use these as a sanity check if there aren't and major performance issues after refactoring, so I'd like to keep this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than double. Your example shows it growing from 2 to 5, but if the pattern is length 1 (probably a common case?) you're really allocating one output string per input byte. This is probably too much.
Instead, you may use:
- a presized
BufferBuilderfor the string bytes - a dynamically-sized
TypedBufferBuilder<offset_type>for the string offsets - a presized
TypedBufferBuilder<int32_t>for the list offsets
And the output's null bitmap is the same as the input's null bitmap, so no need to recompute it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those tips, this improved the code a lot. If the offsets were preallocated is there a helper class to wrap that? (say after https://issues.apache.org/jira/browse/ARROW-10207 is fixed).
4359a53 to
4f403d3
Compare
|
@maartenbreddels Please ping me when you need a new review. |
5e8bb4f to
82488b0
Compare
|
@pitrou @jorisvandenbossche I think this is ready for review, provided the points below are not an issue. Open questions
|
The implementation would be exactly the same, right? I think "split_pattern" is fine.
We do it by hand indeed.
I don't think that would make sense, no. |
33a34e7 to
b3c90bb
Compare
|
@maartenbreddels Don't hesitate to tell me when this PR is ready again. |
|
Hi @pitrou, yes, I've addressed all comments, should be good for a (last?) review. |
|
Will update for the new |
|
macOS CI failures are unrelated. |
|
Thank you very much @maartenbreddels ! |
|
Thanks for the reviews Antoine! |
Contains: * `split_pattern` kernel with max_split and reverse option * `ascii_split_whitespace` similar to Python's `bytes.split` * `utf8_split_whitespace` similar to Python's `str.split` It should be easy to add new split methods, e.g. a regex one in the future. Closes #8271 from maartenbreddels/ARROW-9991 Authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Contains:
split_patternkernel with max_split and reverse optionascii_split_whitespacesimilar to Python'sbytes.splitutf8_split_whitespacesimilar to Python'sstr.splitIt should be easy to add new split methods, e.g. a regex one in the future.