-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12713 [C++] String reverse kernel #10317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Learned from @cyb70289 that new kernels need to extend both C++ and Python documentation:
|
6b4695f to
24a6088
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I left some comments about wording - I don't think we need to explain the Unicode details here, just mention that we operate on codepoints/code units, and not grapheme clusters.
| return utf8_char_found == 0; | ||
| } | ||
|
|
||
| static Status InvalidStatus() { return Status::Invalid("Non-ascii sequence in input"); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we be consistent with ASCII?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think you should add InvalidStatus methods with specific messages that do not necessarily generalize for all string errors. Why not directly invoke Status::Invalid("Specific message to specific code block")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reusing the StringTransform class for both ascii and utf8 kernels. In that, it always throws a Status::Invalid("Invalid UTF8 sequence in input") which is not quite right for an ascii kernel. That's why it was generalized and offloaded to Derived::InvalidStatus() 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify here, the structure of the kernel means the 'implementation' doesn't directly return an error, only true/false. Would you prefer to change it so that it returns a status instead?
|
Remember to update Python documentation as well, https://github.com/apache/arrow/blob/master/docs/source/python/api/compute.rst |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more nit from me (tests are failing due to a small capitalization issue)
Co-authored-by: David Li <li.davidm96@gmail.com>
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for this.
|
The formatting needs to be fixed here (see the logs for the Dev action), the other CI failures look unrelated. You could try |
|
@lidavidm I think this looks okay now. |
| // would produce arrays with same lengths. Hence checking offset buffer equality | ||
| auto malformed_input = ArrayFromJSON(this->type(), "[\"ɑ\xFFɑa\", \"ɽ\xe1\xbdɽa\"]"); | ||
| const Result<Datum>& res = CallFunction("utf8_reverse", {malformed_input}); | ||
| ASSERT_TRUE(res->array()->buffers[1]->Equals(*malformed_input->data()->buffers[1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, but FTR we probably have a AssertBufferEquals (or perhaps AssertBuffersEqual :-)).
This PR adds ascii and utf8 reverse kernels.