-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10195: [C++] Add string struct extract kernel using re2 #8459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
should re2 be available on all platform? appveyor for instance gives: |
|
@maartenbreddels Can you rebase this? That would be helpful for me setting up |
9041521 to
5372dc3
Compare
|
I cleaned it up a bit, but it might fail running the test, hope this is enough for you for now. |
5372dc3 to
f66bc36
Compare
|
There are two issues with this PR: empty slicesIn #8728 https://github.com/apache/arrow/pull/8728/files#diff-7771ecc138c4ecc2dc8498affe04f5b7f182c4b77b18a512c3dd07a82d45aa3dR116 was added, which adds an empty slice to the test, which breaks my test (which is good). However, my kernel will not be called, meaning i cannot set the output's type (which depends on the input regex). Shall I open a JIRA for that? The error is: (showing the struct mismatch) ChunkedArray failsCommenting out the empty slice test, the next failure is with
I fail to see a difference in this output (maybe a null value in a child?), and this is difficult to debug in gdb. Maybe an opportunity to improve the printing of differences (since I fail to see any). |
b500ed4 to
f1bd901
Compare
d4608a9 to
356c300
Compare
f1bd901 to
43773ea
Compare
|
@maartenbreddels what's the status of this PR? Ready for review, or are there still open issues? (you listed some above at #8459 (comment)) |
|
This needs rebase now that #8468 has merged. |
1357f98 to
58cbcb7
Compare
|
Rebased, but there are two issues left:
|
|
@maartenbreddels Will you need help here? |
|
Yes, that would be appriciated, the null values I find difficult to debug, and the second issue I'm not sure that is good solution. |
|
Ok, I'm looking at this. |
|
Naming nit @jorisvandenbossche @jonkeane @ianmcook . |
f565700 to
9b1f6e0
Compare
|
@maartenbreddels For the record:
I am not sure what was causing this, but my small rewrite of the kernel fixed it.
This was solved by defining a |
In pandas, this is called "extract" (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html), so from that point of view this is a good name. The |
|
One thing I noted while quickly trying it out: When having an optional group like the above, I would maybe expect a null instead of an empty string in the first child? |
Agreed, but unfortunately re2 doesn't give us that information... |
cpp/src/arrow/compute/api_scalar.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nit: I would maybe call this "pattern" instead of "regex"
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, will merge
|
Thanks for finishing this Antoine!
Good to know, thanks. |
|
@maartenbreddels You're welcome :-) |
The second commit adds re2 to the linked libraries. @xhochy how should this be done, should I open a separate issue for this?