-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10306: [C++] Add string replacement kernel #8468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
You may want to remove the regex variant of this if you want to move this forward without depending on resolving the re2 dependency issue. |
|
@maartenbreddels would it be practical to split this into two PRs? (one for plain replace, other for re2-based replace) Or would you prefer first to have the re2 dependency issue resolved? (ARROW-10541) |
c55f9ec to
1099f4e
Compare
6f41599 to
e1c0571
Compare
|
I'd rather keep this 1 PR, looks like #8756 is working |
|
@pitrou this is ready for review, failure seems unrelated (minio on windows). |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! See comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReplaceString below is basically independent from Type, but using this idiom may compile it twice. Can you find another way to parametrize the kernel?
(hint: perhaps use composition rather than inheritance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstand, but via offset_type we are not independent of Type right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't understand why offset_type is being used here. ValueDataBuilder is basically a TypedBufferBuilder<uint8_t>, it's used for building the string data, it doesn't deal with string offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yes, I see it now. I'm using offset_type in this class as well, which I shouldn't, I think that's what led me to this. This requires a bit of refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern occurs more often in the file, I didn't realize this lead to slower compilation and probably larger binary sizes. I think it requires a refactor that is larger than this PR. Also, I won't have the time currently to do this. Can we merge this as is, and I'll open a Jira issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you don't need to refactor other kernels for now, but I suppose this one could easily be adapted, no? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly as above, this looks basically independent from Type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the GlobalReplace loop works a bit differently, it calls Match then Rewrite, avoiding the duplicate matching calls. Not sure it's worth optimizing this, though:
https://github.com/google/re2/blob/master/re2/re2.cc#L427
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I prefer to keep it as it is, I left a comment in the code so this doesn't get lost.
e1c0571 to
60d6589
Compare
3f8aa29 to
bd5e8fe
Compare
|
@maartenbreddels Is it ready for review again? Feel free to ping me. |
|
@pitrou Yes, apart from an unanswered question this is ready for review 👍 |
d4608a9 to
356c300
Compare
bd5e8fe to
436a9e2
Compare
|
@pitrou this is ready for review (assuming you agree with the above plan of doing a refactor later on) |
|
(gentle ping here, would really like to see those PRs merged for 4.0 in April!) |
|
I'll rebase and update this PR. |
|
Hmm, there's a re2 link error on RTools 3.5: |
|
Haven't seen that before, and we've been building with re2 for months now. Maybe this is the first time we're building something that actually uses re2? It's possible that re2 needs a "backport" library built with the rtools3.5 toolchain; an immediate workaround could be to build with |
|
@github-actions crossbow submit -g r |
|
Revision: 82bb60d Submitted crossbow builds: ursacomputing/crossbow @ actions-233 |
|
The as-cran failure is legit: https://github.com/ursacomputing/crossbow/runs/2197449739?check_suite_focus=true#step:7:210 It's a weird build setup that uses |
|
I made ARROW-12094 for fixing that build. Merging now. |
Two new kernels