Skip to content

Conversation

@lidavidm
Copy link
Member

This adds a split_pattern_regex kernel using RE2.

Caveats:

  • RE2 requires us to wrap the user's regex in a capture group in order to actually get the matched delimiter.
  • Reverse splitting is not implemented - there's not a good way to do this with RE2.
  • In R, strsplit behaves differently - trailing empty splits are no longer dropped:
> df <- tibble(x = c("foo bar"))
> (df %>% mutate(x = strsplit(x, "bar")) %>% collect())$x
[[1]]
[1] "foo "

> (record_batch(df) %>% mutate(x = strsplit(x, "bar")) %>% collect())$x
<list<character>[1]>
[[1]]
[1] "foo " ""   

So the behavior here does not exactly match R. Though this was already the case:

> (df %>% mutate(x = strsplit(x, "bar", fixed = TRUE)) %>% collect())$x
[[1]]
[1] "foo "

> (record_batch(df) %>% mutate(x = strsplit(x, "bar", fixed = TRUE)) %>% collect())$x
<list<character>[1]>
[[1]]
[1] "foo " ""    

@lidavidm
Copy link
Member Author

One question I have is if we might want to still use split_pattern if we detect the pattern is not actually a regex, just so we don't invoke RE2/require an Arrow-with-RE2 build in all cases.

@nealrichardson
Copy link
Member

One question I have is if we might want to still use split_pattern if we detect the pattern is not actually a regex, just so we don't invoke RE2/require an Arrow-with-RE2 build in all cases.

@thisisnic did something similar in #10190, specifically https://github.com/apache/arrow/pull/10190/files#diff-3a791b6bbfdb1605c74f85ca5e9854503ed70ac8b274288919dfbec7ce0cef02R549-R552

@lidavidm
Copy link
Member Author

One question I have is if we might want to still use split_pattern if we detect the pattern is not actually a regex, just so we don't invoke RE2/require an Arrow-with-RE2 build in all cases.

@thisisnic did something similar in #10190, specifically https://github.com/apache/arrow/pull/10190/files#diff-3a791b6bbfdb1605c74f85ca5e9854503ed70ac8b274288919dfbec7ce0cef02R549-R552

Right, and currently this removes that check in favor of always dispatching to the new kernel unless fixed = TRUE. But we could keep that check and only dispatch if fixed = FALSE and the pattern is actually regex-like. That'd let you still use strsplit even if you built Arrow without RE2 in the same cases as before.

It's a very minor point, though - maybe not worth doing.

@nealrichardson
Copy link
Member

Yeah, I thought you were proposing moving that logic into C++ or something. Not sure it's worth doing, if you call a _regex function you must be expecting a build with RE2, and I'd expect that RE2 would optimize non-regex evaluation itself.

@github-actions
Copy link

Copy link
Contributor

@cyb70289 cyb70289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, +1

@lidavidm
Copy link
Member Author

Thanks for the review, merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants