BUG: .str methods failing on PyArrow using regex with \Z#63705
BUG: .str methods failing on PyArrow using regex with \Z#63705jorisvandenbossche merged 12 commits intopandas-dev:mainfrom
Conversation
| if isinstance(pat, re.Pattern): | ||
| pat, case, flags = self._preprocess_re_pattern(pat, case) | ||
|
|
||
| if pat.endswith("\\Z") and not pat.endswith("\\\\Z"): |
There was a problem hiding this comment.
This correctly handles what I think would be the vast majority of real cases and as far as I can tell has no false positives. However it does have false negatives. The following should give true:
r"\\\Z"(in general, an odd number of\)r"text(\Z)r"\Z text"
Only the first seems like it could maybe occur in practice, the other two are likely erroneous. And while I could see this being used with lookarounds, we'll already be falling back to object dtype with those so we don't need to be worried about them here.
There was a problem hiding this comment.
Thanks for the analysis!
r"\\\Z"(in general, an odd number of\)
Should we count the number of trailing \? (and so only get here if the number of \ is not even). Something like:
if pat.endswith("\\Z") and not ((len(pat[:-1]) - len(pat[:-1].rstrip("\\"))) % 2 == 0):
There was a problem hiding this comment.
We can remove one of the pat[:-1] (which I think will give a copy of the entire string). Also removed the not but I'm fine if that remains as the original.
# Second condition counts the number of `\` that pat ends with prior to Z
if pat.endswith("\\Z") and (len(pat) - len(pat[:-1].rstrip("\\")) + 1) % 2 == 1:There was a problem hiding this comment.
Added that and added some test cases to one of the tests
| elif not pat.startswith("^"): | ||
| pat = f"^({pat[0:-1]})$" | ||
| return self._str_match(pat, case, flags, na) | ||
| return ArrowStringArrayMixin._str_match(self, pat, case, flags, na) |
There was a problem hiding this comment.
@rhshadrach I changed this and added a self._has_regex_lookaround(pat) to ArrowStringArray._str_fullmatch, to ensure the method here certainly uses the mixin version of _str_match, and not the ArrowStringArray, which would otherwise call the validation methods a second time (which can fail after replacing \Z to \z for older python, and generally we can also avoid the overhead of validating twice)
rhshadrach
left a comment
There was a problem hiding this comment.
lgtm; good to merge this as-is without handling higher number of \ as discussed above.
Closes #63385
This is only for match and needs to be generalized, if we think this approach is workable
The RE2 engine only supports
\zand not\Z, and Python 3.14 actually also just added\zand kept\Zas an alias for compat. So for python users, up to recently, you could only use\Z, but pyarrow engine requires\z. As a user changing manually to use\zonly works with Python 3.14+, so I think it would be good to handle this under the hood for the user.