Skip to content

BUG: .str methods failing on PyArrow using regex with \Z#63705

Merged
jorisvandenbossche merged 12 commits intopandas-dev:mainfrom
jorisvandenbossche:string-dtype-regex-z
Jan 21, 2026
Merged

BUG: .str methods failing on PyArrow using regex with \Z#63705
jorisvandenbossche merged 12 commits intopandas-dev:mainfrom
jorisvandenbossche:string-dtype-regex-z

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

Closes #63385

This is only for match and needs to be generalized, if we think this approach is workable

The RE2 engine only supports \z and not \Z, and Python 3.14 actually also just added \z and kept \Z as an alias for compat. So for python users, up to recently, you could only use \Z, but pyarrow engine requires \z. As a user changing manually to use \z only works with Python 3.14+, so I think it would be good to handle this under the hood for the user.

@jorisvandenbossche jorisvandenbossche added this to the 3.0 milestone Jan 16, 2026
@jorisvandenbossche jorisvandenbossche added Strings String extension data type and string data Arrow pyarrow functionality labels Jan 16, 2026
Comment thread pandas/core/arrays/string_arrow.py Outdated
if isinstance(pat, re.Pattern):
pat, case, flags = self._preprocess_re_pattern(pat, case)

if pat.endswith("\\Z") and not pat.endswith("\\\\Z"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correctly handles what I think would be the vast majority of real cases and as far as I can tell has no false positives. However it does have false negatives. The following should give true:

  • r"\\\Z" (in general, an odd number of \)
  • r"text(\Z)
  • r"\Z text"

Only the first seems like it could maybe occur in practice, the other two are likely erroneous. And while I could see this being used with lookarounds, we'll already be falling back to object dtype with those so we don't need to be worried about them here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the analysis!

r"\\\Z" (in general, an odd number of \)

Should we count the number of trailing \? (and so only get here if the number of \ is not even). Something like:

if pat.endswith("\\Z") and not ((len(pat[:-1]) - len(pat[:-1].rstrip("\\"))) % 2 == 0):

Copy link
Copy Markdown
Member

@rhshadrach rhshadrach Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove one of the pat[:-1] (which I think will give a copy of the entire string). Also removed the not but I'm fine if that remains as the original.

# Second condition counts the number of `\` that pat ends with prior to Z
if pat.endswith("\\Z") and (len(pat) - len(pat[:-1].rstrip("\\")) + 1) % 2 == 1:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that and added some test cases to one of the tests

@jorisvandenbossche jorisvandenbossche changed the title BUG: .str.contains et al failing with PyArrow and using \Z BUG: .str methods failing on PyArrow using regex with \Z Jan 18, 2026
elif not pat.startswith("^"):
pat = f"^({pat[0:-1]})$"
return self._str_match(pat, case, flags, na)
return ArrowStringArrayMixin._str_match(self, pat, case, flags, na)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach I changed this and added a self._has_regex_lookaround(pat) to ArrowStringArray._str_fullmatch, to ensure the method here certainly uses the mixin version of _str_match, and not the ArrowStringArray, which would otherwise call the validation methods a second time (which can fail after replacing \Z to \z for older python, and generally we can also avoid the overhead of validating twice)

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review January 19, 2026 14:13
Copy link
Copy Markdown
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; good to merge this as-is without handling higher number of \ as discussed above.

@jorisvandenbossche jorisvandenbossche merged commit c9b51fa into pandas-dev:main Jan 21, 2026
39 of 42 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-regex-z branch January 21, 2026 10:21
vkverma9534 pushed a commit to vkverma9534/pandas that referenced this pull request Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: string replace results in invalid regular expression: invalid perl operator: (?<=

2 participants