bpo-31672: string: Use re.A | re.I flag for identifier pattern#3872
Conversation
re.A | re.I flag for identifier pattern
As documented, identifier should be ASCII. Since we forgot re.A flag, it matched to some non ASCII characters. For backward compatibility, we need to remove re.A flag after pattern is compiled.
b056576 to
c43ec46
Compare
warsaw
left a comment
There was a problem hiding this comment.
It seems a little weird, but with comments I think it could be okay. I guess I'm a +0 so I'd like to get some other opinions.
| idpattern = r'[_a-z][_a-z0-9]*' | ||
| braceidpattern = None | ||
| flags = _re.IGNORECASE | ||
| flags = _re.IGNORECASE | _re.ASCII |
There was a problem hiding this comment.
I would add a comment here too, since a person reading the code may not notice the restored flag below.
|
|
||
| # We use re.I | re.A when compiling Template.idpattern, but restore old flag | ||
| # for backward compatibility. | ||
| Template.flags = _re.IGNORECASE |
There was a problem hiding this comment.
How about:
We use re.I | re.A while compiling Template.idpattern in the metaclass above, but since
flags is part of the public API, we restore its original documented value for backward
compatibility.
?
|
Do you think it's worth waiting to see how bpo-31690 resolves? I'd definitely prefer the inline 'a' flag approach instead. |
I agree with you. Pending this pull request until bpo-31690. |
warsaw
left a comment
There was a problem hiding this comment.
I think it's worth adding a NEWS entry instead of a skip-news label, to pass the bevedere test. Other than that, this looks great, thanks!
|
@warsaw I added NEWS entry and updated the doc. Would you review it too? |
warsaw
left a comment
There was a problem hiding this comment.
Looks great, thanks! I noticed a few misspellings, and made some documentation change suggestions.
Also, I'm noticing the tests are failing.
| ``[_a-z][_a-z0-9]*``. If this is given and *braceidpattern* is ``None`` | ||
| this pattern will also apply to braced placeholders. | ||
| ``(?-i:[_a-zA-Z][_a-zA-Z0-9]*)``. Since default *flags* is | ||
| ``re.IGNORECASE``, ``[a-z]``Without local flag ``-i``, is used to avoid to match with non ASCII characters. |
There was a problem hiding this comment.
There should be a space before the W, but I also feel like the sentence starting with "Since default..." should just be moved to, and consolidated with, the note:: section that just follows.
There was a problem hiding this comment.
Oh, I missed removing this sentence after writing note section.
| .. note:: | ||
|
|
||
| Default *flags* is ``re.IGNORECASE``. So the pattern ``[a-z]`` can match | ||
| with some non ASCII characters. That's why We use local ``-i`` flag here. |
| with some non ASCII characters. That's why We use local ``-i`` flag here. | ||
|
|
||
| When overrinding this class, please consider overriding *flags* with ``0`` | ||
| or ``re.IGNORECASE | re.ASCII``. |
There was a problem hiding this comment.
How about: "When subclassing, please..."
Why? Or in other words, can you add a short explanation of why they should consider this?
| @@ -0,0 +1,2 @@ | |||
| ``idpattern`` in ``string.Template`` matched some non ASCII characters. Now | |||
| it uses ``-i`` regular expression local flag to avoid non ASCII characters. | |||
| raises(ValueError, s.substitute, dict(who='tim')) | ||
| # Template.idpattern should match to only ASCII characters. | ||
| # https://bugs.python.org/issue31672 | ||
| s = Template("$who likes $ı") # (0x131, DOTLESS I) |
There was a problem hiding this comment.
Test also 'İ' (0x130). 'İ'.lower() == 'i'. In older Python versions [a-z] didn't match 'ı', but matched 'İ'.
|
|
||
| Default *flags* is ``re.IGNORECASE``. So the pattern ``[a-z]`` can match | ||
| with some non ASCII characters. That's why We use local ``-i`` flag here. | ||
| with some non-ASCII characters. That's why We use local ``-i`` flag here. |
|
Should I backport this to 3.6? It seems minor enough to not backport. |
|
We have spent so much time on this issue only to find 3.6-compatible solution. I think that in 3.7-only solution we could break backward compatibility and remove the re.IGNORECASE flag if not found more compatible solution. |
|
Thanks @methane for the PR 🌮🎉.. I'm working now to backport this PR to: 3.6. |
|
Sorry, @methane, I could not cleanly backport this to |
…dentifiers (pythonGH-3872) Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters. Straightforward solution for this is using `IGNORECASE | ASCII` flag. But users may subclass `Template` and override only `idpattern`. So we want to avoid changing `Template.flags`. So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`.. (cherry picked from commit b22273e)
|
GH-3982 is a backport of this pull request to the 3.6 branch. |
…iers (GH-3872) Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters. Straightforward solution for this is using `IGNORECASE | ASCII` flag. But users may subclass `Template` and override only `idpattern`. So we want to avoid changing `Template.flags`. So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`. (cherry picked from commit b22273e)
As documented, identifier should be ASCII.
Since we forgot re.A flag, it matched to some non ASCII characters.
For backward compatibility, we need to remove re.A flag after
pattern is compiled.
https://bugs.python.org/issue31672