Skip to content

bpo-31672: string: Use re.A | re.I flag for identifier pattern#3872

Merged
methane merged 7 commits into
python:masterfrom
methane:string-template-ascii
Oct 13, 2017
Merged

bpo-31672: string: Use re.A | re.I flag for identifier pattern#3872
methane merged 7 commits into
python:masterfrom
methane:string-template-ascii

Conversation

@methane

@methane methane commented Oct 3, 2017

Copy link
Copy Markdown
Member

As documented, identifier should be ASCII.
Since we forgot re.A flag, it matched to some non ASCII characters.

For backward compatibility, we need to remove re.A flag after
pattern is compiled.

https://bugs.python.org/issue31672

@methane methane changed the title bpo-31672: strings.Template should use re.A flag bpo-31672: string: Use re.A | re.I flag for identifier pattern Oct 3, 2017
As documented, identifier should be ASCII.
Since we forgot re.A flag, it matched to some non ASCII characters.

For backward compatibility, we need to remove re.A flag after
pattern is compiled.
@methane methane force-pushed the string-template-ascii branch from b056576 to c43ec46 Compare October 3, 2017 16:35

@warsaw warsaw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little weird, but with comments I think it could be okay. I guess I'm a +0 so I'd like to get some other opinions.

Comment thread Lib/string.py Outdated
idpattern = r'[_a-z][_a-z0-9]*'
braceidpattern = None
flags = _re.IGNORECASE
flags = _re.IGNORECASE | _re.ASCII

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here too, since a person reading the code may not notice the restored flag below.

Comment thread Lib/string.py Outdated

# We use re.I | re.A when compiling Template.idpattern, but restore old flag
# for backward compatibility.
Template.flags = _re.IGNORECASE

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

We use re.I | re.A while compiling Template.idpattern in the metaclass above, but since
flags is part of the public API, we restore its original documented value for backward
compatibility.

?

@warsaw

warsaw commented Oct 4, 2017

Copy link
Copy Markdown
Member

Do you think it's worth waiting to see how bpo-31690 resolves? I'd definitely prefer the inline 'a' flag approach instead.

@methane

methane commented Oct 4, 2017

Copy link
Copy Markdown
Member Author

Do you think it's worth waiting to see how bpo-31690 resolves? I'd definitely prefer the inline 'a' flag approach instead.

I agree with you. Pending this pull request until bpo-31690.

@warsaw warsaw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a NEWS entry instead of a skip-news label, to pass the bevedere test. Other than that, this looks great, thanks!

@methane

methane commented Oct 12, 2017

Copy link
Copy Markdown
Member Author

@warsaw I added NEWS entry and updated the doc. Would you review it too?

@warsaw warsaw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! I noticed a few misspellings, and made some documentation change suggestions.

Also, I'm noticing the tests are failing.

Comment thread Doc/library/string.rst Outdated
``[_a-z][_a-z0-9]*``. If this is given and *braceidpattern* is ``None``
this pattern will also apply to braced placeholders.
``(?-i:[_a-zA-Z][_a-zA-Z0-9]*)``. Since default *flags* is
``re.IGNORECASE``, ``[a-z]``Without local flag ``-i``, is used to avoid to match with non ASCII characters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a space before the W, but I also feel like the sentence starting with "Since default..." should just be moved to, and consolidated with, the note:: section that just follows.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed removing this sentence after writing note section.

Comment thread Doc/library/string.rst Outdated
.. note::

Default *flags* is ``re.IGNORECASE``. So the pattern ``[a-z]`` can match
with some non ASCII characters. That's why We use local ``-i`` flag here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-ASCII"

Also s/We/we/

Comment thread Doc/library/string.rst Outdated
with some non ASCII characters. That's why We use local ``-i`` flag here.

When overrinding this class, please consider overriding *flags* with ``0``
or ``re.IGNORECASE | re.ASCII``.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about: "When subclassing, please..."

Why? Or in other words, can you add a short explanation of why they should consider this?

@@ -0,0 +1,2 @@
``idpattern`` in ``string.Template`` matched some non ASCII characters. Now
it uses ``-i`` regular expression local flag to avoid non ASCII characters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-ASCII" in two places.

Comment thread Lib/test/test_string.py Outdated
raises(ValueError, s.substitute, dict(who='tim'))
# Template.idpattern should match to only ASCII characters.
# https://bugs.python.org/issue31672
s = Template("$who likes $ı") # (0x131, DOTLESS I)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test also 'İ' (0x130). 'İ'.lower() == 'i'. In older Python versions [a-z] didn't match 'ı', but matched 'İ'.

Comment thread Doc/library/string.rst Outdated

Default *flags* is ``re.IGNORECASE``. So the pattern ``[a-z]`` can match
with some non ASCII characters. That's why We use local ``-i`` flag here.
with some non-ASCII characters. That's why We use local ``-i`` flag here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We -> we

@methane methane merged commit b22273e into python:master Oct 13, 2017
@methane

methane commented Oct 13, 2017

Copy link
Copy Markdown
Member Author

Should I backport this to 3.6?

It seems minor enough to not backport.
But it seems safe enough to backport too.

@serhiy-storchaka

Copy link
Copy Markdown
Member

We have spent so much time on this issue only to find 3.6-compatible solution. I think that in 3.7-only solution we could break backward compatibility and remove the re.IGNORECASE flag if not found more compatible solution.

@miss-islington

Copy link
Copy Markdown
Contributor

Thanks @methane for the PR 🌮🎉.. I'm working now to backport this PR to: 3.6.
🐍🍒⛏🤖

@miss-islington

Copy link
Copy Markdown
Contributor

Sorry, @methane, I could not cleanly backport this to 3.6 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker b22273ec5d1992b0cbe078b887427ae9977dfb78 3.6

methane added a commit to methane/cpython that referenced this pull request Oct 13, 2017
…dentifiers (pythonGH-3872)

Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters.

Straightforward solution for this is using `IGNORECASE | ASCII` flag.
But users may subclass `Template` and override only `idpattern`. So we want to
avoid changing `Template.flags`.

So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`..
(cherry picked from commit b22273e)
@methane methane deleted the string-template-ascii branch October 13, 2017 07:31
@bedevere-bot

Copy link
Copy Markdown

GH-3982 is a backport of this pull request to the 3.6 branch.

methane added a commit that referenced this pull request Oct 14, 2017
…iers (GH-3872)

Pattern `[a-z]` with `IGNORECASE` flag can match to some non-ASCII characters.

Straightforward solution for this is using `IGNORECASE | ASCII` flag.
But users may subclass `Template` and override only `idpattern`. So we want to
avoid changing `Template.flags`.

So this commit uses local flag `-i` for `idpattern` and change `[a-z]` to `[a-zA-Z]`.
(cherry picked from commit b22273e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants