Skip to content

Use str.isidentifier to match idents on python 3#731

Merged
davidism merged 9 commits intomasterfrom
feature/kill-stringdefs
Jul 4, 2017
Merged

Use str.isidentifier to match idents on python 3#731
davidism merged 9 commits intomasterfrom
feature/kill-stringdefs

Conversation

@mitsuhiko
Copy link
Copy Markdown
Contributor

@mitsuhiko mitsuhiko commented Jul 1, 2017

Not sure I like this very much and I need to ensure still that \w is a superset of the identifier ranges. It is however also what tokenize does on python 3 so there is precedent. Moreover though the error message is different now for bad identifiers but I assume not many users will notice.

Refs #729

@lf-
Copy link
Copy Markdown

lf- commented Jul 1, 2017

I confirmed that \w matches all meaningful identifiers:

import re
import sys
import unicodedata

cre = re.compile(r'\w')
for cp in range(sys.maxunicode + 1):
    s = chr(cp)
    if s.isidentifier() and not re.match(cre, s):
        print(hex(cp), unicodedata.name(s))

Valid identifiers not matched by \w:

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Beyond that, it appears that the clauses for the try-except at the start are backwards, causing python 2 to raise and python 3 to not have unicode identifier support.

@mitsuhiko
Copy link
Copy Markdown
Contributor Author

Need to add a test to this one but I fixed the reverse thing and added characters missing. Oddly enough I can only reproduce \w not matching on the latter two but not the mongolian characters.

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 2, 2017

Need to also check not keyword.iskeyword(value) to catch things like True and class which str.isidentifier returns true for.

Or maybe not, since they'll raise SyntaxError later if used improperly.

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 2, 2017

env.from_string('{{ ℮ }}') (or any of the four special case characters) raises TemplateSyntaxError: unexpected char '℮' at 3. Other Unicode characters work as expected.

davidism added 2 commits July 2, 2017 09:18
currently fails on special case unicode
avoids duplicate work for internal prs
@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 2, 2017

Pushed a simple test, currently fails for the four special case characters.

@pallets pallets deleted a comment from lf- Jul 2, 2017
@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

The regex is specifically failing on the \b boundaries, if they are removed the special cases work. This is because \b means "transition from \w to \W", and the special cases are not in \w.

@mitsuhiko
Copy link
Copy Markdown
Contributor Author

I wonder if that was not already broken before.

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

Just checked, e.from_string('{{ ℮ }}') works on master.

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

I don't think the \b is necessary, it's not present in the previous stringdef regex. All tests still pass with the boundaries removed.

There were boundaries on the Python 2 regex, but they don't seem necessary either.

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

I reported the \w issue to Python: http://bugs.python.org/issue30838

@lf- thanks for helping figure this out.

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

Writing more tests based on the previous stringdef, https://raw.githubusercontent.com/pallets/jinja/2.9.5/jinja2/_stringdefs.py, turns up some invalid characters in the previous regex. The following start characters were matched but were not valid as start or continue characters:

\u309B KATAKANA-HIRAGANA VOICED SOUND MARK
\u309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

The following were matched and are valid continue characters, but are no longer matched by \w:

\u309B MIDDLE DOT
\u309C GREEK ANO TELEIA

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

Then I realized we were missing a test in that script: ('a' + c).isidentifier() for characters that are valid for continue but not start. That reveals a lot more that aren't matched by \w.

2097 characters not matched. 😞

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

Going to have to bring back _stringdefs, but still simplified.

@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 3, 2017

Original _stringdefs:

         len   sizeof
start    48194 96462
continue 49826 99726
total    98020 196188

New _identifier:

len: 635
sizeof: 2616

Timing is hard to test, but on my machine re.purge(); re.compile(...) is down from ~100 ms to ~840 µs.

Strategy was to collapse contiguous ranges and rely on str.isidentifier to validate so that the regex is simpler.

@mitsuhiko
Copy link
Copy Markdown
Contributor Author

That sounds good. Ideally we can reuse the same thing as we had before to regenerate the regex though we need to make sure we hit the set that is missing in all versions of Python 3.

@davidism davidism force-pushed the feature/kill-stringdefs branch from ad47a4b to 511384f Compare July 4, 2017 16:59
new version uses ~2KB vs 200KB memory, is ~100x faster to load
move script to generate pattern to scripts directory
add more tests
@davidism davidism force-pushed the feature/kill-stringdefs branch from 511384f to fb1e453 Compare July 4, 2017 17:00
@davidism
Copy link
Copy Markdown
Member

davidism commented Jul 4, 2017

At the cost of about twice as much space, the regex could be made more accurate by omitting \w and generating the full range of valid characters. At the cost of twice again as much space, we could remove the need for calling str.isidentifer during lexing by distinguishing start and continue characters in the regex. For now, I'm happy with where we are and will leave \w and isidentifier in.

@davidism davidism merged commit 47ef6a3 into master Jul 4, 2017
@davidism davidism deleted the feature/kill-stringdefs branch July 4, 2017 18:04
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Nov 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants