Use str.isidentifier to match idents on python 3 by mitsuhiko · Pull Request #731 · pallets/jinja

mitsuhiko · 2017-07-01T18:41:44Z

Not sure I like this very much and I need to ensure still that \w is a superset of the identifier ranges. It is however also what tokenize does on python 3 so there is precedent. Moreover though the error message is different now for bad identifiers but I assume not many users will notice.

Refs #729

lf- · 2017-07-01T21:28:13Z

I confirmed that \w matches all meaningful identifiers:

import re
import sys
import unicodedata

cre = re.compile(r'\w')
for cp in range(sys.maxunicode + 1):
    s = chr(cp)
    if s.isidentifier() and not re.match(cre, s):
        print(hex(cp), unicodedata.name(s))

Valid identifiers not matched by \w:

0x1885 MONGOLIAN LETTER ALI GALI BALUDA
0x1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
0x2118 SCRIPT CAPITAL P
0x212e ESTIMATED SYMBOL

Beyond that, it appears that the clauses for the try-except at the start are backwards, causing python 2 to raise and python 3 to not have unicode identifier support.

mitsuhiko · 2017-07-01T22:31:08Z

Need to add a test to this one but I fixed the reverse thing and added characters missing. Oddly enough I can only reproduce \w not matching on the latter two but not the mongolian characters.

davidism · 2017-07-02T15:04:11Z

~~Need to also check not keyword.iskeyword(value) to catch things like True and class which str.isidentifier returns true for.~~

Or maybe not, since they'll raise SyntaxError later if used improperly.

davidism · 2017-07-02T15:32:43Z

env.from_string('{{ ℮ }}') (or any of the four special case characters) raises TemplateSyntaxError: unexpected char '℮' at 3. Other Unicode characters work as expected.

currently fails on special case unicode

avoids duplicate work for internal prs

davidism · 2017-07-02T16:20:31Z

Pushed a simple test, currently fails for the four special case characters.

davidism · 2017-07-03T14:39:57Z

The regex is specifically failing on the \b boundaries, if they are removed the special cases work. This is because \b means "transition from \w to \W", and the special cases are not in \w.

mitsuhiko · 2017-07-03T14:41:42Z

I wonder if that was not already broken before.

davidism · 2017-07-03T14:48:39Z

Just checked, e.from_string('{{ ℮ }}') works on master.

davidism · 2017-07-03T15:00:20Z

I don't think the \b is necessary, it's not present in the previous stringdef regex. All tests still pass with the boundaries removed.

There were boundaries on the Python 2 regex, but they don't seem necessary either.

davidism · 2017-07-03T15:47:02Z

I reported the \w issue to Python: http://bugs.python.org/issue30838

@lf- thanks for helping figure this out.

davidism · 2017-07-03T16:15:27Z

Writing more tests based on the previous stringdef, https://raw.githubusercontent.com/pallets/jinja/2.9.5/jinja2/_stringdefs.py, turns up some invalid characters in the previous regex. The following start characters were matched but were not valid as start or continue characters:

\u309B KATAKANA-HIRAGANA VOICED SOUND MARK
\u309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

The following were matched and are valid continue characters, but are no longer matched by \w:

\u309B MIDDLE DOT
\u309C GREEK ANO TELEIA

davidism · 2017-07-03T16:17:35Z

Then I realized we were missing a test in that script: ('a' + c).isidentifier() for characters that are valid for continue but not start. That reveals a lot more that aren't matched by \w.

2097 characters not matched. 😞

davidism · 2017-07-03T16:44:28Z

Going to have to bring back _stringdefs, but still simplified.

davidism · 2017-07-03T20:10:31Z

Original _stringdefs:

         len   sizeof
start    48194 96462
continue 49826 99726
total    98020 196188

New _identifier:

len: 635
sizeof: 2616

Timing is hard to test, but on my machine re.purge(); re.compile(...) is down from ~100 ms to ~840 µs.

Strategy was to collapse contiguous ranges and rely on str.isidentifier to validate so that the regex is simpler.

mitsuhiko · 2017-07-03T20:12:59Z

That sounds good. Ideally we can reuse the same thing as we had before to regenerate the regex though we need to make sure we hit the set that is missing in all versions of Python 3.

new version uses ~2KB vs 200KB memory, is ~100x faster to load move script to generate pattern to scripts directory add more tests

davidism · 2017-07-04T17:10:33Z

At the cost of about twice as much space, the regex could be made more accurate by omitting \w and generating the full range of valid characters. At the cost of twice again as much space, we could remove the need for calling str.isidentifer during lexing by distinguishing start and continue characters in the regex. For now, I'm happy with where we are and will leave \w and isidentifier in.

Use str.isidentifier to match idents on python 3

f823bdb

mitsuhiko added 2 commits July 1, 2017 23:29

Inversed invalid logic

d6a4a34

Added missing identifiers to the name re

2177fc4

This was referenced Jul 2, 2017

High compute-time penalty for Unicode identifiers #707

Closed

jinja2._stringdefs uses much RAM #666

Closed

davidism added 2 commits July 2, 2017 09:18

test for new identifier lexer

c5d78be

currently fails on special case unicode

only test master and maintenance branches

5de1f1b

avoids duplicate work for internal prs

davidism added 2 commits July 2, 2017 09:30

fix unicode for py2

c8b37d4

switch back to unicode escapes

896aed2

pallets deleted a comment from lf- Jul 2, 2017

remove unnecessary \b from name regex

1f1f031

davidism force-pushed the feature/kill-stringdefs branch from ad47a4b to 511384f Compare July 4, 2017 16:59

go back to generating regex, simplified

fb1e453

new version uses ~2KB vs 200KB memory, is ~100x faster to load move script to generate pattern to scripts directory add more tests

davidism force-pushed the feature/kill-stringdefs branch from 511384f to fb1e453 Compare July 4, 2017 17:00

davidism merged commit 47ef6a3 into master Jul 4, 2017

davidism deleted the feature/kill-stringdefs branch July 4, 2017 18:04

methane mentioned this pull request Sep 6, 2017

[Request] Could importing asyncio optional? #765

Closed

github-actions Bot locked as resolved and limited conversation to collaborators Nov 13, 2020

Uh oh!

Conversation

mitsuhiko commented Jul 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lf- commented Jul 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mitsuhiko commented Jul 1, 2017

Uh oh!

davidism commented Jul 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidism commented Jul 2, 2017

Uh oh!

davidism commented Jul 2, 2017

Uh oh!

davidism commented Jul 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mitsuhiko commented Jul 3, 2017

Uh oh!

davidism commented Jul 3, 2017

Uh oh!

davidism commented Jul 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidism commented Jul 3, 2017

Uh oh!

davidism commented Jul 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidism commented Jul 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidism commented Jul 3, 2017

Uh oh!

davidism commented Jul 3, 2017

Uh oh!

mitsuhiko commented Jul 3, 2017

Uh oh!

davidism commented Jul 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mitsuhiko commented Jul 1, 2017 •

edited

Loading

lf- commented Jul 1, 2017 •

edited

Loading

davidism commented Jul 2, 2017 •

edited

Loading

davidism commented Jul 3, 2017 •

edited

Loading

davidism commented Jul 3, 2017 •

edited

Loading

davidism commented Jul 3, 2017 •

edited

Loading

davidism commented Jul 3, 2017 •

edited

Loading

davidism commented Jul 4, 2017 •

edited

Loading