Use str.isidentifier to match idents on python 3#731
Conversation
|
I confirmed that import re
import sys
import unicodedata
cre = re.compile(r'\w')
for cp in range(sys.maxunicode + 1):
s = chr(cp)
if s.isidentifier() and not re.match(cre, s):
print(hex(cp), unicodedata.name(s))Valid identifiers not matched by Beyond that, it appears that the clauses for the try-except at the start are backwards, causing python 2 to raise and python 3 to not have unicode identifier support. |
|
Need to add a test to this one but I fixed the reverse thing and added characters missing. Oddly enough I can only reproduce |
|
Or maybe not, since they'll raise |
|
|
currently fails on special case unicode
avoids duplicate work for internal prs
|
Pushed a simple test, currently fails for the four special case characters. |
|
The regex is specifically failing on the |
|
I wonder if that was not already broken before. |
|
Just checked, |
|
I don't think the There were boundaries on the Python 2 regex, but they don't seem necessary either. |
|
I reported the @lf- thanks for helping figure this out. |
|
Writing more tests based on the previous stringdef, https://raw.githubusercontent.com/pallets/jinja/2.9.5/jinja2/_stringdefs.py, turns up some invalid characters in the previous regex. The following start characters were matched but were not valid as start or continue characters: The following were matched and are valid continue characters, but are no longer matched by |
|
Then I realized we were missing a test in that script: 2097 characters not matched. 😞 |
|
Going to have to bring back |
|
Original New Timing is hard to test, but on my machine Strategy was to collapse contiguous ranges and rely on |
|
That sounds good. Ideally we can reuse the same thing as we had before to regenerate the regex though we need to make sure we hit the set that is missing in all versions of Python 3. |
ad47a4b to
511384f
Compare
new version uses ~2KB vs 200KB memory, is ~100x faster to load move script to generate pattern to scripts directory add more tests
511384f to
fb1e453
Compare
|
At the cost of about twice as much space, the regex could be made more accurate by omitting |
Not sure I like this very much and I need to ensure still that
\wis a superset of the identifier ranges. It is however also what tokenize does on python 3 so there is precedent. Moreover though the error message is different now for bad identifiers but I assume not many users will notice.Refs #729