Apply NFKC normalization to unicode identifiers in the lexer#10412
Apply NFKC normalization to unicode identifiers in the lexer#10412AlexWaygood merged 6 commits intoastral-sh:mainfrom
Conversation
|
1eed840 to
a8a9863
Compare
dhruvmanila
left a comment
There was a problem hiding this comment.
This is great with such a minimal change. Can you please write / update documentation around this change? Maybe the lex_identifier or the Name token?
It's good to go from my side but I'll let @MichaReiser give the final approval.
Thanks! I had a go at writing some docs in d785be5 -- how does that look to you? |
Looks good. Thank you! |
| /// | ||
| /// Unicode names are NFKC-normalized by the lexer, | ||
| /// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers) | ||
| name: Box<str>, |
There was a problem hiding this comment.
I was eventually hoping to remove all owned data from the lexer tokens (e.g., prior to this change, we could've conceivably removed this field altogether; if we remove more similar fields from other tokens, we can eventually reduce the size of the Tok enum, which could be great for performance). This now precludes us from doing so. But I don't have enough context on the future design of the lexer-parser to know if it matters.
There was a problem hiding this comment.
I think we can still accomplish this and Name isn't the only token that must return the parsed value (e.g. FStrings etc).
What I have in mind is to:
- Replace
TokwithTokenKind(that holds no data) - Add a
take_value()method onLexerthat "takes" the current value (enumoverName,Intetc.).
This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).
There was a problem hiding this comment.
This is a good point. We could potentially move this to the parse_identifier method in the parser as that's the final destination for where this value needs to be. I could come back to this once the new parser is merged and I'm looking into the feedback loop between the lexer and parser.
charliermarsh
left a comment
There was a problem hiding this comment.
Looks reasonable to me, I'll defer to Micha and Dhruv to approve.
| /// | ||
| /// Unicode names are NFKC-normalized by the lexer, | ||
| /// matching [the behaviour of Python's lexer](https://docs.python.org/3/reference/lexical_analysis.html#identifiers) | ||
| name: Box<str>, |
There was a problem hiding this comment.
I think we can still accomplish this and Name isn't the only token that must return the parsed value (e.g. FStrings etc).
What I have in mind is to:
- Replace
TokwithTokenKind(that holds no data) - Add a
take_value()method onLexerthat "takes" the current value (enumoverName,Intetc.).
This design also fits better into our new parser that already does exactly this internally (except that it "takes" the value from Tok). The advantage of this is that we only pay the overhead of reading or writting a value if it is a value token (and we're interested in the value).
|
Thanks all! |
Summary
A second attempt to fix #5003, hopefully without the performance problems that #10381 suffered from.
Python applies NFKC normalization to identifiers that use unicode characters. That means that F821 should not be emitted if ruff encounters the following snippet (but on
main, it is), as from Python's perspective, these are all the same identifier:This PR fixes that false positive by NFKC-normalizing identifers as they are encountered in the lexer.
Test Plan
cargo test