formatter: multi char tokens in SimpleTokenizer #5610
formatter: multi char tokens in SimpleTokenizer #5610MichaReiser merged 12 commits intoastral-sh:mainfrom
Conversation
7af5c9c to
9671e62
Compare
PR Check ResultsEcosystem✅ ecosystem check detected no changes. BenchmarkLinuxWindows |
|
|
||
| c => { | ||
| let kind = TokenKind::from_non_trivia_char(c); | ||
| let kind = if self.eat_multichar("if", c) { |
There was a problem hiding this comment.
I believe this will incorrectly lex ify as an if keyword because it doesn't test what follows after. Lexing out keyword is a bit harder than I assumed it is.
I'm also unsure if LLVM is able to short-circuit if the expression, for example, starts with a number.
I think the way I would implement this is to implement lexing for identifiers, and then use a match on the lexed identifier to test if it is a keyword. You'll need to add a dependency to unic-ucd-ident (used by RustPython's lexer) to detect the start/end of an identifier.
if is_identifier_start(c) {
self.cursor.eat_while(is_identifier_continuation);
// If we want to support all identifiers, then we need to handle the string prefixes `b'`, `rf` etc. but we can ignore those for keywords
let range = TextRange::at(self.offset, token_len);
let kind = match &self.source[range] { // This requires storing the whole source on the tokenizer. I think that's fine
"if" => TokenKind::If,
"else" => TokenKind::Else,
"match" => TokenKind::Match // Match is a soft keyword that depends on the context but we can always lex it as a keyword and leave it to the caller (parser) to decide if it should be handled as an identifier or keyword.
...,
_ => TokenKind::Other // Potentially an identifier, but only if it isn't a string prefix. We can ignore this for now https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
}
}
fn is_identifier_start(c: char) -> bool {
c.is_ascii_alphabetic() || c == '_' || is_non_ascii_identifier_start(c)
}
// Checks if the character c is a valid continuation character as described
// in https://docs.python.org/3/reference/lexical_analysis.html#identifiers
fn is_identifier_continuation(c: char) -> bool {
if c.is_ascii() {
matches!(c, 'a'..='z' | 'A'..='Z' | '_' | '0'..='9')
} else {
is_xid_continue(c)
}
}
fn is_non_ascii_identifier_start(c: char) -> bool {
is_xid_start(c)
}Something similar should work for the backward parsing, except that it must start with an is_non_ascii_identifier_start and the last (or first, depending on how you look at the problem) must be is_identifier_start . We should be able to move the match expression into some shared function.
|
Neat. Thank you for working on this. I think there's an edge case that I didn't consider initially, making this a little less simple than I assumed (we may also need to rename the |
|
Thanks, will have a go |
|
in unicode fun, some multi-character grapheme clusters (including one from we can maybe fix by using something like https://crates.io/crates/unicode-segmentation but i'm not sure of the perf implications and i guess we don't need this while we only support keyword tokens |
| TokenKind::Continuation | ||
| } else { | ||
| let kind = TokenKind::from_non_trivia_char(c); | ||
| let kind = if is_identifier_start(c) { |
There was a problem hiding this comment.
I believe we need to change this to test if it is an id continuation because we are testing the last character and not the first
There was a problem hiding this comment.
ah yes. but i guess we also need to change the strategy a bit, since we must _finish with an identifier_start? ah you mentioned this below
5407cac to
e6d9551
Compare
MichaReiser
left a comment
There was a problem hiding this comment.
This is awesome. Well done! I'm sorry that it turned out much more complicated than I thought. Let me know when you're ready to merge.
| .next() | ||
| .is_none(), | ||
| "Didn't expect any other token" | ||
| token.kind() == token_kind, |
There was a problem hiding this comment.
Nit: We can use debug_assert_eq now to compare the kinds.
| token.kind() == token_kind, | ||
| "expected a `{token_kind:?}` token", | ||
| ); | ||
| debug_assert!(tokens.next().is_none(), "Didn't expect any other token"); |
There was a problem hiding this comment.
Nit: We can use debug_assert_eq(..., None) here. It should give us a better error message when it throws (I believe)
| let token_len = self.cursor.token_len(); | ||
|
|
||
| let range = TextRange::at(self.offset, token_len); | ||
| self.parse_identifier(range) |
There was a problem hiding this comment.
Nit: The method name indicates to me that this parses an identifier and not a keyword. I haven't been able to come up with a name that I like, but was thinking of match_keyword or to_keyword_or_other
There was a problem hiding this comment.
sure, i don't mind. i also considered making it a method on TokenKind (similar to from_non_trivia_char)
| if self.source[range] | ||
| .chars() | ||
| .next() | ||
| .is_some_and(is_identifier_start) | ||
| { |
There was a problem hiding this comment.
Not really. You could use let mut identifier_start = c and then update the variable in the eat_back_while callback if it is an identifier_continuation but I would prefer what you have right now.
| /// `match` | ||
| Match, | ||
|
|
||
| /// Any other non trivia token. Always has a length of 1 |
There was a problem hiding this comment.
oops i guess this is no longer true (length 1)
|
ok i think that's ready from me now |
|
thanks for the impl help and the review! |
|
Thank you for implementing. This is so cool :) Not long, and our |
No description provided.