Implement re-lexing logic for better error recovery#11845
Conversation
| 1 | from x import (a, b | ||
| 2 | 1 + 1 | ||
| 3 | from x import (a, b, | ||
| | ^ Syntax Error: Expected ')', found newline |
There was a problem hiding this comment.
Woah, this is so much better. Nice!
c8a0997 to
55163c7
Compare
a43422d to
9e56495
Compare
|
2294170 to
d550b3b
Compare
What happens if we have something like In this case, the parser should recover from the unclosed |
Is the code snippet up-to date as you've mentioned |
MichaReiser
left a comment
There was a problem hiding this comment.
I love this. It's so exciting to see how the hard work is paying off and we now get much better error messages.
I've left some open questions before approving but this is definetely looking good!
Do you know how our messages compare with CPython or Pyright? Do you see any significant difference?
| /// and the caller is responsible for updating it's state accordingly. | ||
| /// | ||
| /// This method is a no-op if the lexer isn't in a parenthesized context. | ||
| pub(crate) fn re_lex_logical_token(&mut self) -> bool { |
There was a problem hiding this comment.
What's your reasoning for returning a bool over the re-lexed kind of the token?
There was a problem hiding this comment.
I think initially I kept it as Option<TokenKind> but the current usage doesn't really require the token itself. I don't think there's any other particular reason to use bool apart from that.
There was a problem hiding this comment.
What I meant was to return the re-lexed token kind or the current token kind if no re-lexing happens, so that it's just a TokenKind (and in line with lex_token). But if all we need is a bool, than that's fine.
| /// previous current token. This also means that the current position of the lexer has changed | ||
| /// and the caller is responsible for updating it's state accordingly. | ||
| /// | ||
| /// This method is a no-op if the lexer isn't in a parenthesized context. |
There was a problem hiding this comment.
I would expand the comment with some explanation for which tokens this is relevant. Like which tokens may change in this situation or what kind of different tokens could be emitted.
| 1 | class Foo[T1, *T2(a, b): | ||
| | ^ Syntax Error: Expected ']', found '(' | ||
| 2 | pass | ||
| 3 | x = 10 |
There was a problem hiding this comment.
this is pretty cool! Nice to see how all the work accumulated to now having a single, accurate error message.
|
|
||
|
|
||
| | | ||
| 3 | def foo(): |
There was a problem hiding this comment.
Yes, no more EOF parse errors!
| 2 | def foo(): | ||
| | ^^^ Syntax Error: Expected an indented block after function definition |
| | | ||
| 28 | # The lexer is nested with multiple levels of parentheses | ||
| 29 | if call(foo, [a, b | ||
| | ^ Syntax Error: Expected ']', found NonLogicalNewline |
There was a problem hiding this comment.
Should we map NonLogicalNewline to newline? It seems an odd error message to show to users.
There was a problem hiding this comment.
Yeah, it is odd. I could map it to "newline" in the Display implementation.
Uhm, not sure what happened here (a, [b,
c
) |
Ok, so this is a bit problematic because when the lexer emitted the |
Ok, I've fixed this bug and expanded the documentation and inline comments. |
The CPython parser higlights two kinds of errors which is probably done by the lexer:
While Pyright gets really confused: |
958b587 to
90ca067
Compare
CodSpeed Performance ReportMerging #11845 will not alter performanceComparing Summary
|
2fdf062 to
14fcbb3
Compare
14fcbb3 to
1989946
Compare
## Summary This PR is a follow-up on #11845 to add the re-lexing logic for normal list parsing. A normal list parsing is basically parsing elements without any separator in between i.e., there can only be trivia tokens in between the two elements. Currently, this is only being used for parsing **assignment statement** and **f-string elements**. Assignment statements cannot be in a parenthesized context, but f-string can have curly braces so this PR is specifically for them. I don't think this is an ideal recovery but the problem is that both lexer and parser could add an error for f-strings. If the lexer adds an error it'll emit an `Unknown` token instead while the parser adds the error directly. I think we'd need to move all f-string errors to be emitted by the parser instead. This way the parser can correctly inform the lexer that it's out of an f-string and then the lexer can pop the current f-string context out of the stack. ## Test Plan Add test cases, update the snapshots, and run the fuzzer.
Summary
This PR implements the re-lexing logic in the parser.
This logic is only applied when recovering from an error during list parsing. The logic is as follows:
Newlinetoken instead ofNonLogicalNewline.It turns out that the list parsing isn't that happy with the results so it requires some re-arranging such that the following two errors are raised correctly:
For (1), the following scenarios needs to be considered:
For (2), the following scenarios needs to be considered:
eatcall above. And, the parser doesn't take the re-lexing route on a comma token.resolves: #11640
Test Plan