Implement re-lexing logic for better error recovery by dhruvmanila · Pull Request #11845 · astral-sh/ruff

dhruvmanila · 2024-06-12T06:34:23Z

Summary

This PR implements the re-lexing logic in the parser.

This logic is only applied when recovering from an error during list parsing. The logic is as follows:

During list parsing, if an unexpected token is encountered and it detects that an outer context can understand it and thus recover from it, it invokes the re-lexing logic in the lexer
This logic first checks if the lexer is in a parenthesized context and returns if it's not. Thus, the logic is a no-op if the lexer isn't in a parenthesized context
It then reduces the nesting level by 1. It shouldn't reset it to 0 because otherwise the recovery from nested list parsing will be incorrect
Then, it tries to find last newline character going backwards from the current position of the lexer. This avoids any whitespaces but if it encounters any character other than newline or whitespace, it aborts.
Now, if there's a newline character, then it needs to be re-lexed in a logical context which means that the lexer needs to emit it as a Newline token instead of NonLogicalNewline.
If the re-lexing gives a different token than the current one, the token source needs to update it's token collection to remove all the tokens which comes after the new current position.

It turns out that the list parsing isn't that happy with the results so it requires some re-arranging such that the following two errors are raised correctly:

Expected comma
Recovery context error

For (1), the following scenarios needs to be considered:

Missing comma between two elements
Half parsed element because the grammar doesn't allow it (for example, named expressions)

For (2), the following scenarios needs to be considered:

If the parser is at a comma which means that there's a missing element otherwise the comma would've been consumed by the first eat call above. And, the parser doesn't take the re-lexing route on a comma token.
If it's the first element and the current token is not a comma which means that it's an invalid element.

resolves: #11640

Test Plan

Update existing test snapshots and validate them
Add additional test cases specific to the re-lexing logic and validate the snapshots
Run the fuzzer on 3000+ valid inputs
Run the fuzzer on invalid inputs
Run the parser on various open source projects
Make sure the ecosystem changes are none

MichaReiser · 2024-06-12T06:45:19Z

crates/ruff_python_parser/tests/snapshots/invalid_syntax@from_import_missing_rpar.py.snap

+1 | from x import (a, b
+2 | 1 + 1
 3 | from x import (a, b,
+  |                     ^ Syntax Error: Expected ')', found newline


Woah, this is so much better. Nice!

github-actions · 2024-06-12T09:00:07Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser · 2024-06-13T06:25:07Z

It then reduces the nesting level by 1. It shouldn't reset it to 0 because otherwise the recovery from nested list parsing will be incorrect

What happens if we have something like

c
)

In this case, the parser should recover from the unclosed [, but the parser and lexer should remain in a parenthesied context because of the (. Or does that work out of the box already, because ) is a list terminator, and the parser won't recover past it?

dhruvmanila · 2024-06-13T06:38:33Z

In this case, the parser should recover from the unclosed [, but the parser and lexer should remain in a parenthesied context because of the (. Or does that work out of the box already, because ) is a list terminator, and the parser won't recover past it?

Is the code snippet up-to date as you've mentioned [ and ( which isn't present in the snippet? :)

MichaReiser

I love this. It's so exciting to see how the hard work is paying off and we now get much better error messages.

I've left some open questions before approving but this is definetely looking good!

Do you know how our messages compare with CPython or Pyright? Do you see any significant difference?

crates/ruff_python_parser/src/lexer.rs

MichaReiser · 2024-06-13T06:29:11Z

crates/ruff_python_parser/src/lexer.rs

+    /// and the caller is responsible for updating it's state accordingly.
+    ///
+    /// This method is a no-op if the lexer isn't in a parenthesized context.
+    pub(crate) fn re_lex_logical_token(&mut self) -> bool {


What's your reasoning for returning a bool over the re-lexed kind of the token?

I think initially I kept it as Option<TokenKind> but the current usage doesn't really require the token itself. I don't think there's any other particular reason to use bool apart from that.

What I meant was to return the re-lexed token kind or the current token kind if no re-lexing happens, so that it's just a TokenKind (and in line with lex_token). But if all we need is a bool, than that's fine.

MichaReiser · 2024-06-13T06:29:50Z

crates/ruff_python_parser/src/lexer.rs

+    /// previous current token. This also means that the current position of the lexer has changed
+    /// and the caller is responsible for updating it's state accordingly.
+    ///
+    /// This method is a no-op if the lexer isn't in a parenthesized context.


I would expand the comment with some explanation for which tokens this is relevant. Like which tokens may change in this situation or what kind of different tokens could be emitted.

crates/ruff_python_parser/src/parser/mod.rs

crates/ruff_python_parser/src/token_source.rs

MichaReiser · 2024-06-13T06:40:40Z

...ruff_python_parser/tests/snapshots/invalid_syntax@class_def_unclosed_type_param_list.py.snap

 1 | class Foo[T1, *T2(a, b):
  |                  ^ Syntax Error: Expected ']', found '('
 2 |     pass
-3 | x = 10


this is pretty cool! Nice to see how all the work accumulated to now having a single, accurate error message.

MichaReiser · 2024-06-13T06:42:07Z

...ruff_python_parser/tests/snapshots/invalid_syntax@expressions__arguments__unclosed_1.py.snap

-
-
-  |
-3 | def foo():


Yes, no more EOF parse errors!

MichaReiser · 2024-06-13T06:43:07Z

...ff_python_parser/tests/snapshots/invalid_syntax@function_def_unclosed_parameter_list.py.snap

+2 | def foo():
+  | ^^^ Syntax Error: Expected an indented block after function definition


MichaReiser · 2024-06-13T06:44:52Z

crates/ruff_python_parser/tests/snapshots/invalid_syntax@re_lex_logical_token.py.snap

+   |
+28 | # The lexer is nested with multiple levels of parentheses
+29 | if call(foo, [a, b
+   |                   ^ Syntax Error: Expected ']', found NonLogicalNewline


Should we map NonLogicalNewline to newline? It seems an odd error message to show to users.

Yeah, it is odd. I could map it to "newline" in the Display implementation.

MichaReiser · 2024-06-13T06:48:11Z

Is the code snippet up-to date as you've mentioned [ and ( which isn't present in the snippet? :)

Uhm, not sure what happened here

(a, [b, 
	c
)

dhruvmanila · 2024-06-13T08:36:10Z

Uhm, not sure what happened here
(a, [b, 
	c
)

Ok, so this is a bit problematic because when the lexer emitted the ) token, it reduced the nesting level by 1 and now the re-lexing also reduces the nesting level by 1 making it 0 which means that there'll be Newline token after c which is incorrect. We could check if the current token would reduce the nesting level in re-lexing logic and avoid reducing it again. I'll need to do some more testing for these kind of scenarios.

dhruvmanila · 2024-06-13T10:21:32Z

Uhm, not sure what happened here
(a, [b, 
	c
)
Ok, so this is a bit problematic because when the lexer emitted the ) token, it reduced the nesting level by 1 and now the re-lexing also reduces the nesting level by 1 making it 0 which means that there'll be Newline token after c which is incorrect. We could check if the current token would reduce the nesting level in re-lexing logic and avoid reducing it again. I'll need to do some more testing for these kind of scenarios.

Ok, I've fixed this bug and expanded the documentation and inline comments.

dhruvmanila · 2024-06-13T10:28:29Z

Do you know how our messages compare with CPython or Pyright? Do you see any significant difference?

The CPython parser higlights two kinds of errors which is probably done by the lexer:

Highlighting the opening parenthesis which doesn't have a closing parenthesis:

    if call(foo, [a, b]
           ^
SyntaxError: '(' was never closed

Highlighting the closing parenthesis to consider it a mismatch with the opening one:

    if call(foo, [a, b)
                      ^
SyntaxError: closing parenthesis ')' does not match opening parenthesis '['

While Pyright gets really confused:

codspeed-hq · 2024-06-14T12:43:57Z

CodSpeed Performance Report

Merging #11845 will not alter performance

_{Comparing dhruv/re-lexing (1989946) with main (1f654ee)}

Summary

✅ 30 untouched benchmarks

## Summary This PR is a follow-up on #11845 to add the re-lexing logic for normal list parsing. A normal list parsing is basically parsing elements without any separator in between i.e., there can only be trivia tokens in between the two elements. Currently, this is only being used for parsing **assignment statement** and **f-string elements**. Assignment statements cannot be in a parenthesized context, but f-string can have curly braces so this PR is specifically for them. I don't think this is an ideal recovery but the problem is that both lexer and parser could add an error for f-strings. If the lexer adds an error it'll emit an `Unknown` token instead while the parser adds the error directly. I think we'd need to move all f-string errors to be emitted by the parser instead. This way the parser can correctly inform the lexer that it's out of an f-string and then the lexer can pop the current f-string context out of the stack. ## Test Plan Add test cases, update the snapshots, and run the fuzzer.

dhruvmanila added the parser Related to the parser label Jun 12, 2024

MichaReiser reviewed Jun 12, 2024

View reviewed changes

dhruvmanila force-pushed the dhruv/list-terminator-kind branch from c8a0997 to 55163c7 Compare June 12, 2024 08:29

Base automatically changed from dhruv/list-terminator-kind to main June 12, 2024 08:33

dhruvmanila force-pushed the dhruv/re-lexing branch from a43422d to 9e56495 Compare June 12, 2024 08:40

dhruvmanila force-pushed the dhruv/re-lexing branch 2 times, most recently from 2294170 to d550b3b Compare June 12, 2024 17:07

dhruvmanila marked this pull request as ready for review June 12, 2024 17:42

MichaReiser reviewed Jun 13, 2024

View reviewed changes

dhruvmanila requested a review from MichaReiser June 13, 2024 10:21

MichaReiser approved these changes Jun 14, 2024

View reviewed changes

dhruvmanila force-pushed the dhruv/re-lexing branch from 958b587 to 90ca067 Compare June 14, 2024 12:25

dhruvmanila force-pushed the dhruv/re-lexing branch from 2fdf062 to 14fcbb3 Compare June 17, 2024 05:40

dhruvmanila added 5 commits June 17, 2024 12:08

Implement re-lexing logic for better error recovery

01f1ce6

Add specific test cases for re-lexing logic

2d0a50c

Avoid reducing nesting level if lexer moves before closing parenthesis

35c6ca3

Expand and provide docs

2a43dae

Add tests for alternate newline character

1989946

dhruvmanila force-pushed the dhruv/re-lexing branch from 14fcbb3 to 1989946 Compare June 17, 2024 06:43

dhruvmanila enabled auto-merge (squash) June 17, 2024 06:44

dhruvmanila merged commit 8499abf into main Jun 17, 2024

dhruvmanila deleted the dhruv/re-lexing branch June 17, 2024 06:47

This was referenced Jun 17, 2024

Use re-lexing for normal list parsing #11871

Merged

Allow token-based rules to work on source code with syntax errors #11915

Closed

BrewTestBot mentioned this pull request Jun 20, 2024

ruff 0.4.10 Homebrew/homebrew-core#175233

Merged

		2 \| def foo():
		\| ^^^ Syntax Error: Expected an indented block after function definition

Conversation

dhruvmanila commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Uh oh!

MichaReiser commented Jun 13, 2024

Uh oh!

dhruvmanila commented Jun 13, 2024

Uh oh!

MichaReiser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaReiser commented Jun 13, 2024

Uh oh!

dhruvmanila commented Jun 13, 2024

Uh oh!

dhruvmanila commented Jun 13, 2024

Uh oh!

dhruvmanila commented Jun 13, 2024

Uh oh!

codspeed-hq bot commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #11845 will not alter performance

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhruvmanila commented Jun 12, 2024 •

edited

Loading

github-actions bot commented Jun 12, 2024 •

edited

Loading

`ruff-ecosystem` results

codspeed-hq bot commented Jun 14, 2024 •

edited

Loading