Attempt error recovery in the TerminatedBy parsers by Xanewok · Pull Request #567 · NomicFoundation/slang

Xanewok · 2023-08-14T12:48:33Z

Based on #564

EDIT:
As asked, this was split into supporting PRs (#579, #580, merged now), doesn't use the Error CST node anymore and only contains terminated-by recovery.

To support error side-channel in a backtracking scenario, Vec<ParseError> was added to the Stream which serves more as a parse context now. To backtrack, we first record the position of the stream and how many errors are there; then, we reset the errors and the position to the first recorded Marker struct (basically copying what chumsky does).

This wasn't changed everywhere as Stream::set_position in the lexer will never emit errors (only parsers do), nor will the optional trivia parsers. I'm happy to polish the lexer interface a bit to accommodate for this change later, if that's okay. First, I wanted to make sure the approach is fine and accepted, until we proceed with the DelimitedBy error recovery.

changeset-bot · 2023-08-14T12:48:36Z

⚠️ No Changeset found

Latest commit: a855405

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

OmarTawfik · 2023-08-24T06:25:01Z

.changeset/wicked-beds-buy.md

+"@nomicfoundation/slang": minor
+---
+
+We attempt simple CST error recovery for common "terminated-by" and "delimited-by" groups instead of bailing out on incomplete/invalid input


nit: "terminated-by" and "delimited-by" are internal grammar concepts. I suggest using user-facing terms. Maybe something like "terminators (semicolon)" and "delimiters (brackets)" instead?

crates/codegen/grammar/src/dsl.rs

crates/codegen/grammar/src/grammar.rs

crates/codegen/parser/runtime/src/cst.rs

OmarTawfik · 2023-08-24T06:50:49Z

crates/solidity/outputs/npm/tests/src/errors.ts

  expect(errors).toHaveLength(1);

-  const report = errors[0]?.toErrorReport("test.sol", source, /* withColor */ false);
+  let reports = errors.map((error: any) => error.toErrorReport("test.sol", source, /* withColor */ false));


I wonder why the change here? we already asserted it has a length of 1 above. Also, why would we use any here?

The change was here as at some points there were two errors emitted, and I blessed the snapshots everytime for consistency; didn't revert that after it was only one error. The any was because the type was not inferred but now I see that it's unresolved in Node.js rather than being pointed to parse_output.ParseOutput. I can fix that in a follow-up and I can revert the code change here.

because the type was not inferred but now I see that it's unresolved in Node.js

Is the generated .d.ts is broken somehow? not blocking to this PR, but I would have expected the earlier test to fail if it cannot resolve the types, since we compile with struct: true?

OmarTawfik · 2023-08-24T06:54:04Z

crates/solidity/testing/utils/src/cst_snapshots/test_nodes.rs

    Rule(RuleKind),
    Token(TokenKind),
    Trivia(TokenKind),
+    Error,


I suggest storing/rendering the error message from ErrorNode as well, instead of dropping it.

I understand the consensus is to drop the Node, so this probably won't end up in a CST

OmarTawfik · 2023-08-24T06:57:26Z

crates/solidity/testing/snapshots/cst_output/Block/unchecked/generated/0.4.11-failure.yml


 Tree:
-  - Block (Rule): # 0..24 "{ unchecked { x = 1; } }"
+  - Block (Rule): # 0..22 "{ unchecked { x = 1; }"


The Block changed here from 24 to 22 bytes, as we dropped the last }. Is that intentional?

The CST should always be complete (covers the entire input).

I can bring back the SKIPPED token handling for the incomplete parse

OmarTawfik · 2023-08-24T07:01:08Z

...dity/testing/snapshots/cst_output/SourceUnit/partial_definition/generated/0.4.11-failure.yml

                  - FunctionDefinition (Rule): # 18..28 "  function"
                      - FunctionKeyword (Token): "function" # 20..28
-      - SKIPPED (Token): "" # 28..28
+                      - Error: "" # 28..28


Two empty error nodes here. Is that expected?
Do we need to create them in this case?

If this a WIP result - maybe we should split this into the commits before and after. I'd prefer not to publish this CST result.

OmarTawfik

Left a few questions on the structure/test output.

crates/codegen/grammar/src/grammar.rs

crates/codegen/grammar/src/dsl.rs

AntonyBlakey · 2023-08-24T09:07:27Z

crates/codegen/parser/generator/src/parser_definition.rs

            if_true
        } else {
            let flags = self.iter().map(|vqr| {
-                let flag = format_ident!(


Not keen on this change - the existing code has the version checks in the scanner/parser 'inner loop' as simple bool tests.

In any case, I don't think this change should be in this PR.

Wanted to reduce ad-hoc identifier generation since it makes the harness/code harder to follow and adds unnecessary state to the language IMO (these should be initialized and read-only) but I can revert that; I don't have strong feelings about this

AntonyBlakey · 2023-08-24T09:10:23Z

crates/codegen/parser/runtime/src/lexer.rs

@@ -0,0 +1,165 @@
+use crate::{


I very much like this.

AntonyBlakey · 2023-08-24T11:05:43Z

crates/codegen/parser/runtime/src/support/choice_helper.rs

+/// a full match if possible, otherwise on the best incomplete match.
 pub struct ChoiceHelper {
    result: ParserResult,
-    is_done: bool,


The fact that the result value does double duty as the done sentinel might be more efficient in some way, but is less easy to comprehend at a glance.

I didn't think about efficiency here but more about being impossible to encode invalid state (e.g. it's impossible to have is_done == true && matches!(result, IncompleteResult(..)))).

The code is ultimately a reduction that settles on a particular ParserResult state, so I wanted to make it clear that is_done is derived from the result and not used alongside it. Maybe the finished_state! is a bit of an overkill, I can try to make it more obvious but I'd rather not reintroduce is_done as state as it made me harder to follow what uses/mutates/depends on what.

AntonyBlakey · 2023-08-24T11:15:54Z

crates/codegen/parser/generator/src/parser_definition.rs

                            let mut helper = SequenceHelper::new();
-                            loop {
+                            // Poor man's try block
+                            std::iter::once(()).try_for_each(|_| {


IMO should be a macro. It might better than loop, but the hack is uglier. At it still abuses the concept of repetition.

This was cleaned up later on

AntonyBlakey · 2023-08-24T11:23:33Z

crates/codegen/parser/runtime/src/templates/language.tera


 impl Language {
-    const VERSIONS: &'static [&'static str] = &[
+    pub const SUPPORTED_VERSIONS: &[Version] = &[


Nice polishing.

AntonyBlakey · 2023-08-24T11:59:24Z

...dity/testing/snapshots/cst_output/SourceUnit/partial_definition/generated/0.4.11-failure.yml

                  - FunctionDefinition (Rule): # 18..28 "  function"
                      - FunctionKeyword (Token): "function" # 20..28
-      - SKIPPED (Token): "" # 28..28
+                      - Error: "" # 28..28


If this a WIP result - maybe we should split this into the commits before and after. I'd prefer not to publish this CST result.

AntonyBlakey · 2023-08-24T12:08:44Z

...ing/snapshots/cst_output/ContractMembersList/incomplete_recovery/generated/0.8.0-failure.yml

+                  - FunctionKeyword (Token): "function" # 95..103
+                  - Error: "" # 103..103
+          - Error: "nobody() public\n\nfunction noargs public {\n\tbreak" # 104..152
+          - Semicolon (Token): ";" # 152..153


We can't have this on main if this doesn't produce a complete CST covering every char of the input.

@AntonyBlakey

The 59645e6 change is unrelated but I hope it's simple enough that I can squeeze it in as well here. This includes some smaller, some bigger refactorings while I worked on #567. I hope they are useful on their own and this will help with later work related to #567, as I'll need more lexer-related utilities (i.e. skipping or consuming until a token is found) and it got somewhat unwieldy when I left it in the language.tera. Next up, I can separate lexer-related bits out of the language.tera into a dedicated lexer.tera that may contain `LexicalContext` (cc @AntonyBlakey wrt #567 (comment)) if if you think it's worthwhile.

Most importantly, this documents and tries to simplify the parser helpers and improves legibility of the generated parse functions. Done as part of the #567

Xanewok · 2023-08-29T12:12:10Z

Updated the PR, see OP #567 (comment).

AntonyBlakey · 2023-08-29T13:28:56Z

crates/codegen/parser/runtime/src/support/stream.rs

+
 use super::super::text_index::TextIndex;

 pub struct Stream<'s> {


Given that this is now more than a stream, it should be renamed to ParserContext - either including a Stream or having trait Stream. And wherever &mut Stream is used the param name changed as well.

Done in a855405

@AntonyBlakey

The 59645e669c72a8468f510531268bb6ed57ae1a29 change is unrelated but I hope it's simple enough that I can squeeze it in as well here. This includes some smaller, some bigger refactorings while I worked on #567. I hope they are useful on their own and this will help with later work related to #567, as I'll need more lexer-related utilities (i.e. skipping or consuming until a token is found) and it got somewhat unwieldy when I left it in the language.tera. Next up, I can separate lexer-related bits out of the language.tera into a dedicated lexer.tera that may contain `LexicalContext` (cc @AntonyBlakey wrt NomicFoundation/slang#567 (comment)) if if you think it's worthwhile.

Xanewok force-pushed the error-recovery branch 5 times, most recently from 950d580 to c9cad99 Compare August 23, 2023 11:39

Xanewok changed the title ~~Polish parser utilities and support naively recovering from bad termination~~ Support simple error recovery in termination/delimitation and polish parser utilities Aug 23, 2023

Xanewok marked this pull request as ready for review August 23, 2023 12:13

Xanewok requested a review from a team as a code owner August 23, 2023 12:13

OmarTawfik reviewed Aug 24, 2023

View reviewed changes

crates/codegen/grammar/src/dsl.rs Outdated Show resolved Hide resolved

OmarTawfik reviewed Aug 24, 2023

View reviewed changes

crates/codegen/grammar/src/grammar.rs Outdated Show resolved Hide resolved

OmarTawfik reviewed Aug 24, 2023

View reviewed changes

crates/codegen/parser/runtime/src/cst.rs Outdated Show resolved Hide resolved

OmarTawfik reviewed Aug 24, 2023

View reviewed changes

AntonyBlakey suggested changes Aug 24, 2023

View reviewed changes

This was referenced Aug 28, 2023

Separate lexer utilities #579

Merged

Polish parser helpers #580

Merged

github-merge-queue bot pushed a commit that referenced this pull request Aug 29, 2023

Polish parser helpers (#580)

8125d65

Most importantly, this documents and tries to simplify the parser helpers and improves legibility of the generated parse functions. Done as part of the #567

Xanewok added 3 commits August 29, 2023 13:59

Add initial TerminatedBy error recovery tests

d6dc62e

Attempt recovering from incomplete matches followed by a terminator

772912b

Attempt recovering from extra tokens until the terminator

b13bdce

Xanewok force-pushed the error-recovery branch from d0f601d to b13bdce Compare August 29, 2023 12:00

Xanewok changed the title ~~Support simple error recovery in termination/delimitation and polish parser utilities~~ Attempt error recovery in the TerminatedBy parsers Aug 29, 2023

AntonyBlakey suggested changes Aug 29, 2023

View reviewed changes

Rename Stream to ParserContext

a855405

AntonyBlakey approved these changes Aug 29, 2023

View reviewed changes

AntonyBlakey added this pull request to the merge queue Aug 29, 2023

Merged via the queue into NomicFoundation:main with commit 7c6b6c9 Aug 29, 2023


		use super::super::text_index::TextIndex;

		pub struct Stream<'s> {

Conversation

Xanewok commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Aug 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarTawfik Aug 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarTawfik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xanewok commented Aug 29, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xanewok commented Aug 14, 2023 •

edited

Loading

changeset-bot bot commented Aug 14, 2023 •

edited

Loading

OmarTawfik Aug 24, 2023 •

edited

Loading