Skip to content

Attempt error recovery in the TerminatedBy parsers#567

Merged
AntonyBlakey merged 4 commits intoNomicFoundation:mainfrom
Xanewok:error-recovery
Aug 29, 2023
Merged

Attempt error recovery in the TerminatedBy parsers#567
AntonyBlakey merged 4 commits intoNomicFoundation:mainfrom
Xanewok:error-recovery

Conversation

@Xanewok
Copy link
Copy Markdown
Contributor

@Xanewok Xanewok commented Aug 14, 2023

Based on #564

EDIT:
As asked, this was split into supporting PRs (#579, #580, merged now), doesn't use the Error CST node anymore and only contains terminated-by recovery.

To support error side-channel in a backtracking scenario, Vec<ParseError> was added to the Stream which serves more as a parse context now. To backtrack, we first record the position of the stream and how many errors are there; then, we reset the errors and the position to the first recorded Marker struct (basically copying what chumsky does).

This wasn't changed everywhere as Stream::set_position in the lexer will never emit errors (only parsers do), nor will the optional trivia parsers. I'm happy to polish the lexer interface a bit to accommodate for this change later, if that's okay. First, I wanted to make sure the approach is fine and accepted, until we proceed with the DelimitedBy error recovery.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Aug 14, 2023

⚠️ No Changeset found

Latest commit: a855405

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@Xanewok Xanewok force-pushed the error-recovery branch 5 times, most recently from 950d580 to c9cad99 Compare August 23, 2023 11:39
@Xanewok Xanewok changed the title Polish parser utilities and support naively recovering from bad termination Support simple error recovery in termination/delimitation and polish parser utilities Aug 23, 2023
@Xanewok Xanewok marked this pull request as ready for review August 23, 2023 12:13
@Xanewok Xanewok requested a review from a team as a code owner August 23, 2023 12:13
"@nomicfoundation/slang": minor
---

We attempt simple CST error recovery for common "terminated-by" and "delimited-by" groups instead of bailing out on incomplete/invalid input
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "terminated-by" and "delimited-by" are internal grammar concepts. I suggest using user-facing terms. Maybe something like "terminators (semicolon)" and "delimiters (brackets)" instead?

expect(errors).toHaveLength(1);

const report = errors[0]?.toErrorReport("test.sol", source, /* withColor */ false);
let reports = errors.map((error: any) => error.toErrorReport("test.sol", source, /* withColor */ false));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the change here? we already asserted it has a length of 1 above. Also, why would we use any here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was here as at some points there were two errors emitted, and I blessed the snapshots everytime for consistency; didn't revert that after it was only one error. The any was because the type was not inferred but now I see that it's unresolved in Node.js rather than being pointed to parse_output.ParseOutput. I can fix that in a follow-up and I can revert the code change here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the type was not inferred but now I see that it's unresolved in Node.js

Is the generated .d.ts is broken somehow? not blocking to this PR, but I would have expected the earlier test to fail if it cannot resolve the types, since we compile with struct: true?

Rule(RuleKind),
Token(TokenKind),
Trivia(TokenKind),
Error,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest storing/rendering the error message from ErrorNode as well, instead of dropping it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the consensus is to drop the Node, so this probably won't end up in a CST


Tree:
- Block (Rule): # 0..24 "{ unchecked { x = 1; } }"
- Block (Rule): # 0..22 "{ unchecked { x = 1; }"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Block changed here from 24 to 22 bytes, as we dropped the last }. Is that intentional?

The CST should always be complete (covers the entire input).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can bring back the SKIPPED token handling for the incomplete parse

- FunctionDefinition (Rule): # 18..28 " function"
- FunctionKeyword (Token): "function" # 20..28
- SKIPPED (Token): "" # 28..28
- Error: "" # 28..28
Copy link
Copy Markdown
Contributor

@OmarTawfik OmarTawfik Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two empty error nodes here. Is that expected?
Do we need to create them in this case?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this a WIP result - maybe we should split this into the commits before and after. I'd prefer not to publish this CST result.

Copy link
Copy Markdown
Contributor

@OmarTawfik OmarTawfik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions on the structure/test output.

if_true
} else {
let flags = self.iter().map(|vqr| {
let flag = format_ident!(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not keen on this change - the existing code has the version checks in the scanner/parser 'inner loop' as simple bool tests.

In any case, I don't think this change should be in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to reduce ad-hoc identifier generation since it makes the harness/code harder to follow and adds unnecessary state to the language IMO (these should be initialized and read-only) but I can revert that; I don't have strong feelings about this

@@ -0,0 +1,165 @@
use crate::{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like this.

/// a full match if possible, otherwise on the best incomplete match.
pub struct ChoiceHelper {
result: ParserResult,
is_done: bool,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the result value does double duty as the done sentinel might be more efficient in some way, but is less easy to comprehend at a glance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about efficiency here but more about being impossible to encode invalid state (e.g. it's impossible to have is_done == true && matches!(result, IncompleteResult(..)))).

The code is ultimately a reduction that settles on a particular ParserResult state, so I wanted to make it clear that is_done is derived from the result and not used alongside it. Maybe the finished_state! is a bit of an overkill, I can try to make it more obvious but I'd rather not reintroduce is_done as state as it made me harder to follow what uses/mutates/depends on what.

let mut helper = SequenceHelper::new();
loop {
// Poor man's try block
std::iter::once(()).try_for_each(|_| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO should be a macro. It might better than loop, but the hack is uglier. At it still abuses the concept of repetition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was cleaned up later on


impl Language {
const VERSIONS: &'static [&'static str] = &[
pub const SUPPORTED_VERSIONS: &[Version] = &[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice polishing.

- FunctionDefinition (Rule): # 18..28 " function"
- FunctionKeyword (Token): "function" # 20..28
- SKIPPED (Token): "" # 28..28
- Error: "" # 28..28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this a WIP result - maybe we should split this into the commits before and after. I'd prefer not to publish this CST result.

- FunctionKeyword (Token): "function" # 95..103
- Error: "" # 103..103
- Error: "nobody() public\n\nfunction noargs public {\n\tbreak" # 104..152
- Semicolon (Token): ";" # 152..153
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have this on main if this doesn't produce a complete CST covering every char of the input.

This was referenced Aug 28, 2023
github-merge-queue bot pushed a commit that referenced this pull request Aug 29, 2023
The 59645e6 change is unrelated but I
hope it's simple enough that I can squeeze it in as well here.

This includes some smaller, some bigger refactorings while I worked on
#567. I hope they are useful on their own and this will help with later
work related to #567, as I'll need more lexer-related utilities (i.e.
skipping or consuming until a token is found) and it got somewhat
unwieldy when I left it in the language.tera.

Next up, I can separate lexer-related bits out of the language.tera into
a dedicated lexer.tera that may contain `LexicalContext` (cc
@AntonyBlakey wrt
#567 (comment))
if if you think it's worthwhile.
github-merge-queue bot pushed a commit that referenced this pull request Aug 29, 2023
Most importantly, this documents and tries to simplify the parser
helpers and improves legibility of the generated parse functions. Done
as part of the #567
@Xanewok Xanewok changed the title Support simple error recovery in termination/delimitation and polish parser utilities Attempt error recovery in the TerminatedBy parsers Aug 29, 2023
@Xanewok
Copy link
Copy Markdown
Contributor Author

Xanewok commented Aug 29, 2023

Updated the PR, see OP #567 (comment).


use super::super::text_index::TextIndex;

pub struct Stream<'s> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is now more than a stream, it should be renamed to ParserContext - either including a Stream or having trait Stream. And wherever &mut Stream is used the param name changed as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a855405

@AntonyBlakey AntonyBlakey added this pull request to the merge queue Aug 29, 2023
Merged via the queue into NomicFoundation:main with commit 7c6b6c9 Aug 29, 2023
rollup-smithbm0p added a commit to rollup-smithbm0p/slang that referenced this pull request Dec 26, 2025
The 59645e669c72a8468f510531268bb6ed57ae1a29 change is unrelated but I
hope it's simple enough that I can squeeze it in as well here.

This includes some smaller, some bigger refactorings while I worked on
#567. I hope they are useful on their own and this will help with later
work related to #567, as I'll need more lexer-related utilities (i.e.
skipping or consuming until a token is found) and it got somewhat
unwieldy when I left it in the language.tera.

Next up, I can separate lexer-related bits out of the language.tera into
a dedicated lexer.tera that may contain `LexicalContext` (cc
@AntonyBlakey wrt
NomicFoundation/slang#567 (comment))
if if you think it's worthwhile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants