Skip to content

Add 'reserved word' construct: improve error recovery by avoiding treating reserved words as other tokens#1635

Closed
maxbrunsfeld wants to merge 16 commits intomasterfrom
reserved-rule
Closed

Add 'reserved word' construct: improve error recovery by avoiding treating reserved words as other tokens#1635
maxbrunsfeld wants to merge 16 commits intomasterfrom
reserved-rule

Conversation

@maxbrunsfeld
Copy link
Contributor

@maxbrunsfeld maxbrunsfeld commented Feb 3, 2022

This is a second attempt to solve a problem described in #246. I'll copy some of the description here.

Background

Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes. In general, Tree-sitter is permissive in allowing words that are keywords in some places to be used freely as names in other places.

Sometimes this permissiveness causes unexpected error recoveries. Consider this syntax error in Rust:

fn main() {
    a.  // <- incomplete

    if b {
        c();
    }
}

Currently, when tree-sitter-rust encounters this code, it doesn't detect an error until the word b, because it interprets the word if as a field/method name on the a object. It doesn't see if as a keyword, because the keyword if would not be valid in that position.

Because the error is detected too late, it's not possible to recover well. Tree-sitter fails to recognize the if_statement, and sees it instead as a continuation of the expression above:

rust tree with bad recovery
(source_file [0, 0] - [7, 0]
  (function_item [0, 0] - [5, 3]
    name: (identifier [0, 3] - [0, 7])
    parameters: (parameters [0, 7] - [0, 9])
    body: (block [0, 10] - [5, 3]
      (field_expression [1, 2] - [3, 4]
        value: (identifier [1, 2] - [1, 3])
        field: (field_identifier [3, 2] - [3, 4]))
      (call_expression [3, 5] - [4, 7]
        function: (struct_expression [3, 5] - [4, 5]
          name: (type_identifier [3, 5] - [3, 6])
          body: (field_initializer_list [3, 7] - [4, 5]
            (shorthand_field_initializer [4, 4] - [4, 5]
              (identifier [4, 4] - [4, 5]))))
        arguments: (arguments [4, 5] - [4, 7]))))
  (ERROR [6, 0] - [6, 1]))

The reserved property

In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are reserved. That is - even if they are not technically valid, they should still be recognized as separate from any other tokens that would match that string (such as an identifier). In Rust, most keywords like if and let are reserved in all contexts.

This PR introduces a new top-level property on grammars called reserved, which lets you list these reserved words.

module.exports = grammar({
  name: 'rust',

  reserved: $ => [
    "enum",
    "fn",
    "for",
    "if",
    "let",
    "loop",
    "match",
    "mod",
    "struct",
    "type",
    "while",
  ],

  // ...
});

When using this new feature, and parsing the same rust code as above, the error is now detected at the correct time (at the if token), because Tree-sitter still treats if as a keyword and not an identifier, even though the keyword is unexpected. This allows error recovery to be much better: preserving the entire if_statement, and marking the incomplete a. line as an error.

rust tree with good recovery
(source_file [0, 0] - [7, 0]
  (function_item [0, 0] - [6, 1]
    name: (identifier [0, 3] - [0, 7])
    parameters: (parameters [0, 7] - [0, 9])
    body: (block [0, 10] - [6, 1]
      (ERROR [1, 2] - [1, 4]
        (identifier [1, 2] - [1, 3]))
      (if_expression [3, 2] - [5, 3]
        condition: (identifier [3, 5] - [3, 6])
        consequence: (block [3, 7] - [5, 3]
          (call_expression [4, 4] - [4, 7]
            function: (identifier [4, 4] - [4, 5])
            arguments: (arguments [4, 5] - [4, 7])))))))

Contextual Reserved Words

Many languages have a more complex system of reserved words, in which words are reserved in some contexts, but not others. For example, in JavaScript, the word if cannot be used in a local declaration or an expression, but it can be used as the name of an object property:

var a = {if: true}; // <- valid
var b =             // <- incomplete
if (c) {            // <- error should be detected at this "if"
  a.if();           // <- valid
}

The current version of tree-sitter-javascript will treat the if properties as valid, which is correct, but it will fail to detect the error on the if token on line 3, similarly to the Rust example described above.

javscript tree with bad error recovery
(program [0, 0] - [6, 0]
  (variable_declaration [0, 0] - [0, 19]
    (variable_declarator [0, 4] - [0, 18]
      name: (identifier [0, 4] - [0, 5])
      value: (object [0, 8] - [0, 18]
        (pair [0, 9] - [0, 17]
          key: (property_identifier [0, 9] - [0, 11])
          value: (true [0, 13] - [0, 17])))))
  (comment [0, 20] - [0, 31])
  (variable_declaration [1, 0] - [2, 6]
    (variable_declarator [1, 4] - [2, 6]
      name: (identifier [1, 4] - [1, 5])
      (comment [1, 20] - [1, 36])
      value: (call_expression [2, 0] - [2, 6]
        function: (identifier [2, 0] - [2, 2])
        arguments: (arguments [2, 3] - [2, 6]
          (identifier [2, 4] - [2, 5])))))
  (statement_block [2, 7] - [4, 1]
    (comment [2, 20] - [2, 63])
    (expression_statement [3, 2] - [3, 9]
      (call_expression [3, 2] - [3, 8]
        function: (member_expression [3, 2] - [3, 6]
          object: (identifier [3, 2] - [3, 3])
          property: (property_identifier [3, 4] - [3, 6]))
        arguments: (arguments [3, 6] - [3, 8])))
    (comment [3, 20] - [3, 31])))
test.js	3 ms	(MISSING ";" [2, 6] - [2, 6])

In order to allow the valid usages, but detect the invalid ones, the grammar author needs a way to indicate in the JavaScript grammar that if is normally a reserved word, but it is still allowed in property names.

The reserved Grammar Rule

In addition to the top-level reserved property, this PR also introduces a new rule function, reserved(reservedWords, rule), which lets you override the set of reserved words in a certain context. In the case of JavaScript, we actually want to remove all reserved words in the context of object properties:

grammar({
  name: 'javascript',
  
  reserved: $ => [
    'if',
    'for',
    'function',
    'var',
    // ...
  ],

  rules: {
    // ...
    
    _property_name: $ => reserved([], choice(
      alias($.identifier, $.property_name),
      // ...
    )),
  }
});

In this particular case, we call reserved() with an empty array to indicate that there are no reserved words in that context. In other use cases, we might pass an alternative set of reserved words.

Details

  • Reserved Word Semantics - Right now, reserved words only take effect in parse states where the grammar's word token would be valid. For example, when inside of a string literal, the reserved words won't cause the lexer to recognize the contents of a string as an if keyword.
  • Combining Reserved Word Sets - In many parse states, there are multiple possible rules that could be in progress. Right now, the reserved words for these parse states will be the union of the reserved words associated with each in-progress rule.

Remaining Tasks

  • Make it easier to use the reserved function correctly. Change the semantics of reserved word overrides so that one reserved word set is always chosen for a given state; don't join multiple sets by taking their union. If one rule has higher precedence, take its reserved word set. Raise an error if there are multiple reserved word sets with the same precedence.
  • Fix performance regression in SQL grammar
  • Automatically handle local ambiguities involving non-reserved keywords. When a keyword is not reserved in a given context, (i.e. it may be used as an identifier there), the parser often needs to see the token that follows the keyword before it can determine if should be interpreted as a keyword, or a as an ordinary identifier.
    To implement this, many Tree-sitter grammars have a rule called "reserved_identifier" or something like that, which wraps a set of keywords, and aliases them as an identifier. This works, but it's a bit confusing, and I'd like to remove the need for this pattern. I think that now, we should be able to automatically apply this kind of handling to any keyword that is not reserved.
  • Allow specifying reserved words that aren't used elsewhere.

@dfgordon
Copy link

dfgordon commented Feb 6, 2022

Great! This would be the ideal answer to #1472. I was able to solve with a tricky scanner, but this sounds much better.
Question: how will this treat white space? I ask because the notes above mention it will be triggered (in part?) by the word token being defined, and this is connected, I found, with the regex notion of word boundaries, such as white space. For my case I need to exclude reserved words from identifiers even if they are broken up by spaces. Spaces need to be eaten up as if they didn't exist. Put another way, will it be possible to fully control what a word boundary is?

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Apr 3, 2022

Ok, I believe this PR is now working correctly. I've opened PRs making use of the API in some grammars. There are some improvements that I'd like to make before merging and releasing this.

EDIT - moved these TODOs to tasks in the PR description.

@dfgordon
Copy link

dfgordon commented Apr 4, 2022

Trying another way: can the property be massaged to produce an error when an identifier contains an embedded reserved word?

tl;dr

Suppose and is the token for logical and, and we are emulating a parser that expects all spaces to be stripped at a prior stage. Then sand and grand is an error because it is seen as s and and gr and. TS will handle that fine, but will typically allow a declaration of sand, since and is not expected in that context. But in languages like this such a declaration is forbidden to protect the programmer. My solution was to lex identifiers using a scanner that rejects anything with a reserved word embedded in it. Wondering if the PR would provide a scanner-free solution.

@paf31
Copy link
Contributor

paf31 commented Apr 7, 2022

First of all, thank you very much for this - I think this looks like a great approach and has the potential to be a very straightforward API for working with error recovery and incomplete documents.

I have a couple of notes, having tried this out today, and I hope to be able to provide some more detailed notes soon:

  1. This seems to impact the speed of compilation quite drastically, i.e. on the order of a 10x slowdown. Is that expected? Perhaps I'm building it incorrectly, I'm not a Rust user, but I've built a release build using cargo and temporarily just copied it into my node_modules folder.
  2. I've had trouble so far if I try to use the $ in the reserved: ($) => ... top-level configuration, i.e. if my reserved tokens are anything but strings. Specifically, I want to be able to use regexes to allow for case-insensitive reserved keywords, but the results don't seem to be as good as when using string literals.
  3. Am I right to assume that if a keyword is in the reserved context at some point, then there should never be a parse tree result which includes a name with that keyword as its text, in that context? If so, I'm definitely seeing cases where that is not the case (if this is a bug, I can try to distill this down to some minimal test cases). And if not, is there an equally simple way for me to think about what reserved should restrict in the resulting parse trees?

Thanks!

@maxbrunsfeld
Copy link
Contributor Author

This seems to impact the speed of compilation quite drastically

I didn't expect that. I'd like to fix it before merging. Can you share a grammar that reproduces the issue?

Specifically, I want to be able to use regexes to allow for case-insensitive reserved keywords

They shouldn't need to be strings. I do think that right now, they need to be existing tokens in the grammar which "overlap" (match some common strings) with your grammar's word token. I'd be curious about cases where it's not working as expected with regexes.

Am I right to assume that if a keyword is in the reserved context at some point, then there should never be a parse tree result which includes a name with that keyword as its text, in that context?

Yeah, although there is some subtlety right now in determining which keywords are in the context at a given point. This is related to the second problem I mentioned above, about the reserved function being a bit hard to use correctly.

@maxbrunsfeld
Copy link
Contributor Author

can the property be massaged to produce an error when an identifier contains an embedded reserved word?

@dfgordon For right now, I don't think this feature will support that. I think using custom code in an external scanner is probably the best way to do it.

@paf31
Copy link
Contributor

paf31 commented Apr 18, 2022

@maxbrunsfeld Thanks for your reply. We're putting together a small grammar which reproduces the issue we're seeing, and we will follow up shortly.

@ioseb
Copy link

ioseb commented Apr 18, 2022

@maxbrunsfeld we put up demo grammar as @paf31 promised: https://github.com/DeepChannel/reserved-word

Here are some of the issues we are experiencing:

  1. We can not use plain text keywords inside reserved: [...] array. When compiling TS complains with following message "Reserved words must be tokens".

This works:

reserved: ($) => [$.SELECT, $.AS, $.UNION, $.INTERSECT, $.EXCEPT],

But this does not:

reserved: ($) => ["SELECT", "AS", "UNION", "INTERSECT", "EXCEPT"],
  1. Behaviour is same for both - case insensitive and plain text rules. kywords.js includes following two lines which allow to switch between such rules:
result[value.toUpperCase()] = (_) => toCaseInsensitiveRule(value);
// result[value.toUpperCase()] = (_) => value.toUpperCase();
  1. Overriding reserved keywords doesn't have any effect also. Below is the example where for the first choice global reserved: [] set should work and for the second rule there is an override:
alias: ($) =>
  choice(
    $.identifier,
    seq(
      $.AS,
      // Allow any reserved keywords except SELECT
      reserved([$.SELECT], $.identifier)
    )
  ),
  1. In grammar.json I see all of the defined reserved rules:
"reserved": [
    {
      "type": "SYMBOL",
      "name": "SELECT"
    },
    {
      "type": "SYMBOL",
      "name": "AS"
    },
    {
      "type": "SYMBOL",
      "name": "UNION"
    },
    {
      "type": "SYMBOL",
      "name": "INTERSECT"
    },
    {
      "type": "SYMBOL",
      "name": "EXCEPT"
    }
]
  1. In parser.c there is a definition for reserved keywords count but final result doesn't actually work:
#define MAX_RESERVED_WORD_SET_SIZE 5

Maybe this is the issue related to .wasm?

Following example should not parse UNION keyword as identifier:

SELECT a + 
UNION 
SELECT 1 SELECT, a - b

Parse tree:

program [0, 0] - [3, 0]
  statement [0, 0] - [2, 22]
    select_stmt [0, 0] - [2, 22]
      simple_select [0, 0] - [2, 22]
        SELECT [0, 0] - [0, 6]
        ERROR [0, 7] - [2, 6]
          target_el [0, 7] - [2, 6]
            expr [0, 7] - [1, 5]
              binary_expr [0, 7] - [1, 5]
                left: expr [0, 7] - [0, 8]
                  identifier [0, 7] - [0, 8]
                    name [0, 7] - [0, 8]
                right: expr [1, 0] - [1, 5]
                  identifier [1, 0] - [1, 5]
                    name [1, 0] - [1, 5]
            alias [2, 0] - [2, 6]
              identifier [2, 0] - [2, 6]
                name [2, 0] - [2, 6]
        target_list [2, 7] - [2, 22]
          target_el [2, 7] - [2, 15]
            expr [2, 7] - [2, 8]
              int [2, 7] - [2, 8]
            alias [2, 9] - [2, 15]
              identifier [2, 9] - [2, 15]
                name [2, 9] - [2, 15]
          target_el [2, 17] - [2, 22]
            expr [2, 17] - [2, 22]
              binary_expr [2, 17] - [2, 22]
                left: expr [2, 17] - [2, 18]
                  identifier [2, 17] - [2, 18]
                    name [2, 17] - [2, 18]
                right: expr [2, 21] - [2, 22]
                  identifier [2, 21] - [2, 22]
                    name [2, 21] - [2, 22]

I would expect it to look something like this:

program [0, 0] - [3, 0]
  statement [0, 0] - [2, 22]
    select_stmt [0, 0] - [2, 22]
      simple_select [0, 0] - [0, 12]
        SELECT [0, 0] - [0, 6]
        target_list [0, 7] - [0, 12]
          target_el [0, 7] - [0, 12]
            expr [0, 7] - [0, 12]
              [ERROR]
                binary_expr [0, 7] - [0, 12]
                  left: expr [0, 7] - [0, 8]
                    identifier [0, 7] - [0, 8]
                      name [0, 7] - [0, 8]
      select_combinator [1, 0] - [1, 5]
        UNION [1, 0] - [1, 5]
      select_stmt [2, 0] - [2, 22]
        simple_select [2, 0] - [2, 22]
          SELECT [2, 0] - [2, 6]
          target_list [2, 7] - [2, 22]
            target_el [2, 7] - [2, 15]
              expr [2, 7] - [2, 8]
                int [2, 7] - [2, 8]
              alias [2, 9] - [2, 15]
                identifier [2, 9] - [2, 15]
                  name [2, 9] - [2, 15]
            target_el [2, 17] - [2, 22]
              expr [2, 17] - [2, 22]
                binary_expr [2, 17] - [2, 22]
                  left: expr [2, 17] - [2, 18]
                    identifier [2, 17] - [2, 18]
                      name [2, 17] - [2, 18]
                  right: expr [2, 21] - [2, 22]
                    identifier [2, 21] - [2, 22]
                      name [2, 21] - [2, 22]

@ioseb
Copy link

ioseb commented Apr 26, 2022

@maxbrunsfeld below are benchmark results for our relatively large grammar.

Benchmark results for master branch:

Benchmark 1: npm run build
  Time (mean ± σ):     13.106 s ±  0.553 s    [User: 13.244 s, System: 0.773 s]
  Range (min … max):   12.451 s … 14.053 s    10 runs

Benchmark results for reserved-rule branch:

Benchmark 1: npm run build
  Time (mean ± σ):     196.793 s ±  9.894 s    [User: 196.355 s, System: 0.953 s]
  Range (min … max):   185.614 s … 211.372 s    10 runs

npm run build resolves to tree-sitter generate && tree-sitter build-wasm

Machine:

MacBook Pro (16-inch, 2019)
Processor 2,3 GHz 8-Core Intel Core i9
Memory 16 GB 2667 MHz DDR4

@paf31
Copy link
Contributor

paf31 commented May 5, 2022

Am I right to assume that if a keyword is in the reserved context at some point, then there should never be a parse tree result which includes a name with that keyword as its text, in that context?

Yeah, although there is some subtlety right now in determining which keywords are in the context at a given point. This is related to the second problem I mentioned above, about the reserved function being a bit hard to use correctly.

I meant to follow up with a bit more context here, based on the example repo that @ioseb provided. If you try the playground in that repo, you can see the problem that I was originally referring to:

  • If you paste the input text SELECT x UNION there, you will see that UNION is correctly marked as an error node, because UNION is reserved, so it cannot take the place of the alias.
  • SELECT x as UNION parses with UNION as an identifier - again this is correct, because the nested reserved call has removed UNION from the reserved set in that scope, as I understand it.
  • SELECT x SELECT y is interesting. I'll paste the output below.
  • SELECT x AS SELECT y seems to result in a similar tree.

Output for SELECT x SELECT y:

program [0, 0] - [1, 0]
  statement [0, 0] - [0, 17]
    select_stmt [0, 0] - [0, 17]
      simple_select [0, 0] - [0, 17]
        SELECT [0, 0] - [0, 6]
        ERROR [0, 7] - [0, 15]
          target_el [0, 7] - [0, 15]
            expr [0, 7] - [0, 8]
              identifier [0, 7] - [0, 8]
                name [0, 7] - [0, 8]
            alias [0, 9] - [0, 15]
              identifier [0, 9] - [0, 15]
                name [0, 9] - [0, 15]
        target_list [0, 16] - [0, 17]
          target_el [0, 16] - [0, 17]
            expr [0, 16] - [0, 17]
              identifier [0, 16] - [0, 17]
                name [0, 16] - [0, 17]

As you can see, the ERROR range was introduced surrounding x SELECT, but the second SELECT was still parsed as a name, even though it was in the reserved set.

If you could please help me to understand what the intended behavior should be here, I'd appreciate it, because my understanding is that the parser should have backtracked at this point, and returned some tree in which no SELECT keyword is marked as a name. Hopefully that's the intent, and I've not misunderstood, but please correct me if I'm wrong.

Thanks again!

@gdamore
Copy link
Contributor

gdamore commented Oct 6, 2022

I'm working on a D grammar for tree-sitter. D is a C family language, and like most such languages has a some (D has a plethora of them) words that are reserved in all contexts. There are a ton of conflicts in the grammar that I'm very happy that this PR will address. (At one point I considered moving the entire logic of isolating keywords from identifiers in the lexer -- I had an implementation that did it even -- but I very much like the approach here better.) Btw, my lexer does identify strings, comments, and a couple of other oddball constructs so that they show up as tokens and cannot conflict with reserved words.

@mingodad
Copy link

There is also another way that help when we need soft keywords see here https://sqlite.org/src/doc/trunk/doc/lemon.html#pfallback then %fallback directive.

@obs-gh-zuozhiwang
Copy link

I'm working on a pretty simple search DSL that allows boolean operators: and, or, not, and search tokens. From other discussions, it seems the only way is to do this in an external scanner.

I want to contribute one way of doing it using regex. Basically I'm negating and/or/not, but it's very ugly as negative lookahead is not supported.

        // original regex: /([^<>\s\\'"()]|\\.)+/
        // exclude the keywords: and, or, not
        // regex logic is: 
        // 1: any string not start with a, o, or n
        // 2: if starts with a,   must not follow n
        // 3: if starts with an,  must not follow d
        // 4: if starts with and, must follow with more characters
        // ... (repeat the same logic for or, not)
        token: _ => /((?:(?:[^aon<>\s\\'"()]|\\.)(?:[^<>\s\\'"()]|\\.)*)|a|(?:a[^n<>\s\\'"()](?:[^<>\s\\'"()]|\\.)*)|an|(?:an[^d<>\s\\'"()](?:[^<>\s\\'"()]|\\.)*)|(?:and(?:[^<>\s\\'"()]|\\.)+)|o|(?:o[^r<>\s\\'"()](?:[^<>\s\\'"()]|\\.)*)|(?:or(?:[^<>\s\\'"()]|\\.)+)|n|(?:n[^o<>\s\\'"()](?:[^<>\s\\'"()]|\\.)*)|no|(?:no[^t<>\s\\'"()](?:[^<>\s\\'"()]|\\.)*)|(?:not(?:[^<>\s\\'"()]|\\.)+))/i,

I very much look forward to having this feature merged.

@amaanq
Copy link
Member

amaanq commented Nov 22, 2024

Hey, just commenting for everyone in this thread that's interested, #3896 supersedes this PR, and is now working with everything on the todo list fixed minus the Allow specifying reserved words that aren't used elsewhere. item, since Max and I agreed that it doesn't really make sense to reserve keywords that aren't used anywhere - it feels like a weird feature, and doesn't seem like it'd be used.

Questions/Discussion should happen in the follow-up PR, so I'll be closing this

@amaanq amaanq closed this Nov 22, 2024
@ObserverOfTime ObserverOfTime removed this from the 0.25 milestone Nov 23, 2024
@amaanq amaanq deleted the reserved-rule branch December 27, 2024 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants