Introduce an EXCLUDE rule by maxbrunsfeld · Pull Request #246 · tree-sitter/tree-sitter

maxbrunsfeld · 2018-12-04T17:10:38Z

Background

Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes and states. In general, Tree-sitter tends to be permissive in allowing words that are keywords in some places to be used freely as names in other places.

Sometimes this permissiveness causes unexpected error recoveries. Consider this C syntax error:

float // <-- error
int main() {}

Currently, when tree-sitter-c encounters this code, it doesn't detect an error until the word main, because it interprets the word int as a variable, declared with type float. It doesn't see int as a keyword, because the keyword int wouldn't be allowed in that position.

Solution

In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are not allowed in certain places. For example in C, primitive types like int and control-flow keywords like while are not allowed as variable names in declarators.

This PR introduces a new EXCLUDE rule to the underlying JSON schema. From JavaScript, you can use it like this:

declarator: choice(
  $.pointer_declarator,
  $.array_declarator,
  $.identifier.exclude('if', 'int', ...etc)
)

Conceptually, you're saying "a declarator can match an identifier, but not these other tokens".

Implementation

Internally, all Tree-sitter needs to do is to insert the excluded tokens (if, int, etc) into the set of valid lookahead symbols that it uses when tokenizing, in the relevant states. Then, when the lexer sees the string "if", it will recognize it as an if token, not just an identifier. Then, as always when there's an error, the parser will find that there are no valid parse actions for the token if.

Alternatives

I could have instead introduced a new field on the entire grammar called keywords. Then, if we added if to the grammar's keywords, the word if would always be treated as its own token, in every parse state.

This is less general, and it wouldn't really work AFAICT. Even in C, there are states where the word int should not be treated as a keyword. For example, inside of a string ("int"), as the name of a macro definition #define int. And in other languages, there are many more cases like this than there are in C. For example in JavaScript, it's fine to have an object property named if.

Relevant Issues

atom/language-c#308

JoeBlanchard · 2019-02-01T02:02:48Z

Hey Max,

I have been working on developing Treesitter grammars for extensible languages developed by the Minnesota Extensible Language Tools Group. This group actually wrote the paper on context aware scanning that you mentioned as part of your underlying research for this project.

This rule seems useful, but also a hassle to list all the keywords that should be excluded wherever an identifier should appear. I was wondering if you had considered implementing a version of lexer classes where these classes of terminals can submit to or dominate other classes of terminals to specify precedence. This can be used to specify that one prefers keywords over identifiers. The paper describes how this works fully.

The benefit of doing this is, as mentioned in the paper in section 2 under the lexical precedence relation header, that we can always view higher precedence terms as being a member of the valid lookahead set when parsing so we can fix the same problem as you are doing with the exclude rule. This could be done once using submits to and dominates instead of excluding the keywords every place the identifier occurs.

Also, using submits to and dominates with lexer classes defines a partial precedence on terminals that is less constraining and allows more flexibility than the total precedence currently required in Treesitter. This would help Treesitter support more languages more easily. I know it'd likely be a significant undertaking to implement this. So, I'm wondering your rational against doing this and why or why not you would consider implementing this in the future.

I'm not sure where the best place to post this was, but I guess this will do. Thanks for your time!

maxbrunsfeld · 2019-02-01T16:44:22Z

Good to hear from you @JoeBlanchard! There's some things that I like about the approach the paper describes, but I don't fully understand how it addresses this problem.

we can always view higher precedence terms as being a member of the valid lookahead set when parsing so we can fix the same problem as you are doing with the exclude rule.

I don't think that's the behavior that I want though. By default, in any given parse state, I don't want any syntactically-invalid tokens to be added to the valid lookahead set. Rather, I want the default to be that we produce the most permissive possible parser-lexer pairing.

For example, in JavaScript you can use words like if and else as identifiers, as long as you're in a context where the if and else keywords wouldn't be valid.

if (a) {
  const b = {
    if: 1,
    else: 2,
  };
} else {
  // ...
}

In Tree-sitter's JavaScript grammar, this 'Just Works', with no extra effort because:

When an if keywords is valid, it "dominates" the identifier token because of Tree-sitter's default rule that String tokens dominate Regex tokens.
Inside of the Object literal above, the if and else keywords are not valid lookaheads.

This even comes up in older languages like C, because you can do stuff like this:

#define float double
#ifdef float
// ...
#endif

And in code like that☝️, we parse float as an identifier.

If we implicitly added keywords like if to every lookahead set containing identifier, it seems like we'd lose these behaviors.

maxbrunsfeld · 2019-02-01T16:53:47Z

In other words, I want Tree-sitter's default to be highly permissive, treating all keywords as contextual, because most languages really do behave this way in many cases. Our number 1 priority is to successfully parse all code that's valid. If we successfully parse some code that's invalid, that's not a big problem to me.

Once you have a working parser, I do want to add a more niche API that lets you manually expand the valid lookahead set in specific contexts. That's what this exclude is for.

This rule seems useful, but also a hassle to list all the keywords that should be excluded wherever an identifier should appear.

The way I'm imagining it, we'd only call exclude in one place in tree-sitter-c: we'd just add it to the _declarator rule like I showed above, in order to improve Tree-sitter's error recovery in the specific case of code like

float // not done typing yet
int foo;

maxbrunsfeld · 2019-02-01T16:54:55Z

I might be misunderstanding the way dominates works in the paper though. Am I missing something?

Overall, my impression is that fundamentally, I do want this concept to be context-specific, rather than global, whereas dominates and submitsTo are global concepts.

JoeBlanchard · 2019-02-05T04:56:53Z

Thanks for getting back @maxbrunsfeld

Overall, my impression is that fundamentally, I do want this concept to be context-specific, rather than global, whereas dominates and submitsTo are global concepts.

Yes you are correct that domiantes and submitsTo are global concepts. Your reasoning makes a lot of sense for the way you have written grammars in Tree-sitter. The approach of keywords, such as if and else, always dominating identifiers for the purpose of error detection would cause parse errors in the examples you provided despite being correct programs.

Additionally, based on the development timeline of Tree-sitter it is logical that implementing this behavior was not a priority.

However, grammars can be written such that submitsTo and dominates do work. For example, in the C example you provided.

This even comes up in older languages like C, because you can do stuff like this:
#define float double
#ifdef float
// ...
#endif
And in code like that☝️, we parse float as an identifier.

If we implicitly added keywords like if to every lookahead set containing identifier, it seems like we'd lose these behaviors.

Here a grammar could be written that has a macro_identifier and identifier rule that both match the same regular expression but are separate rules. However, the keywords would only dominate the identifier and not the macro_identifier. Thus, since a macro_identifier is in the valid lookahead set while identifier is not, the keywords will not be added and this will be parsed correctly.

The main advantage of submitsTo and dominates is that it transforms precedence, which is a total ordering on all terminals, to a partial ordering only on all terminals where precedence is important and those terminals will appear in the same valid lookahead set. This feature is very useful when writing extensible programming languages which is what the MELT group does, but may not provide as much benefit to most languages used in Atom and on Github.

Thanks for getting back to me with such a detailed response! By the way, the reason I have been looking into Tree-sitter is that we are attempting to auto-generate the grammar.js file for a Tree-sitter specification from a Silver grammar specification to generate highlighters for ableC and extensions developed for ableC. I'll probably implement the use of submitsTo and dominates in Silver using the new EXCLUDE rule for now. When do you plan to merge the rule?

maxbrunsfeld · 2019-02-05T23:20:17Z

The main advantage of submitsTo and dominates is that it transforms precedence, which is a total ordering on all terminals, to a partial ordering only on all terminals where precedence is important

Yeah, I can see how that would help with maintainability in some cases, both for lexical disambiguation and for syntactic disambiguation as well. I think that numerical precedence is very easy to understand for people, but there are definitely cases where it's awkward to force things into a total ordering.

When do you plan to merge the rule?

Unfortunately, I got distracted with some other issues, and I haven't had time to work on this. I never even fully got it working correctly, so there's still a bit of work to do.

oxalica · 2021-10-07T19:40:09Z

I need the feature to exclude keywords from identifiers.

@maxbrunsfeld

treating all keywords as contextual

Yes for javascript but HUGE NO for other languages using hard keywords like nix-lang. We need this to be customizable.

I just hit the same issue that an hard keyword is incorrectly parsed as an identifier in the location where identifier is expected, instead of raising an error. It heavily mess up the error recovery since the delimiter-like keyword is eaten as identifier.

Given the simplified grammar rules:

word: $ => $.identifier,
rules: {
  identifier: $ => /[a-z]+/,
  let: $ => seq("let", repeat($.bind), "in", field("body", $.expr)),
  bind: $ => seq(field("attrpath", $.identifier), "=", field("expression", $.expr), ";"),
  expr: $ => choice($.identifier, $.number, $.app, ...),
  app: $ => prec.left(..., seq(field("function", $.expr), field("argument", $.expr))), // function apply
}

When we typing a new bind b =,

let
  a = 1;
  b =
in
  a

The tree is like

(ERROR
  (bind
    (identifier)
    (integer))
  (identifier)
  (app
    (identifier)
    (identifier)))

Here we can see the keyword in is parsed as a function application in a, which should be impossible. It consume the keyword in and failed to construct the top-level whole let structure. (bind is not only used in let, they have different highlighting depends on if it's in let. So this breaks all highlighting inside the let)

If we let identifier rule exclude keywords in, to say identifier: $ => /[a-hj-z][a-z]*|i[a-mo-z][a-z]*|in[a-z]+/, then we get a pretty good error recovered result:

(let
  (bind
    (identifier)
    (integer))
  (ERROR
    (identifier))
  (identifier))

But of course it's nearly impossible to construct a regex to exclude tens of keywords while only matching other valid identifiers.

maxbrunsfeld · 2021-10-07T19:52:15Z

I just hit the same issue that an hard keyword is incorrectly parsed as an identifier in the location where identifier is expected, instead of raising an error. It heavily mess up the error recovery since the delimiter-like keyword is eaten as identifier.

Yeah, I strongly agree. This is actually a problem for some of my own current use-cases as well. I plan to fix it pretty soon; I'm not sure if I'll go with this same exclude API though; it's been a long time since I worked on this, so I might come up with something better.

Let me know if you have suggestions for what you'd like to see in the grammar API.

Julian · 2021-10-07T20:01:14Z

I also have been watching this ticket, but because my parser essentially became totally unusable (taking multiple hours to generate) I essentially "hackily" implemented this functionality via helper functions.

My hack (which is still not good but at least unblocked me) is here: https://github.com/Julian/tree-sitter-lean/blob/main/grammar/util.js#L9-L29

oxalica · 2021-10-08T18:52:58Z

I'm kind of curious about how we write a hard-keyword grammar with exclude rule.

If we have identifier: $ => /\w+/.exclude('some', 'keyword') and word: $ => $.identifier then the keyword extraction just always fails, yes? So do we need to have another keywords: $ => /\w+/ with word: $ => $.keywords, and also have keywords before identifier to match keywords first since these two regex intersects?
This seems not quite intuitive comparing to a new keywords grammar field.

maxbrunsfeld · 2021-10-08T19:39:44Z

the keyword extraction just always fails, yes?

No, I don't think it it needs to affect keyword extraction in any way.

The .exclude('if', 'for', ...) method is not something that you would call in the definition of identifier (as you wrote). Instead, you would call it some particular rule where you're using an identifier, to indicate that in that specific position, a given set of tokens should be treated as distinct from identifier. See the example in the PR body.

So in your grammar, you would prevent the error recovery mistake that you mentioned by changing your expr rule to say $.identifier.exclude('in').

There might turn out to be several different rules in which you want the same excludes (like bind and expr in your case). In that case, you could simply define a constant in your grammar file like this:

const FORBIDDEN_IDENTS = ['in', 'let'];

Then in a few places in your grammar, you might say this:

$.identifier.exclude(...FORBIDDEN_IDENTS)

oxalica · 2021-10-08T19:50:56Z

Instead, you would call it some particular rule where you're using an identifier

Okay so if we need hard keywords everywhere, we need something like this. Seems more acceptable.

word: $ => $.identifier_like,
rules: {
  identifier_like: $ => /\w+/,
  identifier: $ => $.identifier_like.exclude(...keywords),
  // ... tons of rules using $.identifier
}

maxbrunsfeld · 2021-10-12T21:20:18Z

Ok, I have brought this PR back from the dead. So far, I'm keeping the grammar API the same way I originally proposed it.

smatlapudi · 2021-12-05T07:00:39Z

@maxbrunsfeld, I am checking on this PR as I have the same issue with following SQL code with invalid syntax ( , after col_one should not be there). The parser I generated not throwing error as expected but it is taking from keyword as column name.
Note: from clause is optional this SQL grammar and just a select expr, expr statement is valid one.

select 
  col_one,
 from 
  table_one

Is this exclude solution work for case insensitive keywords ( like SQL grammar here) use-case also ?

This allows you to explicitly disallow certain tokens that would otherwise be matched by a more general regex.

JoostK · 2022-01-20T07:43:16Z

Asking here since this was closed: is there still incentive to support something like this in the future?

maxbrunsfeld · 2022-01-20T15:53:20Z

Yes, I’m going to work more on this, but give it a slightly different API and terminology. I’ll link to a new PR soon.

maxbrunsfeld · 2022-01-20T18:53:45Z

/cc @paf31

maxbrunsfeld · 2022-02-03T19:30:33Z

Subscribe to #1635 for further updates on this. I plan to land that PR soon, but it's not quite done.

This was referenced Dec 4, 2018

Undefined words are highlighted as keywords atom/language-c#308

Open

Use new exclude rule to ensure keywords are always recognized tree-sitter/tree-sitter-c#16

Open

maxbrunsfeld force-pushed the exclude-rule branch from 2a3eb12 to 5c57167 Compare December 4, 2018 20:13

maxbrunsfeld mentioned this pull request Dec 4, 2018

Expose the new EXCLUDE rule from the JS grammar API tree-sitter/tree-sitter-cli#50

Open

maxbrunsfeld force-pushed the master branch 4 times, most recently from 10d1ff6 to 3340168 Compare March 21, 2019 00:02

maxbrunsfeld force-pushed the master branch 2 times, most recently from 4a14bcd to f9a3998 Compare September 5, 2019 22:43

maxbrunsfeld mentioned this pull request Jul 14, 2020

tree-sitter does not detect potential conflict #688

Closed

Julian added a commit to Julian/tree-sitter-lean that referenced this pull request Apr 8, 2021

Can't be done without tree-sitter/tree-sitter#246

07224a9

ahelwer mentioned this pull request Apr 25, 2021

Is a tree-sitter generated ast suitable as the base ast for a new compile-to-wasm language? #1073

Closed

ahelwer mentioned this pull request Jul 9, 2021

Exclude keywords from identifiers tlaplus-community/tree-sitter-tlaplus#14

Closed

maxbrunsfeld force-pushed the exclude-rule branch from 5c57167 to 9e3e4af Compare October 12, 2021 21:17

maxbrunsfeld force-pushed the exclude-rule branch from c5909f8 to 3542f39 Compare October 13, 2021 16:08

jonatanklosko mentioned this pull request Dec 17, 2021

Improve highlighting in basic IEx snippets elixir-lang/tree-sitter-elixir#17

Closed

This was referenced Jan 3, 2022

Invalid variable names are not marked as errors tree-sitter/tree-sitter-javascript#204

Open

Incremental highlighting constantly changes color of unrelated lines emacs-tree-sitter/elisp-tree-sitter#200

Open

SalBakraa mentioned this pull request Jan 4, 2022

External safe nav fwcd/tree-sitter-kotlin#48

Merged

maxbrunsfeld added 2 commits January 19, 2022 16:42

Introduce an 'exclude' rule

3e869fe

This allows you to explicitly disallow certain tokens that would otherwise be matched by a more general regex.

Delete unused code, tweak whitespace

4821f03

maxbrunsfeld force-pushed the exclude-rule branch from 3542f39 to 4821f03 Compare January 20, 2022 00:42

Change API terminology to refer to 'keywords'

c3217f5

maxbrunsfeld closed this Jan 20, 2022

maxbrunsfeld deleted the exclude-rule branch January 20, 2022 00:56

maxbrunsfeld mentioned this pull request Feb 3, 2022

Add 'reserved word' construct: improve error recovery by avoiding treating reserved words as other tokens #1635

Closed

4 tasks

ahelwer mentioned this pull request May 25, 2022

How to encode a subtractive grammar rule? #1754

Closed

sogaiu mentioned this pull request Feb 22, 2023

Make it compilable as a c library sogaiu/tree-sitter-clojure#44

Closed

amaanq mentioned this pull request Nov 23, 2024

Add 'reserved word' construct again, with a better API #3896

Merged

Uh oh!

Conversation

maxbrunsfeld commented Dec 4, 2018

Background

Solution

Implementation

Alternatives

Relevant Issues

Uh oh!

JoeBlanchard commented Feb 1, 2019

Uh oh!

maxbrunsfeld commented Feb 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxbrunsfeld commented Feb 1, 2019

Uh oh!

maxbrunsfeld commented Feb 1, 2019

Uh oh!

JoeBlanchard commented Feb 5, 2019

Uh oh!

maxbrunsfeld commented Feb 5, 2019

Uh oh!

oxalica commented Oct 7, 2021

Uh oh!

maxbrunsfeld commented Oct 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Julian commented Oct 7, 2021

Uh oh!

oxalica commented Oct 8, 2021

Uh oh!

maxbrunsfeld commented Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oxalica commented Oct 8, 2021

Uh oh!

maxbrunsfeld commented Oct 12, 2021

Uh oh!

smatlapudi commented Dec 5, 2021

Uh oh!

JoostK commented Jan 20, 2022

Uh oh!

maxbrunsfeld commented Jan 20, 2022

Uh oh!

maxbrunsfeld commented Jan 20, 2022

Uh oh!

maxbrunsfeld commented Feb 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

maxbrunsfeld commented Feb 1, 2019 •

edited

Loading

maxbrunsfeld commented Oct 7, 2021 •

edited

Loading

maxbrunsfeld commented Oct 8, 2021 •

edited

Loading