Skip to content

Introduce an EXCLUDE rule#246

Closed
maxbrunsfeld wants to merge 3 commits intomasterfrom
exclude-rule
Closed

Introduce an EXCLUDE rule#246
maxbrunsfeld wants to merge 3 commits intomasterfrom
exclude-rule

Conversation

@maxbrunsfeld
Copy link
Contributor

Background

Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes and states. In general, Tree-sitter tends to be permissive in allowing words that are keywords in some places to be used freely as names in other places.

Sometimes this permissiveness causes unexpected error recoveries. Consider this C syntax error:

float // <-- error
int main() {}

Currently, when tree-sitter-c encounters this code, it doesn't detect an error until the word main, because it interprets the word int as a variable, declared with type float. It doesn't see int as a keyword, because the keyword int wouldn't be allowed in that position.

Solution

In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are not allowed in certain places. For example in C, primitive types like int and control-flow keywords like while are not allowed as variable names in declarators.

This PR introduces a new EXCLUDE rule to the underlying JSON schema. From JavaScript, you can use it like this:

declarator: choice(
  $.pointer_declarator,
  $.array_declarator,
  $.identifier.exclude('if', 'int', ...etc)
)

Conceptually, you're saying "a declarator can match an identifier, but not these other tokens".

Implementation

Internally, all Tree-sitter needs to do is to insert the excluded tokens (if, int, etc) into the set of valid lookahead symbols that it uses when tokenizing, in the relevant states. Then, when the lexer sees the string "if", it will recognize it as an if token, not just an identifier. Then, as always when there's an error, the parser will find that there are no valid parse actions for the token if.

Alternatives

I could have instead introduced a new field on the entire grammar called keywords. Then, if we added if to the grammar's keywords, the word if would always be treated as its own token, in every parse state.

This is less general, and it wouldn't really work AFAICT. Even in C, there are states where the word int should not be treated as a keyword. For example, inside of a string ("int"), as the name of a macro definition #define int. And in other languages, there are many more cases like this than there are in C. For example in JavaScript, it's fine to have an object property named if.

Relevant Issues

atom/language-c#308

@JoeBlanchard
Copy link

Hey Max,

I have been working on developing Treesitter grammars for extensible languages developed by the Minnesota Extensible Language Tools Group. This group actually wrote the paper on context aware scanning that you mentioned as part of your underlying research for this project.

This rule seems useful, but also a hassle to list all the keywords that should be excluded wherever an identifier should appear. I was wondering if you had considered implementing a version of lexer classes where these classes of terminals can submit to or dominate other classes of terminals to specify precedence. This can be used to specify that one prefers keywords over identifiers. The paper describes how this works fully.

The benefit of doing this is, as mentioned in the paper in section 2 under the lexical precedence relation header, that we can always view higher precedence terms as being a member of the valid lookahead set when parsing so we can fix the same problem as you are doing with the exclude rule. This could be done once using submits to and dominates instead of excluding the keywords every place the identifier occurs.

Also, using submits to and dominates with lexer classes defines a partial precedence on terminals that is less constraining and allows more flexibility than the total precedence currently required in Treesitter. This would help Treesitter support more languages more easily. I know it'd likely be a significant undertaking to implement this. So, I'm wondering your rational against doing this and why or why not you would consider implementing this in the future.

I'm not sure where the best place to post this was, but I guess this will do. Thanks for your time!

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Feb 1, 2019

Good to hear from you @JoeBlanchard! There's some things that I like about the approach the paper describes, but I don't fully understand how it addresses this problem.

we can always view higher precedence terms as being a member of the valid lookahead set when parsing so we can fix the same problem as you are doing with the exclude rule.

I don't think that's the behavior that I want though. By default, in any given parse state, I don't want any syntactically-invalid tokens to be added to the valid lookahead set. Rather, I want the default to be that we produce the most permissive possible parser-lexer pairing.

For example, in JavaScript you can use words like if and else as identifiers, as long as you're in a context where the if and else keywords wouldn't be valid.

if (a) {
  const b = {
    if: 1,
    else: 2,
  };
} else {
  // ...
}

In Tree-sitter's JavaScript grammar, this 'Just Works', with no extra effort because:

  • When an if keywords is valid, it "dominates" the identifier token because of Tree-sitter's default rule that String tokens dominate Regex tokens.
  • Inside of the Object literal above, the if and else keywords are not valid lookaheads.

This even comes up in older languages like C, because you can do stuff like this:

#define float double
#ifdef float
// ...
#endif

And in code like that☝️, we parse float as an identifier.

If we implicitly added keywords like if to every lookahead set containing identifier, it seems like we'd lose these behaviors.

@maxbrunsfeld
Copy link
Contributor Author

In other words, I want Tree-sitter's default to be highly permissive, treating all keywords as contextual, because most languages really do behave this way in many cases. Our number 1 priority is to successfully parse all code that's valid. If we successfully parse some code that's invalid, that's not a big problem to me.

Once you have a working parser, I do want to add a more niche API that lets you manually expand the valid lookahead set in specific contexts. That's what this exclude is for.

This rule seems useful, but also a hassle to list all the keywords that should be excluded wherever an identifier should appear.

The way I'm imagining it, we'd only call exclude in one place in tree-sitter-c: we'd just add it to the _declarator rule like I showed above, in order to improve Tree-sitter's error recovery in the specific case of code like

float // not done typing yet
int foo;

@maxbrunsfeld
Copy link
Contributor Author

I might be misunderstanding the way dominates works in the paper though. Am I missing something?

Overall, my impression is that fundamentally, I do want this concept to be context-specific, rather than global, whereas dominates and submitsTo are global concepts.

@JoeBlanchard
Copy link

Thanks for getting back @maxbrunsfeld

Overall, my impression is that fundamentally, I do want this concept to be context-specific, rather than global, whereas dominates and submitsTo are global concepts.

Yes you are correct that domiantes and submitsTo are global concepts. Your reasoning makes a lot of sense for the way you have written grammars in Tree-sitter. The approach of keywords, such as if and else, always dominating identifiers for the purpose of error detection would cause parse errors in the examples you provided despite being correct programs.

Additionally, based on the development timeline of Tree-sitter it is logical that implementing this behavior was not a priority.

However, grammars can be written such that submitsTo and dominates do work. For example, in the C example you provided.

This even comes up in older languages like C, because you can do stuff like this:

#define float double
#ifdef float
// ...
#endif

And in code like that☝️, we parse float as an identifier.

If we implicitly added keywords like if to every lookahead set containing identifier, it seems like we'd lose these behaviors.

Here a grammar could be written that has a macro_identifier and identifier rule that both match the same regular expression but are separate rules. However, the keywords would only dominate the identifier and not the macro_identifier. Thus, since a macro_identifier is in the valid lookahead set while identifier is not, the keywords will not be added and this will be parsed correctly.

The main advantage of submitsTo and dominates is that it transforms precedence, which is a total ordering on all terminals, to a partial ordering only on all terminals where precedence is important and those terminals will appear in the same valid lookahead set. This feature is very useful when writing extensible programming languages which is what the MELT group does, but may not provide as much benefit to most languages used in Atom and on Github.

Thanks for getting back to me with such a detailed response! By the way, the reason I have been looking into Tree-sitter is that we are attempting to auto-generate the grammar.js file for a Tree-sitter specification from a Silver grammar specification to generate highlighters for ableC and extensions developed for ableC. I'll probably implement the use of submitsTo and dominates in Silver using the new EXCLUDE rule for now. When do you plan to merge the rule?

@maxbrunsfeld
Copy link
Contributor Author

The main advantage of submitsTo and dominates is that it transforms precedence, which is a total ordering on all terminals, to a partial ordering only on all terminals where precedence is important

Yeah, I can see how that would help with maintainability in some cases, both for lexical disambiguation and for syntactic disambiguation as well. I think that numerical precedence is very easy to understand for people, but there are definitely cases where it's awkward to force things into a total ordering.

When do you plan to merge the rule?

Unfortunately, I got distracted with some other issues, and I haven't had time to work on this. I never even fully got it working correctly, so there's still a bit of work to do.

@oxalica
Copy link

oxalica commented Oct 7, 2021

I need the feature to exclude keywords from identifiers.

@maxbrunsfeld

treating all keywords as contextual

Yes for javascript but HUGE NO for other languages using hard keywords like nix-lang. We need this to be customizable.

I just hit the same issue that an hard keyword is incorrectly parsed as an identifier in the location where identifier is expected, instead of raising an error. It heavily mess up the error recovery since the delimiter-like keyword is eaten as identifier.

Given the simplified grammar rules:

word: $ => $.identifier,
rules: {
  identifier: $ => /[a-z]+/,
  let: $ => seq("let", repeat($.bind), "in", field("body", $.expr)),
  bind: $ => seq(field("attrpath", $.identifier), "=", field("expression", $.expr), ";"),
  expr: $ => choice($.identifier, $.number, $.app, ...),
  app: $ => prec.left(..., seq(field("function", $.expr), field("argument", $.expr))), // function apply
}

When we typing a new bind b =,

let
  a = 1;
  b =
in
  a

The tree is like

(ERROR
  (bind
    (identifier)
    (integer))
  (identifier)
  (app
    (identifier)
    (identifier)))

Here we can see the keyword in is parsed as a function application in a, which should be impossible. It consume the keyword in and failed to construct the top-level whole let structure. (bind is not only used in let, they have different highlighting depends on if it's in let. So this breaks all highlighting inside the let)

If we let identifier rule exclude keywords in, to say identifier: $ => /[a-hj-z][a-z]*|i[a-mo-z][a-z]*|in[a-z]+/, then we get a pretty good error recovered result:

(let
  (bind
    (identifier)
    (integer))
  (ERROR
    (identifier))
  (identifier))

But of course it's nearly impossible to construct a regex to exclude tens of keywords while only matching other valid identifiers.

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Oct 7, 2021

I just hit the same issue that an hard keyword is incorrectly parsed as an identifier in the location where identifier is expected, instead of raising an error. It heavily mess up the error recovery since the delimiter-like keyword is eaten as identifier.

Yeah, I strongly agree. This is actually a problem for some of my own current use-cases as well. I plan to fix it pretty soon; I'm not sure if I'll go with this same exclude API though; it's been a long time since I worked on this, so I might come up with something better.

Let me know if you have suggestions for what you'd like to see in the grammar API.

@Julian
Copy link
Contributor

Julian commented Oct 7, 2021

I also have been watching this ticket, but because my parser essentially became totally unusable (taking multiple hours to generate) I essentially "hackily" implemented this functionality via helper functions.

My hack (which is still not good but at least unblocked me) is here: https://github.com/Julian/tree-sitter-lean/blob/main/grammar/util.js#L9-L29

@oxalica
Copy link

oxalica commented Oct 8, 2021

I'm kind of curious about how we write a hard-keyword grammar with exclude rule.

If we have identifier: $ => /\w+/.exclude('some', 'keyword') and word: $ => $.identifier then the keyword extraction just always fails, yes? So do we need to have another keywords: $ => /\w+/ with word: $ => $.keywords, and also have keywords before identifier to match keywords first since these two regex intersects?
This seems not quite intuitive comparing to a new keywords grammar field.

@maxbrunsfeld
Copy link
Contributor Author

maxbrunsfeld commented Oct 8, 2021

the keyword extraction just always fails, yes?

No, I don't think it it needs to affect keyword extraction in any way.

The .exclude('if', 'for', ...) method is not something that you would call in the definition of identifier (as you wrote). Instead, you would call it some particular rule where you're using an identifier, to indicate that in that specific position, a given set of tokens should be treated as distinct from identifier. See the example in the PR body.

So in your grammar, you would prevent the error recovery mistake that you mentioned by changing your expr rule to say $.identifier.exclude('in').

There might turn out to be several different rules in which you want the same excludes (like bind and expr in your case). In that case, you could simply define a constant in your grammar file like this:

const FORBIDDEN_IDENTS = ['in', 'let'];

Then in a few places in your grammar, you might say this:

$.identifier.exclude(...FORBIDDEN_IDENTS)

@oxalica
Copy link

oxalica commented Oct 8, 2021

Instead, you would call it some particular rule where you're using an identifier

Okay so if we need hard keywords everywhere, we need something like this. Seems more acceptable.

word: $ => $.identifier_like,
rules: {
  identifier_like: $ => /\w+/,
  identifier: $ => $.identifier_like.exclude(...keywords),
  // ... tons of rules using $.identifier
}

@maxbrunsfeld
Copy link
Contributor Author

Ok, I have brought this PR back from the dead. So far, I'm keeping the grammar API the same way I originally proposed it.

@smatlapudi
Copy link

@maxbrunsfeld, I am checking on this PR as I have the same issue with following SQL code with invalid syntax ( , after col_one should not be there). The parser I generated not throwing error as expected but it is taking from keyword as column name.
Note: from clause is optional this SQL grammar and just a select expr, expr statement is valid one.

select 
  col_one,
 from 
  table_one

Is this exclude solution work for case insensitive keywords ( like SQL grammar here) use-case also ?

This allows you to explicitly disallow certain tokens that would otherwise
be matched by a more general regex.
@maxbrunsfeld maxbrunsfeld deleted the exclude-rule branch January 20, 2022 00:56
@JoostK
Copy link

JoostK commented Jan 20, 2022

Asking here since this was closed: is there still incentive to support something like this in the future?

@maxbrunsfeld
Copy link
Contributor Author

Yes, I’m going to work more on this, but give it a slightly different API and terminology. I’ll link to a new PR soon.

@maxbrunsfeld
Copy link
Contributor Author

/cc @paf31

@maxbrunsfeld
Copy link
Contributor Author

Subscribe to #1635 for further updates on this. I plan to land that PR soon, but it's not quite done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants