Conversation
2a3eb12 to
5c57167
Compare
|
Hey Max, I have been working on developing Treesitter grammars for extensible languages developed by the Minnesota Extensible Language Tools Group. This group actually wrote the paper on context aware scanning that you mentioned as part of your underlying research for this project. This rule seems useful, but also a hassle to list all the keywords that should be excluded wherever an identifier should appear. I was wondering if you had considered implementing a version of lexer classes where these classes of terminals can submit to or dominate other classes of terminals to specify precedence. This can be used to specify that one prefers keywords over identifiers. The paper describes how this works fully. The benefit of doing this is, as mentioned in the paper in section 2 under the lexical precedence relation header, that we can always view higher precedence terms as being a member of the valid lookahead set when parsing so we can fix the same problem as you are doing with the exclude rule. This could be done once using submits to and dominates instead of excluding the keywords every place the identifier occurs. Also, using submits to and dominates with lexer classes defines a partial precedence on terminals that is less constraining and allows more flexibility than the total precedence currently required in Treesitter. This would help Treesitter support more languages more easily. I know it'd likely be a significant undertaking to implement this. So, I'm wondering your rational against doing this and why or why not you would consider implementing this in the future. I'm not sure where the best place to post this was, but I guess this will do. Thanks for your time! |
|
Good to hear from you @JoeBlanchard! There's some things that I like about the approach the paper describes, but I don't fully understand how it addresses this problem.
I don't think that's the behavior that I want though. By default, in any given parse state, I don't want any syntactically-invalid tokens to be added to the valid lookahead set. Rather, I want the default to be that we produce the most permissive possible parser-lexer pairing. For example, in JavaScript you can use words like if (a) {
const b = {
if: 1,
else: 2,
};
} else {
// ...
}In Tree-sitter's JavaScript grammar, this 'Just Works', with no extra effort because:
This even comes up in older languages like C, because you can do stuff like this: #define float double
#ifdef float
// ...
#endifAnd in code like that☝️, we parse If we implicitly added keywords like |
|
In other words, I want Tree-sitter's default to be highly permissive, treating all keywords as contextual, because most languages really do behave this way in many cases. Our number 1 priority is to successfully parse all code that's valid. If we successfully parse some code that's invalid, that's not a big problem to me. Once you have a working parser, I do want to add a more niche API that lets you manually expand the valid lookahead set in specific contexts. That's what this
The way I'm imagining it, we'd only call float // not done typing yet
int foo; |
|
I might be misunderstanding the way Overall, my impression is that fundamentally, I do want this concept to be context-specific, rather than global, whereas |
|
Thanks for getting back @maxbrunsfeld
Yes you are correct that Additionally, based on the development timeline of Tree-sitter it is logical that implementing this behavior was not a priority. However, grammars can be written such that
Here a grammar could be written that has a The main advantage of Thanks for getting back to me with such a detailed response! By the way, the reason I have been looking into Tree-sitter is that we are attempting to auto-generate the |
Yeah, I can see how that would help with maintainability in some cases, both for lexical disambiguation and for syntactic disambiguation as well. I think that numerical precedence is very easy to understand for people, but there are definitely cases where it's awkward to force things into a total ordering.
Unfortunately, I got distracted with some other issues, and I haven't had time to work on this. I never even fully got it working correctly, so there's still a bit of work to do. |
10d1ff6 to
3340168
Compare
4a14bcd to
f9a3998
Compare
|
I need the feature to exclude keywords from identifiers.
Yes for javascript but HUGE NO for other languages using hard keywords like nix-lang. We need this to be customizable. I just hit the same issue that an hard keyword is incorrectly parsed as an identifier in the location where identifier is expected, instead of raising an error. It heavily mess up the error recovery since the delimiter-like keyword is eaten as identifier. Given the simplified grammar rules: word: $ => $.identifier,
rules: {
identifier: $ => /[a-z]+/,
let: $ => seq("let", repeat($.bind), "in", field("body", $.expr)),
bind: $ => seq(field("attrpath", $.identifier), "=", field("expression", $.expr), ";"),
expr: $ => choice($.identifier, $.number, $.app, ...),
app: $ => prec.left(..., seq(field("function", $.expr), field("argument", $.expr))), // function apply
}When we typing a new bind let
a = 1;
b =
in
aThe tree is like Here we can see the keyword If we let identifier rule exclude keywords But of course it's nearly impossible to construct a regex to exclude tens of keywords while only matching other valid identifiers. |
Yeah, I strongly agree. This is actually a problem for some of my own current use-cases as well. I plan to fix it pretty soon; I'm not sure if I'll go with this same Let me know if you have suggestions for what you'd like to see in the grammar API. |
|
I also have been watching this ticket, but because my parser essentially became totally unusable (taking multiple hours to generate) I essentially "hackily" implemented this functionality via helper functions. My hack (which is still not good but at least unblocked me) is here: https://github.com/Julian/tree-sitter-lean/blob/main/grammar/util.js#L9-L29 |
|
I'm kind of curious about how we write a hard-keyword grammar with If we have |
No, I don't think it it needs to affect keyword extraction in any way. The So in your grammar, you would prevent the error recovery mistake that you mentioned by changing your There might turn out to be several different rules in which you want the same excludes (like const FORBIDDEN_IDENTS = ['in', 'let'];Then in a few places in your grammar, you might say this: $.identifier.exclude(...FORBIDDEN_IDENTS) |
Okay so if we need hard keywords everywhere, we need something like this. Seems more acceptable. word: $ => $.identifier_like,
rules: {
identifier_like: $ => /\w+/,
identifier: $ => $.identifier_like.exclude(...keywords),
// ... tons of rules using $.identifier
} |
5c57167 to
9e3e4af
Compare
|
Ok, I have brought this PR back from the dead. So far, I'm keeping the grammar API the same way I originally proposed it. |
c5909f8 to
3542f39
Compare
|
@maxbrunsfeld, I am checking on this PR as I have the same issue with following SQL code with invalid syntax ( Is this |
This allows you to explicitly disallow certain tokens that would otherwise be matched by a more general regex.
3542f39 to
4821f03
Compare
|
Asking here since this was closed: is there still incentive to support something like this in the future? |
|
Yes, I’m going to work more on this, but give it a slightly different API and terminology. I’ll link to a new PR soon. |
|
/cc @paf31 |
|
Subscribe to #1635 for further updates on this. I plan to land that PR soon, but it's not quite done. |
Background
Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes and states. In general, Tree-sitter tends to be permissive in allowing words that are keywords in some places to be used freely as names in other places.
Sometimes this permissiveness causes unexpected error recoveries. Consider this C syntax error:
Currently, when tree-sitter-c encounters this code, it doesn't detect an error until the word
main, because it interprets the wordintas a variable, declared with typefloat. It doesn't seeintas a keyword, because the keywordintwouldn't be allowed in that position.Solution
In order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are not allowed in certain places. For example in C, primitive types like
intand control-flow keywords likewhileare not allowed as variable names in declarators.This PR introduces a new
EXCLUDErule to the underlying JSON schema. From JavaScript, you can use it like this:Conceptually, you're saying "a declarator can match an identifier, but not these other tokens".
Implementation
Internally, all Tree-sitter needs to do is to insert the excluded tokens (
if,int, etc) into the set of valid lookahead symbols that it uses when tokenizing, in the relevant states. Then, when the lexer sees the string "if", it will recognize it as aniftoken, not just an identifier. Then, as always when there's an error, the parser will find that there are no valid parse actions for the tokenif.Alternatives
I could have instead introduced a new field on the entire grammar called
keywords. Then, if we addedifto the grammar'skeywords, the wordifwould always be treated as its own token, in every parse state.This is less general, and it wouldn't really work AFAICT. Even in C, there are states where the word
intshould not be treated as a keyword. For example, inside of a string ("int"), as the name of a macro definition#define int. And in other languages, there are many more cases like this than there are in C. For example in JavaScript, it's fine to have an object property namedif.Relevant Issues
atom/language-c#308