Add 'reserved word' construct: improve error recovery by avoiding treating reserved words as other tokens#1635
Add 'reserved word' construct: improve error recovery by avoiding treating reserved words as other tokens#1635maxbrunsfeld wants to merge 16 commits intomasterfrom
Conversation
Reserving a set of tokens makes those tokens always valid from the lexer's perspective, even if they aren't valid syntactically. You can set a default set of reserved tokens using the top-level `reserved` property on the grammar, and selectively override this set of reserved tokens in certain contexts using the `reserved` rule function.
Not other tokens with which they overlap
Dont' rely on a special parse action list
|
Great! This would be the ideal answer to #1472. I was able to solve with a tricky scanner, but this sounds much better. |
|
Ok, I believe this PR is now working correctly. I've opened PRs making use of the API in some grammars. There are some improvements that I'd like to make before merging and releasing this. EDIT - moved these TODOs to tasks in the PR description. |
|
Trying another way: can the property be massaged to produce an error when an identifier contains an embedded reserved word? tl;dr Suppose |
|
First of all, thank you very much for this - I think this looks like a great approach and has the potential to be a very straightforward API for working with error recovery and incomplete documents. I have a couple of notes, having tried this out today, and I hope to be able to provide some more detailed notes soon:
Thanks! |
I didn't expect that. I'd like to fix it before merging. Can you share a grammar that reproduces the issue?
They shouldn't need to be strings. I do think that right now, they need to be existing tokens in the grammar which "overlap" (match some common strings) with your grammar's
Yeah, although there is some subtlety right now in determining which keywords are in the context at a given point. This is related to the second problem I mentioned above, about the |
@dfgordon For right now, I don't think this feature will support that. I think using custom code in an external scanner is probably the best way to do it. |
|
@maxbrunsfeld Thanks for your reply. We're putting together a small grammar which reproduces the issue we're seeing, and we will follow up shortly. |
|
@maxbrunsfeld we put up demo grammar as @paf31 promised: https://github.com/DeepChannel/reserved-word Here are some of the issues we are experiencing:
This works: reserved: ($) => [$.SELECT, $.AS, $.UNION, $.INTERSECT, $.EXCEPT],But this does not: reserved: ($) => ["SELECT", "AS", "UNION", "INTERSECT", "EXCEPT"],
result[value.toUpperCase()] = (_) => toCaseInsensitiveRule(value);
// result[value.toUpperCase()] = (_) => value.toUpperCase();
alias: ($) =>
choice(
$.identifier,
seq(
$.AS,
// Allow any reserved keywords except SELECT
reserved([$.SELECT], $.identifier)
)
),
"reserved": [
{
"type": "SYMBOL",
"name": "SELECT"
},
{
"type": "SYMBOL",
"name": "AS"
},
{
"type": "SYMBOL",
"name": "UNION"
},
{
"type": "SYMBOL",
"name": "INTERSECT"
},
{
"type": "SYMBOL",
"name": "EXCEPT"
}
]
#define MAX_RESERVED_WORD_SET_SIZE 5Maybe this is the issue related to .wasm? Following example should not parse UNION keyword as identifier: SELECT a +
UNION
SELECT 1 SELECT, a - bParse tree: I would expect it to look something like this: |
|
@maxbrunsfeld below are benchmark results for our relatively large grammar. Benchmark results for Benchmark results for
Machine: |
I meant to follow up with a bit more context here, based on the example repo that @ioseb provided. If you try the playground in that repo, you can see the problem that I was originally referring to:
Output for As you can see, the ERROR range was introduced surrounding If you could please help me to understand what the intended behavior should be here, I'd appreciate it, because my understanding is that the parser should have backtracked at this point, and returned some tree in which no Thanks again! |
|
I'm working on a D grammar for tree-sitter. D is a C family language, and like most such languages has a some (D has a plethora of them) words that are reserved in all contexts. There are a ton of conflicts in the grammar that I'm very happy that this PR will address. (At one point I considered moving the entire logic of isolating keywords from identifiers in the lexer -- I had an implementation that did it even -- but I very much like the approach here better.) Btw, my lexer does identify strings, comments, and a couple of other oddball constructs so that they show up as tokens and cannot conflict with reserved words. |
|
There is also another way that help when we need |
|
I'm working on a pretty simple search DSL that allows boolean operators: and, or, not, and search tokens. From other discussions, it seems the only way is to do this in an external scanner. I want to contribute one way of doing it using regex. Basically I'm negating and/or/not, but it's very ugly as negative lookahead is not supported. I very much look forward to having this feature merged. |
|
Hey, just commenting for everyone in this thread that's interested, #3896 supersedes this PR, and is now working with everything on the todo list fixed minus the Questions/Discussion should happen in the follow-up PR, so I'll be closing this |
This is a second attempt to solve a problem described in #246. I'll copy some of the description here.
Background
Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes. In general, Tree-sitter is permissive in allowing words that are keywords in some places to be used freely as names in other places.
Sometimes this permissiveness causes unexpected error recoveries. Consider this syntax error in Rust:
Currently, when tree-sitter-rust encounters this code, it doesn't detect an error until the word
b, because it interprets the wordifas a field/method name on theaobject. It doesn't seeifas a keyword, because the keywordifwould not be valid in that position.Because the error is detected too late, it's not possible to recover well. Tree-sitter fails to recognize the
if_statement, and sees it instead as a continuation of the expression above:rust tree with bad recovery
The
reservedpropertyIn order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are reserved. That is - even if they are not technically valid, they should still be recognized as separate from any other tokens that would match that string (such as an
identifier). In Rust, most keywords likeifandletare reserved in all contexts.This PR introduces a new top-level property on grammars called
reserved, which lets you list these reserved words.When using this new feature, and parsing the same rust code as above, the error is now detected at the correct time (at the
iftoken), because Tree-sitter still treatsifas a keyword and not an identifier, even though the keyword is unexpected. This allows error recovery to be much better: preserving the entireif_statement, and marking the incompletea.line as an error.rust tree with good recovery
Contextual Reserved Words
Many languages have a more complex system of reserved words, in which words are reserved in some contexts, but not others. For example, in JavaScript, the word
ifcannot be used in a local declaration or an expression, but it can be used as the name of an object property:The current version of tree-sitter-javascript will treat the
ifproperties as valid, which is correct, but it will fail to detect the error on theiftoken on line 3, similarly to the Rust example described above.javscript tree with bad error recovery
In order to allow the valid usages, but detect the invalid ones, the grammar author needs a way to indicate in the JavaScript grammar that
ifis normally a reserved word, but it is still allowed in property names.The
reservedGrammar RuleIn addition to the top-level
reservedproperty, this PR also introduces a new rule function,reserved(reservedWords, rule), which lets you override the set of reserved words in a certain context. In the case of JavaScript, we actually want to remove all reserved words in the context of object properties:In this particular case, we call
reserved()with an empty array to indicate that there are no reserved words in that context. In other use cases, we might pass an alternative set of reserved words.Details
wordtoken would be valid. For example, when inside of a string literal, the reserved words won't cause the lexer to recognize the contents of a string as anifkeyword.Remaining Tasks
reservedfunction correctly. Change the semantics of reserved word overrides so that one reserved word set is always chosen for a given state; don't join multiple sets by taking their union. If one rule has higher precedence, take its reserved word set. Raise an error if there are multiple reserved word sets with the same precedence.To implement this, many Tree-sitter grammars have a rule called "reserved_identifier" or something like that, which wraps a set of keywords, and aliases them as an identifier. This works, but it's a bit confusing, and I'd like to remove the need for this pattern. I think that now, we should be able to automatically apply this kind of handling to any keyword that is not reserved.