Add 'reserved word' construct again, with a better API#3896
Merged
Conversation
747b67f to
d61aa8b
Compare
4de1466 to
708f10f
Compare
4 tasks
Contributor
Author
|
Looking at these test failures, I see that we still need to implement loading of the reserved-word-related fields on |
08fe860 to
5956ab1
Compare
This was referenced Nov 23, 2024
8ffed93 to
ebb9df5
Compare
This was referenced Nov 23, 2024
d4582f2 to
ca44065
Compare
ebb9df5 to
99018d1
Compare
176e761 to
7deba3a
Compare
e54fdad to
ee3853a
Compare
1 task
|
This would be very useful for Pascal. Does this work only for strings or also for regular expressions? Because Pascal is case-insensitive. |
MarcinR-DevBrother
added a commit
to F1R3FLY-io/rholang-rs
that referenced
this pull request
Sep 11, 2025
MarcinR-DevBrother
added a commit
to F1R3FLY-io/rholang-rs
that referenced
this pull request
Sep 11, 2025
berecik
pushed a commit
to F1R3FLY-io/rholang-rs
that referenced
this pull request
Sep 15, 2025
berecik
pushed a commit
to F1R3FLY-io/rholang-rs
that referenced
this pull request
Sep 15, 2025
kopecs
added a commit
to semgrep/ocaml-tree-sitter-core
that referenced
this pull request
Feb 2, 2026
Some major changes are needed: - the `tree-sitter` CLI has had some breaking changes, notably in 0.24.0 where configuration moved to tree_sitter.json. Thus, the `--no-bindings` flag no longer exists (instead it should be configured through tree_sitter.json) - the ABI of the generated code differs. We ask for ABI 14 (what was current for 0.20.6 and 0.22.6), but this is behind latest. - Version 0.25.x added the `reserved` constrct: <tree-sitter/tree-sitter#3896>, so we need to handle that when consuming generated `grammar.json` files.
1 task
kopecs
added a commit
to semgrep/ocaml-tree-sitter-core
that referenced
this pull request
Feb 2, 2026
Some major changes are needed: - the `tree-sitter` CLI has had some breaking changes, notably in 0.24.0 where configuration moved to tree_sitter.json. Thus, the `--no-bindings` flag no longer exists (instead it should be configured through tree_sitter.json) - the ABI of the generated code differs. We ask for ABI 14 (what was current for 0.20.6 and 0.22.6), but this is behind latest. - Version 0.25.x added the `reserved` constrct: <tree-sitter/tree-sitter#3896>, so we need to handle that when consuming generated `grammar.json` files.
kopecs
added a commit
to semgrep/ocaml-tree-sitter-core
that referenced
this pull request
Feb 3, 2026
Some major changes are needed: - the `tree-sitter` CLI has had some breaking changes, notably in 0.24.0 where configuration moved to tree_sitter.json. Thus, the `--no-bindings` flag no longer exists (instead it should be configured through tree_sitter.json) - the ABI of the generated code differs. We ask for ABI 14 (what was current for 0.20.6 and 0.22.6), but this is behind latest. - Version 0.25.x added the `reserved` constrct: <tree-sitter/tree-sitter#3896>, so we need to handle that when consuming generated `grammar.json` files.
kopecs
added a commit
to semgrep/ocaml-tree-sitter-core
that referenced
this pull request
Feb 3, 2026
Some major changes are needed: - the `tree-sitter` CLI has had some breaking changes, notably in 0.24.0 where configuration moved to tree_sitter.json. Thus, the `--no-bindings` flag no longer exists (instead it should be configured through tree_sitter.json) - the ABI of the generated code differs. We ask for ABI 14 (what was current for 0.20.6 and 0.22.6), but this is behind latest. - Version 0.25.x added the `reserved` constrct: <tree-sitter/tree-sitter#3896>, so we need to handle that when consuming generated `grammar.json` files.
kopecs
added a commit
to semgrep/ocaml-tree-sitter-core
that referenced
this pull request
Feb 3, 2026
Some major changes are needed: - the `tree-sitter` CLI has had some breaking changes, notably in 0.24.0 where configuration moved to tree_sitter.json. Thus, the `--no-bindings` flag no longer exists (instead it should be configured through tree_sitter.json) - the ABI of the generated code differs. We ask for ABI 14 (what was current for 0.20.6 and 0.22.6), but this is behind latest. - Version 0.25.x added the `reserved` constrct: <tree-sitter/tree-sitter#3896>, so we need to handle that when consuming generated `grammar.json` files.
kopecs
added a commit
to semgrep/ocaml-tree-sitter-core
that referenced
this pull request
Feb 3, 2026
Some major changes are needed: - the `tree-sitter` CLI has had some breaking changes, notably in 0.24.0 where configuration moved to tree_sitter.json. Thus, the `--no-bindings` flag no longer exists (instead it should be configured through tree_sitter.json) - the ABI of the generated code differs. We ask for ABI 14 (what was current for 0.20.6 and 0.22.6), but this is behind latest. - Version 0.25.x added the `reserved` constrct: <tree-sitter/tree-sitter#3896>, so we need to handle that when consuming generated `grammar.json` files. ### Security - [x] Change has no security implications (otherwise, ping the security team)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a third attempt to solve a problem described in #246, which now supersedes #1635.
Background
Tree-sitter uses context-aware tokenization - in a given parse state, Tree-sitter only recognizes tokens that are syntactically valid in that state. This is what allows Tree-sitter to tokenize languages correctly without requiring the grammar author to think about different lexer modes. In general, Tree-sitter is permissive in allowing words that are keywords in some places to be used freely as names in other places.
Sometimes this permissiveness causes unexpected error recoveries. Consider this syntax error in Rust:
Currently, when tree-sitter-rust encounters this code, it doesn't detect an error until the word
b, because it interprets the wordifas a field/method name on theaobject. It doesn't seeifas a keyword, because the keywordifwould not be valid in that position.Because the error is detected too late, it's not possible to recover well. Tree-sitter fails to recognize the
if_statement, and sees it instead as a continuation of the expression above:rust tree with bad recovery
The
reservedpropertyIn order improve this error recovery, the grammar author needs a way to explicitly indicate that certain keywords are reserved. That is - even if they are not technically valid, they should still be recognized as separate from any other tokens that would match that string (such as an
identifier). In Rust, most keywords likeifandletare reserved in all contexts.This PR introduces a new top-level property on grammars called
reserved, which is an object much like a grammar'srulesproperty. In this object, the first property represents the global reserved rules, so typically this should be called "global", though, much like therulesproperty's start rule, any name works.When using this new feature, and parsing the same rust code as above, the error is now detected at the correct time (at the
iftoken), because Tree-sitter still treatsifas a keyword and not an identifier, even though the keyword is unexpected. This allows error recovery to be much better: preserving the entireif_statement, and marking the incompletea.line as an error.rust tree with good recovery
Contextual Reserved Words
Many languages have a more complex system of reserved words, in which words are reserved in some contexts, but not others. For example, in JavaScript, the word
ifcannot be used in a local declaration or an expression, but it can be used as the name of an object property:The current version of tree-sitter-javascript will treat the
ifproperties as valid, which is correct, but it will fail to detect the error on theiftoken on line 3, similarly to the Rust example described above.javscript tree with bad error recovery
In order to allow the valid usages, but detect the invalid ones, the grammar author needs a way to indicate in the JavaScript grammar that
ifis normally a reserved word, but it is still allowed in property names.The
reservedGrammar RuleIn addition to the top-level
reservedproperty, this PR also introduces a new rule function,reserved(reservedWordSetName, rule), which lets you override the set of reserved words in a certain context. In the case of JavaScript, we actually want to remove all reserved words in the context of object properties:In this particular case, we call
reserved()with the name of the property in thereservedobject, in this case, 'properties', to indicate that there are no reserved words in that context. In other use cases, we might pass an alternative set of reserved words. The property name that's passed into the first argument ofreservedhas its reserved word set used in this context.Details
wordtoken would be valid. For example, when inside of a string literal, the reserved words won't cause the lexer to recognize the contents of a string as anifkeyword.globalset always has a precedence of 0, andpropertiesin the JavaScript example has a precedence of 1.