Use DSL v2 for parser codegen by Xanewok · Pull Request #650 · NomicFoundation/slang

Xanewok · 2023-11-10T14:18:00Z

Part of #638

Outline

The new definition is used to construct the grammar definition model from the DSL v1, which is then used by the codegen to create the parser and rule/token kinds, like before.

The rough translation done here is to translate every struct/enum/repeated/separated/precedence expression as a rule/non-terminal kind (with the exception of choice nodes, which are inlined, as per #650 (comment)); the rest (i.e. trivia, fragment, token, keywords) are translated as "old" tokens/terminals.

More context

In general, this requires more polish and decision which rules are we comfortable with leaving and fixing leftover issues such as explicit PrecedenceExpression#rule_name added to fit the old DSL v1 model/codegen better.

I specifically didn't want to touch the v1 codegen/model yet, because it helped being able to incrementally migrate to a new definition and I wanted to limit changes done in a single PR, so that it's easier to review. We can modify the codegen/model to fit the v2 model better once we fully migrate and remove the old DSL v1.

The translation/construction is done at build "run-time" rather than inside the newly introduced definition "compiler" proc macro. There's not much explicit reason for that, other than the fact that it was quicker for me to work on (quicker dev cycle) and the logic is just plain Rust code. This can be moved inside the proc macro later on, if we decide that it's better that way.

Differences

Because the DSL v2 takes a more named and structured approach (as it's designed to be a model for the strongly typed CST/AST), translating it 1-1 to our current parser structure is impossible or even desired.

There are some rules that were introduced and it doesn't make much sense for some to replicate the old behaviour.

At the moment, it's impossible to specify an optional separated item, so the following ones are introduced as wrappers:
- TupleValue (used in TupleValuesList)
- TupleMemberDeconstruction (used in TupleMembersList)
Some sequences in the structs are optional, but these are now introduced and named, rather than inlined as before:
- IndexAccessEnd
- ImportAlias
- UsingAlias
- VariableDeclarationValue
~~Previously inlined sequence parsers now have to be named~~ (fixed with Retrofit {Hex,Decimal}NumberExpression from v2 into v0/v1 spec #657)
- NumericExpression is now split into two, now named, choices: HexNumberExpression and DecimalNumberExpression

Moreover, the following was done to bring back the old CST shape as much as possible:

Some of the new rules where renamed back to the old ones, e.g. some repeated nodes have List suffix again
The (Yul)Statement are outlined (as in v0/v1) and re-introduced as a struct wrapper
ArgumentsDeclaration is outlined (as in v0/v1) as well
FunctionCallOptions is outlined again as a (NamedArgs NamedArgs*) rather than NamedArgs | NamedArgs+ to better match the old CST in the singular case (but differs from the old CST in the multiple case)
Fix some language definitions in DSL v1/v2 #653

This was done to unify the definitions where possible and to help with reviewing the CST changes by minimizing the shape differences. Once we move off v0 (#637) and v1 (#638), we will be free to change the final shape, since the old definitions will stop tying us down quite a bit.

Outstanding issues

Separate outstanding FIXMEs that need to be done after the migration into a task list
- LeadingTrivia in v2 is using the v1 definition for now
- SourceUnit is hacked to always be followed by Leading Trivia (copies v1; trivia model in v2 is a bit different)
- Clean up PrecedenceExpression#rule_name and adapt the codegen model to fit the v2 definition better
- Stop leaking identifiers by adapting either v1 or v2 models
- Keyword trie inclusion should be reworked to not require synthetic rules over all keywords (v1 model) and to properly allow keywords to overlap identifiers (Allow keywords to overlap identifiers #568)
Fix the Prettier error caused by newly possible nested, empty rules:

[error] crates/solidity/testing/snapshots/cst_output/TupleExpression/empty/generated/0.4.11-success.yml: SyntaxError: All collection items must start at the same column (11:9)
[error]    9 |   - TupleExpression (Rule): # 0..3 "( )"
[error]   10 |       - OpenParen (Token): "(" # 0..1
[error] > 11 |       - TupleValuesList (Rule): [] # 1..1
[error]      |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[error] > 12 |           - TupleValue (Rule): [] # 1..1
[error]      | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[error] > 13 |       - CloseParen (Token): ")" # 2..3
[error]      | ^
[error]   14 |

changeset-bot · 2023-11-10T14:18:04Z

⚠️ No Changeset found

Latest commit: e9add9b

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

crates/solidity/inputs/language/src/definition.rs

crates/codegen/language/definition/src/model/types.rs

crates/solidity/outputs/cargo/crate/src/generated/kinds.rs

...ots/cst_output/ContractDefinition/member_constructor_definition/generated/0.4.22-success.yml

...ty/testing/snapshots/cst_output/Expression/function_call_options/generated/0.6.2-success.yml

...dity/testing/snapshots/cst_output/SourceUnit/end_of_file_trivia/generated/0.8.22-failure.yml

Xanewok · 2023-11-14T20:05:15Z

With b5c51ae it's now possible to specify an optional (inline) repeated item, so this removes one more CST difference and solves the last Prettier error, incidentally.

crates/codegen/language/definition/src/model/types.rs

Part of #652 - Fixes versioning for `UsingDirectiveSymbol` in DSL v1 - Fixes event parameter definition in DSL v2 - Fixes `ElseBranch` sequence fields to be required themselves in DSL v2 - Syncs ordered choice for `(Yul)Statement` across DSLs (helps minimize diff for #650) I'll probably bundle remaining renames and restructures for DSL v2 in #650 already, but wanted to land this here since it fixes some tests and definitions separately from migrating the parser to DSL v2.

Part of #652 Split out of #650

…656) Part of #638 Split out of #650

Split from #650 We talked about this recently at our stand-up. These are never reserved in the default context and are only used in the (lexical) context of pragmas, so let's reflect that in the spec. Not updating v1 in this context since #650 is about to obsolete that and v0 does not have lexical contexts that might be impacted.

Part of #652 Helps with #650 These make more sense, given that these are the names used in the official grammar (fixed-bytes, signed-integer-type, unsigned-integer-type) and "XXXKeyword" name implies a single atom and not a more complex rule (e.g. looking at the `fixed` definition...). I think we should revert the accidental name change in v2 and later decide if we want to change these. EDIT: This now does the opposite and backports the name change to v0/v1.

crates/solidity/outputs/cargo/build/src/main.rs

crates/codegen/language/definition/src/model/types.rs

crates/solidity/inputs/language/src/dsl.rs

crates/solidity/inputs/language/src/grammar.rs

crates/solidity/outputs/cargo/build/src/main.rs

crates/solidity/outputs/cargo/crate/src/generated/language.rs

Current codegen relies on the rule to be defined in the same lexical context that the referred to keyword in order for it to be included in this context's trie, so we move the assembly statement up to the default context, so that the `assembly` keyword is picked up.

OmarTawfik

Looks great! Thank you!

Xanewok · 2023-11-28T21:31:44Z

Thanks a lot for the reviews! Here goes nothing...

Part of #638 ### Outline The new definition is used to construct the grammar definition model from the DSL v1, which is then used by the codegen to create the parser and rule/token kinds, like before. The rough translation done here is to translate every struct/enum/repeated/separated/precedence expression as a rule/non-terminal kind (with the exception of choice nodes, which are inlined, as per NomicFoundation/slang#650 (comment)); the rest (i.e. trivia, fragment, token, keywords) are translated as "old" tokens/terminals. ### More context In general, this requires more polish and decision which rules are we comfortable with leaving and fixing leftover issues such as explicit `PrecedenceExpression#rule_name` added to fit the old DSL v1 model/codegen better. I specifically didn't want to touch the v1 codegen/model yet, because it helped being able to incrementally migrate to a new definition and I wanted to limit changes done in a single PR, so that it's easier to review. We can modify the codegen/model to fit the v2 model better once we fully migrate and remove the old DSL v1. The translation/construction is done at build "run-time" rather than inside the newly introduced definition "compiler" proc macro. There's not much explicit reason for that, other than the fact that it was quicker for me to work on (quicker dev cycle) and the logic is just plain Rust code. This can be moved inside the proc macro later on, if we decide that it's better that way. ### Differences Because the DSL v2 takes a more named and structured approach (as it's designed to be a model for the strongly typed CST/AST), translating it 1-1 to our current parser structure is impossible or even desired. There are some rules that were introduced and it doesn't make much sense for some to replicate the old behaviour. 1. At the moment, it's impossible to specify an optional *separated* item, so the following ones are introduced as wrappers: - `TupleValue` (used in `TupleValuesList`) - `TupleMemberDeconstruction` (used in `TupleMembersList`) 2. Some sequences in the structs are optional, but these are now introduced and named, rather than inlined as before: - `IndexAccessEnd` - `ImportAlias` - `UsingAlias` - `VariableDeclarationValue` 3. ~Previously inlined sequence parsers now have to be named~ (fixed with #657) - `NumericExpression` is now split into two, now named, choices: `HexNumberExpression` and `DecimalNumberExpression` Moreover, the following was done to bring back the old CST shape as much as possible: - Some of the new rules where renamed back to the old ones, e.g. some repeated nodes have `List` suffix again - The (Yul)Statement are outlined (as in v0/v1) and re-introduced as a struct wrapper - `ArgumentsDeclaration` is outlined (as in v0/v1) as well - `FunctionCallOptions` is outlined again as a `(NamedArgs NamedArgs*)` rather than `NamedArgs | NamedArgs+` to better match the old CST in the singular case (but differs from the old CST in the multiple case) - NomicFoundation/slang#653 This was done to unify the definitions where possible and to help with reviewing the CST changes by minimizing the shape differences. Once we move off v0 (#637) and v1 (#638), we will be free to change the final shape, since the old definitions will stop tying us down quite a bit. ### Outstanding issues - [x] Separate outstanding FIXMEs that need to be done after the migration into a task list * LeadingTrivia in v2 is using the v1 definition for now * SourceUnit is hacked to always be followed by Leading Trivia (copies v1; trivia model in v2 is a bit different) * Clean up `PrecedenceExpression#rule_name` and adapt the codegen model to fit the v2 definition better * Stop leaking identifiers by adapting either v1 or v2 models * Keyword trie inclusion should be reworked to not require synthetic rules over all keywords (v1 model) and to properly allow keywords to overlap identifiers (#568) - [x] Fix the Prettier error caused by newly possible nested, empty rules: ``` [error] crates/solidity/testing/snapshots/cst_output/TupleExpression/empty/generated/0.4.11-success.yml: SyntaxError: All collection items must start at the same column (11:9) [error] 9 | - TupleExpression (Rule): # 0..3 "( )" [error] 10 | - OpenParen (Token): "(" # 0..1 [error] > 11 | - TupleValuesList (Rule): [] # 1..1 [error] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [error] > 12 | - TupleValue (Rule): [] # 1..1 [error] | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [error] > 13 | - CloseParen (Token): ")" # 2..3 [error] | ^ [error] 14 | ```

Xanewok force-pushed the codegen-use-dslv2 branch 2 times, most recently from 664f1ad to 08a20b6 Compare November 10, 2023 22:42

OmarTawfik reviewed Nov 11, 2023

View reviewed changes

crates/solidity/inputs/language/src/definition.rs Outdated Show resolved Hide resolved