Status: The toolset is still undergoing a major rewrite. Consider this toolkit as "pre-alpha". Old tools are being removed, and new ones are being added. Features are being added, while bugs are constantly being fixed. The XPath/XQuery engine is still being rewritten.
The repo g4-scripts contains a collections of Bash which use Trash. The repo also contains XQuery scripts that implement complex operations on a parse tree. You can also read about Trash details in my blog.
Trash is a collection of ~40 command-line tools to analyze and transform Antlr parse trees and grammars. The toolkit can: generate a parser application for an Antlr4 grammar for any target and any OS; analyze the grammar for common problems; automate changes applied to a grammar scraped from a specification; transform parse trees for transpilating and proprocessing source code. With the Antlr toolkit and the collection of Antlr grammars, one can write programming language tools quickly and easily.
The toolkit is designed around a JSON representation of parse trees and command-line tools that read, modify, and write those tree via standard input and output. Complex refactorings can be achieved by chaining different commands together.
Each app in Trash is implemented as a Dotnet Tool console application, and can be used on Windows, Linux, or Mac.
No prerequisites are required other than installing the
NET SDK, and the toolchains
for any other targets you want to use.
The toolkit uses Antlr and XPath2. The code is implemented in C#.
An application of the toolkit was used to scrape and refactor the Dart2 grammar from spec. See this script.
dotnet tool install -g trash
dotnet tool uninstall -g trash
dotnet new tool-manifest
dotnet tool install trash
- dotnet trash analyze -- Analyze a grammar
- dotnet trash caret -- Caret operations on a parse tree
- dotnet trash clonereplace -- Clone and replace in a grammar
- dotnet trash combine -- Combine a split Antlr4 grammar
- dotnet trash convert -- Convert a grammar from one form to another
- dotnet trash cover -- Code coverage analysis
- dotnet trash dot -- Print a parse tree in Graphviz Dot format
- dotnet trash extract -- Extract from a parse tree
- dotnet trash ff -- Outputs FIRST and FOLLOW sets of a grammar
- dotnet trash foldlit -- Perform fold transform on grammar with literals
- dotnet trash gen -- Generate an Antlr4 parser for a given target language
- dotnet trash genvsc -- Generate VS Code extension files
- dotnet trash glob -- Glob file patterns
- dotnet trash iconv -- Convert file encoding
- dotnet trash itext -- Get strings from a PDF file
- dotnet trash json -- Print a parse tree in JSON structured format
- dotnet trash nullable -- Nullable analysis of a grammar
- dotnet trash parse -- Parse a grammar or use a generated parser to parse input
- dotnet trash perf -- Perform performance analysis of an Antlr grammar parse
- dotnet trash query -- Query parse trees using XPath
- dotnet trash rename -- Rename symbols in a grammar
- dotnet trash sort -- Sort rules in a grammar
- dotnet trash split -- Split a combined Antlr4 grammar
- dotnet trash sponge -- Extract parsing results of a Trash command into files
- dotnet trash text -- Print a parse tree with a specific interval
- dotnet trash tokens -- Print tokens in a parse tree
- dotnet trash tree -- Print a parse tree in a human-readable format
- dotnet trash unfold -- Perform an unfold transform on a grammar
- dotnet trash unfoldlit -- Perform unfold transform with literals on a grammar
- dotnet trash ungroup -- Perform an ungroup transform on a grammar
- dotnet trash wdog -- Kill a program that runs too long
- dotnet trash xgrep -- Search using XPath in parse trees
- dotnet trash xml -- Print a parse tree in XML structured format
- dotnet trash xml2 -- Print an enumeration of all paths in a parse tree to leaves
git clone https://github.com/antlr/grammars-v4
cd grammars-v4/python/python
dotnet trash parse *.g4 | dotnet trash query 'grep //grammarDecl' | dotnet trash text
# Output:
# PythonLexer.g4:lexer grammar PythonLexer;
# PythonParser.g4:parser grammar PythonParser;
dotnet trash gen
cd Generated
dotnet build
cat - <<EOF | dotnet trash parse | dotnet trash query 'grep //test' | dotnet trash text
x == y
x == y if z == b else a == u
lambda: a
lambda x, y: a
EOF
# Output:
# a
# lambda x, y: a
# a
# lambda: a
# a == u
# x == y if z == b else a == u
# x == y
dotnet trash parse -i "a == b" | dotnet trash tree
trtree is only one of several ways to view parse tree data.
Other programs for different output are
trjson for JSON output,
trxml for XML output,
trst for Antlr runtime ToStringTree output,
trdot,
trprint for input text for the parse,
and
tragl.
dotnet trash parse ada.g2 | dotnet trash convert | trprint | less
This command parses an old Antlr2 grammar using trparse, converts the parse tree data to Antlr4 syntax using trconvert and finally prints out the converted parse tree data, ada.g4 using trprint. Other grammar that can be converted are Antlr3, Bison, and ISO EBNF. In order to use the grammar to parse data, you will need to convert it to an Antlr4 grammar.
mkdir foobar; cd foobar; dotnet trash gen
This command creates a parser application for the C# target.
If executed in an empty directory, which is done in the example
shown above, trgen
creates an application using the Arithmetic grammar.
If executed in a directory containing
a Antlr Maven plugin (pom.xml), trgen will create a program according
to the information specified in the pom.xml file. Either way, it creates a directory
Generated/, and places the source code there.
trgen has many options to generate a parser from any Antlr4 grammar, for any target.
But, if a parser is generated for the C# target, built using the NET SDK, then trparse
can execute the generated parser, and can be used with all the other tools in Trash. _NB:
In order to use the generate parser application, you must first build it:
dotnet restore Generated/Test.csproj
dotnet build Generated/Test.csproj
dotnet trash parse -i "1+2+3" | dotnet trash tree
After using trgen to generate a parser program in C#, shown previously,
and after building the program, you can run the parser using trparse. This program
looks for the generated parser in directory Generated/. If it exists,
it will run the parser application in the directory. You can pass
as command-line arguments an input string or input file. If no command-line
arguments are supplied, the program will read stdin. The output of trparse, as
with most tools of Trash, is parse tree data.
mkdir empty; cd empty; dotnet trash gen; dotnet build Generated/Test.csproj; \
dotnet trash parse -i "1+2+3" | dotnet trash query "grep //SCIENTIFIC_NUMBER" | trst
With this command, a directory is created, the Arithmetic grammar generated, build,
and then run using trparse.
The trparse tool unifies all parsing, whether it's parsing a grammar or parsing input
using a generated parser application. The output from the trparse tool is a parse
tree which you can search. Trquery
is the generalized search program for parse trees. Trquery uses XPath expressions to
precisely identify nodes in the parse tree.
XPath was added to Antlr4, but Trash takes the idea
further with the addition of an XPath2 engine ported from the
Eclipse Web toolkit.
XPath is a well-defined language that should be
used more often in compiler construction.
dotnet trash parse Arithmetic.g4 | dotnet trash rename "//parserRuleSpec//labeledAlt//RULE_REF[text() = 'expression']" "xxx" | dotnet trash text > new-source.g4
dotnet trash parse Arithmetic.g4 | dotnet trash rename -r "expression,expression_;atom,atom_;scientific,scientific_" | trprint
In these two examples, the Arithmetic grammar is parsed.
trrename reads the parse tree data and
modifies it by renaming the expression symbol two ways: first by XPath expression identifying the LHS terminal
symbol of the expression symbol, and the second by assumption that the tree is an Antlr4 parse tree,
then renaming a semi-colon-separated list of paired renames. The resulting code is reconstructed and saved.
trrename does not rename symbols in actions, nor does it rename identifiers corresponding to the
grammar symbols in any support source code (but it could if the tool is extended).
git clone https://github.com/antlr/grammars-v4.git; \
cd grammars-v4/java/java9; \
dotnet trash gen; dotnet build Generated/Test.csproj;\
dotnet trash parse examples/AllInOne8.java | dotnet trash query "greap //methodDeclaration" | trst | wc
This command clones the Antlr4 grammars-v4 repo, generates a parser for the Java9 grammar,
then runs the parser on examples/AllInOne8.java.
The parse tree is then piped to trquery to find all parse tree nodes that are
a methodDeclaration type, converts it to a simple string, and counts the result using
wc.
dotnet trash parse Java9.g4 | trstrip | dotnet trash text > Essential-Java9.g4
Since Antlr2, one can written a combined parser/lexer in one file, or a split parser/lexer in two files. While it's not hard to split or combine a grammar, it's tedious. For automating transformations, it's necessary because Antlr4 requires the grammars to be split when super classes are needed for different targets.
dotnet trash combine ArithmeticLexer.g4 ArithmeticParser.g4 | trprint > Arithmetic.g4
This command calls trcombine which parses two split grammar files ArithmeticLexer.g4 and ArithmeticParser.g4, and creates a combined grammar for the two.
dotnet trash parse Arithmetic.g4 | dotnet trash split | dotnet trash sponge -o true
This command calls trsplit which splits the grammar into two parse tree results, one that defines ArithmeticLexer.g4 and the other that defines ArithmeticParser.g4. The tool trsponge is similar to the tee in Linux: the parse tree data is split and placed in files.
A parsing result set is a JSON serialization of an array of:
- A set of parse tree nodes.
- Parser information related to the parse tree nodes.
- Lexer information related to the parse tree nodes.
- The name of the input corresponding to the parse tree nodes.
- The input text corresponding to the parse tree nodes.
Most commands in Trash read and/or write parsing result sets.
| Grammars | File suffix |
|---|---|
| Antlr4 | .g4 |
| Antlr3 | .g3 |
| Antlr2 | .g2 |
| Bison | .y |
| LBNF | .cf |
| W3C EBNF | .ebnf |
| ISO 14977 | .iso14977, .iso |
Trash provides a number of transformations that can help to make grammars cleaner (reformatting), more readable (reducing the length of the RHS of a rule), and more efficient (reducing the number of non-terminals) for Antlr.
Some of these refactorings are very specific for Antlr due to the way the parser works, e.g., converting a prioritized chain of productions recognizing an arithmetic expression to a recursive alternate form. The refactorings implemented are:
- Remove useless parentheses
- Remove useless parser rules
- Rename lexer or parser symbol
- Unfold
- Group alts
- Ungroup alts
- Upper and lower case string literals
- Fold
- Replace direct left recursion with right recursion
- Replace direct left/right recursion with Kleene operator
- Replace indirect left recursion with right recursion
- Replace parser rule symbols that conflict with Antlr keywords
- Replace string literals in parser with lexer symbols
- Replace string literals in parser with lexer symbols, with lexer rule create
- Delabel removes the annoying and mostly useless labeling in an Antlr grammar
The source code for the extension is open source, free of charge, and free of ads. For the latest developments on the extension, check out my blog.
git clone https://github.com/kaby76/Trash
cd Trash
make clean; make; make install
You must have the NET SDK version 10 installed to build and run.
See https://github.com/kaby76/Trash/releases.
If you have any questions, email me at ken.domino gmail.com