Issue #10272: Upgrade Java Grammar from ANTLR2 to ANTLR4#10280
Conversation
e215dcc to
7dddea2
Compare
| * @param hiddenBefore comment token preceding this DetailAstImpl | ||
| */ | ||
| public void setHiddenBefore(List<Token> hiddenBefore) { | ||
| this.hiddenBefore = Collections.unmodifiableList(hiddenBefore); |
There was a problem hiding this comment.
523: setHiddenBefore() Assignment to List<Token> field 'hiddenBefore' from parameter hiddenBefore
532: setHiddenAfter() Assignment to List<Token> field 'hiddenAfter' from parameter hiddenAfter
| assertEquals(oldParent, newSibling.getParent(), "Invalid parent"); | ||
| assertNull(newSibling.getNextSibling(), "Invalid next sibling"); | ||
| assertEquals(newSibling, child.getNextSibling(), "Invalid child"); | ||
| assertSame(newSibling, child.getNextSibling(), "Invalid child"); |
There was a problem hiding this comment.
401: testAddNextSiblingNullParent() assertEquals() may be 'assertSame()'
| * @see #INTERFACE_DEF | ||
| **/ | ||
| public static final int EOF = GeneratedJavaTokenTypes.EOF; | ||
| public static final int EOF = Recognizer.EOF; |
There was a problem hiding this comment.
This must be Recognizer and not JavaParser.
46: EOF Static field EOF declared in class 'org.antlr.v4.runtime.Recognizer' but referenced via subclass 'com.puppycrawl.tools.checkstyle.grammar.java.JavaLexer'
e23487a to
c72b9d5
Compare
| final boolean isMethodViolation = isMethod && !isJavaStyle; | ||
| final boolean isVariableViolation = !isMethod | ||
| && isJavaStyle != javaStyle | ||
| && typeAST.getType() != TokenTypes.TYPE_ARGUMENT; |
There was a problem hiding this comment.
Now that GENERIC_END token positions are always reported correctly, and emitted accurately, this line was no longer covered.
Example:
protected Pair<Integer, Pair<String, Pair<String, Object>>[]>[] values3a;Difference between latest master and PR branch:
https://www.diffchecker.com/IuwJEwmB
0e6bcc4 to
2f5c283
Compare
I have been keeping an eye on this, and it's about the same. 10 runs of 5 wercker runs, averaged out: ANTLR2 - 3052.6 s, ANTLR4 - 2579.2 s Report generation time is about the same (Check: ~4.7 hours, AST: ~50 min). If there is any particular data that you would like to see, or repos you would like me to test on, I will do so, but since ANTLR4 parsing times have been consistent with ANTLR2 parsing times (except for openjdk14, 16), I have not done extensive time trials. |
None from me. I wanted to ensure we were looking at this, and discussing if there was any issues being presented. If upgrading ANTLR doesn't cause a slowdown, then that atleast means ANTLR itself isn't the issue for slowdown we see with Javadocs. |
| ); | ||
|
|
||
| @Test | ||
| public void testAllVisitMethodsAreOverridden() { |
There was a problem hiding this comment.
With ANTLR2, we relied on the automatic AST generation from the file parsing operation; with ANTLR4 we have to generate our own AST with the visitor. This leaves room for error; we could add a production rule to the grammar, but not visit the rule context and create an AST. I have created this test to make sure that this doesn't happen; anytime a new rule or labeled alternative is created, a new visit... method is generated in CheckstyleJavaParserBaseVisitor automatically. This test checks that all visit methods are overridden, or that the method name is explicitly added to VISIT_METHODS_NOT_OVERRIDDEN list with a comment to explain why we don't override the visit method and build an AST.
| } | ||
|
|
||
| @Test | ||
| public void testOrderOfVisitMethodsAndProductionRules() throws Exception { |
There was a problem hiding this comment.
I have determined that keeping production rules in CheckstyleJavaParser.g4 and visit methods in JavaAstVisitor.java helps to ease maintenance, and should be considered a best practice.
Example:
Just as we keep related production rules together in the grammar, so too should we keep related visit methods together in JavaAstVisitor.java. This test makes sure that we keep these two files consistent.
…s and expected output
|
After all reviews are completed, I will squash all commits into one, and run one final regression report before we merge this PR. |
| /** | ||
| * Visitor class used to build Checkstyle's Java AST from the parse tree produced by | ||
| * {@link CheckstyleJavaParser}. In each {@code visit...} method, we visit the children of a node | ||
| * (which correspond to subrules) or create terminal nodes (tokens), and return a subtree as a | ||
| * result. | ||
| * | ||
| * <p>Example:</p> | ||
| * | ||
| * <p>The following package declaration:</p> | ||
| * <pre> | ||
| * package com.puppycrawl.tools.checkstyle; | ||
| * </pre> | ||
| * | ||
| * <p> | ||
| * Will be parsed by the {@code packageDeclaration} rule from {@code CheckstyleJavaParser.g4}: | ||
| * </p> | ||
| * <pre> | ||
| * packageDeclaration | ||
| * : annotations[true] LITERAL_PACKAGE qualifiedName SEMI | ||
| * ; | ||
| * </pre> | ||
| * | ||
| * <p> | ||
| * We override the {@code visitPackageDeclaration} method generated by ANTLR in | ||
| * {@code CheckstyleJavaParserBaseVisitor} at | ||
| * {@link JavaAstVisitor#visitPackageDeclaration(CheckstyleJavaParser.PackageDeclarationContext)} | ||
| * to create a subtree based on the subrules and tokens found in the {@code packageDeclaration} | ||
| * subrule accordingly, thus producing the following AST: | ||
| * </p> | ||
| * <pre> | ||
| * PACKAGE_DEF -> package | ||
| * |--ANNOTATIONS -> ANNOTATIONS | ||
| * |--DOT -> . | ||
| * | |--DOT -> . | ||
| * | | |--DOT -> . | ||
| * | | | |--IDENT -> com | ||
| * | | | `--IDENT -> puppycrawl | ||
| * | | `--IDENT -> tools | ||
| * | `--IDENT -> checkstyle | ||
| * `--SEMI -> ; | ||
| * </pre> | ||
| * <p> | ||
| * See https://github.com/checkstyle/checkstyle/pull/10434 for a good example of how | ||
| * to make changes to Checkstyle's grammar and AST. | ||
| * </p> | ||
| * <p> | ||
| * The order of {@code visit...} methods in {@code JavaAstVisitor.java} and production rules in | ||
| * {@code CheckstyleJavaParser.g4} should be consistent to ease maintenance. | ||
| * </p> | ||
| */ |
There was a problem hiding this comment.
Should I write a more detailed README/ guide with code examples on how to update grammar/ visitor? If so, should it live in the same directory as the parser and lexer grammar?
|
I'm seeing huge memory usage / OutOfMemoryErrors in the Antlr runtime with the 9.x series. With the 8.x series my project(s) builds complete on a Maven heap of only 256M, but by bumping Checkstyle from 8.45.1 to any of the 9.x series, the minimum heap sits at around 768M (fails at 512M, succeeds at 768M). This suggests that Checkstyle memory usage is somewhere in the order of 2-3 times what it was previously. Is this something you have observed / an unavoidable consequence of Antlr 4, or is it possible there are leaks somewhere? |
We are conducting an investigation, and have an open issue at #10934. Please leave a comment there with details about your project, and if it is open source, share a link to your repo. This issue has been noted previously with ANTLR4, there are a few reports such as |
|
For now kind of unavoidable We keep #10934 open to find out root reason and maybe fix problem by PR to antrl4 project |

Closes #10272.
In summary, there are no regressions. All differences (in Check regression reports and AST regression reports) fall into the following categories:
DOToperator placement, both are consistent with Checkstyle vision of AST (DOToperator as parent of expression).InputAntlr4AstRegressionUncommon3.java(from 3d471c3#r679410108)Check Regression Reports
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_checks-nonjavadoc-error/index.html
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_checks-only-javadoc-error/index.html
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part1/index.html
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part2/index.html
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part3/index.html
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part4/index.html
Changes at https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part4/spoon/index.html are from non-compilable files and correct
>positioning.All other changes are from files with unicode characters that we previously could not parse.
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part5/index.html
Changes at https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part5/guava/index.html are from corrected
DOToperator parent placement.All other changes are from the same non-compilable files and files containing unicode chars as above.
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part6/index.html
InputAntlr4AstRegressionUncommon3.java.https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_sevntu-check-regression_part_1/index.html
Change at https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_sevntu-check-regression_part_1/guava/index.html from same correct
DOToperator parent placement as above.All other changes are from the same non-compilable files and files containing unicode chars as above.
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_sevntu-check-regression_part_2/index.html
AST Regression Report
All repos:
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr/index.html
Regression was found in
lombok-astat https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr/lombok-ast/index.html#A1 , but is fixed. but is fixed.Updated AST regression report for
lombok-ast: https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr-lombok/index.html