Skip to content

Issue #10272: Upgrade Java Grammar from ANTLR2 to ANTLR4#10280

Merged
romani merged 8 commits into
checkstyle:masterfrom
nrmancuso:issue-10272
Aug 11, 2021
Merged

Issue #10272: Upgrade Java Grammar from ANTLR2 to ANTLR4#10280
romani merged 8 commits into
checkstyle:masterfrom
nrmancuso:issue-10272

Conversation

@nrmancuso

@nrmancuso nrmancuso commented Jul 7, 2021

Copy link
Copy Markdown
Contributor

Closes #10272.

In summary, there are no regressions. All differences (in Check regression reports and AST regression reports) fall into the following categories:

  1. Diffs are in file that could not be parsed previously, but now we can (usually due to unicode characters).
  2. Diffs are in a file that we cannot parse any longer, that is not compilable.
  3. Diffs are from correct type parameter placement. This usually happens when one of the types is an array type. Example: Changes at https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part4/spoon/index.html
  4. Diffs are from change in exception message.
  5. Two cases of changes in DOT operator placement, both are consistent with Checkstyle vision of AST (DOT operator as parent of expression).
    1. https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr/spoon/index.html#A20
    2. https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr/guava-mvnstyle/index.html#A91
  6. One case of change in line length now that we can parse unicode correctly: https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr/my-checkstyle/index.html#A239
  7. New violations on previously not parsable file InputAntlr4AstRegressionUncommon3.java(from 3d471c3#r679410108)

Check Regression Reports

https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_checks-nonjavadoc-error/index.html


https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_checks-only-javadoc-error/index.html

  • Changes are identical to above.

https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part1/index.html


https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part2/index.html


https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part3/index.html


https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part4/index.html


https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part5/index.html


https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_part6/index.html

  • New violations on previously not parsable file InputAntlr4AstRegressionUncommon3.java.

https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_sevntu-check-regression_part_1/index.html


https://nmancus1.github.io/issue-10272_check_diff_reports_2021_08_09/diff_sevntu-check-regression_part_2/index.html

  • All other changes are from the same non-compilable files and files containing unicode chars as above.

AST Regression Report

All repos:
https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr/index.html

Regression was found in lombok-ast at https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr/lombok-ast/index.html#A1 , but is fixed. but is fixed.

Updated AST regression report for lombok-ast : https://nmancus1.github.io/issue-10272_check_diff_reports_2021_07_25/diff-antlr-lombok/index.html

@nrmancuso nrmancuso marked this pull request as draft July 7, 2021 21:31
@nrmancuso nrmancuso force-pushed the issue-10272 branch 4 times, most recently from e215dcc to 7dddea2 Compare July 9, 2021 01:24
* @param hiddenBefore comment token preceding this DetailAstImpl
*/
public void setHiddenBefore(List<Token> hiddenBefore) {
this.hiddenBefore = Collections.unmodifiableList(hiddenBefore);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://teamcity.jetbrains.com/viewLog.html?buildId=3543062&tab=Inspection&buildTypeId=Checkstyle_IdeaInspectionsPullRequest:

523: setHiddenBefore() Assignment to List<Token> field 'hiddenBefore' from parameter hiddenBefore 
532: setHiddenAfter() Assignment to List<Token> field 'hiddenAfter' from parameter hiddenAfter 

assertEquals(oldParent, newSibling.getParent(), "Invalid parent");
assertNull(newSibling.getNextSibling(), "Invalid next sibling");
assertEquals(newSibling, child.getNextSibling(), "Invalid child");
assertSame(newSibling, child.getNextSibling(), "Invalid child");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @see #INTERFACE_DEF
**/
public static final int EOF = GeneratedJavaTokenTypes.EOF;
public static final int EOF = Recognizer.EOF;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be Recognizer and not JavaParser.

From https://teamcity.jetbrains.com/viewLog.html?buildId=3543062&tab=Inspection&buildTypeId=Checkstyle_IdeaInspectionsPullRequest:

46: EOF Static field EOF declared in class 'org.antlr.v4.runtime.Recognizer' but referenced via subclass 'com.puppycrawl.tools.checkstyle.grammar.java.JavaLexer' 

@nrmancuso nrmancuso force-pushed the issue-10272 branch 12 times, most recently from e23487a to c72b9d5 Compare July 15, 2021 13:44
final boolean isMethodViolation = isMethod && !isJavaStyle;
final boolean isVariableViolation = !isMethod
&& isJavaStyle != javaStyle
&& typeAST.getType() != TokenTypes.TYPE_ARGUMENT;

@nrmancuso nrmancuso Jul 15, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that GENERIC_END token positions are always reported correctly, and emitted accurately, this line was no longer covered.

Example:

        protected Pair<Integer, Pair<String, Pair<String, Object>>[]>[] values3a;

Difference between latest master and PR branch:
https://www.diffchecker.com/IuwJEwmB

@nrmancuso nrmancuso force-pushed the issue-10272 branch 8 times, most recently from 0e6bcc4 to 2f5c283 Compare July 19, 2021 18:20
Comment thread src/main/java/com/puppycrawl/tools/checkstyle/DetailAstImpl.java
@nrmancuso

nrmancuso commented Aug 4, 2021

Copy link
Copy Markdown
Contributor Author

Has there been any tests to see what is the time difference needed in parsing with ANTLR 4?

I have been keeping an eye on this, and it's about the same.

10 runs of mvn clean verify yields +/- 20 seconds (average total time is 4:42) on my machine, there is no clear winner.

5 wercker runs, averaged out: ANTLR2 - 3052.6 s, ANTLR4 - 2579.2 s
all individual no-exception and no-error tests are +/- ~40 seconds

Report generation time is about the same (Check: ~4.7 hours, AST: ~50 min).

If there is any particular data that you would like to see, or repos you would like me to test on, I will do so, but since ANTLR4 parsing times have been consistent with ANTLR2 parsing times (except for openjdk14, 16), I have not done extensive time trials.

@rnveach

rnveach commented Aug 5, 2021

Copy link
Copy Markdown
Contributor

If there is any particular data that you would like to see

None from me. I wanted to ensure we were looking at this, and discussing if there was any issues being presented.

If upgrading ANTLR doesn't cause a slowdown, then that atleast means ANTLR itself isn't the issue for slowdown we see with Javadocs.

);

@Test
public void testAllVisitMethodsAreOverridden() {

@nrmancuso nrmancuso Aug 6, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ANTLR2, we relied on the automatic AST generation from the file parsing operation; with ANTLR4 we have to generate our own AST with the visitor. This leaves room for error; we could add a production rule to the grammar, but not visit the rule context and create an AST. I have created this test to make sure that this doesn't happen; anytime a new rule or labeled alternative is created, a new visit... method is generated in CheckstyleJavaParserBaseVisitor automatically. This test checks that all visit methods are overridden, or that the method name is explicitly added to VISIT_METHODS_NOT_OVERRIDDEN list with a comment to explain why we don't override the visit method and build an AST.

}

@Test
public void testOrderOfVisitMethodsAndProductionRules() throws Exception {

@nrmancuso nrmancuso Aug 6, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have determined that keeping production rules in CheckstyleJavaParser.g4 and visit methods in JavaAstVisitor.java helps to ease maintenance, and should be considered a best practice.

Example:

image

Just as we keep related production rules together in the grammar, so too should we keep related visit methods together in JavaAstVisitor.java. This test makes sure that we keep these two files consistent.

@romani romani left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

items:

Comment thread src/main/java/com/puppycrawl/tools/checkstyle/JavaParser.java Outdated
Comment thread src/main/java/com/puppycrawl/tools/checkstyle/api/DetailAST.java Outdated
@nrmancuso

Copy link
Copy Markdown
Contributor Author

After all reviews are completed, I will squash all commits into one, and run one final regression report before we merge this PR.

Comment on lines +44 to +93
/**
* Visitor class used to build Checkstyle's Java AST from the parse tree produced by
* {@link CheckstyleJavaParser}. In each {@code visit...} method, we visit the children of a node
* (which correspond to subrules) or create terminal nodes (tokens), and return a subtree as a
* result.
*
* <p>Example:</p>
*
* <p>The following package declaration:</p>
* <pre>
* package com.puppycrawl.tools.checkstyle;
* </pre>
*
* <p>
* Will be parsed by the {@code packageDeclaration} rule from {@code CheckstyleJavaParser.g4}:
* </p>
* <pre>
* packageDeclaration
* : annotations[true] LITERAL_PACKAGE qualifiedName SEMI
* ;
* </pre>
*
* <p>
* We override the {@code visitPackageDeclaration} method generated by ANTLR in
* {@code CheckstyleJavaParserBaseVisitor} at
* {@link JavaAstVisitor#visitPackageDeclaration(CheckstyleJavaParser.PackageDeclarationContext)}
* to create a subtree based on the subrules and tokens found in the {@code packageDeclaration}
* subrule accordingly, thus producing the following AST:
* </p>
* <pre>
* PACKAGE_DEF -&gt; package
* |--ANNOTATIONS -&gt; ANNOTATIONS
* |--DOT -&gt; .
* | |--DOT -&gt; .
* | | |--DOT -&gt; .
* | | | |--IDENT -&gt; com
* | | | `--IDENT -&gt; puppycrawl
* | | `--IDENT -&gt; tools
* | `--IDENT -&gt; checkstyle
* `--SEMI -&gt; ;
* </pre>
* <p>
* See https://github.com/checkstyle/checkstyle/pull/10434 for a good example of how
* to make changes to Checkstyle's grammar and AST.
* </p>
* <p>
* The order of {@code visit...} methods in {@code JavaAstVisitor.java} and production rules in
* {@code CheckstyleJavaParser.g4} should be consistent to ease maintenance.
* </p>
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I write a more detailed README/ guide with code examples on how to update grammar/ visitor? If so, should it live in the same directory as the parser and lexer grammar?

@romani romani left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this

@md-5

md-5 commented Nov 22, 2021

Copy link
Copy Markdown

I'm seeing huge memory usage / OutOfMemoryErrors in the Antlr runtime with the 9.x series.

With the 8.x series my project(s) builds complete on a Maven heap of only 256M, but by bumping Checkstyle from 8.45.1 to any of the 9.x series, the minimum heap sits at around 768M (fails at 512M, succeeds at 768M).

This suggests that Checkstyle memory usage is somewhere in the order of 2-3 times what it was previously.

Is this something you have observed / an unavoidable consequence of Antlr 4, or is it possible there are leaks somewhere?

@nrmancuso

Copy link
Copy Markdown
Contributor Author

Is this something you have observed / an unavoidable consequence of Antlr 4, or is it possible there are leaks somewhere?

We are conducting an investigation, and have an open issue at #10934. Please leave a comment there with details about your project, and if it is open source, share a link to your repo.

This issue has been noted previously with ANTLR4, there are a few reports such as
antlr/antlr4#2384.

@romani

romani commented Nov 22, 2021

Copy link
Copy Markdown
Member

For now kind of unavoidable

We keep #10934 open to find out root reason and maybe fix problem by PR to antrl4 project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upgrade Java Grammar from ANTLR2 to ANTLR4

7 participants