We need to revisit #8050. Checkstyle should have the same level of support in our Javadoc grammar that we do in our Java grammar. We cannot claim full unicode character support until we do.
While our ANTLR4 Java grammar now supports unicode characters, our ANTLR4 Javadoc grammar still does not:
➜ src cat Test.java
public class Test {
/**
*
* @param ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ
*/
void aᇭᇮᇯᇰᇱᇲᇳᇴᇵᇶᇷᇸᇹᇺ(int ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ) {}
}
➜ src java -jar checkstyle-9.0-SNAPSHOT-all.jar -T Test.java
CLASS_DEF -> CLASS_DEF [1:0]
|--MODIFIERS -> MODIFIERS [1:0]
| `--LITERAL_PUBLIC -> public [1:0]
|--LITERAL_CLASS -> class [1:7]
|--IDENT -> Test [1:13]
`--OBJBLOCK -> OBJBLOCK [1:18]
|--LCURLY -> { [1:18]
|--METHOD_DEF -> METHOD_DEF [7:4]
| |--MODIFIERS -> MODIFIERS [7:4]
| |--TYPE -> TYPE [7:4]
| | |--BLOCK_COMMENT_BEGIN -> /* [3:4]
| | | |--COMMENT_CONTENT -> *\n *\n * @param ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ\n [3:6]
| | | `--BLOCK_COMMENT_END -> */ [6:5]
| | `--LITERAL_VOID -> void [7:4]
| |--IDENT -> aᇭᇮᇯᇰᇱᇲᇳᇴᇵᇶᇷᇸᇹᇺ [7:9]
| |--LPAREN -> ( [7:24]
| |--PARAMETERS -> PARAMETERS [7:25]
| | `--PARAMETER_DEF -> PARAMETER_DEF [7:25]
| | |--MODIFIERS -> MODIFIERS [7:25]
| | |--TYPE -> TYPE [7:25]
| | | `--LITERAL_INT -> int [7:25]
| | `--IDENT -> ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ [7:29]
| |--RPAREN -> ) [7:40]
| `--SLIST -> { [7:42]
| `--RCURLY -> } [7:43]
`--RCURLY -> } [10:0]
➜ src java -jar checkstyle-9.0-SNAPSHOT-all.jar -J Test.java
Exception in thread "main" java.lang.IllegalArgumentException: [ERROR:5] Javadoc comment at column 15 has parse error. Details: no viable alternative at input 'ᇤ' while parsing JAVADOC_TAG
at com.puppycrawl.tools.checkstyle.DetailNodeTreeStringPrinter.parseJavadocAsDetailNode(DetailNodeTreeStringPrinter.java:71)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.parseAndPrintJavadocTree(AstTreeStringPrinter.java:116)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:97)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:79)
at com.puppycrawl.tools.checkstyle.Main.runCli(Main.java:306)
at com.puppycrawl.tools.checkstyle.Main.execute(Main.java:191)
at com.puppycrawl.tools.checkstyle.Main.main(Main.java:126)
This will be a complicated update, since many elements have regex defined differently/separately:
checkstyle git:(master) cat $(find . -name "JavadocLexer.g4") | grep "\["
CUSTOM_NAME: '@' [a-zA-Z0-9:._-]+ {isJavadocTagAvailable}?;
PARAMETER_NAME: [a-zA-Z0-9<>_$]+ -> mode(DEFAULT_MODE);
CLASS: [A-Z] [a-zA-Z0-9_$]* {referenceCatched = true;};
MEMBER: [a-zA-Z0-9_$]+ {!insideReferenceArguments}?;
ARGUMENT: ([a-zA-Z0-9_$] | '.' | '[' | ']')+ {insideReferenceArguments}?;
FIELD_NAME: [a-zA-Z0-9_$]+ -> mode(serialFieldFieldType);
FIELD_TYPE: [a-zA-Z0-9_$]+ -> mode(DEFAULT_MODE);
CLASS_NAME: ([a-zA-Z0-9_$] | '.')+ -> mode(DEFAULT_MODE);
CustomName1: '@' [a-zA-Z0-9:._-]+ {recognizeXmlTags=false;}
Brackets: '{' (~[}] | Brackets)* '}' -> type(CHAR);
Text: ~[}] -> type(CHAR);
fragment JavaLetter: [A-Za-z_$];
fragment JavaLetterOrDigit: [0-9A-Za-z_$];
HEXDIGIT : [a-fA-F0-9] ;
DIGIT : [0-9] ;
: [:a-zA-Z]
FragmentReference: ([a-zA-Z0-9_-] | '.')+
| ([a-zA-Z0-9_-] | '.')* '#' [a-zA-Z0-9_-]+ ( '(' (([a-zA-Z0-9_-] | '.')+ | ',' | ' ')* ')' )?
ATTR_VALUE : '"' ~[<"]* '"' {!attributeCatched}? {attributeCatched=true;}
| '\'' ~[<']* '\'' {!attributeCatched}? {attributeCatched=true;}
| (~[> \t\r\n/] | SlashInAttr)+ {!attributeCatched}? {attributeCatched=true;}
CUSTOM_NAME: '@' [a-zA-Z0-9:._-]+ {isJavadocTagAvailable}?;
CLASS: [A-Z] [a-zA-Z0-9_$]* {referenceCatched = true;};
MEMBER: [a-zA-Z0-9_$]+ {!insideReferenceArguments}?;
ARGUMENT: ([a-zA-Z0-9_$] | '.' | '[' | ']')+ {insideReferenceArguments}?;
FIELD_NAME: [a-zA-Z0-9_$]+ -> mode(serialFieldFieldType);
FIELD_TYPE: [a-zA-Z0-9_$]+ -> mode(DEFAULT_MODE);
CLASS_NAME: ([a-zA-Z0-9_$] | '.')+ -> mode(DEFAULT_MODE);
CustomName1: '@' [a-zA-Z0-9:._-]+ {recognizeXmlTags=false;}
Brackets: '{' (~[}] | Brackets)* '}' -> type(CHAR);
Text: ~[}] -> type(CHAR);
: [a-zA-Z$_]
| ~[\u0000-\u007F\uD800-\uDBFF]
| [\uD800-\uDBFF] [\uDC00-\uDFFF]
| [0-9]
HEXDIGIT : [a-fA-F0-9] ;
DIGIT : [0-9] ;
: [:a-zA-Z]
FragmentReference: ([a-zA-Z0-9_-] | '.')+
| ([a-zA-Z0-9_-] | '.')* '#' [a-zA-Z0-9_-]+ ( '(' (([a-zA-Z0-9_-] | '.')+ | ',' | ' ')* ')' )?
ATTR_VALUE : '"' ~[<"]* '"' {!attributeCatched}? {attributeCatched=true;}
| '\'' ~[<']* '\'' {!attributeCatched}? {attributeCatched=true;}
| (~[> \t\r\n/] | SlashInAttr)+ {!attributeCatched}? {attributeCatched=true;}
It would be a good idea to emulate the Java lexer's Letter fragment rule, and reuse that throughout the lexer grammar to ensure that we are covered in all elements. For example, the FIELD_TYPE rule should become something like:
FIELD_TYPE: IDENT -> mode(DEFAULT_MODE);
Where IDENT:
IDENT: Letter LetterOrDigit*;
fragment LetterOrDigit
: Letter
| [0-9]
;
fragment Letter
// these are the "java letters" below 0x7F
: [a-zA-Z$_]
// covers all characters above 0x7F which are not a surrogate
| ~[\u0000-\u007F\uD800-\uDBFF]
// covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
| [\uD800-\uDBFF] [\uDC00-\uDFFF]
;
We need to revisit #8050. Checkstyle should have the same level of support in our Javadoc grammar that we do in our Java grammar. We cannot claim full unicode character support until we do.
While our ANTLR4 Java grammar now supports unicode characters, our ANTLR4 Javadoc grammar still does not:
➜ src cat Test.java public class Test { /** * * @param ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ */ void aᇭᇮᇯᇰᇱᇲᇳᇴᇵᇶᇷᇸᇹᇺ(int ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ) {} } ➜ src java -jar checkstyle-9.0-SNAPSHOT-all.jar -T Test.java CLASS_DEF -> CLASS_DEF [1:0] |--MODIFIERS -> MODIFIERS [1:0] | `--LITERAL_PUBLIC -> public [1:0] |--LITERAL_CLASS -> class [1:7] |--IDENT -> Test [1:13] `--OBJBLOCK -> OBJBLOCK [1:18] |--LCURLY -> { [1:18] |--METHOD_DEF -> METHOD_DEF [7:4] | |--MODIFIERS -> MODIFIERS [7:4] | |--TYPE -> TYPE [7:4] | | |--BLOCK_COMMENT_BEGIN -> /* [3:4] | | | |--COMMENT_CONTENT -> *\n *\n * @param ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ\n [3:6] | | | `--BLOCK_COMMENT_END -> */ [6:5] | | `--LITERAL_VOID -> void [7:4] | |--IDENT -> aᇭᇮᇯᇰᇱᇲᇳᇴᇵᇶᇷᇸᇹᇺ [7:9] | |--LPAREN -> ( [7:24] | |--PARAMETERS -> PARAMETERS [7:25] | | `--PARAMETER_DEF -> PARAMETER_DEF [7:25] | | |--MODIFIERS -> MODIFIERS [7:25] | | |--TYPE -> TYPE [7:25] | | | `--LITERAL_INT -> int [7:25] | | `--IDENT -> ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ [7:29] | |--RPAREN -> ) [7:40] | `--SLIST -> { [7:42] | `--RCURLY -> } [7:43] `--RCURLY -> } [10:0] ➜ src java -jar checkstyle-9.0-SNAPSHOT-all.jar -J Test.java Exception in thread "main" java.lang.IllegalArgumentException: [ERROR:5] Javadoc comment at column 15 has parse error. Details: no viable alternative at input 'ᇤ' while parsing JAVADOC_TAG at com.puppycrawl.tools.checkstyle.DetailNodeTreeStringPrinter.parseJavadocAsDetailNode(DetailNodeTreeStringPrinter.java:71) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.parseAndPrintJavadocTree(AstTreeStringPrinter.java:116) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:97) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101) at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:79) at com.puppycrawl.tools.checkstyle.Main.runCli(Main.java:306) at com.puppycrawl.tools.checkstyle.Main.execute(Main.java:191) at com.puppycrawl.tools.checkstyle.Main.main(Main.java:126)This will be a complicated update, since many elements have regex defined differently/separately:
It would be a good idea to emulate the Java lexer's
Letterfragment rule, and reuse that throughout the lexer grammar to ensure that we are covered in all elements. For example, theFIELD_TYPErule should become something like:Where
IDENT: