Skip to content

Javadoc grammar does not support unicode characters #10629

@nrmancuso

Description

@nrmancuso

We need to revisit #8050. Checkstyle should have the same level of support in our Javadoc grammar that we do in our Java grammar. We cannot claim full unicode character support until we do.

While our ANTLR4 Java grammar now supports unicode characters, our ANTLR4 Javadoc grammar still does not:

 ➜  src cat Test.java                                         
public class Test {

    /**
     *
     * @param ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ
     */
    void aᇭᇮᇯᇰᇱᇲᇳᇴᇵᇶᇷᇸᇹᇺ(int ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ) {}


}

 ➜  src java -jar checkstyle-9.0-SNAPSHOT-all.jar -T Test.java
CLASS_DEF -> CLASS_DEF [1:0]
|--MODIFIERS -> MODIFIERS [1:0]
|   `--LITERAL_PUBLIC -> public [1:0]
|--LITERAL_CLASS -> class [1:7]
|--IDENT -> Test [1:13]
`--OBJBLOCK -> OBJBLOCK [1:18]
    |--LCURLY -> { [1:18]
    |--METHOD_DEF -> METHOD_DEF [7:4]
    |   |--MODIFIERS -> MODIFIERS [7:4]
    |   |--TYPE -> TYPE [7:4]
    |   |   |--BLOCK_COMMENT_BEGIN -> /* [3:4]
    |   |   |   |--COMMENT_CONTENT -> *\n     *\n     * @param ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ\n      [3:6]
    |   |   |   `--BLOCK_COMMENT_END -> */ [6:5]
    |   |   `--LITERAL_VOID -> void [7:4]
    |   |--IDENT -> aᇭᇮᇯᇰᇱᇲᇳᇴᇵᇶᇷᇸᇹᇺ [7:9]
    |   |--LPAREN -> ( [7:24]
    |   |--PARAMETERS -> PARAMETERS [7:25]
    |   |   `--PARAMETER_DEF -> PARAMETER_DEF [7:25]
    |   |       |--MODIFIERS -> MODIFIERS [7:25]
    |   |       |--TYPE -> TYPE [7:25]
    |   |       |   `--LITERAL_INT -> int [7:25]
    |   |       `--IDENT -> ᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮ [7:29]
    |   |--RPAREN -> ) [7:40]
    |   `--SLIST -> { [7:42]
    |       `--RCURLY -> } [7:43]
    `--RCURLY -> } [10:0]

 ➜  src java -jar checkstyle-9.0-SNAPSHOT-all.jar -J Test.java
Exception in thread "main" java.lang.IllegalArgumentException: [ERROR:5] Javadoc comment at column 15 has parse error. Details: no viable alternative at input '' while parsing JAVADOC_TAG
        at com.puppycrawl.tools.checkstyle.DetailNodeTreeStringPrinter.parseJavadocAsDetailNode(DetailNodeTreeStringPrinter.java:71)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.parseAndPrintJavadocTree(AstTreeStringPrinter.java:116)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:97)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:101)
        at com.puppycrawl.tools.checkstyle.AstTreeStringPrinter.printJavaAndJavadocTree(AstTreeStringPrinter.java:79)
        at com.puppycrawl.tools.checkstyle.Main.runCli(Main.java:306)
        at com.puppycrawl.tools.checkstyle.Main.execute(Main.java:191)
        at com.puppycrawl.tools.checkstyle.Main.main(Main.java:126)

This will be a complicated update, since many elements have regex defined differently/separately:

 checkstyle git:(master) cat $(find . -name "JavadocLexer.g4") | grep "\["
CUSTOM_NAME: '@' [a-zA-Z0-9:._-]+ {isJavadocTagAvailable}?;
PARAMETER_NAME: [a-zA-Z0-9<>_$]+ -> mode(DEFAULT_MODE);
CLASS: [A-Z] [a-zA-Z0-9_$]* {referenceCatched = true;};
MEMBER: [a-zA-Z0-9_$]+ {!insideReferenceArguments}?;
ARGUMENT: ([a-zA-Z0-9_$] | '.' | '[' | ']')+ {insideReferenceArguments}?;
FIELD_NAME: [a-zA-Z0-9_$]+ -> mode(serialFieldFieldType);
FIELD_TYPE: [a-zA-Z0-9_$]+ -> mode(DEFAULT_MODE);
CLASS_NAME: ([a-zA-Z0-9_$] | '.')+ -> mode(DEFAULT_MODE);
CustomName1: '@' [a-zA-Z0-9:._-]+ {recognizeXmlTags=false;}
Brackets: '{' (~[}] | Brackets)* '}' -> type(CHAR);
Text: ~[}] -> type(CHAR);
fragment JavaLetter: [A-Za-z_$];
fragment JavaLetterOrDigit: [0-9A-Za-z_$];
HEXDIGIT    :   [a-fA-F0-9] ;
DIGIT       :   [0-9] ;
            :   [:a-zA-Z]
FragmentReference: ([a-zA-Z0-9_-] | '.')+
      | ([a-zA-Z0-9_-] | '.')* '#' [a-zA-Z0-9_-]+ ( '(' (([a-zA-Z0-9_-] | '.')+ | ',' | ' ')* ')' )?
ATTR_VALUE  : '"' ~[<"]* '"'        {!attributeCatched}? {attributeCatched=true;}
            | '\'' ~[<']* '\''      {!attributeCatched}? {attributeCatched=true;}
            | (~[> \t\r\n/] | SlashInAttr)+ {!attributeCatched}? {attributeCatched=true;}
CUSTOM_NAME: '@' [a-zA-Z0-9:._-]+ {isJavadocTagAvailable}?;
CLASS: [A-Z] [a-zA-Z0-9_$]* {referenceCatched = true;};
MEMBER: [a-zA-Z0-9_$]+ {!insideReferenceArguments}?;
ARGUMENT: ([a-zA-Z0-9_$] | '.' | '[' | ']')+ {insideReferenceArguments}?;
FIELD_NAME: [a-zA-Z0-9_$]+ -> mode(serialFieldFieldType);
FIELD_TYPE: [a-zA-Z0-9_$]+ -> mode(DEFAULT_MODE);
CLASS_NAME: ([a-zA-Z0-9_$] | '.')+ -> mode(DEFAULT_MODE);
CustomName1: '@' [a-zA-Z0-9:._-]+ {recognizeXmlTags=false;}
Brackets: '{' (~[}] | Brackets)* '}' -> type(CHAR);
Text: ~[}] -> type(CHAR);
    : [a-zA-Z$_]
    | ~[\u0000-\u007F\uD800-\uDBFF]
    | [\uD800-\uDBFF] [\uDC00-\uDFFF]
    | [0-9]
HEXDIGIT    :   [a-fA-F0-9] ;
DIGIT       :   [0-9] ;
            :   [:a-zA-Z]
FragmentReference: ([a-zA-Z0-9_-] | '.')+
      | ([a-zA-Z0-9_-] | '.')* '#' [a-zA-Z0-9_-]+ ( '(' (([a-zA-Z0-9_-] | '.')+ | ',' | ' ')* ')' )?
ATTR_VALUE  : '"' ~[<"]* '"'        {!attributeCatched}? {attributeCatched=true;}
            | '\'' ~[<']* '\''      {!attributeCatched}? {attributeCatched=true;}
            | (~[> \t\r\n/] | SlashInAttr)+ {!attributeCatched}? {attributeCatched=true;}

It would be a good idea to emulate the Java lexer's Letter fragment rule, and reuse that throughout the lexer grammar to ensure that we are covered in all elements. For example, the FIELD_TYPE rule should become something like:

FIELD_TYPE: IDENT -> mode(DEFAULT_MODE);

Where IDENT:

IDENT:         Letter LetterOrDigit*;

fragment LetterOrDigit
    : Letter
    | [0-9]
    ;

fragment Letter
    // these are the "java letters" below 0x7F
    : [a-zA-Z$_]
    // covers all characters above 0x7F which are not a surrogate
    | ~[\u0000-\u007F\uD800-\uDBFF]
    // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
    | [\uD800-\uDBFF] [\uDC00-\uDFFF]
    ;

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions