Skip to content

Would it be possible to support non-ASCII characters in identifiers? #8050

@blerner

Description

@blerner

According to Checkstyle's limitations, checkstyle does not support UTF-8 characters, and indeed the grammar defines an overly simple regex for recognizing identifiers. But it's not quite true that checkstyle doesn't support any UTF-8 though, since UTF-8 characters in strings or comments are matched by the . regex and are therefore silently accepted.

Per the JLS, section 3.8, identifiers are richer than the simple [A-Za-z_$][0-9A-Za-z_$]* regex defined in JavadocLexer.g4. I think a closer regex that's compatible with Antlr's Lexer rule elements would be

fragment JavaIdentStart: [\p{N}\p{L}\p{Sc}\p{Pc}];
fragment JavaIdentPart: [\p{N}\p{L}\p{Sc}\p{Pc}\p{M}];
fragment Identifier: JavaIdentStart (JavaIdentPart)*;

I see that there are prior issues similar to this (mainly #4562), but I guess my question is, if it's technically feasible and not too difficult to fix the definition of Identifier above to be more faithful to the JLS, why should this limitation continue? If this is simply a hard rule that checkstyle insists upon, feel free to close this issue as a duplicate, but it seems like an odd inconsistency to enforce.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions