Skip to content

Issue #8050: Support UTF-8 characters in identifiers#10437

Merged
rnveach merged 1 commit into
checkstyle:masterfrom
nrmancuso:issue-8050
Aug 13, 2021
Merged

Issue #8050: Support UTF-8 characters in identifiers#10437
rnveach merged 1 commit into
checkstyle:masterfrom
nrmancuso:issue-8050

Conversation

@nrmancuso

@nrmancuso nrmancuso commented Jul 27, 2021

Copy link
Copy Markdown
Contributor

Closes #8050.

The original proposed test input made a file that was too long (javac wouldn't compile it), so I combined characters into identifiers of 25 characters for each int. I generated the input file by iterating through every existing code point, and checking to see if it was a valid java identifier, so everything should be covered here. This input file must be non-compilable, because later versions of java support more characters, and I see no reason to limit the test to Java 8 characters since we support up to Java 16.

@nrmancuso nrmancuso marked this pull request as draft July 27, 2021 15:47
Comment thread src/xdocs/writingchecks.xml
@nrmancuso nrmancuso marked this pull request as ready for review August 2, 2021 12:04
@nrmancuso

Copy link
Copy Markdown
Contributor Author

This PR must not be merged until #10280 is merged.

@romani

romani commented Aug 11, 2021

Copy link
Copy Markdown
Member

@nmancus1 , please fix conflict

@romani romani left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to merge, no functional changes

@romani romani requested a review from pbludov August 11, 2021 22:02
Comment thread src/xdocs/writingchecks.xml
Comment thread src/xdocs/writingchecks.xml
@nrmancuso

Copy link
Copy Markdown
Contributor Author

Github, generate site

@github-actions

Copy link
Copy Markdown
Contributor

@nrmancuso

Copy link
Copy Markdown
Contributor Author

Github, generate site

@github-actions

Copy link
Copy Markdown
Contributor

@romani

romani commented Aug 12, 2021

Copy link
Copy Markdown
Member

Github, rebase

@nrmancuso

Copy link
Copy Markdown
Contributor Author

Updated documentation:

image

Comment thread src/xdocs/writingchecks.xml Outdated
tokens (identifiers, keywords)</a>
should be written with
<a href="https://en.wikipedia.org/wiki/ASCII">ASCII</a>
<a href="https://en.wikipedia.org/wiki/UTF-16">UTF-16</a>

@rnveach rnveach Aug 12, 2021

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, ASCII isn't a subset of UTF16? It is completely different?
https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html

UTF-8 is compatible with ASCII while UTF-16 is incompatible with ASCII

If so then we have to say we support both based on the wording you are using here.

@nrmancuso nrmancuso Aug 12, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I knew that ASCII was a subset of UTF8, and falsely assumed that ASCII was a subset of UTF16 by transitivity. I will update this. <-- this is totally wrong too, I am going to do some more reading and rewrite the documentation.

@rnveach rnveach Aug 12, 2021

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going based on the title, I assume UTF 8 is also supported?

Maybe we are getting too technical here? The main thing is we support natural Unicode. The exact file format depends more on what is reading in the file, than the grammar itself, right? If its us reading the file, then its tied to something outside the grammar and can be looked at to be fixed if reported. If its ANTLR reading the file, then maybe we should just say we are bound by ANTLR's reading capabilities.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after doing more research:

Going based on the title, I assume UTF 8 is also supported?

Yes.

Maybe we are getting too technical here?

Probably not; I think it is good to be very specific here. From https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-JavaLetterOrDigit:

Letters and digits may be drawn from the entire Unicode character set[...]. This allows programmers to use identifiers in their programs that are written in their native languages.

I think this should be our goal, as well; our documentation should reflect that we support all unicode characters now.

We could add support for unicode escapes. We could do this in a similar fashion that Java itself does by first translating the unicode escapes to UTF-16, then passing the CharStream to ANTLR. We have a few issues for unicode escape support, but all have been opened by me from examples found in openjdk; I don't think anyone is actually writing code like this:

public class SupplementaryJavaID1 {
    public static void main(String[] s) {
        int \ud801\udc00abc = 1;
        int \ud802\udc00abc = 2;
        int \ud801\udc01abc = 3;
        int def\ud801\udc00 = 4;
        int \ud801\udc00\ud834\udd7b = 5;

        if (\ud801\udc00abc != 1 ||
            \ud802\udc00abc != 2 ||
            \ud801\udc01abc != 3 ||
            def\ud801\udc00 != 4 ||
            \ud801\udc00\ud834\udd7b != 5) {
                throw new RuntimeException("test failed");
        }
    }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: ASCII is a subset of unicode, so we should just state that we support unicode characters, and not escapes. Done.

@romani

romani commented Aug 13, 2021

Copy link
Copy Markdown
Member

Github, generate web site

@github-actions

github-actions Bot commented Aug 13, 2021

Copy link
Copy Markdown
Contributor

@rnveach rnveach merged commit 1e49957 into checkstyle:master Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Would it be possible to support non-ASCII characters in identifiers?

5 participants