Issue #8050: Support UTF-8 characters in identifiers by nrmancuso · Pull Request #10437 · checkstyle/checkstyle

nrmancuso · 2021-07-27T15:45:24Z

Closes #8050.

The original proposed test input made a file that was too long (javac wouldn't compile it), so I combined characters into identifiers of 25 characters for each int. I generated the input file by iterating through every existing code point, and checking to see if it was a valid java identifier, so everything should be covered here. This input file must be non-compilable, because later versions of java support more characters, and I see no reason to limit the test to Java 8 characters since we support up to Java 16.

nrmancuso · 2021-08-02T12:10:44Z

This PR must not be merged until #10280 is merged.

romani · 2021-08-11T02:48:10Z

@nmancus1 , please fix conflict

romani

Ok to merge, no functional changes

nrmancuso · 2021-08-12T13:38:16Z

Github, generate site

github-actions · 2021-08-12T14:05:07Z

https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/3785f76_2021135116/index.html

nrmancuso · 2021-08-12T14:13:42Z

Github, generate site

github-actions · 2021-08-12T14:56:06Z

https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/3a7de07_2021144714/index.html

romani · 2021-08-12T14:58:29Z

Github, rebase

nrmancuso · 2021-08-12T15:20:56Z

Updated documentation:

rnveach · 2021-08-12T17:03:04Z

            tokens (identifiers, keywords)</a>
          should be written with
-          <a href="https://en.wikipedia.org/wiki/ASCII">ASCII</a>
+          <a href="https://en.wikipedia.org/wiki/UTF-16">UTF-16</a>


Correct me if I am wrong, ASCII isn't a subset of UTF16? It is completely different?
https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html

UTF-8 is compatible with ASCII while UTF-16 is incompatible with ASCII

If so then we have to say we support both based on the wording you are using here.

~~Yup, I knew that ASCII was a subset of UTF8, and falsely assumed that ASCII was a subset of UTF16 by transitivity. I will update this.~~ <-- this is totally wrong too, I am going to do some more reading and rewrite the documentation.

Going based on the title, I assume UTF 8 is also supported?

Maybe we are getting too technical here? The main thing is we support natural Unicode. The exact file format depends more on what is reading in the file, than the grammar itself, right? If its us reading the file, then its tied to something outside the grammar and can be looked at to be fixed if reported. If its ANTLR reading the file, then maybe we should just say we are bound by ANTLR's reading capabilities.

Ok, after doing more research:

Going based on the title, I assume UTF 8 is also supported?

Yes.

Maybe we are getting too technical here?

Probably not; I think it is good to be very specific here. From https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-JavaLetterOrDigit:

Letters and digits may be drawn from the entire Unicode character set[...]. This allows programmers to use identifiers in their programs that are written in their native languages.

I think this should be our goal, as well; our documentation should reflect that we support all unicode characters now.

We could add support for unicode escapes. We could do this in a similar fashion that Java itself does by first translating the unicode escapes to UTF-16, then passing the CharStream to ANTLR. We have a few issues for unicode escape support, but all have been opened by me from examples found in openjdk; I don't think anyone is actually writing code like this:

public class SupplementaryJavaID1 { public static void main(String[] s) { int \ud801\udc00abc = 1; int \ud802\udc00abc = 2; int \ud801\udc01abc = 3; int def\ud801\udc00 = 4; int \ud801\udc00\ud834\udd7b = 5; if (\ud801\udc00abc != 1 || \ud802\udc00abc != 2 || \ud801\udc01abc != 3 || def\ud801\udc00 != 4 || \ud801\udc00\ud834\udd7b != 5) { throw new RuntimeException("test failed"); } }

Also: ASCII is a subset of unicode, so we should just state that we support unicode characters, and not escapes. Done.

romani · 2021-08-13T22:35:41Z

Github, generate web site

github-actions · 2021-08-13T22:39:36Z

https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/d6c9592_2021223831/index.html

https://checkstyle-diff-reports.s3.us-east-2.amazonaws.com/d6c9592_2021223831/writingchecks.html#Limitations

nrmancuso marked this pull request as draft July 27, 2021 15:47

nrmancuso force-pushed the issue-8050 branch from 4326850 to be0fb42 Compare July 27, 2021 16:27

nrmancuso commented Jul 27, 2021

View reviewed changes

Comment thread src/xdocs/writingchecks.xml

nrmancuso force-pushed the issue-8050 branch from be0fb42 to 5047a8b Compare August 2, 2021 12:04

nrmancuso marked this pull request as ready for review August 2, 2021 12:04

pbludov added the blocked label Aug 2, 2021

nrmancuso force-pushed the issue-8050 branch from 5047a8b to 297c49c Compare August 2, 2021 18:43

romani removed the blocked label Aug 11, 2021

nrmancuso force-pushed the issue-8050 branch from 297c49c to 2a38bbc Compare August 11, 2021 04:22

romani approved these changes Aug 11, 2021

View reviewed changes

romani requested a review from pbludov August 11, 2021 22:02

romani assigned pbludov Aug 11, 2021

esilkensen approved these changes Aug 11, 2021

View reviewed changes

rnveach reviewed Aug 12, 2021

View reviewed changes

Comment thread src/xdocs/writingchecks.xml

Comment thread src/xdocs/writingchecks.xml

pbludov approved these changes Aug 12, 2021

View reviewed changes

pbludov assigned rnveach and unassigned pbludov Aug 12, 2021

nrmancuso force-pushed the issue-8050 branch from 2a38bbc to 3785f76 Compare August 12, 2021 13:36

nrmancuso force-pushed the issue-8050 branch from 3785f76 to 3a7de07 Compare August 12, 2021 14:13

github-actions Bot force-pushed the issue-8050 branch from 3a7de07 to ffcdec3 Compare August 12, 2021 14:59

rnveach reviewed Aug 12, 2021

View reviewed changes

nrmancuso force-pushed the issue-8050 branch from ffcdec3 to a6df630 Compare August 13, 2021 14:15

Issue checkstyle#8050: Support UTF-8 characters in identifiers

d6c9592

nrmancuso force-pushed the issue-8050 branch from a6df630 to d6c9592 Compare August 13, 2021 14:37

rnveach approved these changes Aug 13, 2021

View reviewed changes

rnveach merged commit 1e49957 into checkstyle:master Aug 13, 2021

Uh oh!

Conversation

nrmancuso commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nrmancuso commented Aug 2, 2021

Uh oh!

romani commented Aug 11, 2021

Uh oh!

romani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nrmancuso commented Aug 12, 2021

Uh oh!

github-actions Bot commented Aug 12, 2021

Uh oh!

nrmancuso commented Aug 12, 2021

Uh oh!

github-actions Bot commented Aug 12, 2021

Uh oh!

romani commented Aug 12, 2021

Uh oh!

nrmancuso commented Aug 12, 2021

Uh oh!

rnveach Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrmancuso Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnveach Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrmancuso Aug 13, 2021

Choose a reason for hiding this comment

Uh oh!

nrmancuso Aug 13, 2021

Choose a reason for hiding this comment

Uh oh!

romani commented Aug 13, 2021

Uh oh!

github-actions Bot commented Aug 13, 2021 • edited by romani Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nrmancuso commented Jul 27, 2021 •

edited

Loading

rnveach Aug 12, 2021 •

edited

Loading

nrmancuso Aug 12, 2021 •

edited

Loading

rnveach Aug 12, 2021 •

edited

Loading

github-actions Bot commented Aug 13, 2021 •

edited by romani

Loading