Issue #8050: Support UTF-8 characters in identifiers#10437
Conversation
|
This PR must not be merged until #10280 is merged. |
|
@nmancus1 , please fix conflict |
romani
left a comment
There was a problem hiding this comment.
Ok to merge, no functional changes
|
Github, generate site |
|
Github, generate site |
|
Github, rebase |
3a7de07 to
ffcdec3
Compare
| tokens (identifiers, keywords)</a> | ||
| should be written with | ||
| <a href="https://en.wikipedia.org/wiki/ASCII">ASCII</a> | ||
| <a href="https://en.wikipedia.org/wiki/UTF-16">UTF-16</a> |
There was a problem hiding this comment.
Correct me if I am wrong, ASCII isn't a subset of UTF16? It is completely different?
https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html
UTF-8 is compatible with ASCII while UTF-16 is incompatible with ASCII
If so then we have to say we support both based on the wording you are using here.
There was a problem hiding this comment.
Yup, I knew that ASCII was a subset of UTF8, and falsely assumed that ASCII was a subset of UTF16 by transitivity. I will update this. <-- this is totally wrong too, I am going to do some more reading and rewrite the documentation.
There was a problem hiding this comment.
Going based on the title, I assume UTF 8 is also supported?
Maybe we are getting too technical here? The main thing is we support natural Unicode. The exact file format depends more on what is reading in the file, than the grammar itself, right? If its us reading the file, then its tied to something outside the grammar and can be looked at to be fixed if reported. If its ANTLR reading the file, then maybe we should just say we are bound by ANTLR's reading capabilities.
There was a problem hiding this comment.
Ok, after doing more research:
Going based on the title, I assume UTF 8 is also supported?
Yes.
Maybe we are getting too technical here?
Probably not; I think it is good to be very specific here. From https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-JavaLetterOrDigit:
Letters and digits may be drawn from the entire Unicode character set[...]. This allows programmers to use identifiers in their programs that are written in their native languages.
I think this should be our goal, as well; our documentation should reflect that we support all unicode characters now.
We could add support for unicode escapes. We could do this in a similar fashion that Java itself does by first translating the unicode escapes to UTF-16, then passing the CharStream to ANTLR. We have a few issues for unicode escape support, but all have been opened by me from examples found in openjdk; I don't think anyone is actually writing code like this:
public class SupplementaryJavaID1 {
public static void main(String[] s) {
int \ud801\udc00abc = 1;
int \ud802\udc00abc = 2;
int \ud801\udc01abc = 3;
int def\ud801\udc00 = 4;
int \ud801\udc00\ud834\udd7b = 5;
if (\ud801\udc00abc != 1 ||
\ud802\udc00abc != 2 ||
\ud801\udc01abc != 3 ||
def\ud801\udc00 != 4 ||
\ud801\udc00\ud834\udd7b != 5) {
throw new RuntimeException("test failed");
}
}
There was a problem hiding this comment.
Also: ASCII is a subset of unicode, so we should just state that we support unicode characters, and not escapes. Done.
|
Github, generate web site |


Closes #8050.
The original proposed test input made a file that was too long (
javacwouldn't compile it), so I combined characters into identifiers of 25 characters for eachint. I generated the input file by iterating through every existing code point, and checking to see if it was a valid java identifier, so everything should be covered here. This input file must be non-compilable, because later versions of java support more characters, and I see no reason to limit the test to Java 8 characters since we support up to Java 16.