Recognize UTF8-encoded Latin-1 letters#11717
Conversation
In identifiers and labels, we recognize the same Latin-1 letters than before, but through their UTF8 encodings, not their ISO-Latin-1 encodings. Accented letters are normalized to their NFD representation.
|
This is a modest change, it makes mandating UTF8 less-compatibility-breaking, and it solves a practical problem with the maintenance of existing non-English textbooks that relied on latin1 encoding. I must say that I like the idea of starting with (something like) this as a baby step towards proper Unicode identifiers. |
|
Reading the code again, I think it can be made a bit nicer (and avoid the big regexps) by using some UTF8 functions that were introduced in OCaml 4.14. In my mind, it's not clear what we should shoot for: a quick hack like this or a full implementation of Unicode identifiers as in UAX 31 and @whitequark 's old prototype . Opinions welcome. |
|
I think solving this would be a first good step to normalise the situation. I suspect most people like me who write end-user apps in OCaml for non-english speaking users or long form documentation in I note however that this does not check that files are UTF-8 encoded which would be good to have. A few comments.
If the compiler does use Which brings the question… what happens with the file Other than that, except for size, and possible expectations in other tooling downstream (debuggers, profilers) I think the choice between NFD or NFC is mostly irrelevant.
I'm surprised by this. I tried here and both TextEdit and Xcode give me NFC with a CH-FR input method.
Moving to checked UTF-8 encoded files with a compatibility story for latin1 users looks like a low hanging fruit that needs little Unicode machinery. I would rather aim for that first. |
|
Thanks for the comments @dbuenzli, your points are well taken. I remembered wrongly about macOS text editors favoring NFD over NFC. Concerning interactions with files (the éléphant in the pièce), @whitequark 's analysis is interesting, but I need to run more experiments too. |
|
I strongly recommend NFC. IIUC NFD is widely considered to be a mistake. |
Would you care to elaborate ? Without proper out-of-band agreements, you have to be prepared for any, even mixed, forms on foreign input. After that it's up to you to convert and use the form that is the easiest to work with for the task at hand internally. There's no such thing as a "mistaken" normal form. However there may be mistaken programs making assumptions they shouldn't do on foreign inputs :-) |
Sorry, I meant that Apple’s use of NFD is considered to be a mistake, IIUC |
|
Thanks for the feedback. The next iteration is here: #11736 . It does use NFC instead of NFD, and tries to address the |
As a leftover from the 1990's, OCaml currently recognizes accented letters in identifiers provided they are encoded with the now-defunct Latin-1 character set (ISO 8859-1). There's considerable pressure to get rid of this special case and accept ASCII identifiers only + arbitrary UTF-8 in strings and comments, see #1802 for instance. However, I still like my accented letters in identifiers, because they work beautifully for textbooks written in Western languages other than English.
This PR is an old experiment of mine where Latin-1 accented letters are recognized in identifiers using Unicode/UTF-8 encoding instead of ISO 8859-1 encoding. The approach was discussed earlier with @dbuenzli, see #10749 (comment) and following messages. I'm not sure this is a good idea, but I'm offering this implementation as a proof-of-concept that @dbuenzli and others can then tear to pieces.
I felt obliged to normalize the UTF-8 encoding of accented letters (I chose NFD) because different text editors save Unicode text in different forms: e.g. NFC for Emacs and most Linux utilities, and NFD for xcode, textedit, and everything made by Apple.
Because of the ugly regexps, the lexer automaton gets significantly larger, but not excessively so:
I would still prefer to have a better story to tell, e.g. all UTF-8 letters in identifiers, but it's a bigger change.