Recognize UTF8-encoded Latin-1 letters by xavierleroy · Pull Request #11717 · ocaml/ocaml

xavierleroy · 2022-11-11T17:50:33Z

As a leftover from the 1990's, OCaml currently recognizes accented letters in identifiers provided they are encoded with the now-defunct Latin-1 character set (ISO 8859-1). There's considerable pressure to get rid of this special case and accept ASCII identifiers only + arbitrary UTF-8 in strings and comments, see #1802 for instance. However, I still like my accented letters in identifiers, because they work beautifully for textbooks written in Western languages other than English.

This PR is an old experiment of mine where Latin-1 accented letters are recognized in identifiers using Unicode/UTF-8 encoding instead of ISO 8859-1 encoding. The approach was discussed earlier with @dbuenzli, see #10749 (comment) and following messages. I'm not sure this is a good idea, but I'm offering this implementation as a proof-of-concept that @dbuenzli and others can then tear to pieces.

I felt obliged to normalize the UTF-8 encoding of accented letters (I chose NFD) because different text editors save Unicode text in different forms: e.g. NFC for Emacs and most Linux utilities, and NFD for xcode, textedit, and everything made by Apple.

Because of the ugly regexps, the lexer automaton gets significantly larger, but not excessively so:

Before: 252 states, 7726 transitions, table size 32416 bytes
After: 515 states, 30181 transitions, table size 123814 bytes

I would still prefer to have a better story to tell, e.g. all UTF-8 letters in identifiers, but it's a bigger change.

In identifiers and labels, we recognize the same Latin-1 letters than before, but through their UTF8 encodings, not their ISO-Latin-1 encodings. Accented letters are normalized to their NFD representation.

gasche · 2022-11-11T20:24:40Z

This is a modest change, it makes mandating UTF8 less-compatibility-breaking, and it solves a practical problem with the maintenance of existing non-English textbooks that relied on latin1 encoding. I must say that I like the idea of starting with (something like) this as a baby step towards proper Unicode identifiers.

xavierleroy · 2022-11-12T16:22:22Z

Reading the code again, I think it can be made a bit nicer (and avoid the big regexps) by using some UTF8 functions that were introduced in OCaml 4.14.

In my mind, it's not clear what we should shoot for: a quick hack like this or a full implementation of Unicode identifiers as in UAX 31 and @whitequark 's old prototype . Opinions welcome.

dbuenzli · 2022-11-13T20:34:35Z

I think solving this would be a first good step to normalise the situation. I suspect most people like me who write end-user apps in OCaml for non-english speaking users or long form documentation in .mli and .mld files moved to UTF-8 encoded files a long time ago.

I note however that this does not check that files are UTF-8 encoded which would be good to have.

A few comments.

I felt obliged to normalize the UTF-8 encoding of accented letters (I chose NFD)

If the compiler does use String.capitalize_ascii or Char.uppercase in a meaningful way NFD may actually be a requirement rather than a choice unless a Unicode aware case mapping is added.

Which brings the question… what happens with the file éléphant.ml in the room ? @whitequark has a section about file systems. You likely need to normalise the filenames of compilation units. Only Apple's file systems seem to guarantee you NFD. Other systems do nothing about it and likely give you NFC (the output of input methods).

Other than that, except for size, and possible expectations in other tooling downstream (debuggers, profilers) I think the choice between NFD or NFC is mostly irrelevant.

because different text editors save Unicode text in different forms: e.g. NFC for Emacs and most Linux utilities, and NFD for xcode, textedit, and everything made by Apple.

I'm surprised by this. I tried here and both TextEdit and Xcode give me NFC with a CH-FR input method.

In my mind, it's not clear what we should shoot for: a quick hack like this or a full implementation of Unicode identifiers as in UAX 31 and @whitequark 's old prototype . Opinions welcome.

Moving to checked UTF-8 encoded files with a compatibility story for latin1 users looks like a low hanging fruit that needs little Unicode machinery. I would rather aim for that first.

xavierleroy · 2022-11-15T14:30:06Z

Thanks for the comments @dbuenzli, your points are well taken. I remembered wrongly about macOS text editors favoring NFD over NFC. Concerning interactions with files (the éléphant in the pièce), @whitequark 's analysis is interesting, but I need to run more experiments too.

DemiMarie · 2022-11-16T22:23:14Z

I strongly recommend NFC. IIUC NFD is widely considered to be a mistake.

dbuenzli · 2022-11-16T23:02:07Z

I strongly recommend NFC. IIUC NFD is widely considered to be a mistake.

Would you care to elaborate ?

Without proper out-of-band agreements, you have to be prepared for any, even mixed, forms on foreign input.

After that it's up to you to convert and use the form that is the easiest to work with for the task at hand internally.

There's no such thing as a "mistaken" normal form. However there may be mistaken programs making assumptions they shouldn't do on foreign inputs :-)

DemiMarie · 2022-11-17T09:20:23Z

I strongly recommend NFC. IIUC NFD is widely considered to be a mistake.

Would you care to elaborate ?

Without proper out-of-band agreements, you have to be prepared for any, even mixed, forms on foreign input.

After that it's up to you to convert and use the form that is the easiest to work with for the task at hand internally.

There's no such thing as a "mistaken" normal form. However there may be mistaken programs making assumptions they shouldn't do on foreign inputs :-)

Sorry, I meant that Apple’s use of NFD is considered to be a mistake, IIUC

xavierleroy · 2022-11-18T16:06:45Z

Thanks for the feedback. The next iteration is here: #11736 . It does use NFC instead of NFD, and tries to address the éléphant.ml problem.

Recognize UTF8-encoded Latin-1 letters

bc9452b

In identifiers and labels, we recognize the same Latin-1 letters than before, but through their UTF8 encodings, not their ISO-Latin-1 encodings. Accented letters are normalized to their NFD representation.

xavierleroy mentioned this pull request Nov 11, 2022

Attacks by introducing invisible source code using special Unicode characters #10749

Open

xavierleroy mentioned this pull request Nov 18, 2022

Modest support for Unicode letters in identifiers #11736

Closed

xavierleroy closed this Nov 18, 2022

gasche mentioned this pull request Jul 3, 2023

asmcomp "compile-time constants" do not work on cross-compilers #7250

Closed

favonia mentioned this pull request Dec 2, 2024

Support unescaped Uchar.t literals #12696

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize UTF8-encoded Latin-1 letters#11717

Recognize UTF8-encoded Latin-1 letters#11717
xavierleroy wants to merge 1 commit intoocaml:trunkfrom
xavierleroy:latin-utf8

xavierleroy commented Nov 11, 2022

Uh oh!

gasche commented Nov 11, 2022

Uh oh!

xavierleroy commented Nov 12, 2022

Uh oh!

dbuenzli commented Nov 13, 2022

Uh oh!

xavierleroy commented Nov 15, 2022

Uh oh!

DemiMarie commented Nov 16, 2022

Uh oh!

dbuenzli commented Nov 16, 2022

Uh oh!

DemiMarie commented Nov 17, 2022

Uh oh!

xavierleroy commented Nov 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xavierleroy commented Nov 11, 2022

Uh oh!

gasche commented Nov 11, 2022

Uh oh!

xavierleroy commented Nov 12, 2022

Uh oh!

dbuenzli commented Nov 13, 2022

Uh oh!

xavierleroy commented Nov 15, 2022

Uh oh!

DemiMarie commented Nov 16, 2022

Uh oh!

dbuenzli commented Nov 16, 2022

Uh oh!

DemiMarie commented Nov 17, 2022

Uh oh!

xavierleroy commented Nov 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants