Skip to content

Recognize UTF8-encoded Latin-1 letters#11717

Closed
xavierleroy wants to merge 1 commit intoocaml:trunkfrom
xavierleroy:latin-utf8
Closed

Recognize UTF8-encoded Latin-1 letters#11717
xavierleroy wants to merge 1 commit intoocaml:trunkfrom
xavierleroy:latin-utf8

Conversation

@xavierleroy
Copy link
Copy Markdown
Contributor

As a leftover from the 1990's, OCaml currently recognizes accented letters in identifiers provided they are encoded with the now-defunct Latin-1 character set (ISO 8859-1). There's considerable pressure to get rid of this special case and accept ASCII identifiers only + arbitrary UTF-8 in strings and comments, see #1802 for instance. However, I still like my accented letters in identifiers, because they work beautifully for textbooks written in Western languages other than English.

This PR is an old experiment of mine where Latin-1 accented letters are recognized in identifiers using Unicode/UTF-8 encoding instead of ISO 8859-1 encoding. The approach was discussed earlier with @dbuenzli, see #10749 (comment) and following messages. I'm not sure this is a good idea, but I'm offering this implementation as a proof-of-concept that @dbuenzli and others can then tear to pieces.

I felt obliged to normalize the UTF-8 encoding of accented letters (I chose NFD) because different text editors save Unicode text in different forms: e.g. NFC for Emacs and most Linux utilities, and NFD for xcode, textedit, and everything made by Apple.

Because of the ugly regexps, the lexer automaton gets significantly larger, but not excessively so:

  • Before: 252 states, 7726 transitions, table size 32416 bytes
  • After: 515 states, 30181 transitions, table size 123814 bytes

I would still prefer to have a better story to tell, e.g. all UTF-8 letters in identifiers, but it's a bigger change.

In identifiers and labels, we recognize the same Latin-1 letters than
before, but through their UTF8 encodings, not their ISO-Latin-1
encodings.

Accented letters are normalized to their NFD representation.
@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 11, 2022

This is a modest change, it makes mandating UTF8 less-compatibility-breaking, and it solves a practical problem with the maintenance of existing non-English textbooks that relied on latin1 encoding. I must say that I like the idea of starting with (something like) this as a baby step towards proper Unicode identifiers.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Reading the code again, I think it can be made a bit nicer (and avoid the big regexps) by using some UTF8 functions that were introduced in OCaml 4.14.

In my mind, it's not clear what we should shoot for: a quick hack like this or a full implementation of Unicode identifiers as in UAX 31 and @whitequark 's old prototype . Opinions welcome.

@dbuenzli
Copy link
Copy Markdown
Contributor

I think solving this would be a first good step to normalise the situation. I suspect most people like me who write end-user apps in OCaml for non-english speaking users or long form documentation in .mli and .mld files moved to UTF-8 encoded files a long time ago.

I note however that this does not check that files are UTF-8 encoded which would be good to have.

A few comments.

I felt obliged to normalize the UTF-8 encoding of accented letters (I chose NFD)

If the compiler does use String.capitalize_ascii or Char.uppercase in a meaningful way NFD may actually be a requirement rather than a choice unless a Unicode aware case mapping is added.

Which brings the question… what happens with the file éléphant.ml in the room ? @whitequark has a section about file systems. You likely need to normalise the filenames of compilation units. Only Apple's file systems seem to guarantee you NFD. Other systems do nothing about it and likely give you NFC (the output of input methods).

Other than that, except for size, and possible expectations in other tooling downstream (debuggers, profilers) I think the choice between NFD or NFC is mostly irrelevant.

because different text editors save Unicode text in different forms: e.g. NFC for Emacs and most Linux utilities, and NFD for xcode, textedit, and everything made by Apple.

I'm surprised by this. I tried here and both TextEdit and Xcode give me NFC with a CH-FR input method.

In my mind, it's not clear what we should shoot for: a quick hack like this or a full implementation of Unicode identifiers as in UAX 31 and @whitequark 's old prototype . Opinions welcome.

Moving to checked UTF-8 encoded files with a compatibility story for latin1 users looks like a low hanging fruit that needs little Unicode machinery. I would rather aim for that first.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Thanks for the comments @dbuenzli, your points are well taken. I remembered wrongly about macOS text editors favoring NFD over NFC. Concerning interactions with files (the éléphant in the pièce), @whitequark 's analysis is interesting, but I need to run more experiments too.

@DemiMarie
Copy link
Copy Markdown
Contributor

I strongly recommend NFC. IIUC NFD is widely considered to be a mistake.

@dbuenzli
Copy link
Copy Markdown
Contributor

I strongly recommend NFC. IIUC NFD is widely considered to be a mistake.

Would you care to elaborate ?

Without proper out-of-band agreements, you have to be prepared for any, even mixed, forms on foreign input.

After that it's up to you to convert and use the form that is the easiest to work with for the task at hand internally.

There's no such thing as a "mistaken" normal form. However there may be mistaken programs making assumptions they shouldn't do on foreign inputs :-)

@DemiMarie
Copy link
Copy Markdown
Contributor

I strongly recommend NFC. IIUC NFD is widely considered to be a mistake.

Would you care to elaborate ?

Without proper out-of-band agreements, you have to be prepared for any, even mixed, forms on foreign input.

After that it's up to you to convert and use the form that is the easiest to work with for the task at hand internally.

There's no such thing as a "mistaken" normal form. However there may be mistaken programs making assumptions they shouldn't do on foreign inputs :-)

Sorry, I meant that Apple’s use of NFD is considered to be a mistake, IIUC

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback. The next iteration is here: #11736 . It does use NFC instead of NFD, and tries to address the éléphant.ml problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants