Skip to content

Make the character set for OCaml source code officially UTF-8.#1802

Closed
pmetzger wants to merge 1 commit intoocaml:trunkfrom
pmetzger:utf8lex
Closed

Make the character set for OCaml source code officially UTF-8.#1802
pmetzger wants to merge 1 commit intoocaml:trunkfrom
pmetzger:utf8lex

Conversation

@pmetzger
Copy link
Copy Markdown
Member

@pmetzger pmetzger commented May 26, 2018

By doing this, editors and other tools can presume that UTF-8 is the encoding to use for OCaml source. I think this is an idea whose time has come. Latin-1 identifiers have now been deprecated with a warning for many years.

Note: I've made the Changes entry assuming this would NOT be a candidate for 4.07.

  1. lexer.mll: change to no longer recognize latin-1 in identifiers and the like. Previously this was accepted with a warning. UTF-8 encoding conflicts with Latin-1.
  2. lexer.mll: changes to validate that the file is valid UTF-8. This is done by recognizing a regular expression for valid UTF-8 characters in comments, strings, and quoted strings, and by assuring that character literals do not have a literal character with its eighth bit set.
  3. lex.etex: change documentation to state explicitly that the character set for source code is Unicode/UTF-8, but that only ASCII codepoints are permitted in source except for comments and string literals. Also note that character literals can't contain non-ASCII. Note in the identifier syntax that letters mean ASCII letters.
  4. Changes: Note the above has happened.

Note that this set of changes is a strawman. I would like to know people's opinions on how to do this and which parts should be done which way.

Each of the following question is followed by the likely available options for answers, and I'd appreciate hearing what people prefer:

  1. Should we be specifying UTF-8 as the encoding of OCaml source code files going forward? (Yes/No?)
  2. If (1) is yes, should we actually check that the files are valid UTF-8 and warn or error if not? If we check, should it be a warning or an error? (Don't Check/Warn/Error?)
  3. If (1) is yes, should we follow up the deprecation of Latin-1 identifier codepoints in 4.01 (that is, bare eighth bit set characters intended to be interpreted as ISO/IEC 8859-1) with removing them as valid in the lexer? (Remove/Leave the current deprecation warning?)
  4. If (3) is yes, should we change the regular expression for identifiers to accept Latin-1 letters encoded as UTF-8? (Yes/No?)

@nojb
Copy link
Copy Markdown
Contributor

nojb commented May 26, 2018

By doing this, editors and other tools can presume that UTF-8 is the encoding to use for OCaml source.

Good editors can already auto-detect a file's encoding without any problem. And in any case this PR would not allow them to assume UTF-8, see next point.

… but that only ASCII codepoints are permitted in source except for comments and string literals.

This is not the case: the contents of comments and string literals are not UTF-8-validated, they are processed as raw bytes.

Also note that character literals can't contain codepoints over 255.

Characters in OCaml (i.e. the char type) represent bytes, not code points. Values of type string and bytes are sequences of bytes, do not have any specific encoding, and so they cannot be considered as sequences of code points.

@pmetzger
Copy link
Copy Markdown
Member Author

Good editors can already auto-detect a file's encoding without any problem

And if you want to insert UTF-8 greek chars into a string but there's a Latin-1 identifier you have to use? It would be nice to have the question of what the encoding is settled. I think the clear long term goal has been to go to Unicode the way everything else has. Do you object in some specific way to that?

This is not the case: the contents of comments and string literals are not UTF-8-validated, they are processed as raw bytes.

They don't need to be validated in order for the policy to be that they're supposed to be UTF-8. The point is to set a policy and an intent, not to prevent people from cleverly deciding to include things that aren't actually unicode in their files. People can trick the compiler in all sorts of ways if they want to. Heck, they can use the Obj module to violate the type system, and yet we say that OCaml is strongly typed. If there's a desire to validate the UTF-8 someday, we could do that, but it isn't needed.

Characters in OCaml (i.e. the char type) represent bytes, not code points

This is a distinction without a difference. The documentation needs to explain that 'λ' remains an invalid character constant because a char can't store it, which is what my documentation change does. That is already true right now, by the way.

@gasche
Copy link
Copy Markdown
Member

gasche commented May 27, 2018

Personally I think the change makes sense (and the implementation looks correct), but I guess that it would need to be discussed more broadly among maintainers to get consensus.

@pmetzger
Copy link
Copy Markdown
Member Author

@gasche Indeed, I presumed that this would not be committed without a significant discussion among the maintainers. The implementation is small, but the agreement needed is broad.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented May 27, 2018

FTR, I am not against removing support of latin-1 identifiers. But giving the char type a Unicode-related meaning in the documentation just seems like a bad idea and conceptually confused.

@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented May 27, 2018

But giving the char type a Unicode-related meaning in the documentation just seems like a bad idea and conceptually confused.

The documentation change I made specifically says it does not have the ability to store arbitrary Unicode characters. That's the whole point, to indicate that character constants can't store more than eight bits, that even though you could write let c = 'ξ' in a file, that it would be a syntax error.

@shindere
Copy link
Copy Markdown
Contributor

shindere commented May 28, 2018 via email

@alainfrisch
Copy link
Copy Markdown
Contributor

alainfrisch commented May 28, 2018

(Note: codepoints greater than 255 are not permitted within character
literals, as the character type is eight bits for historical reasons.
The Uchar type in the standard library can store arbitrary
Unicode codepoints.)

This is very confusing, as it suggests that (1) codepoints between 128 and 255 can be represented in char,
and (2) that the char is used to represent a subset of codepoints. This is true, but only if the encoding is Latin-1. If we declare that source files are interpreted as being utf-8 encoded:

  • The literal 'é' should be rejected (even though the codepoint is <= 255).
  • The literal '\xe9' should be accepted (even though it does not utf8-encode any Unicode character)

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented May 28, 2018

I share similar concerns as @alainfrisch, @shindere and @nojb. If you want to say something about this here (personally I'm not sure it's actually useful to say this here, I would just delete the note), emphasis should be but on the fact that for historical reasons what OCaml calls a string is simply a sequence of bytes and what it uses as atoms for strings is the char type which represents a single byte and is what you get when you index a string.

Copy link
Copy Markdown
Member

@Octachron Octachron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the idea of precising the expected encoding for source files and the removal of latin1 identifiers. However, the additional documentation seems often counter-productive to me:

(Note: codepoints greater than 255 are not permitted within character
literals, as the character type is eight bits for historical reasons.
The Uchar type in the standard library can store arbitrary
Unicode codepoints.)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @nojb , there is little point in mixing the 8-bit integer type char with unicode characters. I believe that this paragraph should be removed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below. The issue to me is I don't want someone asking "but why is it that

let foo = "αβγδ"

works but

let foo = 'β'

does not?"

It needs to be said, even if the current phrasing is bad.

characters 223--246 and 248--255 as lowercase letters). This
feature is deprecated and should be avoided for future compatibility.
(Currently, letters consist of the 52 lowercase and uppercase
letters from the ASCII character set.)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parentheses does not seem warranted. I would also remove Currently which makes the sentence unnecessarily controversial.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easily fixed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pmetzger
Copy link
Copy Markdown
Member Author

Okay, so everyone seems angry about the text about char values. The problem I'm trying to address, in advance, is someone attempting to type

let foo = '⊕'

and having it fail and wondering why it is failing since they can type

let foo = "αβγδ"

just fine.

I'm happy to have whatever sort of explanation in the text people prefer for this, but the documentation does need to explain that it isn't possible to store an arbitrary Unicode codepoint in a char, that a char can only store something that, when converted to an integer, is between 0 and 255.

@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented May 28, 2018

Also, above, many people have said "char is an eight bit integer". But, char isn't really an eight bit integer type. One cannot type

let c : char = 27

as that is a syntax error. c is a "small" character (which used, in the old days, to be able to store an arbitrary character), not an eight bit integer. We can pretend it is an eight bit integer for many purposes, but you can't assign numbers to it, you have to assign things like 'a' to it.

OCaml is a strongly typed language. The fact that we know we can perform type puns of various kinds doesn't mean we should.

Arguably, "someday" we should:

  1. Have a real octet type (call it byte or octet or int8 or something)
  2. Slowly move away from the use of char if we have such a type, given that it can't store an arbitrary character any longer, and now we have Uchar for that purpose.

However, that "someday" doesn't seem like today. So for now, we need some language in the manual that explains the current situation is unexpected and violates a naive user's expectations.

@pmetzger
Copy link
Copy Markdown
Member Author

@alainfrisch:

(Note: codepoints greater than 255 are not permitted within character literals, as the character type is eight bits for historical reasons. The Uchar type in the standard library can store arbitrary Unicode codepoints.)

This is very confusing, as it suggests that (1) codepoints between 128 and 255 can be represented in char, and (2) that the char is used to represent a subset of codepoints. This is true, but only if the encoding is Latin-1. If we declare that source files are interpreted as being utf-8 encoded:

The literal 'é' should be rejected (even though the codepoint is <= 255).
The literal '\xe9' should be accepted (even though it does not utf8-encode any Unicode character)

I'm not sure there is harm as such in accepting 'é' but I'm fine with making changes to the lexing/parsing code to prevent it if that's the consensus.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented May 28, 2018

Also, above, many people have said "char is an eight bit integer". But, char isn't really an eight bit integer type. One cannot type

It is, one can type let c : char = '\027'. Just think of it as special syntax for eight bit integer literals. Really if you want to avoid violating end user expectations that's the best model one can expose. Basicaly char's are eight bit integers that can be specified either via, decimal, octal or hexadecimal numbers or their representative US-ASCII character code.

@pmetzger pmetzger force-pushed the utf8lex branch 2 times, most recently from e870cfe to cf506e6 Compare May 28, 2018 15:00
@alainfrisch
Copy link
Copy Markdown
Contributor

I'm not sure there is harm as such in accepting 'é'

There are two questions:

  • Whether to accept the literal when the source code contains the 0xE9 byte (ie the literal 'é', Latin-1 encoded). This is currently the case, and I'm not sure it's worth a dedicated warning to reject it: one should rather, perhaps, add a warning that check that the entire source file is well encoded in utf-8.

  • Whether to accept the literal 'é' from a well encoded utf-8 source file, producing the same value as the literal '\xe9'. I think it would be crazy to do so, if the goal is to move towards having all string values representing valid utf-8 encodings.

@pmetzger
Copy link
Copy Markdown
Member Author

All: I've simplified my documentation about valid character literals. Please tell me if this is more acceptable.

As for @alainfrisch and others stating we should check whether the input file is valid UTF-8, if that is desired it can of course be done.

@pmetzger
Copy link
Copy Markdown
Member Author

I would be interested in hearing @xavierleroy's thoughts...

@xavierleroy
Copy link
Copy Markdown
Contributor

I'll let others comment on how the char type and character literals should be documented.

My concern is with the removal of Latin-1 in identifiers, without any alternative being provided. I agree that production-quality code as well as collaborative projects must use ASCII identifiers, preferably meaningful in English. I do think that for teaching beginners it can make sense to use identifiers in the students' native language. There are more textbooks about Caml in French, German, Italian than in English (and none in non-Western European languages). Looking at my collection I see two French textbooks that use French identifiers with accents, one with French identifiers and no accents, and two with English identifiers. We should show some respect and some consideration for the pedagogical decisions taken by the authors of those books.

So, opening Pandora's box: what would it look like to support some Unicode characters in identifiers? I know Java has been doing this since 1995, and other languages followed suit, so by now there should be some kind of consensus on what Unicode to support in identifiers and how to do it.

@dbuenzli
Copy link
Copy Markdown
Contributor

So, opening Pandora's box: what would it look like to support some Unicode characters in identifiers?

There is UAX 31 that deals with this matters. Note that to do it properly you'd still need to implement normalization in the compiler to test for identifier equivalence. You could avoid too much complexity by fixing an Unicode version which is supported for a few years from which you could extract the information from the unicode character database and bake it in the OCaml repo. You could also do nothing about identifier equivalence; for example xml does this (you need to match tags, which are allowed to have unicode characters but those are not normalized so you can easily confuse a document to make it as if it is valid xml with matching tags, while it's not in practice).

@whitequark has done an implementation of Unicode identifiers for OCaml here along with an interesting section about technical details (e.g. interaction with the file system).

@pmetzger
Copy link
Copy Markdown
Member Author

@xavierleroy: I would suggest that The Right Thing might (might!) be to open Pandora's Box at some point and to permit Unicode identifiers, the use of Unicode math operators, etc., but as @dbuenzli points out, doing that change fully will require significant work.

I see this as an intermediate step along the path towards Unicode adoption. There's a lot more work to do than this, but this seems like a reasonable incremental change (which is why I proposed it.)

@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented May 30, 2018

@dbuenzli:

Note that to do it properly you'd still need to implement normalization in the compiler to test for identifier equivalence

Not that this is relevant to the current pull request, but would one approach there be to do Unicode normalization for identifiers in the lexical analyzer? (I know relatively little about these issues.)

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented May 30, 2018

@dbuenzli: thanks for the pointers, reading all that will keep me silent for a while :-)

@whitequark: I didn't know about your efforts. After working on this internationalization stuff for several years, what are your conclusions? Would you recommend it for general use or is it too complicated for its own good?

@pmetzger: I'm still wary of removing a feature (Latin-1 accented letters in identifiers) that causes no harm, might be useful to a few users, and is not planned to come back in a more general form (Unicode "letters" in identifiers).

A reasonable compromise would be to move the part of the documentation that describes Latin-1 letters in identifiers to the "Language extensions" chapter, with a note that this feature is questionable and could go away soon, or be subsumed by a more general support for Unicode "letters" in identifiers.

@whitequark
Copy link
Copy Markdown
Member

whitequark commented May 30, 2018

After working on this internationalization stuff for several years, what are your conclusions? Would you recommend it for general use or is it too complicated for its own good?

I was actually surprised by how compact and reliable that implementation is, after writing it. I expected the confusables issue to be much harder to tackle but UAX 31 has everything covered. It's certainly not what I would call "too complicated". In my opinion (and I'm not saying this just because I wrote it--I've killed many of my own projects because they didn't work well enough) m17n can be integrated into OCaml trunk more or less as-is.

I'm still a bit torn on (keyword) localization; it looks like people who use Latin supersets as alphabet don't care for it very much, but those who don't might want it. Since it can be experimented with without any modification of the core compiler (just expose the lexer keyword list in compiler-libs) it seems fine to let it bake a bit in the community and then decide.

@dra27
Copy link
Copy Markdown
Member

dra27 commented May 30, 2018

Two thoughts:

  • Why not introduce the deprecation as a configure-time switch, a little like safe-string was? So the default in 4.08 or 4.09 disables latin1 but -enable-latin1-identifiers or something to configure will restore the current behaviour (still with the deprecation warning). It's not one we'd need to test continuously, as it only affects the lexer, but it would allow us to have an opam switch with it enabled and that in turn would allow us to detect approximately how often it's being used...
  • It feels morally odd to me to remove a deprecated feature without doing something which depends on its removal. To me it feels as though UTF-8 validation of strings literals and comments should accompany the final removal of the latin1 identifiers? As it happens, Windows Console Unicode Support #1408 may very well soon propose a (correctly implemented) UTF-8 validator, although it'll be in C...

@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented May 30, 2018

@xavierleroy

I'm still wary of removing a feature (Latin-1 accented letters in identifiers) that causes no harm, might be useful to a few users, and is not planned to come back in a more general form

So, there are two questions here.

  1. What encoding shall OCaml source files be in? If we say it is UTF-8, we can no longer embed naked Latin-1 characters in the sources, since we've then removed support for the Latin-1 encoding. I think this is a step that needs to be taken, regardless. Unicode is now the way things are done, for good or ill.
  2. What characters shall be accepted as part of OCaml identifiers? If we've adopted Unicode and UTF-8, we can retain support for the set of Latin-1 characters we had in identifiers, but it would now require a Unicode-aware lexer, which we don't yet have, though presumably based on Whitequark's work one is not terribly infeasible. However, once we have that, only doing the characters that were once in Latin-1 seems odd.

My proposal was to adopt UTF-8 for (1) while putting off a decision on (2). However, both could be decided on at once if that was the consensus.

(An aside on issue (2): Latin-1 characters in identifiers have generated a warning for some time now.)

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented May 30, 2018

Regarding latin-1 identifiers or identifiers in foreign languages in general, with all due respect to OCaml's rich history, I think we definitely want to avoid such identifiers going forward. If I had to modify code with identifiers in a foreign language, I would have a very hard time making any progress. If I had to deal with identifiers I couldn't even reproduce with my keyboard, I'd probably just burn the program to the ground and start again. We don't want to encourage any of this IMO.

@pmetzger
Copy link
Copy Markdown
Member Author

I should work on it again.

The problem here is that there are several decisions being conflated; there is too much to do in only one step. I think it will be easier if we take them one at a time. I think I'm going to split it into several parts, provided that everyone agrees.

The first would be a PR finalizing obsoleting the use of source code in ISO/IEC 8859-1 — this was started years ago (use of ISO/IEC 8859-1 identifiers already results in warnings.)

@bobzhang
Copy link
Copy Markdown
Member

bobzhang commented May 1, 2021

FWIW, I think it is a great step to adopt utf8 for source encoding

@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented May 2, 2021

I've thought about this a great deal. The big problem with progressing this is that the naive approach to moving to UTF-8 would mean needing to remove the last support for accented ISO Latin 1 characters. This support was deprecated a long time ago, but it appears some people really don't want it gone. Unfortunately, UTF-8 requires the use of the high bit in some bytes. We thus have an entanglement between the two issues.

I had hoped that this could be done in a few steps and would mostly be a policy change, but that isn't really the case because of the resistance to completing the Latin 1 identifier deprecation.

I think the most straightforward way to fix the impasse is to have a unicode/UTF-8 aware lexical analyzer generator to replace the use of ocamllex. Then it would be possible to properly support identifiers with Latin 1 accents in a Unicode context, and the arguments about whether we should or shouldn't support such accents could be deferred to another time — that is, we could separate the issue of Unicode source code from the issue of removing support for Latin 1 accents.

Unfortunately, providing a new lexical analyzer generator is a sufficiently large task that I've been avoiding it for a while. About a week or two of work on sedlex might be enough, but that requires having a week or two to spend on it.

@DemiMarie
Copy link
Copy Markdown
Contributor

At this point I think this is a necessary first step to fixing #10749 (a CVE-worthy security hole).

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 10, 2023

The lexer changes of this PR are superseded by #11736. The documentation changes are independent and useful, but need to wait for proper UTF8 support in the compiler (subsuming latin1 features) before merging.

@jonahbeckford
Copy link
Copy Markdown
Contributor

jonahbeckford commented Dec 11, 2023

I've thought about this a great deal. The big problem with progressing this is that the naive approach to moving to UTF-8 would mean needing to remove the last support for accented ISO Latin 1 characters. This support was deprecated a long time ago, but it appears some people really don't want it gone. Unfortunately, UTF-8 requires the use of the high bit in some bytes. We thus have an entanglement between the two issues.

Another approach is what Microsoft did to maintain backwards-compatibility for software written decades earlier: Use the BOM as the Unicode signature. Confer with https://en.wikipedia.org/wiki/Byte_order_mark

That would mean keeping the source code encoding as-is (whatever Latin1 variant/tweak of ISO-8859 OCaml supports today) so that all existing source code works but switching to Unicode mode with UTF-8 decoding when there is a BOM.

(Secondary less important question: Out of curiosity, why is the UTF-8 encoding privileged? If a source code encoding can be unambiguously detected, which a BOM does for UTF-8 and UTF-16, and the OCaml standard library can decode it, then it seems natural to support Unicode source code in other encodings. I'm thinking specifically of Windows which requires effort+care to write UTF-8 files.)

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented Nov 3, 2024

Okay, is my understanding correct that:

  1. The compiler now accepts UTF-8 encoded characters, with identifiers currently accepted from the ISO Latin 9 character subset of Unicode.
  2. It would be okay to redo this pull request to document the above?

What is the state, btw, of Unicode characters being accepted in comments and strings?

@Octachron I am assuming you would be able to answer the above?

@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented Nov 3, 2024

So just to be clear, the documentation should reflect the following (independent) issues:

  1. What is the character set and encoding now accepted in OCaml source files (presumably now Unicode in the UTF-8 encoding.)
  2. Also, does the compiler now assure that there aren't invalidly encoded source files, that is, does it produce an error if a file is not valid UTF-8? This should also be documented.
  3. What characters are accepted in identifiers? (I believe that's now ISO Latin 9?)
  4. What characters are accepted in strings? What characters are accepted in comments? What characters are accepted in character literals?

@Octachron
Copy link
Copy Markdown
Member

Octachron commented Nov 3, 2024

  1. OCaml identifiers are required to be utf-8 encoded latin-9 characters. It is recommended that the full source code is utf-8 encoded unicode.

    • As previously, there is no restriction on filenames as long as the source code does not need to refer to the corresponding non-valid identifiers (in other words, one can link 秋.cmx). Filenames are still recommended to be valid module identifiers.
  2. The compiler (lexer really) requires identifiers to be validly utf-8 encoded. The file itself can contain invalid utf-8 contents (inside string literals and comments).

  3. Latin-9

    • Same as before: the contents of string literals is not restricted in any way. However, outside of quoted strings " and \ (aka \034 and \092) must be escaped, thus utf-8 contents is recommended.
    • Same as before: the contents of comments is not restricted (but they cannot contain *) outside of nested string literals)
    • Same as before: character literals are graphical representation of 8-bit integers.

@pmetzger
Copy link
Copy Markdown
Member Author

pmetzger commented Nov 5, 2024

So, I think it might be sane, especially now that the identifier issue has been handled, to officially document that the encoding is UTF-8 and actually enforce that in the compiler. This will both remove ambiguity about the encoding and assure that mistakes that might lead to mysterious errors are caught early.

In that case, assuming that patches to enforce the encoding were added, we would document:

  1. That OCaml source files are encoded in UTF-8, and that programmers may rely on this.
  2. That the compiler enforces that the file must be valid UTF-8, producing appropriate errors if it is not.
  3. That OCaml identifiers must use only valid ISO Latin-9 alphabetical / accented alphabetical characters and the digits 0 to 9, as encoded in UTF-8.
  4. That strings may contain any valid Unicode character.
  5. That comments may contain any valid Unicode character.
  6. That character constants must be valid 7 bit ASCII or characters between 128 and 255 specified by an escape sequence.
  7. Should there be an input syntax for Uchars? (I'd say "not now.")

@Octachron
Copy link
Copy Markdown
Member

Enforcing that OCaml source code are utf-8 valid can break existing user code and remove valid use cases.

Moreover, I don't see much ambiguity with "The strongly recommended encoding for OCaml source code is utf-8 12". Similarly, what are the mysterious errors that you are mentioning?

Overall, I am not convinced that it is sensible to break backward compatibility more than necessary.

Footnotes

  1. using other ascii-compatible encodings in strings and comments is supported

  2. using other encodings in strings and comments is possible but not supported

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Nov 6, 2024

Overall, I am not convinced that it is sensible to break backward compatibility more than necessary.

I know we already had an "interesting" discussion about this, but let's make it clear again: it means that OCaml source code is it own, unique, binary file format.

This means that it may break or confuse a lot of processes that assume they are dealing with Unicode text like line based processors, search engine indexers etc.

OCaml already got its own, "interesting", way of doing line normalization which means it needs special casing if you want to use any generic line-based text processing tools on OCaml sources. Now the idea that you can't run a simple UTF-8 decoder and re-encode it the result to get the same source is another "interesting" twist. It seems like OCaml just wants to corner itself into textual peculiarities. You are just making it harder for everyone who want to treat OCaml sources as text and applying generic textual procedures like text indexation or transfer OCaml sources over http as text or store it in an UTF-8 database column.

If the valid use case is "I want my expectation tests to spit binary data in my sources" then there is a simple solution: these expectation tests should simply escape their binary data so that it becomes valid UTF-8 text. Having UTF-8 encoded data and binary data in the middle of your sources will likely confuse your editor anyways – if it doesn't even force you to save it with Unicode replacement characters where they found offending bytes.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Nov 6, 2024

https://xkcd.com/1172/

@Octachron
Copy link
Copy Markdown
Member

This means that it may break or confuse a lot of processes that assume they are dealing with Unicode text like line based processors, search engine indexers etc.

No processes operating on text can ever assume that files are always correctly encoded. Similarly, no processes working on OCaml source code can assume that its input is a valid OCaml source code.

But yes, applying text process on OCaml source code will work better on source code that are valid UTF-8, which seems yet another reason for people to use utf-8 OCaml source code as officially recommended. I don't mind strengthening the wording to "OCaml character set is officially utf-8 (the parser may recognize a larger binary format, so don't use the compiler as an utf-8 validator)"

However, I still don't see any reason to go specifically out of our way to intentionally break (strange) valid OCaml code before even updating the documentation about the charset.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Nov 7, 2024

No processes operating on text can ever assume that files are always correctly encoded.

I'm not sure what this changes to the discussion.

Any process that requires a particular encoding will either bail out with an error (what the OCaml compiler should do) or silently replace decoding errors with the Unicode replacement character and hereby silently change the semantics of your source or, if you are lucky, break it.

Personally I want to be able to transmit any OCaml source code as UTF-8 encoded plain text, store any OCaml source code in an UTF-8 encoded database column, store any OCaml source code in the field of a data format that advertises as UTF-8, edit, cut, paste any OCaml source code in the dumbest UTF-8 aware editor or GUI text widget, and so on… without risking having its semantics silently changed or its source broken or… having to base64 encode it. That's what I expect from a language that advertises itself as having "an emphasis on safety".

The stance advocated here leads to these experiences where everything works in 99.9% of the cases and fails in obscure (and enraging) manner on 0.01% of the cases. I think that trying to keep happy the people who abused OCaml's source code as a binary file format is a very wrong usability tradeoff and a disservice to the users of the language.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Nov 7, 2024

I tend to agree with the point that @dbuenzli is making. For what is worth this is how I would proceed:

  • First, document that OCaml source code must be valid UTF-8 and that the behaviour of the compiler on non-valid UTF-8 is undefined.
  • Second, deprecate non-UTF-8 input by emitting a warning or alert in the lexer whenever comments and/or string literals contain invalid UTF-8, but otherwise not changing the existing semantics.
  • Third, enforce valid UTF-8 by turning the warning into a hard error. This step would occur at least one release after the introduction of the warning.

@Octachron
Copy link
Copy Markdown
Member

I am not opposed to this plan, as long as we agree that the documentation must precede any breaking changes.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Nov 7, 2024

I am not opposed to this plan, as long as we agree that the documentation must precede any breaking changes.

I think everyone will agree to that!

@pmetzger I suggest you rework this PR as mentioned above: adapt the documentation to state that source code must be valid UTF-8 and that behaviour is undefined otherwise. If feeling ambitious, you could try implementing the warning as well, but without any actual change to what is accepted by the lexer.

@Octachron
Copy link
Copy Markdown
Member

If feeling ambitious, you could try implementing the warning as well, but without any actual change to what is accepted by the lexer.

This should be done in another PR: the documentation must be backported to 5.3 while the potential warning will not be part of 5.3 .

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Nov 7, 2024

If feeling ambitious, you could try implementing the warning as well, but without any actual change to what is accepted by the lexer.

This should be done in another PR: the documentation must be backported to 5.3 while the potential warning will not be part of 5.3 .

Right, good point.

@Octachron
Copy link
Copy Markdown
Member

With the documentation in #13668 merged, I have the impression that all original contents of this PR has been superseded by other PRs.
I am thus tentatively closing this PR.
Thanks everyone for the discussion over the years !

@Octachron Octachron closed this Feb 4, 2025
@pmetzger pmetzger deleted the utf8lex branch February 6, 2025 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.