Make the character set for OCaml source code officially UTF-8.#1802
Make the character set for OCaml source code officially UTF-8.#1802pmetzger wants to merge 1 commit intoocaml:trunkfrom
Conversation
Good editors can already auto-detect a file's encoding without any problem. And in any case this PR would not allow them to assume UTF-8, see next point.
This is not the case: the contents of comments and string literals are not UTF-8-validated, they are processed as raw bytes.
Characters in OCaml (i.e. the |
And if you want to insert UTF-8 greek chars into a string but there's a Latin-1 identifier you have to use? It would be nice to have the question of what the encoding is settled. I think the clear long term goal has been to go to Unicode the way everything else has. Do you object in some specific way to that?
They don't need to be validated in order for the policy to be that they're supposed to be UTF-8. The point is to set a policy and an intent, not to prevent people from cleverly deciding to include things that aren't actually unicode in their files. People can trick the compiler in all sorts of ways if they want to. Heck, they can use the
This is a distinction without a difference. The documentation needs to explain that |
|
Personally I think the change makes sense (and the implementation looks correct), but I guess that it would need to be discussed more broadly among maintainers to get consensus. |
|
@gasche Indeed, I presumed that this would not be committed without a significant discussion among the maintainers. The implementation is small, but the agreement needed is broad. |
|
FTR, I am not against removing support of latin-1 identifiers. But giving the |
The documentation change I made specifically says it does not have the ability to store arbitrary Unicode characters. That's the whole point, to indicate that character constants can't store more than eight bits, that even though you could write |
|
Well I don't know exactly what would be the best way to make things
change, but I do think there is room for improvmeent here. The fact that
e.g.
let ch = 'é'
gives rise to code that either compiles or does not compile according to
how the file is stored looks very odd to me.
|
This is very confusing, as it suggests that (1) codepoints between 128 and 255 can be represented in
|
|
I share similar concerns as @alainfrisch, @shindere and @nojb. If you want to say something about this here (personally I'm not sure it's actually useful to say this here, I would just delete the note), emphasis should be but on the fact that for historical reasons what OCaml calls a |
Octachron
left a comment
There was a problem hiding this comment.
I agree with the idea of precising the expected encoding for source files and the removal of latin1 identifiers. However, the additional documentation seems often counter-productive to me:
manual/manual/refman/lex.etex
Outdated
| (Note: codepoints greater than 255 are not permitted within character | ||
| literals, as the character type is eight bits for historical reasons. | ||
| The Uchar type in the standard library can store arbitrary | ||
| Unicode codepoints.) |
There was a problem hiding this comment.
I agree with @nojb , there is little point in mixing the 8-bit integer type char with unicode characters. I believe that this paragraph should be removed.
There was a problem hiding this comment.
See below. The issue to me is I don't want someone asking "but why is it that
let foo = "αβγδ"works but
let foo = 'β'does not?"
It needs to be said, even if the current phrasing is bad.
manual/manual/refman/lex.etex
Outdated
| characters 223--246 and 248--255 as lowercase letters). This | ||
| feature is deprecated and should be avoided for future compatibility. | ||
| (Currently, letters consist of the 52 lowercase and uppercase | ||
| letters from the ASCII character set.) |
There was a problem hiding this comment.
These parentheses does not seem warranted. I would also remove Currently which makes the sentence unnecessarily controversial.
|
Okay, so everyone seems angry about the text about let foo = '⊕'and having it fail and wondering why it is failing since they can type let foo = "αβγδ"just fine. I'm happy to have whatever sort of explanation in the text people prefer for this, but the documentation does need to explain that it isn't possible to store an arbitrary Unicode codepoint in a |
|
Also, above, many people have said " let c : char = 27as that is a syntax error. OCaml is a strongly typed language. The fact that we know we can perform type puns of various kinds doesn't mean we should. Arguably, "someday" we should:
However, that "someday" doesn't seem like today. So for now, we need some language in the manual that explains the current situation is unexpected and violates a naive user's expectations. |
I'm not sure there is harm as such in accepting |
It is, one can type |
e870cfe to
cf506e6
Compare
There are two questions:
|
|
All: I've simplified my documentation about valid character literals. Please tell me if this is more acceptable. As for @alainfrisch and others stating we should check whether the input file is valid UTF-8, if that is desired it can of course be done. |
|
I would be interested in hearing @xavierleroy's thoughts... |
|
I'll let others comment on how the My concern is with the removal of Latin-1 in identifiers, without any alternative being provided. I agree that production-quality code as well as collaborative projects must use ASCII identifiers, preferably meaningful in English. I do think that for teaching beginners it can make sense to use identifiers in the students' native language. There are more textbooks about Caml in French, German, Italian than in English (and none in non-Western European languages). Looking at my collection I see two French textbooks that use French identifiers with accents, one with French identifiers and no accents, and two with English identifiers. We should show some respect and some consideration for the pedagogical decisions taken by the authors of those books. So, opening Pandora's box: what would it look like to support some Unicode characters in identifiers? I know Java has been doing this since 1995, and other languages followed suit, so by now there should be some kind of consensus on what Unicode to support in identifiers and how to do it. |
There is UAX 31 that deals with this matters. Note that to do it properly you'd still need to implement normalization in the compiler to test for identifier equivalence. You could avoid too much complexity by fixing an Unicode version which is supported for a few years from which you could extract the information from the unicode character database and bake it in the OCaml repo. You could also do nothing about identifier equivalence; for example xml does this (you need to match tags, which are allowed to have unicode characters but those are not normalized so you can easily confuse a document to make it as if it is valid xml with matching tags, while it's not in practice). @whitequark has done an implementation of Unicode identifiers for OCaml here along with an interesting section about technical details (e.g. interaction with the file system). |
|
@xavierleroy: I would suggest that The Right Thing might (might!) be to open Pandora's Box at some point and to permit Unicode identifiers, the use of Unicode math operators, etc., but as @dbuenzli points out, doing that change fully will require significant work. I see this as an intermediate step along the path towards Unicode adoption. There's a lot more work to do than this, but this seems like a reasonable incremental change (which is why I proposed it.) |
Not that this is relevant to the current pull request, but would one approach there be to do Unicode normalization for identifiers in the lexical analyzer? (I know relatively little about these issues.) |
|
@dbuenzli: thanks for the pointers, reading all that will keep me silent for a while :-) @whitequark: I didn't know about your efforts. After working on this internationalization stuff for several years, what are your conclusions? Would you recommend it for general use or is it too complicated for its own good? @pmetzger: I'm still wary of removing a feature (Latin-1 accented letters in identifiers) that causes no harm, might be useful to a few users, and is not planned to come back in a more general form (Unicode "letters" in identifiers). A reasonable compromise would be to move the part of the documentation that describes Latin-1 letters in identifiers to the "Language extensions" chapter, with a note that this feature is questionable and could go away soon, or be subsumed by a more general support for Unicode "letters" in identifiers. |
I was actually surprised by how compact and reliable that implementation is, after writing it. I expected the confusables issue to be much harder to tackle but UAX 31 has everything covered. It's certainly not what I would call "too complicated". In my opinion (and I'm not saying this just because I wrote it--I've killed many of my own projects because they didn't work well enough) m17n can be integrated into OCaml trunk more or less as-is. I'm still a bit torn on (keyword) localization; it looks like people who use Latin supersets as alphabet don't care for it very much, but those who don't might want it. Since it can be experimented with without any modification of the core compiler (just expose the lexer keyword list in compiler-libs) it seems fine to let it bake a bit in the community and then decide. |
|
Two thoughts:
|
So, there are two questions here.
My proposal was to adopt UTF-8 for (1) while putting off a decision on (2). However, both could be decided on at once if that was the consensus. (An aside on issue (2): Latin-1 characters in identifiers have generated a warning for some time now.) |
|
Regarding latin-1 identifiers or identifiers in foreign languages in general, with all due respect to OCaml's rich history, I think we definitely want to avoid such identifiers going forward. If I had to modify code with identifiers in a foreign language, I would have a very hard time making any progress. If I had to deal with identifiers I couldn't even reproduce with my keyboard, I'd probably just burn the program to the ground and start again. We don't want to encourage any of this IMO. |
|
I should work on it again. The problem here is that there are several decisions being conflated; there is too much to do in only one step. I think it will be easier if we take them one at a time. I think I'm going to split it into several parts, provided that everyone agrees. The first would be a PR finalizing obsoleting the use of source code in ISO/IEC 8859-1 — this was started years ago (use of ISO/IEC 8859-1 identifiers already results in warnings.) |
|
FWIW, I think it is a great step to adopt utf8 for source encoding |
|
I've thought about this a great deal. The big problem with progressing this is that the naive approach to moving to UTF-8 would mean needing to remove the last support for accented ISO Latin 1 characters. This support was deprecated a long time ago, but it appears some people really don't want it gone. Unfortunately, UTF-8 requires the use of the high bit in some bytes. We thus have an entanglement between the two issues. I had hoped that this could be done in a few steps and would mostly be a policy change, but that isn't really the case because of the resistance to completing the Latin 1 identifier deprecation. I think the most straightforward way to fix the impasse is to have a unicode/UTF-8 aware lexical analyzer generator to replace the use of Unfortunately, providing a new lexical analyzer generator is a sufficiently large task that I've been avoiding it for a while. About a week or two of work on |
|
At this point I think this is a necessary first step to fixing #10749 (a CVE-worthy security hole). |
|
The lexer changes of this PR are superseded by #11736. The documentation changes are independent and useful, but need to wait for proper UTF8 support in the compiler (subsuming latin1 features) before merging. |
Another approach is what Microsoft did to maintain backwards-compatibility for software written decades earlier: Use the BOM as the Unicode signature. Confer with https://en.wikipedia.org/wiki/Byte_order_mark That would mean keeping the source code encoding as-is (whatever Latin1 variant/tweak of ISO-8859 OCaml supports today) so that all existing source code works but switching to Unicode mode with UTF-8 decoding when there is a BOM. (Secondary less important question: Out of curiosity, why is the UTF-8 encoding privileged? If a source code encoding can be unambiguously detected, which a BOM does for UTF-8 and UTF-16, and the OCaml standard library can decode it, then it seems natural to support Unicode source code in other encodings. I'm thinking specifically of Windows which requires effort+care to write UTF-8 files.) |
|
Okay, is my understanding correct that:
What is the state, btw, of Unicode characters being accepted in comments and strings? @Octachron I am assuming you would be able to answer the above? |
|
So just to be clear, the documentation should reflect the following (independent) issues:
|
|
|
So, I think it might be sane, especially now that the identifier issue has been handled, to officially document that the encoding is UTF-8 and actually enforce that in the compiler. This will both remove ambiguity about the encoding and assure that mistakes that might lead to mysterious errors are caught early. In that case, assuming that patches to enforce the encoding were added, we would document:
|
|
Enforcing that OCaml source code are utf-8 valid can break existing user code and remove valid use cases. Moreover, I don't see much ambiguity with "The strongly recommended encoding for OCaml source code is utf-8 12". Similarly, what are the mysterious errors that you are mentioning? Overall, I am not convinced that it is sensible to break backward compatibility more than necessary. Footnotes |
I know we already had an "interesting" discussion about this, but let's make it clear again: it means that OCaml source code is it own, unique, binary file format. This means that it may break or confuse a lot of processes that assume they are dealing with Unicode text like line based processors, search engine indexers etc. OCaml already got its own, "interesting", way of doing line normalization which means it needs special casing if you want to use any generic line-based text processing tools on OCaml sources. Now the idea that you can't run a simple UTF-8 decoder and re-encode it the result to get the same source is another "interesting" twist. It seems like OCaml just wants to corner itself into textual peculiarities. You are just making it harder for everyone who want to treat OCaml sources as text and applying generic textual procedures like text indexation or transfer OCaml sources over http as text or store it in an UTF-8 database column. If the valid use case is "I want my expectation tests to spit binary data in my sources" then there is a simple solution: these expectation tests should simply escape their binary data so that it becomes valid UTF-8 text. Having UTF-8 encoded data and binary data in the middle of your sources will likely confuse your editor anyways – if it doesn't even force you to save it with Unicode replacement characters where they found offending bytes. |
No processes operating on text can ever assume that files are always correctly encoded. Similarly, no processes working on OCaml source code can assume that its input is a valid OCaml source code. But yes, applying text process on OCaml source code will work better on source code that are valid UTF-8, which seems yet another reason for people to use utf-8 OCaml source code as officially recommended. I don't mind strengthening the wording to "OCaml character set is officially utf-8 (the parser may recognize a larger binary format, so don't use the compiler as an utf-8 validator)" However, I still don't see any reason to go specifically out of our way to intentionally break (strange) valid OCaml code before even updating the documentation about the charset. |
I'm not sure what this changes to the discussion. Any process that requires a particular encoding will either bail out with an error (what the OCaml compiler should do) or silently replace decoding errors with the Unicode replacement character and hereby silently change the semantics of your source or, if you are lucky, break it. Personally I want to be able to transmit any OCaml source code as UTF-8 encoded plain text, store any OCaml source code in an UTF-8 encoded database column, store any OCaml source code in the field of a data format that advertises as UTF-8, edit, cut, paste any OCaml source code in the dumbest UTF-8 aware editor or GUI text widget, and so on… without risking having its semantics silently changed or its source broken or… having to base64 encode it. That's what I expect from a language that advertises itself as having "an emphasis on safety". The stance advocated here leads to these experiences where everything works in 99.9% of the cases and fails in obscure (and enraging) manner on 0.01% of the cases. I think that trying to keep happy the people who abused OCaml's source code as a binary file format is a very wrong usability tradeoff and a disservice to the users of the language. |
|
I tend to agree with the point that @dbuenzli is making. For what is worth this is how I would proceed:
|
|
I am not opposed to this plan, as long as we agree that the documentation must precede any breaking changes. |
I think everyone will agree to that! @pmetzger I suggest you rework this PR as mentioned above: adapt the documentation to state that source code must be valid UTF-8 and that behaviour is undefined otherwise. If feeling ambitious, you could try implementing the warning as well, but without any actual change to what is accepted by the lexer. |
This should be done in another PR: the documentation must be backported to 5.3 while the potential warning will not be part of 5.3 . |
Right, good point. |
|
With the documentation in #13668 merged, I have the impression that all original contents of this PR has been superseded by other PRs. |
By doing this, editors and other tools can presume that UTF-8 is the encoding to use for OCaml source. I think this is an idea whose time has come. Latin-1 identifiers have now been deprecated with a warning for many years.
Note: I've made the
Changesentry assuming this would NOT be a candidate for 4.07.Note that this set of changes is a strawman. I would like to know people's opinions on how to do this and which parts should be done which way.
Each of the following question is followed by the likely available options for answers, and I'd appreciate hearing what people prefer: