Toplevel: only escapes bytes and not strings#1231
Hidden character warning
Conversation
|
I like some aspects of the change (the idea that unicode letters are printed back) but not others: not-escaping the whitespace characters leads to a loss of readability, for example. Of course, some time people write string literals with newlines in them, and escaping them instead hurts readability. I have mixed feelings about the automatic choice of Have people studied the problem of "what is the right choice of which characters to print and which characters to escape" before, and are there solution that do not require more unicode knowledge than available in the OCaml standard library? (Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.) |
|
I'm not really fond of the choices made by this PR. These would be my suggestions:
This is a user interface not an API, as such I think it would be legitimate to depend on the user locale. |
|
Concerning point 1 and 2, escaping characters in the C0 control character code set and Concerning point 3 (and 4), I am not completely convinced because it seems much simpler to simply not escape bytes ≥ 0x7F. By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users ) of not escaping control characters in the C1 code set (in particular NEL) and the more exotic LS or PS new lines (on the other hand, it would even give a work-arround for users that really really want non-escaped new lines). I agree with point 5, but using hexadecimal in escaping sequence does not seems to particularly concern the toplevel, and I think this should be changed globally. |
This kind of fortunate coincidence argument doesn't make sense to me. We have been trying to push for sometime now for a model where OCaml strings should be UTF-8 encoded. I will let the dev team determine if they find it important that the toplevel is still able to function in a 7-bit environment. But I really think that we should have at least an |
|
I have updated this to escape C0 control characters and string delimiters; in other words, quoted string # "серафими\t\"многоꙮчитїи\"";;
- : string = "серафими\t\"многоꙮчитїи\""@dbuenzli, I am not sure what your proposed Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input (except for control characters that indeed should not take control of the toplevel printing). |
|
So @pqwy who studied the problem in the context of his
No if
Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input |
I agree that being able to reactivate the escaping of bytes > 0x7E is a undeniable improvement.
Well, |
But in that case the user would leave |
|
I have added the Note that I have also deleted the paragraph in the ocaml man page about |
FTR @xavierleroy removed that from the manual in 5d385f9. I think this may have gone away with the resolution by @damiendoligez of MPR6521 in e60a2db. |
| if isneg then pp_print_char ppf ')' | ||
|
|
||
| (** Escape only C0 control characters (bytes <= 0x1F) and '"' *) | ||
| let print_out_string ppf s = |
There was a problem hiding this comment.
0x7F, a.k.a. DEL, is a control character and needs escaping too.
|
This discussion makes me feel younger, because we had pretty much the same discussion back in the early 1990s when Latin-1 support was added to Caml Light and not all environments would support characters above 0x80... Escaping control characters is absolutely necessary, and not only to display TAB, CR and LF meaningfully. For example, if terminal escape sequences are printed verbatim, the display can be completely messed up. A desirable property of the toplevel value printer is that the output, once fed back into the toplevel by cut-and-paste, should parse back to the same value (as much as possible). I think it is the case in the latest incarnation of this PR, but make sure it is. |
|
Isn't this going to mess up the formatting? |
It will.
It's worse than that. It's a problem that is impossible to solve without being able to interact with the rendering layer to measure how many cells your UTF-8 encoded string is going to span when rendered -- something no terminal out there will provide you. You can perform some kind of best effort formatting using either |
|
Format output will get messy since Format will tend to overestimate the length of the graphical representation of strings. Some examples, first in Greek: # String.split_on_char ' ' "Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾿ Ἀχαιοῖς ἄλγε᾿ ἔθηκε, πολλὰς δ᾿ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι· Διὸς δ᾿ ἐτελείετο βουλή ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.";;
- : string list =
["Μῆνιν"; "ἄειδε"; "θεὰ"; "Πηληϊάδεω";
"Ἀχιλῆος"; "οὐλομένην,"; "ἣ"; "μυρί᾿";
"Ἀχαιοῖς"; "ἄλγε᾿"; "ἔθηκε,"; "πολλὰς";
"δ᾿"; "ἰφθίμους"; "ψυχὰς"; "Ἄϊδι";
"προΐαψεν"; "ἡρώων,"; "αὐτοὺς"; "δὲ";
"ἑλώρια"; "τεῦχε"; "κύνεσσιν"; "οἰωνοῖσί";
"τε"; "πᾶσι·"; "Διὸς"; "δ᾿"; "ἐτελείετο";
"βουλή"; "ἐξ"; "οὗ"; "δὴ"; "τὰ"; "πρῶτα";
"διαστήτην"; "ἐρίσαντε"; "Ἀτρεΐδης"; "τε";
"ἄναξ"; "ἀνδρῶν"; "καὶ"; "δῖος";
"Ἀχιλλεύς."]Compared to the english version # String.split_on_char ' ' "Achilles sing, O Goddess! Peleus' son; His wrath pernicious, who ten thousand woes Caused to Achaia's host, sent many a soul Illustrious into Ades premature, And Heroes gave (so stood the will of Jove) To dogs and to all ravening fowls a prey, When fierce dispute had separated once The noble Chief Achilles from the son Of Atreus, Agamemnon, King of men.";;
- : string list =
["Achilles"; "sing,"; "O"; "Goddess!"; "Peleus'"; "son;"; "His"; "wrath";
"pernicious,"; "who"; "ten"; "thousand"; "woes"; "Caused"; "to"; "Achaia's";
"host,"; "sent"; "many"; "a"; "soul"; Illustrious"; "into"; "Ades";
"premature,"; "And"; "Heroes"; "gave"; "(so"; "stood"; "the"; "will"; "of";
"Jove)"; "To"; "dogs"; "and"; "to"; "all"; "ravening"; "fowls"; "a";
"prey,"; "When"; "fierce"; "dispute"; "had"; "separated"; "once"; "The";
"noble"; "Chief"; "Achilles"; "from"; "the"; "son"; "Of"; "Atreus,";
"Agamemnon,"; "King"; "of"; "men."]Similarly with Japanese # [
"こぬ人を";
"まつほの浦の";
"夕なぎに";
"やくやもしほの";
"身もこがれつつ"
];;
- : string list =
["こぬ人を"; "まつほの浦の"; "夕なぎに";
"やくやもしほの"; "身もこがれつつ"]or Sanskrit # [
"अग्निमीळे"; "पुरोहितं"; "यज्ञस्य"; "देवं रत्वीजम";
"होतारं"; "रत्नधातमम";
"अग्निः"; "पूर्वेभिर्र्षिभिरीड्यो"; "नूतनैरुत";
"स"; "देवानेह"; "वक्षति"
];;
- : string list =
["अग्निमीळे"; "पुरोहितं";
"यज्ञस्य"; "देवं रत्वीजम";
"होतारं"; "रत्नधातमम"; "अग्निः";
"पूर्वेभिर्र्षिभिरीड्यो";
"नूतनैरुत"; "स"; "देवानेह";
"वक्षति"] |
|
On the other hand, any of these examples are completely unreadable with the current pretty-printing scheme, so the output you show (if formatted a bit weirdly compared to the english version) is a strong improvement. Given that this only affect the toplevel output (and not calls to Format in user programs), I believe that not having a general solution to length formatting is fine. |
|
I think that Rust’s approach is best long-term one: strings must be in UTF-8 and are immutable. Creating a string that is not valid UTF-8 is undefined behavior. |
|
Also, if I'm reading the code correctly, backslash is not escaped... Why not base your implementation on that of |
They were not? At least not during my tests? (https://github.com/ocaml/ocaml/blob/trunk/stdlib/char.ml#L29)
Of this, I am atrociously guilty. I should have added a test on the testsuite covering the whole ascii range.
As wished, I have reimplemented the string escape in the style of Bytes.escaped. |
|
My comment about CR LF and co was based on a wrong reading of the code, just ignore it and apologies about this. |
xavierleroy
left a comment
There was a problem hiding this comment.
Looks very good to me, with a bit of LaTeX tweaking recommended.
Changes
Outdated
| (Tadeu Zagallo, review by David Allsopp) | ||
|
|
||
| - GPR#1231: improved printing of unicode texts in the toplevel, | ||
| when OCAMLTOP_UTF_8 is not set to false. |
There was a problem hiding this comment.
"unless OCAMLTOP_UTF_8 is set to false" would read better, I think.
manual/manual/cmds/top.etex
Outdated
| The following environment variables are also consulted: | ||
| \begin{options} | ||
| \item["OCAMLTOP_UTF_8"] When printing string values, non-ascii bytes | ||
| (>0x7E) are printed as decimal escape sequence if "OCAMLTOP_UTF_8" is |
There was a problem hiding this comment.
(>0x7E) will format poorly in LaTeX, with the > character rendered as upside-down question mark or some such. Please format as a proper math formula:
($ {} > "\0x7E" $)
17b77ac to
3fb6deb
Compare
|
I remarked while reading the code that the "max string length" parameter of the Oval_string node may not be respected, given that escaping (done after this test) may increase the length. However, (1) previous implementations and the Bytes codepath also suffer from this issue and (2) this parameter is currently not fixed by the user (then it would be nice to respect it) but by the Genprintval recursion-depth control code, so it sounds reasonable to only respect it approximately. |
|
@Octachron I think you should feel free to rebase (if you want to squash some of the intermediary commits) and merge. |
3fb6deb to
2e6a78a
Compare
Escaping strings when printing them in the toplevel has the disadvantage of mangling unicode text: ``` \# "한글";; - : string = "\237\149\156\234\184\128" ``` With this commit, strings are not escaped anymore, contrarily to bytes: ``` \# let cosmos = "κόσμος";; cosmos : string = "κόσμος" \# Bytes.of_string cosmos;; - : bytes = Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130" ``` This new behavior can be disabled dynamically by setting the environment variable OCAMLTOP_UTF_8 to false This change is not solely aesthetic: the mangling of unicode string may contribute to the impression of some OCaml newcomers that Ocaml has no support for unicode.
|
Squashed to a single commit and merged. |
|
There is another place where a similar treatment of strings may be a good idea, namely |
|
Do we want to consider relaxing the escaping in %S? The semantic specification, as I understand it, is to print strings in the way that they can be parsed back as OCaml literals. Maybe we can assume that unicode strings are parsed back as OCaml literals in a unicode file, so that non-escaping is justified? |
|
(On the other hand, printing unicode on Windows seems to still be an open problem, so maybe escaping is not that bad?) |
|
IMO |
|
I was surprised just now by the change in escaping behaviour introduced by this PR. Running OCaml in an Emacs subshell, I see this: whereas before OCaml 4.06 I see this: What's happening: OCaml used to escape characters above 0x80 when printing strings, and printed them using the same decimal format used in string literals. However, since this PR such characters are now printed directly to the terminal instead. Since they can't be displayed directly, the terminal emulator prints them as octal escapes. |
|
I agree the octal escape is confusing. At least, Emacs colors the |
|
Xavier Leroy (2019/01/08 09:31 -0800):
I agree the octal escape is confusing. At least, Emacs colors the
`\377` in red, doesn't it? So, that gives a hint to the (experienced)
Emacs user...
You mean experienced and sighted, right?
|
|
Emacs does indeed colour the |
|
@shindere: while the colour hint is only useful for sighted users, I wonder whether users who don't rely on the visual interface avoid the confusing issue in the first place. For example, |
|
Well unfortunately not. I am using a braille display which does not
really have a way to convey the same kind of information, but it's great
to read that emacspeak does such a great job, so many thanks for having
reported this back!
|
Co-authored-by: Sabine Schmaltz <sabine@tarides.com>
Escaping strings when printing them in the toplevel has the disadvantage of making unicode text unreadable:
This PR proposes to escape only bytes and not strings:
This change is not purely aesthetic: the mangling of unicode strings may contribute to the impression of some OCaml newcomers that Ocaml has no support for unicode (and being able to read the corresponding strings in the toplevel benefits all users not able to fluently read raw unicode codepoint).
On a more technical note, in presence of string delimiters inside the printed string, the best string delimiters still available are used:
In order of preference, the string delimiters used are
",{|,{t|,{top|,{toplevel|, and in the worst case scenario{toplevel%d|:With this scheme, all strings should be printable as valid string literals.
One disadvantage of this approch is that newline and tabs are not escaped anymore