Toplevel: only escapes bytes and not strings by Octachron · Pull Request #1231 · ocaml/ocaml

Octachron · 2017-07-08T19:53:20Z

Escaping strings when printing them in the toplevel has the disadvantage of making unicode text unreadable:

# "한글";;
- : string = "\237\149\156\234\184\128"

This PR proposes to escape only bytes and not strings:

# let cosmos = "κόσμος";;
cosmos : string = "κόσμος"
# Bytes.of_string cosmos;;
- : bytes =
Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130"

This change is not purely aesthetic: the mangling of unicode strings may contribute to the impression of some OCaml newcomers that Ocaml has no support for unicode (and being able to read the corresponding strings in the toplevel benefits all users not able to fluently read raw unicode codepoint).

On a more technical note, in presence of string delimiters inside the printed string, the best string delimiters still available are used:

# "日本\"語";;
- : string = {|日本"語|}

In order of preference, the string delimiters used are ", {|, {t|, {top|,{toplevel|, and in the worst case scenario {toplevel%d|:

# "مرسی\n\"|}|t}|top}|toplevel}|toplevel1}|toplevel2}";;
- : string =
{toplevel3|ﻡﺮﺳی
"|}|t}|top}|toplevel}|toplevel1}|toplevel2}|toplevel3}

With this scheme, all strings should be printable as valid string literals.

One disadvantage of this approch is that newline and tabs are not escaped anymore

# "\n\t\n\t\n";;
- : string = "
	
	
"

gasche · 2017-07-08T20:03:46Z

I like some aspects of the change (the idea that unicode letters are printed back) but not others: not-escaping the whitespace characters leads to a loss of readability, for example. Of course, some time people write string literals with newlines in them, and escaping them instead hurts readability.

I have mixed feelings about the automatic choice of {|..|} for escaping; it's a lesser-known feature that may surprise users, but on the other it makes the feature more self-discoverable. Overall I would say that I like it.

Have people studied the problem of "what is the right choice of which characters to print and which characters to escape" before, and are there solution that do not require more unicode knowledge than available in the OCaml standard library?

(Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.)

dbuenzli · 2017-07-08T20:38:26Z

I'm not really fond of the choices made by this PR. These would be my suggestions:

Always escape control characters (bytes <= 0x1F) and the string delimiter.
Do not use the "{|" syntax, always delimit strings with "
Consult some kind of environment variable to determine if the terminal is UTF-8 aware (this has to be researched or if nothing reliable is available one ocaml specific should be defined). If the terminal is not UTF-8 aware any byte > 0x7E should still be escaped.
(Ideally the string should be checked for UTF-8 validity and if not valid its non printable US-ASCII bytes should be escaped. But this can be improved once/if we get more UTF-8 tools in the stdlib).
I would really like to have the bytes of string and bytes escaped with hexadecimal numbers. Decimal escapes are a huge PITA if you are dealing with UTF-8 and other byte oriented file formats.

(Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.)

This is a user interface not an API, as such I think it would be legitimate to depend on the user locale.

Octachron · 2017-07-08T22:31:23Z

Concerning point 1 and 2, escaping characters in the C0 control character code set and " may be indeed clearer and it rejoins @gasche remarks.

Concerning point 3 (and 4), I am not completely convinced because it seems much simpler to simply not escape bytes ≥ 0x7F. By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users ) of not escaping control characters in the C1 code set (in particular NEL) and the more exotic LS or PS new lines (on the other hand, it would even give a work-arround for users that really really want non-escaped new lines).

I agree with point 5, but using hexadecimal in escaping sequence does not seems to particularly concern the toplevel, and I think this should be changed globally.

dbuenzli · 2017-07-08T22:43:16Z

By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users )

This kind of fortunate coincidence argument doesn't make sense to me. We have been trying to push for sometime now for a model where OCaml strings should be UTF-8 encoded.

I will let the dev team determine if they find it important that the toplevel is still able to function in a 7-bit environment. But I really think that we should have at least an OCAMLTOP_UTF_8 boolean environment variable (that defaults to true), that controls this.

Octachron · 2017-07-09T12:26:35Z

I have updated this to escape C0 control characters and string delimiters; in other words, quoted string
delimiters are gone and \t and \n are now escaped:

# "серафими\t\"многоꙮчитїи\"";;
- : string = "серафими\t\"многоꙮчитїи\""

@dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence?

Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input (except for control characters that indeed should not take control of the toplevel printing).

dbuenzli · 2017-07-09T13:07:16Z

So @pqwy who studied the problem in the context of his notty library tells me that it seems hopeless to try to find a system environment variable to lookup (though LC_ALL, LANG, LC_CTYPE, could be possibly consulted in that order, but that doesn't tell you if the output actually agrees).

@dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence?

No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E.

Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input

Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal.

Octachron · 2017-07-09T13:54:29Z

No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E.

I agree that being able to reactivate the escaping of bytes > 0x7E is a undeniable improvement.

Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal.

Well, \xC3\xA9 is also a perfectly valid latin-1(Ã©) or SHIFT-JIS (ﾃｩ) byte sequence, so breaking the terminal might be a bit hyperbolic here.

dbuenzli · 2017-07-09T14:11:36Z

Well, \xC3\xA9 is also a perfectly valid latin-1(Ã©) or SHIFT-JIS (ﾃｩ) byte sequence, so breaking the terminal might be a bit hyperbolic here.

But in that case the user would leave OCAMLTOP_UTF_8 set to true (as strange as it sounds) and would not input hex escapes...

Octachron · 2017-07-09T15:10:24Z

I have added the OCAMLTOP_UTF_8 variable, man page and manual description included.

Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1:
the described behavior did not seem valid anymore (and they are no trace of iso_8859_1 nor LC_CTYPE in the code base). Moreover, OCAMLTOP_UTF_8 supersedes such use of LC_CTYPE.

dbuenzli · 2017-07-09T15:27:00Z

Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1:

FTR @xavierleroy removed that from the manual in 5d385f9. I think this may have gone away with the resolution by @damiendoligez of MPR6521 in e60a2db.

xavierleroy · 2017-07-09T16:35:31Z

typing/oprint.ml

  if isneg then pp_print_char ppf ')'

+(** Escape only C0 control characters (bytes <= 0x1F) and '"' *)
+let print_out_string ppf s =


0x7F, a.k.a. DEL, is a control character and needs escaping too.

Fixed, thanks.

xavierleroy · 2017-07-09T16:41:20Z

This discussion makes me feel younger, because we had pretty much the same discussion back in the early 1990s when Latin-1 support was added to Caml Light and not all environments would support characters above 0x80...

Escaping control characters is absolutely necessary, and not only to display TAB, CR and LF meaningfully. For example, if terminal escape sequences are printed verbatim, the display can be completely messed up.

A desirable property of the toplevel value printer is that the output, once fed back into the toplevel by cut-and-paste, should parse back to the same value (as much as possible). I think it is the case in the latest incarnation of this PR, but make sure it is.

lpw25 · 2017-07-09T17:00:06Z

Isn't this going to mess up the formatting? Format doesn't understand the widths of these characters. I'm lead to believe that this is quite a tricky problem to solve.

dbuenzli · 2017-07-09T17:18:28Z

Isn't this going to mess up the formatting?

It will.

I'm lead to believe that this is quite a tricky problem to solve.

It's worse than that. It's a problem that is impossible to solve without being able to interact with the rendering layer to measure how many cells your UTF-8 encoded string is going to span when rendered -- something no terminal out there will provide you.

You can perform some kind of best effort formatting using either Uucp.tty_width_hint (a form of wcwidth designed by @pqwy for this problem) or Uuseg's pretty-printers but even, none of these solutions are entirely fool proof.

Octachron · 2017-07-09T17:57:49Z

Format output will get messy since Format will tend to overestimate the length of the graphical representation of strings.

Some examples, first in Greek:

# String.split_on_char ' ' "Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾿ Ἀχαιοῖς ἄλγε᾿ ἔθηκε, πολλὰς δ᾿ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι· Διὸς δ᾿ ἐτελείετο βουλή ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.";;   
  - : string list =
["Μῆνιν"; "ἄειδε"; "θεὰ"; "Πηληϊάδεω";
 "Ἀχιλῆος"; "οὐλομένην,"; "ἣ"; "μυρί᾿";
 "Ἀχαιοῖς"; "ἄλγε᾿"; "ἔθηκε,"; "πολλὰς";
 "δ᾿"; "ἰφθίμους"; "ψυχὰς"; "Ἄϊδι";
 "προΐαψεν"; "ἡρώων,"; "αὐτοὺς"; "δὲ";
 "ἑλώρια"; "τεῦχε"; "κύνεσσιν"; "οἰωνοῖσί";
 "τε"; "πᾶσι·"; "Διὸς"; "δ᾿"; "ἐτελείετο";
 "βουλή"; "ἐξ"; "οὗ"; "δὴ"; "τὰ"; "πρῶτα";
 "διαστήτην"; "ἐρίσαντε"; "Ἀτρεΐδης"; "τε";
 "ἄναξ"; "ἀνδρῶν"; "καὶ"; "δῖος";
 "Ἀχιλλεύς."]

Compared to the english version

# String.split_on_char ' ' "Achilles sing, O Goddess! Peleus' son; His wrath pernicious, who ten thousand woes Caused to Achaia's host, sent many a soul Illustrious into Ades premature, And Heroes gave (so stood the will of Jove) To dogs and to all ravening fowls a prey, When fierce   dispute had separated once The noble Chief Achilles from the son Of Atreus, Agamemnon, King of men.";;    
- : string list =
["Achilles"; "sing,"; "O"; "Goddess!"; "Peleus'"; "son;"; "His"; "wrath";
 "pernicious,"; "who"; "ten"; "thousand"; "woes"; "Caused"; "to"; "Achaia's";
 "host,"; "sent"; "many"; "a"; "soul"; Illustrious"; "into"; "Ades";
 "premature,"; "And"; "Heroes"; "gave"; "(so"; "stood"; "the"; "will"; "of";
 "Jove)"; "To"; "dogs"; "and"; "to"; "all"; "ravening"; "fowls"; "a";
 "prey,"; "When"; "fierce"; "dispute"; "had"; "separated"; "once"; "The";
 "noble"; "Chief"; "Achilles"; "from"; "the"; "son"; "Of"; "Atreus,";
 "Agamemnon,"; "King"; "of"; "men."]

Similarly with Japanese

# [
"こぬ人を";
"まつほの浦の";
"夕なぎに";
"やくやもしほの";
"身もこがれつつ"
];;            
- : string list =
["こぬ人を"; "まつほの浦の"; "夕なぎに";
 "やくやもしほの"; "身もこがれつつ"]

or Sanskrit

# [       
"अग्निमीळे"; "पुरोहितं"; "यज्ञस्य"; "देवं रत्वीजम";
"होतारं"; "रत्नधातमम";
"अग्निः"; "पूर्वेभिर्र्षिभिरीड्यो"; "नूतनैरुत";
"स"; "देवानेह"; "वक्षति"
];;           
- : string list =
["अग्निमीळे"; "पुरोहितं";
 "यज्ञस्य"; "देवं रत्वीजम";
 "होतारं"; "रत्नधातमम"; "अग्निः";
 "पूर्वेभिर्र्षिभिरीड्यो";
 "नूतनैरुत"; "स"; "देवानेह";
 "वक्षति"]

gasche · 2017-07-09T18:59:49Z

On the other hand, any of these examples are completely unreadable with the current pretty-printing scheme, so the output you show (if formatted a bit weirdly compared to the english version) is a strong improvement.

Given that this only affect the toplevel output (and not calls to Format in user programs), I believe that not having a general solution to length formatting is fine.

DemiMarie · 2017-07-09T20:11:22Z

I think that Rust’s approach is best long-term one: strings must be in UTF-8 and are immutable. Creating a string that is not valid UTF-8 is undefined behavior.

xavierleroy · 2017-07-14T17:44:10Z

Also, if I'm reading the code correctly, backslash is not escaped...

Why not base your implementation on that of String.escaped / Bytes.escaped? That would avoid so many regressions.

Octachron · 2017-07-14T22:37:17Z

There is a cosmetic regression w.r.t the current escaping of strings: CR, LF, TAB and BS are printed with numeric escapes, while previously they were printed \r, \n, \t, \b.

They were not? At least not during my tests? (https://github.com/ocaml/ocaml/blob/trunk/stdlib/char.ml#L29)

Also, if I'm reading the code correctly, backslash is not escaped...

Of this, I am atrociously guilty. I should have added a test on the testsuite covering the whole ascii range.
This lack of test is fixed.

Why not base your implementation on that of String.escaped / Bytes.escaped?

As wished, I have reimplemented the string escape in the style of Bytes.escaped.

xavierleroy · 2017-07-15T08:33:59Z

My comment about CR LF and co was based on a wrong reading of the code, just ignore it and apologies about this.

xavierleroy

Looks very good to me, with a bit of LaTeX tweaking recommended.

xavierleroy · 2017-09-11T09:08:06Z

Changes

  (Tadeu Zagallo, review by David Allsopp)

+- GPR#1231: improved printing of unicode texts in the toplevel,
+  when OCAMLTOP_UTF_8 is not set to false.


"unless OCAMLTOP_UTF_8 is set to false" would read better, I think.

I agree. Fixed.

xavierleroy · 2017-09-11T09:13:30Z

manual/manual/cmds/top.etex

 The following environment variables are also consulted:
 \begin{options}
+\item["OCAMLTOP_UTF_8"] When printing string values, non-ascii bytes
+(>0x7E) are printed as decimal escape sequence if "OCAMLTOP_UTF_8" is


(>0x7E) will format poorly in LaTeX, with the > character rendered as upside-down question mark or some such. Please format as a proper math formula:

($ {} > "\0x7E" $)

gasche · 2017-09-12T08:41:07Z

I remarked while reading the code that the "max string length" parameter of the Oval_string node may not be respected, given that escaping (done after this test) may increase the length. However, (1) previous implementations and the Bytes codepath also suffer from this issue and (2) this parameter is currently not fixed by the user (then it would be nice to respect it) but by the Genprintval recursion-depth control code, so it sounds reasonable to only respect it approximately.

gasche · 2017-09-12T08:41:56Z

@Octachron I think you should feel free to rebase (if you want to squash some of the intermediary commits) and merge.

Escaping strings when printing them in the toplevel has the disadvantage of mangling unicode text: ``` \# "한글";; - : string = "\237\149\156\234\184\128" ``` With this commit, strings are not escaped anymore, contrarily to bytes: ``` \# let cosmos = "κόσμος";; cosmos : string = "κόσμος" \# Bytes.of_string cosmos;; - : bytes = Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130" ``` This new behavior can be disabled dynamically by setting the environment variable OCAMLTOP_UTF_8 to false This change is not solely aesthetic: the mangling of unicode string may contribute to the impression of some OCaml newcomers that Ocaml has no support for unicode.

Octachron · 2017-09-13T21:24:45Z

Squashed to a single commit and merged.

nojb · 2017-10-05T15:11:48Z

There is another place where a similar treatment of strings may be a good idea, namely Printexc.to_string, see here. It currently uses %S which ends up escaping Unicode filenames (which can easily appear as arguments of Sys_error).

gasche · 2017-10-05T15:19:03Z

Do we want to consider relaxing the escaping in %S? The semantic specification, as I understand it, is to print strings in the way that they can be parsed back as OCaml literals. Maybe we can assume that unicode strings are parsed back as OCaml literals in a unicode file, so that non-escaping is justified?

gasche · 2017-10-05T15:19:44Z

(On the other hand, printing unicode on Windows seems to still be an open problem, so maybe escaping is not that bad?)

damiendoligez · 2017-10-09T11:57:18Z

IMO Printexc.to_string is a debugging tool, its output should be seen by developers but not by end users. For a developer, it's probably more convenient to have the escaped version of the string.

yallop · 2019-01-08T14:26:03Z

I was surprised just now by the change in escaping behaviour introduced by this PR. Running OCaml in an Emacs subshell, I see this:

# "\255";;
- : string = "\377"

whereas before OCaml 4.06 I see this:

# "\255";;
- : string = "\255"

What's happening: OCaml used to escape characters above 0x80 when printing strings, and printed them using the same decimal format used in string literals. However, since this PR such characters are now printed directly to the terminal instead. Since they can't be displayed directly, the terminal emulator prints them as octal escapes.

xavierleroy · 2019-01-08T17:31:11Z

I agree the octal escape is confusing. At least, Emacs colors the \377 in red, doesn't it? So, that gives a hint to the (experienced) Emacs user...

shindere · 2019-01-08T20:31:31Z

Xavier Leroy (2019/01/08 09:31 -0800):

I agree the octal escape is confusing. At least, Emacs colors the `\377` in red, doesn't it? So, that gives a hint to the (experienced) Emacs user...

You mean experienced and sighted, right?

yallop · 2019-01-08T21:21:03Z

Emacs does indeed colour the \377, and the printed string behaves correctly in other ways: for example, copying and pasting the string produces the original value, and navigation commands treat the displayed \377 as a single character.

yallop · 2019-01-08T21:29:50Z

@shindere: while the colour hint is only useful for sighted users, I wonder whether users who don't rely on the visual interface avoid the confusing issue in the first place. For example, emacspeak seems to treat the printed character as a single unit (which I think it pronounces "y umlaut") rather than reading out the octal escape code. I'm curious whether the interface you use is similarly helpful.

shindere · 2019-01-08T21:42:19Z

Well unfortunately not. I am using a braille display which does not really have a way to convey the same kind of information, but it's great to read that emacspeak does such a great job, so many thanks for having reported this back!

Co-authored-by: Sabine Schmaltz <sabine@tarides.com>

Octachron force-pushed the hello_κόσμος branch from d8ee0b7 to 60fe3c1 Compare July 8, 2017 19:53

xavierleroy reviewed Jul 9, 2017

View reviewed changes

Octachron force-pushed the hello_κόσμος branch from 0b82612 to a9b3057 Compare July 14, 2017 22:35

gasche added the caml-weekly-news label Jul 15, 2017

Octachron force-pushed the hello_κόσμος branch from a9b3057 to c66b9b9 Compare July 16, 2017 19:53

xavierleroy approved these changes Sep 11, 2017

View reviewed changes

xavierleroy added the consider-for-release label Sep 11, 2017

Octachron force-pushed the hello_κόσμος branch from 17b77ac to 3fb6deb Compare September 12, 2017 08:21

Octachron force-pushed the hello_κόσμος branch from 3fb6deb to 2e6a78a Compare September 12, 2017 20:28

Octachron merged commit 588c231 into ocaml:trunk Sep 13, 2017

dra27 mentioned this pull request Oct 8, 2017

caml_sys_isatty: detect Cygwin/MSYS for better -color heuristic #1406

Merged

Octachron mentioned this pull request Jan 27, 2018

POC/WIP: Attempt at prettier printing of UTF-8 strings reasonml/reason#1780

Closed

dbuenzli mentioned this pull request Jun 11, 2019

Switch from camomile to uu* ocaml-community/lambda-term#74

Closed

dbuenzli mentioned this pull request Sep 2, 2021

Expose Oprint.escape_string as String.escape #10457

Open

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024

Create components for form input and form select (ocaml#1231)

47e5da7

Co-authored-by: Sabine Schmaltz <sabine@tarides.com>

Conversation

Octachron commented Jul 8, 2017

Uh oh!

gasche commented Jul 8, 2017

Uh oh!

dbuenzli commented Jul 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Octachron commented Jul 8, 2017

Uh oh!

dbuenzli commented Jul 8, 2017

Uh oh!

Octachron commented Jul 9, 2017

Uh oh!

dbuenzli commented Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Octachron commented Jul 9, 2017

Uh oh!

dbuenzli commented Jul 9, 2017

Uh oh!

Octachron commented Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbuenzli commented Jul 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy Jul 9, 2017

Choose a reason for hiding this comment

Uh oh!

Octachron Jul 9, 2017

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Jul 9, 2017

Uh oh!

lpw25 commented Jul 9, 2017

Uh oh!

dbuenzli commented Jul 9, 2017

Uh oh!

Octachron commented Jul 9, 2017

Uh oh!

gasche commented Jul 9, 2017

Uh oh!

DemiMarie commented Jul 9, 2017

Uh oh!

xavierleroy commented Jul 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Octachron commented Jul 14, 2017

Uh oh!

xavierleroy commented Jul 15, 2017

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

xavierleroy Sep 11, 2017

Choose a reason for hiding this comment

Uh oh!

Octachron Sep 12, 2017

Choose a reason for hiding this comment

Uh oh!

xavierleroy Sep 11, 2017

Choose a reason for hiding this comment

Uh oh!

Octachron Sep 12, 2017

Choose a reason for hiding this comment

Uh oh!

gasche commented Sep 12, 2017

Uh oh!

gasche commented Sep 12, 2017

Uh oh!

Octachron commented Sep 13, 2017

Uh oh!

nojb commented Oct 5, 2017

Uh oh!

gasche commented Oct 5, 2017

Uh oh!

gasche commented Oct 5, 2017

Uh oh!

damiendoligez commented Oct 9, 2017

dbuenzli commented Jul 8, 2017 •

edited

Loading

dbuenzli commented Jul 9, 2017 •

edited

Loading

Octachron commented Jul 9, 2017 •

edited

Loading

dbuenzli commented Jul 9, 2017 •

edited

Loading

xavierleroy commented Jul 14, 2017 •

edited

Loading