Skip to content

Toplevel: only escapes bytes and not strings#1231

Merged
Octachron merged 1 commit intoocaml:trunkfrom
Octachron:hello_κόσμος
Sep 13, 2017

Hidden character warning

The head ref may contain hidden characters: "hello_\u03ba\u03cc\u03c3\u03bc\u03bf\u03c2"
Merged

Toplevel: only escapes bytes and not strings#1231
Octachron merged 1 commit intoocaml:trunkfrom
Octachron:hello_κόσμος

Conversation

@Octachron
Copy link
Copy Markdown
Member

Escaping strings when printing them in the toplevel has the disadvantage of making unicode text unreadable:

# "한글";;
- : string = "\237\149\156\234\184\128"

This PR proposes to escape only bytes and not strings:

# let cosmos = "κόσμος";;
cosmos : string = "κόσμος"
# Bytes.of_string cosmos;;
- : bytes =
Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130"

This change is not purely aesthetic: the mangling of unicode strings may contribute to the impression of some OCaml newcomers that Ocaml has no support for unicode (and being able to read the corresponding strings in the toplevel benefits all users not able to fluently read raw unicode codepoint).

On a more technical note, in presence of string delimiters inside the printed string, the best string delimiters still available are used:

# "日本\"";;
- : string = {|日本"語|}

In order of preference, the string delimiters used are ", {|, {t|, {top|,{toplevel|, and in the worst case scenario {toplevel%d|:

# "مرسی\n\"|}|t}|top}|toplevel}|toplevel1}|toplevel2}";;
- : string =
{toplevel3|ﻡﺮﺳی
"|}|t}|top}|toplevel}|toplevel1}|toplevel2}|toplevel3}

With this scheme, all strings should be printable as valid string literals.

One disadvantage of this approch is that newline and tabs are not escaped anymore

# "\n\t\n\t\n";;
- : string = "
	
	
"

@Octachron Octachron force-pushed the hello_κόσμος branch from d8ee0b7 to 60fe3c1 Compare July 8, 2017 19:53
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 8, 2017

I like some aspects of the change (the idea that unicode letters are printed back) but not others: not-escaping the whitespace characters leads to a loss of readability, for example. Of course, some time people write string literals with newlines in them, and escaping them instead hurts readability.

I have mixed feelings about the automatic choice of {|..|} for escaping; it's a lesser-known feature that may surprise users, but on the other it makes the feature more self-discoverable. Overall I would say that I like it.

Have people studied the problem of "what is the right choice of which characters to print and which characters to escape" before, and are there solution that do not require more unicode knowledge than available in the OCaml standard library?

(Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.)

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 8, 2017

I'm not really fond of the choices made by this PR. These would be my suggestions:

  1. Always escape control characters (bytes <= 0x1F) and the string delimiter.
  2. Do not use the "{|" syntax, always delimit strings with "
  3. Consult some kind of environment variable to determine if the terminal is UTF-8 aware (this has to be researched or if nothing reliable is available one ocaml specific should be defined). If the terminal is not UTF-8 aware any byte > 0x7E should still be escaped.
  4. (Ideally the string should be checked for UTF-8 validity and if not valid its non printable US-ASCII bytes should be escaped. But this can be improved once/if we get more UTF-8 tools in the stdlib).
  5. I would really like to have the bytes of string and bytes escaped with hexadecimal numbers. Decimal escapes are a huge PITA if you are dealing with UTF-8 and other byte oriented file formats.

(Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.)

This is a user interface not an API, as such I think it would be legitimate to depend on the user locale.

@Octachron
Copy link
Copy Markdown
Member Author

Concerning point 1 and 2, escaping characters in the C0 control character code set and " may be indeed clearer and it rejoins @gasche remarks.

Concerning point 3 (and 4), I am not completely convinced because it seems much simpler to simply not escape bytes ≥ 0x7F. By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users ) of not escaping control characters in the C1 code set (in particular NEL) and the more exotic LS or PS new lines (on the other hand, it would even give a work-arround for users that really really want non-escaped new lines).

I agree with point 5, but using hexadecimal in escaping sequence does not seems to particularly concern the toplevel, and I think this should be changed globally.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 8, 2017

By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users )

This kind of fortunate coincidence argument doesn't make sense to me. We have been trying to push for sometime now for a model where OCaml strings should be UTF-8 encoded.

I will let the dev team determine if they find it important that the toplevel is still able to function in a 7-bit environment. But I really think that we should have at least an OCAMLTOP_UTF_8 boolean environment variable (that defaults to true), that controls this.

@Octachron
Copy link
Copy Markdown
Member Author

I have updated this to escape C0 control characters and string delimiters; in other words, quoted string
delimiters are gone and \t and \n are now escaped:

# "серафими\t\"многоꙮчитїи\"";;
- : string = "серафими\t\"многоꙮчитїи\""

@dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence?

Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input (except for control characters that indeed should not take control of the toplevel printing).

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 9, 2017

So @pqwy who studied the problem in the context of his notty library tells me that it seems hopeless to try to find a system environment variable to lookup (though LC_ALL, LANG, LC_CTYPE, could be possibly consulted in that order, but that doesn't tell you if the output actually agrees).

@dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence?

No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E.

Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input

Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal.

@Octachron
Copy link
Copy Markdown
Member Author

No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E.

I agree that being able to reactivate the escaping of bytes > 0x7E is a undeniable improvement.

Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal.

Well, \xC3\xA9 is also a perfectly valid latin-1(é) or SHIFT-JIS (テゥ) byte sequence, so breaking the terminal might be a bit hyperbolic here.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 9, 2017

Well, \xC3\xA9 is also a perfectly valid latin-1(é) or SHIFT-JIS (テゥ) byte sequence, so breaking the terminal might be a bit hyperbolic here.

But in that case the user would leave OCAMLTOP_UTF_8 set to true (as strange as it sounds) and would not input hex escapes...

@Octachron
Copy link
Copy Markdown
Member Author

Octachron commented Jul 9, 2017

I have added the OCAMLTOP_UTF_8 variable, man page and manual description included.

Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1:
the described behavior did not seem valid anymore (and they are no trace of iso_8859_1 nor LC_CTYPE in the code base). Moreover, OCAMLTOP_UTF_8 supersedes such use of LC_CTYPE.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 9, 2017

Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1:

FTR @xavierleroy removed that from the manual in 5d385f9. I think this may have gone away with the resolution by @damiendoligez of MPR6521 in e60a2db.

if isneg then pp_print_char ppf ')'

(** Escape only C0 control characters (bytes <= 0x1F) and '"' *)
let print_out_string ppf s =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x7F, a.k.a. DEL, is a control character and needs escaping too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@xavierleroy
Copy link
Copy Markdown
Contributor

This discussion makes me feel younger, because we had pretty much the same discussion back in the early 1990s when Latin-1 support was added to Caml Light and not all environments would support characters above 0x80...

Escaping control characters is absolutely necessary, and not only to display TAB, CR and LF meaningfully. For example, if terminal escape sequences are printed verbatim, the display can be completely messed up.

A desirable property of the toplevel value printer is that the output, once fed back into the toplevel by cut-and-paste, should parse back to the same value (as much as possible). I think it is the case in the latest incarnation of this PR, but make sure it is.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Jul 9, 2017

Isn't this going to mess up the formatting? Format doesn't understand the widths of these characters. I'm lead to believe that this is quite a tricky problem to solve.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jul 9, 2017

Isn't this going to mess up the formatting?

It will.

I'm lead to believe that this is quite a tricky problem to solve.

It's worse than that. It's a problem that is impossible to solve without being able to interact with the rendering layer to measure how many cells your UTF-8 encoded string is going to span when rendered -- something no terminal out there will provide you.

You can perform some kind of best effort formatting using either Uucp.tty_width_hint (a form of wcwidth designed by @pqwy for this problem) or Uuseg's pretty-printers but even, none of these solutions are entirely fool proof.

@Octachron
Copy link
Copy Markdown
Member Author

Format output will get messy since Format will tend to overestimate the length of the graphical representation of strings.

Some examples, first in Greek:

# String.split_on_char ' ' "Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾿ Ἀχαιοῖς ἄλγε᾿ ἔθηκε, πολλὰς δ᾿ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι· Διὸς δ᾿ ἐτελείετο βουλή ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.";;   
  - : string list =
["Μῆνιν"; "ἄειδε"; "θεὰ"; "Πηληϊάδεω";
 "Ἀχιλῆος"; "οὐλομένην,"; ""; "μυρί᾿";
 "Ἀχαιοῖς"; "ἄλγε᾿"; "ἔθηκε,"; "πολλὰς";
 "δ᾿"; "ἰφθίμους"; "ψυχὰς"; "Ἄϊδι";
 "προΐαψεν"; "ἡρώων,"; "αὐτοὺς"; "δὲ";
 "ἑλώρια"; "τεῦχε"; "κύνεσσιν"; "οἰωνοῖσί";
 "τε"; "πᾶσι·"; "Διὸς"; "δ᾿"; "ἐτελείετο";
 "βουλή"; "ἐξ"; "οὗ"; "δὴ"; "τὰ"; "πρῶτα";
 "διαστήτην"; "ἐρίσαντε"; "Ἀτρεΐδης"; "τε";
 "ἄναξ"; "ἀνδρῶν"; "καὶ"; "δῖος";
 "Ἀχιλλεύς."]

Compared to the english version

# String.split_on_char ' ' "Achilles sing, O Goddess! Peleus' son; His wrath pernicious, who ten thousand woes Caused to Achaia's host, sent many a soul Illustrious into Ades premature, And Heroes gave (so stood the will of Jove) To dogs and to all ravening fowls a prey, When fierce   dispute had separated once The noble Chief Achilles from the son Of Atreus, Agamemnon, King of men.";;    
- : string list =
["Achilles"; "sing,"; "O"; "Goddess!"; "Peleus'"; "son;"; "His"; "wrath";
 "pernicious,"; "who"; "ten"; "thousand"; "woes"; "Caused"; "to"; "Achaia's";
 "host,"; "sent"; "many"; "a"; "soul"; Illustrious"; "into"; "Ades";
 "premature,"; "And"; "Heroes"; "gave"; "(so"; "stood"; "the"; "will"; "of";
 "Jove)"; "To"; "dogs"; "and"; "to"; "all"; "ravening"; "fowls"; "a";
 "prey,"; "When"; "fierce"; "dispute"; "had"; "separated"; "once"; "The";
 "noble"; "Chief"; "Achilles"; "from"; "the"; "son"; "Of"; "Atreus,";
 "Agamemnon,"; "King"; "of"; "men."]

Similarly with Japanese

# [
"こぬ人を";
"まつほの浦の";
"夕なぎに";
"やくやもしほの";
"身もこがれつつ"
];;            
- : string list =
["こぬ人を"; "まつほの浦の"; "夕なぎに";
 "やくやもしほの"; "身もこがれつつ"]

or Sanskrit

# [       
"अग्निमीळे"; "पुरोहितं"; "यज्ञस्य"; "देवं रत्वीजम";
"होतारं"; "रत्नधातमम";
"अग्निः"; "पूर्वेभिर्र्षिभिरीड्यो"; "नूतनैरुत";
""; "देवानेह"; "वक्षति"
];;           
- : string list =
["अग्निमीळे"; "पुरोहितं";
 "यज्ञस्य"; "देवं रत्वीजम";
 "होतारं"; "रत्नधातमम"; "अग्निः";
 "पूर्वेभिर्र्षिभिरीड्यो";
 "नूतनैरुत"; ""; "देवानेह";
 "वक्षति"]

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 9, 2017

On the other hand, any of these examples are completely unreadable with the current pretty-printing scheme, so the output you show (if formatted a bit weirdly compared to the english version) is a strong improvement.

Given that this only affect the toplevel output (and not calls to Format in user programs), I believe that not having a general solution to length formatting is fine.

@DemiMarie
Copy link
Copy Markdown
Contributor

I think that Rust’s approach is best long-term one: strings must be in UTF-8 and are immutable. Creating a string that is not valid UTF-8 is undefined behavior.

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Jul 14, 2017

Also, if I'm reading the code correctly, backslash is not escaped...

Why not base your implementation on that of String.escaped / Bytes.escaped? That would avoid so many regressions.

@Octachron Octachron force-pushed the hello_κόσμος branch from 0b82612 to a9b3057 Compare July 14, 2017 22:35
@Octachron
Copy link
Copy Markdown
Member Author

There is a cosmetic regression w.r.t the current escaping of strings: CR, LF, TAB and BS are printed with numeric escapes, while previously they were printed \r, \n, \t, \b.

They were not? At least not during my tests? (https://github.com/ocaml/ocaml/blob/trunk/stdlib/char.ml#L29)

Also, if I'm reading the code correctly, backslash is not escaped...

Of this, I am atrociously guilty. I should have added a test on the testsuite covering the whole ascii range.
This lack of test is fixed.

Why not base your implementation on that of String.escaped / Bytes.escaped?

As wished, I have reimplemented the string escape in the style of Bytes.escaped.

@xavierleroy
Copy link
Copy Markdown
Contributor

My comment about CR LF and co was based on a wrong reading of the code, just ignore it and apologies about this.

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me, with a bit of LaTeX tweaking recommended.

Changes Outdated
(Tadeu Zagallo, review by David Allsopp)

- GPR#1231: improved printing of unicode texts in the toplevel,
when OCAMLTOP_UTF_8 is not set to false.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unless OCAMLTOP_UTF_8 is set to false" would read better, I think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Fixed.

The following environment variables are also consulted:
\begin{options}
\item["OCAMLTOP_UTF_8"] When printing string values, non-ascii bytes
(>0x7E) are printed as decimal escape sequence if "OCAMLTOP_UTF_8" is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(>0x7E) will format poorly in LaTeX, with the > character rendered as upside-down question mark or some such. Please format as a proper math formula:

  ($ {} > "\0x7E" $)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 12, 2017

I remarked while reading the code that the "max string length" parameter of the Oval_string node may not be respected, given that escaping (done after this test) may increase the length. However, (1) previous implementations and the Bytes codepath also suffer from this issue and (2) this parameter is currently not fixed by the user (then it would be nice to respect it) but by the Genprintval recursion-depth control code, so it sounds reasonable to only respect it approximately.

@gasche
Copy link
Copy Markdown
Member

gasche commented Sep 12, 2017

@Octachron I think you should feel free to rebase (if you want to squash some of the intermediary commits) and merge.

Escaping strings when printing them in the toplevel has the disadvantage
of mangling unicode text:

```
\# "한글";;
- : string = "\237\149\156\234\184\128"
```

With this commit, strings are not escaped anymore, contrarily to bytes:

```
\# let cosmos = "κόσμος";;
cosmos : string = "κόσμος"
\# Bytes.of_string cosmos;;
- : bytes =
Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130"
```

This new behavior can be disabled dynamically by setting the environment
variable OCAMLTOP_UTF_8 to false

This change is not solely aesthetic: the mangling of unicode string may
contribute to the impression of some OCaml newcomers that Ocaml has no
support for unicode.
@Octachron Octachron merged commit 588c231 into ocaml:trunk Sep 13, 2017
@Octachron
Copy link
Copy Markdown
Member Author

Squashed to a single commit and merged.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Oct 5, 2017

There is another place where a similar treatment of strings may be a good idea, namely Printexc.to_string, see here. It currently uses %S which ends up escaping Unicode filenames (which can easily appear as arguments of Sys_error).

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 5, 2017

Do we want to consider relaxing the escaping in %S? The semantic specification, as I understand it, is to print strings in the way that they can be parsed back as OCaml literals. Maybe we can assume that unicode strings are parsed back as OCaml literals in a unicode file, so that non-escaping is justified?

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 5, 2017

(On the other hand, printing unicode on Windows seems to still be an open problem, so maybe escaping is not that bad?)

@damiendoligez
Copy link
Copy Markdown
Member

IMO Printexc.to_string is a debugging tool, its output should be seen by developers but not by end users. For a developer, it's probably more convenient to have the escaped version of the string.

@yallop
Copy link
Copy Markdown
Member

yallop commented Jan 8, 2019

I was surprised just now by the change in escaping behaviour introduced by this PR. Running OCaml in an Emacs subshell, I see this:

# "\255";;
- : string = "\377"

whereas before OCaml 4.06 I see this:

# "\255";;
- : string = "\255"

What's happening: OCaml used to escape characters above 0x80 when printing strings, and printed them using the same decimal format used in string literals. However, since this PR such characters are now printed directly to the terminal instead. Since they can't be displayed directly, the terminal emulator prints them as octal escapes.

@xavierleroy
Copy link
Copy Markdown
Contributor

I agree the octal escape is confusing. At least, Emacs colors the \377 in red, doesn't it? So, that gives a hint to the (experienced) Emacs user...

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Jan 8, 2019 via email

@yallop
Copy link
Copy Markdown
Member

yallop commented Jan 8, 2019

Emacs does indeed colour the \377, and the printed string behaves correctly in other ways: for example, copying and pasting the string produces the original value, and navigation commands treat the displayed \377 as a single character.

@yallop
Copy link
Copy Markdown
Member

yallop commented Jan 8, 2019

@shindere: while the colour hint is only useful for sighted users, I wonder whether users who don't rely on the visual interface avoid the confusing issue in the first place. For example, emacspeak seems to treat the printed character as a single unit (which I think it pronounces "y umlaut") rather than reading out the octal escape code. I'm curious whether the interface you use is similarly helpful.

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Jan 8, 2019 via email

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Co-authored-by: Sabine Schmaltz <sabine@tarides.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants