Enable UTF-8 on Windows 10 Console (Mark #2)#1444
Conversation
How bad is it not to do it? Even with best efforts, there are several situations where the clean shutdown path is not used. I can think for instance, of: (i) fatal errors such as not-properly detected stack overflows or other segfaults (not necessarily in OCaml code); (ii) direct exit from the main program when the OCaml runtime is embedded. |
|
Yes, please split the PR into independent PRs. It will make it much easier to test and discuss the changes separately. |
|
The support for the Windows Console is uncovering a lot of bugs that show up very late in the release cycle. We could make another choice, which is to revert the If I understand correctly, not doing any of the SetConsoleOutputCP stuff means user of the Console may get segfaults -and we cannot easily detect whether someone is running the Console or not. I wonder whether this can be used as a security flaw. |
|
Gabriel Scherer (2017/10/25 06:54 +0000):
The support for the Windows Console is uncovering a lot of bugs that
show up very late in the release cycle. We could make another choice,
which is to revert the `SetConsoleOutputCP`-related changes from 4.06
(and Windows 10 detection), and advertise that unicode input/output
from the Windows Console is *not* supported in this release. Are there
people out there that think this would be a reasonable idea?
One other possibility I can think of would be to keep this in 4.06 but
advertising this feature to be in a beta state. That way, we may get
more input for users and end-up with a support in 4.07 that would be in
a better shape than with no support at all in 4.06.
|
|
@shindere That would be a reasonable approach. Since as I understand it enabling this feature doesn't make ocaml unstable for majority of the cases. |
I still don't have a clear mental picture of the situation; looking forward to @dra27's blog post! Still, I can see two scenarios:
|
Will this cause trouble running OCaml executables on older versions of Windows? What are all the effects of such a change in the manifest? |
otherlibs/unix/access.c
Outdated
| p = caml_stat_strdup_to_os(String_val(path)); | ||
| caml_enter_blocking_section(); | ||
| ret = access(p, cv_flags); | ||
| fflush(stdout); |
There was a problem hiding this comment.
Was this call to fflush left over from some debugging code?
There was a problem hiding this comment.
Man, that's embarrassing, thanks for spotting that. If you can believe it, I do review my diffs...
|
Rebased to switch to the released FlexDLL 0.37 - I'll let AppVeyor confirm the changes (they're fine in my private fork, but you never know) and then answer these (very helpful) comments. |
|
OK, here goes: @alainfrisch - it's not dreadful not to reset it, it just may not match a user's expectation. I discovered it because I was comparing an intentionally not-working ocamlc with a patched one, and discovered that the output of the broken one was correct as long as I had run the patched one first! What we're doing here is the "standard" way. @gasche - the problem here, as with other things, is that these end up stacked, so I was hoping that we could discuss whether the changes listed are wanted in principle and then carve them off. The error in #1416 is partly an artefact of carving it from its parent #1408. At this stage, I would not bother with the risk of a revert - the code will simply not activate without doing something quite advanced to the executables produced by ocamlc/ocamlopt after they've been built. From a security perspective, we've introduced nothing new here, whether that code is left in or reverted (you could segfault programs before; you still can). @shindere, @bikalgurung - I think this would boil down to shipping with @xavierleroy - the manifest indicates support for a given version of Windows, not a requirement, so manifesting an application for Windows 10 does not prevent (or change) its operation on previous versions. However, that's not how I propose fixing it (I'd intentionally avoided using manifests in #1416) and we should definitely not change something like that at this stage of the release cycle. For the So, I think we have:
Finally, there is a choice:
I would obviously quite like to merge the lot in 4.06 and I'd be happy to add @shindere's suggestion of disabling the new Unicode support by default for 4.06.0, but encouraging users to enable it. |
|
I think that 8187684 (why is the a change to |
Changes to GetVersion/GetVersionEx for Windows 8.1 mean that the OCaml runtime queries the version of kernel32.dll to determine the actual Windows version. Unfortunately, this too is subject to a shim since the wrong field was read - switching from FileVersion to ProductVersion fixes the problem.
The chcp command is erroneously implemented in the Windows Command Prompt and merely returns the value which the Command Processor expects to be active. The OCaml runtime should not leave the Console in CP_UTF8, but rather restore the CodePage when the runtime terminates.
|
Thanks for the explanations concerning |
|
The alternative to restoring before exiting is setting and restoring around each call that outputs to the console, right? Does'nt look like a good option. |
|
@alainfrisch - yes, we could - I agree that it doesn't feel like a good option. @xavierleroy - there are other things which can cause trouble - for example, if I press CTRL+C at the wrong time when Git is displaying output, I can end up with a green/red screen font left over because Git terminates without restoring the previous console colour settings (I don't think that's a thing when using ANSI escape sequences instead, although maybe?). I don't think that's a counter-argument to your worry, but it's possibly worth remembering that all a user has to do if the console is left in an "unacceptable" state is re-issue the |
|
I still don't know whether we should merge this in 4.06 or not. My favorite solution would be to remove all the SetConsoleOutputCp code and Windows-version-detection from 4.06. If we are going to keep the code that is already in there, then what is proposed in (what remains of) this PR seems not-worse to me, so it might make sense to merge it. |
You're the boss. Go ahead. |
|
What about @shindere's idea to make WINDOWS_UNICODE=0 by default and therefore make the whole of this a beta "opt-in" for 4.06? |
|
To my non-expert eyes, the "Unicode in file names" part of the work looks pretty solid and well worth exposing by default in 4.06. It's the "Unicode on the console" part that is currently in flux. |
|
@xavierleroy fair point, which could be addressed by introducing |
damiendoligez
left a comment
There was a problem hiding this comment.
Reviewed by Damien and Gabriel while Xavier was looking away.
|
Cherry-picked in 4.06 : 7103830. |
This GPR started out while trying finally to write a blog post about the new Unicode features on Windows which subsequently turned into a case of Dolan's Rabbithole...
This may well want splitting up - and must not be merged as is (the submodule update in 742df2c is not merged upstream). The following things are addressed:
Unix.accesswas not upgraded to use char_os in Unicode support for the Windows runtime: Let's do it! #1200. This is fixed (with test case) in b59395f and must certainly go in 4.06.0byterun/Makefileto allowOCAML_STDLIB_DIRto be available in C files when compiling on Windows, but this fails if it contains Unicode code points which don't fit the system locale (so 1252 on my system). The patch in bb1419e uses a little trick with GNU iconv (which is part of the base installation on Cygwin and is only used in the Windows build system, so it should be low/no risk to depend on it) to convert the string to Java format and then translate the escapes from that to be compatible with both cl and mingw (TL;DR, the string gets encoded in C99 \uHHHH format, with surrogates and then the \u is translated to \x which works on older Microsoft C compilers)GetVersionEx. The fix is simple in 7e31f50. On the basis that the behaviour of Switch Windows 10 Console to UTF-8 Encoding #1416 was agreed, I would consider this a bug-fix for 4.06.0, but please read onSetConsoleOutputCPhas shown that although cmd's chcp command appears to report that the code page is restored after a process has ended, this isn't the case (Cmd appears to have cached what it thinks the code page is, and doesn't notice). This means that the runtime should restore the code page when it shuts down - this is done in a737357. This wasn't considered in Switch Windows 10 Console to UTF-8 Encoding #1416. I'd obviously quite like it to go in, but I can see an argument could be that the change in Switch Windows 10 Console to UTF-8 Encoding #1416 is presently inactive, so both these commits could be deferred (dra27 makes a pleading face across the channel).cc @gasche, @damiendoligez, @nojb