Skip to content

Enable UTF-8 on Windows 10 Console (Mark #2)#1444

Merged
damiendoligez merged 2 commits intoocaml:trunkfrom
dra27:gpr1416-fixes
Oct 27, 2017
Merged

Enable UTF-8 on Windows 10 Console (Mark #2)#1444
damiendoligez merged 2 commits intoocaml:trunkfrom
dra27:gpr1416-fixes

Conversation

@dra27
Copy link
Copy Markdown
Member

@dra27 dra27 commented Oct 24, 2017

This GPR started out while trying finally to write a blog post about the new Unicode features on Windows which subsequently turned into a case of Dolan's Rabbithole...

This may well want splitting up - and must not be merged as is (the submodule update in 742df2c is not merged upstream). The following things are addressed:

  • Unix.access was not upgraded to use char_os in Unicode support for the Windows runtime: Let's do it! #1200. This is fixed (with test case) in b59395f and must certainly go in 4.06.0
  • There is already trickery in byterun/Makefile to allow OCAML_STDLIB_DIR to be available in C files when compiling on Windows, but this fails if it contains Unicode code points which don't fit the system locale (so 1252 on my system). The patch in bb1419e uses a little trick with GNU iconv (which is part of the base installation on Cygwin and is only used in the Windows build system, so it should be low/no risk to depend on it) to convert the string to Java format and then translate the escapes from that to be compatible with both cl and mingw (TL;DR, the string gets encoded in C99 \uHHHH format, with surrogates and then the \u is translated to \x which works on older Microsoft C compilers)
  • Most embarrassingly (for me), when Windows Console Unicode Support #1408 was being split up to create Switch Windows 10 Console to UTF-8 Encoding #1416, my testing was clearly insufficient on Windows 10. The console changes in Switch Windows 10 Console to UTF-8 Encoding #1416 will only activate if the compiled executable is updated to have a Windows 10/Server 2016 manifest, as the field which was being read from the version information in kernel32.dll is subject to the same shim which affects GetVersionEx. The fix is simple in 7e31f50. On the basis that the behaviour of Switch Windows 10 Console to UTF-8 Encoding #1416 was agreed, I would consider this a bug-fix for 4.06.0, but please read on
  • Further investigation of SetConsoleOutputCP has shown that although cmd's chcp command appears to report that the code page is restored after a process has ended, this isn't the case (Cmd appears to have cached what it thinks the code page is, and doesn't notice). This means that the runtime should restore the code page when it shuts down - this is done in a737357. This wasn't considered in Switch Windows 10 Console to UTF-8 Encoding #1416. I'd obviously quite like it to go in, but I can see an argument could be that the change in Switch Windows 10 Console to UTF-8 Encoding #1416 is presently inactive, so both these commits could be deferred (dra27 makes a pleading face across the channel).
  • I found most of these issues both by compiling OCaml with a PREFIX containing UTF-8 sequences and, as additional torture, even building OCaml in a directory which itself contains UTF-8 sequences. It seems we should do this on AppVeyor. This has led to Fix encoding of high surrogate in UTF-16 flexdll#47. I don't think we should be advertising a release of OCaml which depends on a tool with a faulty UTF-16 encoder, so I think we should be aiming to switch OCaml 4.06.0 to depend on a not-yet-released FlexDLL 0.37 (cc @alainfrisch). 742df2c is a temporary switch to use this patch and points AppVeyor temporarily at my fork of FlexDLL. That should clearly not be merged.
  • Passing UTF-8 between appveyor.yml, Cmd and Bash is really painful! 53b9c66, ee9dc6b and 9cd4f9e make minor tweaks so that OCAMLROOT and OCAMLROOT2 can be painlessly dropped with fairly minimal duplication of the constants between those 3 files in 9f4cf36 which then sets up the UTF-8 tortuous builds. The resulting log file in AppVeyor is riddled with 🐫s in a most visually pleasing fashion... it built successfully on my private run of this, so it had better work now...

cc @gasche, @damiendoligez, @nojb

@alainfrisch
Copy link
Copy Markdown
Contributor

This means that the runtime should restore the code page when it shuts down - this is done in a737357.

How bad is it not to do it? Even with best efforts, there are several situations where the clean shutdown path is not used. I can think for instance, of: (i) fatal errors such as not-properly detected stack overflows or other segfaults (not necessarily in OCaml code); (ii) direct exit from the main program when the OCaml runtime is embedded.

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 25, 2017

Yes, please split the PR into independent PRs. It will make it much easier to test and discuss the changes separately.

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 25, 2017

The support for the Windows Console is uncovering a lot of bugs that show up very late in the release cycle. We could make another choice, which is to revert the SetConsoleOutputCP-related changes from 4.06 (and Windows 10 detection), and advertise that unicode input/output from the Windows Console is not supported in this release. Are there people out there that think this would be a reasonable idea?

If I understand correctly, not doing any of the SetConsoleOutputCP stuff means user of the Console may get segfaults -and we cannot easily detect whether someone is running the Console or not. I wonder whether this can be used as a security flaw.

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Oct 25, 2017 via email

@bikallem
Copy link
Copy Markdown

@shindere That would be a reasonable approach. Since as I understand it enabling this feature doesn't make ocaml unstable for majority of the cases.

@xavierleroy
Copy link
Copy Markdown
Contributor

We could make another choice, which is to revert the SetConsoleOutputCP-related changes from 4.06 (and Windows 10 detection), and advertise that unicode input/output from the Windows Console is not supported in this release. Are there people out there that think this would be a reasonable idea?

I still don't have a clear mental picture of the situation; looking forward to @dra27's blog post! Still, I can see two scenarios:

  • The SetConsoleOutputCP business is a partial, Windows 10-only solution that would be subsumed by the solution based on ReadConsole/WriteConsole described in Windows Console Unicode Support #1408. In this case we should revert the SetConsoleOutputCP changes in 4.06 and get the Windows Console Unicode Support #1408 solution ready for 4.07.
  • The SetConsoleOutputCP business is a prerequisite to other solutions such as Windows Console Unicode Support #1408. In this case we should keep the 4.06 code as it is today, even if it's not functional for the reasons listed in the present PR. Or if we want to make sure the code is innocuous, we comment out the call to SetConsoleOutputCP but leave all the other code around in preparation for improvement in 4.07.

@xavierleroy
Copy link
Copy Markdown
Contributor

The console changes in #1416 will only activate if the compiled executable is updated to have a Windows 10/Server 2016 manifest

Will this cause trouble running OCaml executables on older versions of Windows? What are all the effects of such a change in the manifest?

p = caml_stat_strdup_to_os(String_val(path));
caml_enter_blocking_section();
ret = access(p, cv_flags);
fflush(stdout);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this call to fflush left over from some debugging code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, that's embarrassing, thanks for spotting that. If you can believe it, I do review my diffs...

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 25, 2017

Rebased to switch to the released FlexDLL 0.37 - I'll let AppVeyor confirm the changes (they're fine in my private fork, but you never know) and then answer these (very helpful) comments.

@dra27 dra27 removed the suspended label Oct 25, 2017
@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 25, 2017

OK, here goes:

@alainfrisch - it's not dreadful not to reset it, it just may not match a user's expectation. I discovered it because I was comparing an intentionally not-working ocamlc with a patched one, and discovered that the output of the broken one was correct as long as I had run the patched one first! What we're doing here is the "standard" way.

@gasche - the problem here, as with other things, is that these end up stacked, so I was hoping that we could discuss whether the changes listed are wanted in principle and then carve them off. The error in #1416 is partly an artefact of carving it from its parent #1408. At this stage, I would not bother with the risk of a revert - the code will simply not activate without doing something quite advanced to the executables produced by ocamlc/ocamlopt after they've been built. From a security perspective, we've introduced nothing new here, whether that code is left in or reverted (you could segfault programs before; you still can).

@shindere, @bikalgurung - I think this would boil down to shipping with WINDOWS_UNICODE=0 in the template config/Makefile.m*. I can get on board with that - if we did so, then I'd hope that ca9f26d can be merged as, by inspection, that code only does anything if WINDOWS_UNICODE=1.

@xavierleroy - the manifest indicates support for a given version of Windows, not a requirement, so manifesting an application for Windows 10 does not prevent (or change) its operation on previous versions. However, that's not how I propose fixing it (I'd intentionally avoided using manifests in #1416) and we should definitely not change something like that at this stage of the release cycle. For the SetConsoleOutputCP business, it's not quite as you say for the first scenario. The Windows 10 solution will not be subsumed by #1408 - the output-related stuff in #1408 allows previous versions to operate correctly, but on Windows 10 that code is not activated. We could use #1408 everywhere (I believe @nojb found that this is what Git-for-Windows does), but my personal feeling is that we should workaround old broken behaviour and not keep workarounds in place when the underlying function is fixed (as in the floating point special value hacks and VS2015/2017). For the second scenario, it's not so much that #1408 depends on #1416, the patch would just alter slightly if we'd reverted #1416 - #1408 assumes that Unicode is always enabled, but engages a different function to write to the console when running prior to Windows 10.

So, I think we have:

  • 8187684 is a build system change (which is tested fully in this PR using AppVeyor and can be verified also on precheck). It's not a regression, as you couldn't compile Windows OCaml in a directory including UTF-8 sequences before, so it's up to you whether that's just for trunk.
  • 06775b3, thanks to @nojb spotting my hideous diff artefact, can now be split off to another GPR for both 4.06 and trunk, which I'm happy to do (the reason it's here is that the issue affects ocamldebug which breaks the testsuite when loading files from within a Unicode directory, that's how I found it).
  • fdb1255 updates FlexDLL to 0.37 and can also be split off to another GPR for both 4.06 and trunk. This new version uses the new Buffer.add_utf_16le_uchar which fixes the encoding of UTF-16 surrogate pairs.
  • the last four commits are all AppVeyor-related and non-critical therefore and are dependent on the previous two items being merged.

Finally, there is a choice:

  • Definitively disable Switch Windows 10 Console to UTF-8 Encoding #1416 by commenting out the call to caml_setup_win32_terminal in byterun/sys.c (the call to caml_probe_win32_version is safe, though that could be commented out too)
  • Fix Switch Windows 10 Console to UTF-8 Encoding #1416 by merging 5c16a6d so that OS detection returns caml_win32_major == 10 on Windows 10 instead of caml_win32_major == 6 and cope with the fact that this leaves the console in UTF-8 mode after any OCaml-compiled process terminates
  • Do the previous and also merge ca9f26d so that under normal circumstances we restore the codepage on exit.

I would obviously quite like to merge the lot in 4.06 and I'd be happy to add @shindere's suggestion of disabling the new Unicode support by default for 4.06.0, but encouraging users to enable it.

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 25, 2017

I think that 8187684 (why is the a change to dynlink.c in a "build system change"?), 06775b3 and fdb1255 must all be sent to separate PRs, and I would also send the AppVeyor fixes away on their own (referring to the dependency). People can keep discussing what to do for the Windows Console here, but if you keep all changes bundled the probability that they all get reviewed properly and we can check that they were all reviewed properly by tomorrow seems close to zero.

dra27 added 2 commits October 25, 2017 15:24
Changes to GetVersion/GetVersionEx for Windows 8.1 mean that the OCaml
runtime queries the version of kernel32.dll to determine the actual
Windows version. Unfortunately, this too is subject to a shim since the
wrong field was read - switching from FileVersion to ProductVersion fixes
the problem.
The chcp command is erroneously implemented in the Windows Command Prompt
and merely returns the value which the Command Processor expects to be
active. The OCaml runtime should not leave the Console in CP_UTF8, but
rather restore the CodePage when the runtime terminates.
@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 25, 2017

Split off to #1446, #1447, #1448 and #1449... it will take Travis/AppVeyor a little while to catch up.

@xavierleroy
Copy link
Copy Markdown
Contributor

Thanks for the explanations concerning SetConsoleOutputCP and alternatives. I can agree that SetConsoleOutputCP is the way of the future, but I find it worrying (bordering on not acceptable) that OCaml is supposed to restore the CP to its initial value before exiting, as this looks very hard to guarantee.

@alainfrisch
Copy link
Copy Markdown
Contributor

The alternative to restoring before exiting is setting and restoring around each call that outputs to the console, right? Does'nt look like a good option.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 25, 2017

@alainfrisch - yes, we could - I agree that it doesn't feel like a good option.

@xavierleroy - there are other things which can cause trouble - for example, if I press CTRL+C at the wrong time when Git is displaying output, I can end up with a green/red screen font left over because Git terminates without restoring the previous console colour settings (I don't think that's a thing when using ANSI escape sequences instead, although maybe?). I don't think that's a counter-argument to your worry, but it's possibly worth remembering that all a user has to do if the console is left in an "unacceptable" state is re-issue the chcp command or reload the console. If the application has been started directly, then the console is "ours" anyway, so we can do what we like - if the application has been spawned by another process, then the parent application had the option of controlling access to its console and is supposed to allow for the fact that it may be returned in a different state (arguably, Cmd is misbehaving in not restoring the code page itself and certainly misbehaving in reporting that the wrong one is active).

@gasche gasche added this to the consider-for-4.06.0 milestone Oct 26, 2017
@dra27 dra27 changed the title Various Windows Unicode fixes Enable UTF-8 on Windows 10 Console (Mark #2) Oct 26, 2017
@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 27, 2017

I still don't know whether we should merge this in 4.06 or not.

My favorite solution would be to remove all the SetConsoleOutputCp code and Windows-version-detection from 4.06. If we are going to keep the code that is already in there, then what is proposed in (what remains of) this PR seems not-worse to me, so it might make sense to merge it.

@xavierleroy
Copy link
Copy Markdown
Contributor

My favorite solution would be to remove all the SetConsoleOutputCp code and Windows-version-detection from 4.06.

You're the boss. Go ahead.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 27, 2017

What about @shindere's idea to make WINDOWS_UNICODE=0 by default and therefore make the whole of this a beta "opt-in" for 4.06?

@xavierleroy
Copy link
Copy Markdown
Contributor

To my non-expert eyes, the "Unicode in file names" part of the work looks pretty solid and well worth exposing by default in 4.06. It's the "Unicode on the console" part that is currently in flux.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 27, 2017

@xavierleroy fair point, which could be addressed by introducing WINDOWS_UNICODE_CONSOLE and using it in the one relevant place in byterun/sys.c?

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed by Damien and Gabriel while Xavier was looking away.

@damiendoligez damiendoligez merged commit 10b309f into ocaml:trunk Oct 27, 2017
@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 27, 2017

Cherry-picked in 4.06 : 7103830.

gasche pushed a commit that referenced this pull request Oct 27, 2017
Enable UTF-8 on Windows 10 Console (Mark #2)
(cherry picked from commit 10b309f)
@dra27 dra27 deleted the gpr1416-fixes branch July 6, 2021 13:52
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants