Skip to content

Remove dependency on libwinpthread-1.dll and GCC runtime library for mingw-w64 ports#11586

Merged
xavierleroy merged 6 commits intoocaml:trunkfrom
dra27:static-winpthreads
Oct 11, 2022
Merged

Remove dependency on libwinpthread-1.dll and GCC runtime library for mingw-w64 ports#11586
xavierleroy merged 6 commits intoocaml:trunkfrom
dra27:static-winpthreads

Conversation

@dra27
Copy link
Copy Markdown
Member

@dra27 dra27 commented Sep 29, 2022

This PR alters the mingw-w64 build to link the winpthreads library statically, removing the need for additional runtime DLLs (or, another way, restoring the OCaml 4.x lack of need for additional runtime DLLs).

Doing this proved slightly more complicated than I'd expected, as it turns out that GCC's implementation of implicit TLS (__thread) pulls in both its runtime library and also dynamically links the winpthreads library when building DLLs. This is something we've avoided in the past in the i686 port by using -static-libgcc but not only is that slightly worrying to have to do for the 64-bit port as well, but it also doesn't work for the pthreads linking.

This PR consists of three very closely linked parts:

  • The PTHREAD_ machinery from otherlibs/systhreads/Makefile is removed as pthreads is now always linked as part of the runtime and it would result in double-linking when switching to static winpthreads.
  • On Windows, Caml_state now uses a similar technique to pthread_getspecific on macOS, rather than implicit TLS via __thread. As a very minor aside, the SET_Caml_state macro is shifted inside a CAML_INTERNALS block, since it's only used in domain.c where we can actually load the required Windows headers to use TlsSetValue (more on this below).
  • winpthreads is then linked statically, which requires support -l: library linking specification flexdll#106 (this PR teaches flexlink to understand -l: syntax).

For background, there is no better explanation of Windows TLS than Ken Johnson's Nynaeve Adventures in Windows debugging blog (one should also, of course, read Raymond Chen on the same subject).

The way I propose implementing Caml_state warrants (considerable) further explanation. The problem with Caml_state is that it's in a public header. Our headers and codebase do not interoperate well with windows.h (cf. the trickery required around ATOM) and it seems best to keep it that way, as it's likely that user code suffers the same problem. This already means that the macro implementation of Caml_state wants to avoid using Windows API functions and types, because we have to declare compatible versions in the header, and it's really awkward. Coupled with that is the fact that TlsGetValue somewhat inexplicably insists on clearing GetLastError (see the related previous fix in #10220), I've elected here to expand the macro definitions from winternl.h for NtCurrentTeb and then have caml_make_domain_state_key both reserve a TLS slot and also compute the offset in the TEB where that TLS slot will be.

I did implement a version which instead included windows.h, but this involves some really ugly pre-processor mucking with symbol names in various other files (both in the debugger and in the interpreter). I also looked at the alternative with manual declarations for TlsGetValue, SetLastError and GetLastError and it's similarly hideous and, frankly, looks more brittle than just computing the TEB offset for the TLS slot. It's also fundamentally upsetting to have to make 4 system calls and 2 branches just to dereference a single pointer! I also experimented just doing it the macOS way and using winpthread's pthread_getspecific but this also suffers from the fact that we're in a public header and directly exposing pthreads in this way makes linking potentially awkward for user C code. The stub DLL for systhreads has to jump through a minor hoop to do this, but that seems acceptable for that one library. For Windows, winpthreads provides a nice abstraction for simpler code in most of the runtime, but I don't think it's a good idea for the fact we do this to be exposed in the API (the sync module, etc. hides the implementation of the primitives).

Back in Windows 9x days, manually reading the TEB would have been utter madness - the 16 bit implementations of TLS have actually changed between versions of 16-bit Windows. However, since the very beginning of Windows NT, both the C compiler and loader have colluded on the "internal" format of the TEB in order to implement implicit TLS (with quite a few infamous limitations along the way, which is why GCC uses some emulation). A side-effect of this collusion is that the TLS layout in the TEB pretty much cannot change (but you of course read all 8 parts of Nynaeve and therefore have already read that!). So what I'm proposing here is actually what the C compiler should be doing (and, when MSVC gets resurrected, is possibly what the MSVC port will actually do, although I confess I'll be surprised if it actually works across DLL loading and flexdll). I have chosen to restrict the implementation to the first 64 TLS slots only (later in the TEB is a pointer to the other 1024 slots added with Windows Vista). This should be no problem at all for ocamlopt-compiled programs (I think there are 3 other TLS slots in use before this one gets allocated); some particularly strange "OCaml called from C" implementation should hopefully be able to call caml_make_domain_state_key before allocating gazillions of TLS slots.

Given the DLL hell which can arise on mingw-w64 (see all the fun in opam for manifesting the C++ mingw-w64 runtime DLLs), I hope that one slightly hairy macro definition and a bit of dirty pointer arithmetic is an acceptable cost. As it happens, the combined facts the the TLS slot is set using TlsSetValue and then retrieved directly from the TEB coupled with quite how fundamental Caml_state is to the runtime means that this all gets quite severely tested just building the compiler, let alone running the testsuite!

This PR is marked as Draft only to prevent merging before the FlexDLL PR is merged. This branch has been through precheck#778.

@dra27 dra27 added this to the 5.0 milestone Sep 29, 2022
@xavierleroy
Copy link
Copy Markdown
Contributor

If we go this way (it's a big "if"), all the Windows hackery should be hidden behind functions defined in runtime/win32.c. We cannot pollute <caml/mlvalues.h> with so much stuff.

With apologies in advance for a silly question: why can't we use pthread_getspecific as in macOS? The winpthread implementation of pthread_getspecific is not good enough?

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Sep 30, 2022

The problem looks well-researched and the result looks nicer than it sounds. As a writer of C library without a Windows development environment, I prefer indeed if there is no additional hoops to go through to have it working on the various Windows ports.

The Caml_state_opt macro was written with CSE in mind, which compilers apply when they realise it is a TLS access (see e.g. the definition of Caml_state in debug mode and the new checks just introduced). Is there CSE with this definition of Caml_state_opt? This sounds plausible, depending on how the intrinsic is treated by the compiler. (Look e.g. at the generated code for caml_check_pending_actions.)

@xavierleroy
Copy link
Copy Markdown
Contributor

Here is a proof of concept : https://github.com/xavierleroy/ocaml/tree/refactor-domain-state . In a nutshell, configure determines whether we can use __thread variables in all contexts. If not, accesses to caml_state are hidden behind a getter function and a setter function, the getter being declared pure so that GCC and Clang can do some CSE. This way, the overhead of the getter function is kept low.

Three implementations of the getter and the setter are provided:

  • in runtime/win32.c, an implementation based on TlsGetValue, preserving LastError; that's where @dra27's superoptimized implementation could go;
  • in runtime/unix.c, a generic implementation based on PThread TLS;
  • in runtime/unix.c, a macOS implementation that uses a static __thread variable. (The problem with macOS is accessing __thread variables from a DLL, but here the access is through functions, so it seems OK.)

I wonder whether the third implementation (static __thread) would work for MinGW as well.

Let me know what you think and feel free to use this code in the current PR.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Oct 4, 2022

This looks nice but since you change the implementation of Caml_state for MacOS, it is worth trying to replace __thread with _Thread_local and defining HAS_FULL_THREAD_VARIABLES for MacOS (IIRC the two keywords are treated differently due to ABI issues, with _Thread_local support being more recent).

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 4, 2022

Just an update from my side: I'd been benchmarking this approach over the weekend, but I hadn't marked the function as pure. What I had found was that across the board on Windows it's slower (virtually every test in the testsuite runs more slowly, and both the sequential and parallel build are noticeably slower). There's one outlier in the sequential build which is why I hadn't yet posted any more detail, but I'll redo the comparisons with the function marked pure. Some of the tests in the testsuite were consistently 20% slower with the getter/setter vs full TLS. Obviously there's still a choice in the approach we eventually adopt, but my main conclusion so far is that it is definitely "hot"!

@xavierleroy
Copy link
Copy Markdown
Contributor

Obviously there's still a choice in the approach we eventually adopt, but my main conclusion so far is that it is definitely "hot"!

Then you'll like my next proposal even more... (Teaser!)

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Oct 4, 2022

Proof of concept, take 2: https://github.com/xavierleroy/ocaml/tree/refactor-domain-state-2 Now simpler and faster!

It starts with the assumption that caml_state can always be defined __thread and used like this within the runtime system. The only restriction is that on macOS and MinGW, DLLs must not access it directly, so they need to go through a getter function. (No setter function because DLLs don't need to set caml_state.)

The resulting code is quite simple and should be faster than the existing code on macOS (no pthread_getspecific calls) and at least as fast as the Win32 code proposed here in this PR.

Let me know what you think. If your thoughts are positive I'll submit it as a PR.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 4, 2022

My thoughts are very positive - the Windows-specific bits are gone! 🙂 I'll rebase what I've got here (switching to static linking) on to your TLS implementation and see what the performance numbers are looking like.

@dra27 dra27 force-pushed the static-winpthreads branch from 34b1218 to ea0255a Compare October 10, 2022 13:18
@dra27 dra27 marked this pull request as ready for review October 10, 2022 13:19
@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 10, 2022

Rebased over #11598 (also running through precheck#784)

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to us (@Octachron & @xavierleroy), but there's one thing we'd like to understand better, see below.

Comment on lines +18 to +22
#if defined(_WIN32) && !defined(NATIVE_CODE)
/* Ensure that pthread.h marks symbols __declspec(dllimport) so that they can be
picked up from the runtime (which will have linked winpthreads statically) */
#define DLL_EXPORT
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Octachron and I don't quite understand what this is doing. Which pthread.h file is being affected by this #define?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pthread.h from the winpthreads library in mingw-w64 (/usr/x86_64-w64-mingw32/sys-root/mingw/include/ptjhread.h). If DLL_EXPORT is defined and IN_WINPTHREAD is not defined, then it decorates all the API functions with __declspec(dllimport) (as we would if they were declared CAMLextern in our headers).

That doesn't appear to be expressly documented - if for some reason the header changed that, the possible impact would be the delightful "Cannot relocate" errors.

It has to be done here because caml/domain.h #include's caml/platform.h which #include's <pthread.h> (i.e. by the time st_pthreads.h does #include <pthread.h> it has already happened).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations. We were looking at the wrong pthread.h file.

This trick is kind of fragile indeed, but no need to think of alternatives until it breaks.

@xavierleroy
Copy link
Copy Markdown
Contributor

CI precheck looks happy, so let me merge this in trunk and @Octachron will choose when to pick it in 5.0.

@xavierleroy xavierleroy merged commit cd36f15 into ocaml:trunk Oct 11, 2022
dra27 pushed a commit to dra27/ocaml that referenced this pull request Oct 11, 2022
Remove dependency on libwinpthread-1.dll and GCC runtime library for mingw-w64 ports

(cherry picked from commit cd36f15)
@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 11, 2022

This PR combined with #11598 is going through precheck#785 on a test branch for pushing to 5.0

@Octachron
Copy link
Copy Markdown
Member

Since precheck is non-red, it seems fine to cherry-pick those changes to 5.0 .

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Oct 11, 2022

Done: 0be01f6 sitting on dcb4c83

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants