Skip to content

Simpler handling of caml_state on macOS and MinGW#11598

Merged
xavierleroy merged 2 commits intoocaml:trunkfrom
xavierleroy:refactor-domain-state-2
Oct 10, 2022
Merged

Simpler handling of caml_state on macOS and MinGW#11598
xavierleroy merged 2 commits intoocaml:trunkfrom
xavierleroy:refactor-domain-state-2

Conversation

@xavierleroy
Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy commented Oct 4, 2022

Currently, caml_state is defined as a C11 __thread variable, except on macOS because it doesn't allow a DLL to access a __thread variable defined in the main program. MinGW with static linking of libwinpthread has the same limitation, see #11586 .

The current workaround for macOS, based on pthread_getspecific, is costly, and the proposed workarounds for MinGW are complicated.

This PR offers a simpler and more efficient solution to this problem. It starts with the assumption that caml_state can always be defined with __thread and directly used within the runtime system. The only restriction is that on macOS and MinGW, DLLs must not access caml_state directly, so we make them go through a getter function defined in the main program instead. (No setter function because DLLs don't need to set caml_state.)

The getter function is declared pure so that GCC and Clang can perform CSE on repeated calls to the getter (see https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html).

The resulting code is quite simple and should be faster than the existing code on macOS (no pthread_getspecific calls) and at least as fast as the Win32 code proposed earlier in #11586.

@xavierleroy xavierleroy added this to the 5.0 milestone Oct 4, 2022
Copy link
Copy Markdown
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still running some timing comparisons, but I spotted a macro typo in the meantime

#define SET_Caml_state(x) (caml_state = (x))
#else
#ifdef GNUC
__attribute__((pure))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be detected with:

#if defined (__GNUC__) || defined(__has_attribute)
#  if defined(__GNUC__) || __has_attribute(pure)
  __attribute__((pure))
#  endif
#endif

(or by defining Caml_has_attribute in misc.h a la Caml_has_builtin) but given that every version of GCC we support has this attribute, presumably the same is true of clang?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, unknown attributes are silently ignored. But, yes, Clang seems to support the attribute as well as GCC.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Oct 7, 2022

The resulting code is quite simple and should be faster than the existing code on macOS (no pthread_getspecific calls)

As per the previous discussion on this topic, it is not straightforward that using __thread or _Thread_local results in more efficient code than pthread_getspecific. The source code seems simple but the generated code not so much.

On Arm, __thread and _Thread_local seem to be compiled with an indirect call, both following the model for _Thread_local on x86-64:

10045fbdc: e0 23 00 b0     adrp    x0, 0x1008dc000 <_caml_parse_ocamlrunparam+0x68c>
10045fbe0: 00 20 2f 91     add    x0, x0, #3016
10045fbe4: 08 00 40 f9     ldr    x8, [x0]
10045fbe8: 00 01 3f d6     blr    x8        ; call `tlv_get_addr`
10045fbec: 08 00 40 f9     ldr    x8, [x0]
...
libdyld.dylib`tlv_get_addr:
    0x19cb1c530 <+0>:   ldr    x16, [x0, #0x8]
    0x19cb1c534 <+4>:   mrs    x17, TPIDRRO_EL0      ; Thread ID register
    0x19cb1c538 <+8>:   and    x17, x17, #0xfffffffffffffff8
    0x19cb1c53c <+12>:  ldr    x17, [x17, x16, lsl #3]
    0x19cb1c540 <+16>:  cbz    x17, 0x19cb1c550          ; <+32>
    0x19cb1c544 <+20>:  ldr    x16, [x0, #0x10]
    0x19cb1c548 <+24>:  add    x0, x17, x16
    0x19cb1c54c <+28>:  ret    

In contrast, using pthread_getspecific one gets:

100026788: 60 02 40 f9     ldr    x0, [x19]
10002678c: 06 2b 00 94     bl    0x1000313a4 <_write+0x1000313a4>
...
1000313a4: 30 00 00 f0     adrp    x16, 0x100038000 <__stubs+0x460>
1000313a8: 10 7a 41 f9     ldr    x16, [x16, #752]
1000313ac: 00 02 1f d6     br    x16   ; call pthread_getspecific
...
libsystem_pthread.dylib`pthread_getspecific:
    0x19cb10094 <+0>: mrs    x8, TPIDRRO_EL0
    0x19cb10098 <+4>: ldr    x0, [x8, x0, lsl #3]
    0x19cb1009c <+8>: ret 

In other words pthread_getspecific compiles as ELF local-exec model + one direct call + one indirect call. The indirect call is due to dynamic linkage of pthread_getspecific. I do not know if it is possible to avoid the linkage table or if it is an essential limitation of the platform.

However, clang fails to do CSE with pthread_getspecific. You can force CSE though with something like:

static inline caml_domain_state *caml_get_domain_state(void) __attribute__((pure));
static inline caml_domain_state *caml_get_domain_state(void)
{
  return (caml_domain_state*) pthread_getspecific(caml_domain_state_key);
}
#define Caml_state_opt (caml_get_domain_state())

at least as fast as the Win32 code proposed earlier in #11586

This is also not clear to me, especially when IN_CAML_RUNTIME is not defined. With #11586 it seems that the generated code is similar to ELF local-exec model whether you are in the runtime or not. @dra27 essentially showed that you can have your cake and eat it too.

I am especially interested in performance from outside the runtime. The performance of accesses to Caml_state is important for user programs too (e.g. for both local roots and our library boxroot).

A few notes:

  • When replacing __thread with _Thread_local and removing the wrapping on MacOS + Arm64, the linker still complains, which is all the more surprising because this expensive scheme is similar to the one that supports dynamic linking in ELF. (Remember @kayceesrk's experiment suggesting that _Thread_local was properly implemented on MacOS + x86-64; this issue also suggests that something was working with X86-64 but no longer with M1.)
  • One reason why _Thread_local might be more expensive is that its access should not cause a crash even if the runtime is not started yet. This is a nice convenience (e.g. in our library boxroot, to guarantee safety in corner cases of Rust programs), but it might be cheaper to check this initialization separately if really needed.
  • Using attribute pure does not guarantee the best CSE. The CSE done by the compiler for __thread is between the one obtained with pure and the one obtained with const (the latter of which is unsound). The compiler has probably access to fewer aliasing information. This can be seen with caml_check_pending_actions (in this case the common Caml_state should be factored by hand).

@xavierleroy
Copy link
Copy Markdown
Contributor Author

I am especially interested in performance from outside the runtime

The rest of us care about performance within the runtime. We routinely spend 1/3rd of the running time in the GC, damn it.

@dra27
Copy link
Copy Markdown
Member

dra27 commented Oct 7, 2022

(sorry, my ability to investigate what I wanted to for this is being hampered by ill-health)

TL;DR I'd really like this PR in next week's beta, and I'd really like the static linking part of #11586 to be either with it or in a subsequent 5.0 beta. In more detail:

  • Primary goal for me is no runtime libraries on Windows (because it's a usability disaster)
  • In terms of how this PR works, it should be possible for at least native stubs to use direct access to caml_state too. I was wondering if there's an immediate win to defining IN_CAML_RUNTIME in the appropriate place in otherlibs/systhreads/Makefile as that already compiles the C stubs for bytecode and native separately, and st_stubs.c has quite a lot of explicit use of Caml_state
  • I'm interested as to whether the same might be true for all the implicit uses of Caml_state in libunix.a, but that can wait. It needs performance figures to justify it.
  • Likewise, it might be that there's an actual benefit in ocamlmklib to having the C code for static vs shared use compiled separately, but that can "super" wait! It likewise needs performance figures to justify it - what's interesting in this case is that on macOS and Windows at least, we have a strong case for static linking over shared linking 🙈
  • The TLS implementation I proposed in Remove dependency on libwinpthread-1.dll and GCC runtime library for mingw-w64 ports #11586 does appear to be faster (based on both sequential build time and, more importantly, on the sequential runtime of the testsuite), but to me for 5.0 that's really not that important (and we might choose to live with it anyway). My implementation does incur a slight cost to building apparently, as <intrin.h> is a very slow header for cpp to process (and it gets pulled in by mlvalues.h, so it affects everything)... it's maybe having your cake and eating it, but minus the icing.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

TL;DR I'd really like this PR in next week's beta

I'm in favor too! Why don't you start by approving this PR?

In terms of how this PR works, it should be possible for at least native stubs to use direct access to caml_state too

Yes, that's right.

At some point I considered using direct access as soon as __PIC__ is not defined. But this seems quite unreliable, as gcc often defines __PIC__ even if the -fPIC option is not explicitly given. So there is no existing preprocessor define that we can use to distinguish between native stubs and bytecode stubs.

If we want to go this way, maybe IN_OCAML_RUNTIME is not the best name to convey what we're doing.

Copy link
Copy Markdown
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor too! Why don't you start by approving this PR?

It was only waiting on whether you wanted to do anything with the systhreads Makefile now or later. However, I'm happy with this being merged as-is, thank you!

@xavierleroy
Copy link
Copy Markdown
Contributor Author

I'm building the MinGW-64 version right now, so as to look at the code generated by GCC... Additional comments welcome!

@xavierleroy
Copy link
Copy Markdown
Contributor Author

I was hoping GCC would compile __thread variables like in Linux, as a direct memory access relative to a segment register, but the MinGW code is pretty awful, with a call to a __emutls_get_address library function that doesn't look cheap. It's regrettable that neither macOS nor MinGW implement __thread fully (w.r.t. DLLs) and efficiently...

@xavierleroy
Copy link
Copy Markdown
Contributor Author

It's regrettable that neither macOS nor MinGW implement __thread fully (w.r.t. DLLs) and efficiently...

But, as a consequence, the overhead of calling the caml_get_domain_state function instead of accessing caml_state directly is low (proportionally speaking), so we probably don't need to go out of our way to remove this overhead from OCaml-C stub libraries.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Oct 7, 2022

Well wishes to you @dra27.

@xavierleroy:

The rest of us care about performance within the runtime. We routinely spend 1/3rd of the running time in the GC, damn it.

I do not understand what this is trying to claim and how this relates to my results.

If what matter is really runtime performance, the best-known solution on MacOS is to use the current pthread_getspecific approach and add the attribute pure. It also has the advantage of not needing a workaround to link dynamically. On MinGW, @dra27's not-so-dirty hack was also going to be hard to beat and it does not need special-casing dynamic linking.

It makes sense to use the approach in this PR for MinGW with hopes that #11586 becomes mature enough at some later point. It does not make sense to change the current MacOS implementation in this direction. Later I can send my patch to add the attribute pure for pthread_getspecific.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

If what matter is really runtime performance, the best-known solution on MacOS is to use the current pthread_getspecific approach and add the attribute pure.

No. I'm getting tired of this discussion, so I just did some quick performance measurements on a Mac M1. The time it takes to evaluate Caml_state_opt:

  • on trunk (direct call to pthread_getspecific): 1.3 ns
  • with this PR: 1.0 ns if IN_CAML_RUNTIME is defined, 1.6 ns otherwise.

CSE is not exercised in these measurements, but keep in mind that the code that offers the best CSE opportunities is this PR in IN_CAML_RUNTIME mode.

My old tiny benchmark suite (with Knuth-Bendix and the like) shows no meaningful speed differences, they are all lost in the noise. The GC bench (that spends all its time allocating and mutating data structures) is 0.9% faster with this PR, but I would not consider this meaningful.

So, unless new concerns are expressed, I propose to merge this PR in 5.0 tomorrow.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Oct 9, 2022

It is hard to conclude something from this kind of micro-benchmark without an interpretation of what is going on. Could you please send us your micro-benchmark? Naively, the 60% overhead measured for wrapping in a function seems a bit large for the compiled code I have seen. In addition, I am curious about controlling for the impact of indirect branch prediction, which should be more favourable to pthread_getspecific (since it happens in a single location). That being said, I had access to an M1 Mac temporarily through lucky circumstances and I cannot guarantee that I will have the occasion to further help understand what is going on.

pure is enough for the CSE for which Caml_check_caml_state() and the like were written, which is the one that worries me. The rest of the runtime already tends to factor Caml_state more than CSE alone can do. So I do not think that the difference between flavours of CSE is a factor.

I am also unconvinced by the GC bench, but mainly because a difference of 0.9% could come from unrelated things, and we also do not know the impact of the missing CSE when thread-local access is expensive. Can you please give us a pointer to the benchmark to reproduce?

In summary:

  • I am more-or-less convinced that the impact is small, so this is not necessarily an expression of concern regarding merging this PR (given that you seem in a rush);
  • I am not convinced that we understand what is going on nor which implementation is the most efficient;
  • I am not convinced that the impact will be too small to revisit the approach later on (especially given the gains announced on win32).

… MinGW

macOS and MinGW don't quite support __thread variables defined in the
main program and accessed from a DLL.  So, special handling of
caml_state is required on these platforms.  This commit refactors
the way this is done.

- Introduce HAS_FULL_THREAD_VARIABLES, determined by configure,
  to say that DLLs can access __thread variables from the main program.

- If not, a getter function is used for reading `caml_state`
  (`caml_get_domain_state` in runtime/domain.c).

- DLLs never write to `caml_state`, so remove the `SET_Caml_state` macro.

- This implementation works on macOS (with HAS_FULL_THREAD_VARIABLES not
  defined), so remove the old macOS implementation based on pthread TLS.
There's no risk of accessing across DLLs in this case.

Add the IN_CAML_RUNTIME preprocessor define, which is set when producing
object files in runtime/
@xavierleroy xavierleroy force-pushed the refactor-domain-state-2 branch from 69baac8 to 6b7c5a5 Compare October 10, 2022 13:03
@xavierleroy xavierleroy changed the title Simpler, more efficient handling of caml_state on macOS and MinGW Simpler handling of caml_state on macOS and MinGW Oct 10, 2022
@xavierleroy
Copy link
Copy Markdown
Contributor Author

I agree there's room for improvement in the future, esp. for the Win32 code. On the other hand this PR and #11586 should not block the beta releases of OCaml 5.0, for which we have a tight schedule. So, I'm taking the liberty to merge this PR in trunk, hoping that @dra27 can build the static linking of winpthreads on top of it. @Octachron will then decide whether and when to integrate this work in 5.0.

@xavierleroy xavierleroy merged commit 1851722 into ocaml:trunk Oct 10, 2022
@xavierleroy xavierleroy deleted the refactor-domain-state-2 branch October 11, 2022 08:19
dra27 pushed a commit to dra27/ocaml that referenced this pull request Oct 11, 2022
Simpler handling of `caml_state` on macOS and MinGW

(cherry picked from commit 1851722)
@dra27
Copy link
Copy Markdown
Member

dra27 commented Oct 11, 2022

Cherry-picked to 5.0 with #11586 in dcb4c83

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Oct 21, 2022

I would still like to understand what is going on since @xavierleroy's benchmarks surprise me. Xavier, can you please send us the code for your benchmarks?

@xavierleroy
Copy link
Copy Markdown
Contributor Author

I probably should not, as I'm pretty sure any code that I post is going to be held against me. But against my better judgment, here is the trivial benchmark for Caml_state evaluation:
https://gist.github.com/xavierleroy/7fe9457b095580acd19174259252fc10

And here is the GC benchmark, adapted from Java code that has been around for a while:
https://gist.github.com/xavierleroy/bd07e239c847f67f6a5f0fa4ac52dc65

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Oct 24, 2022

Thanks.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Nov 10, 2022

It turns out that clang optimizes TLS accesses very aggressively, even at -O0, regardless of your asm memory barrier. In particular it factors the indirect call to get the TLS address which is the expensive part in the Apple implementation of thread_local. So the benchmark was probably not measuring what you thought.

I ran my own benchmarks and it is better to use pthread_getspecific. It is a bit faster, although the speedup is not as dramatic as (likely) @dra27's optimization on Windows, but it allows us to get rid of IN_CAML_RUNTIME.

In the case where TLS is wrapped in a function in a shared library, clang does not optimize the benchmark away. I measured the time to do 1024 accesses (fluctuations +/- 10% excluding unknown effects):

duration / #accesses Cold caches Hot caches
pthread_getspecific 17.2ns 1.8ns
wrapped thread_local 23.4ns 2.5ns

In this case my results are consistent with yours.

My trick to prevent false optimizations by clang is to measure the time to access 1024 different TLS variables (clang still does some optimizations in this case which you would not find in actual code, but it does not entirely optimize the benchmark away).

duration / #accesses Cold caches Hot caches
pthread_getspecific 17.7ns 1.9ns
thread_local 35.3ns 2.9ns

We see the cost of missing branch target prediction in the case of thread_local.

This is consistent with what one can predict by looking at the generated code for the various implementations, so one conclusion is that it is simpler to compare implementations by looking at the generated code (or at least it is better to look at the code without doing benchmarks, than doing benchmarks without looking at the code).

Details of the experiment here: https://gitlab.com/gadmm/bench_tls/ (inspired by https://blog.cloudflare.com/branch-predictor/ for cold cache measurement, esp. BTB).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants