Simpler handling of caml_state on macOS and MinGW#11598
Simpler handling of caml_state on macOS and MinGW#11598xavierleroy merged 2 commits intoocaml:trunkfrom
caml_state on macOS and MinGW#11598Conversation
dra27
left a comment
There was a problem hiding this comment.
I'm still running some timing comparisons, but I spotted a macro typo in the meantime
| #define SET_Caml_state(x) (caml_state = (x)) | ||
| #else | ||
| #ifdef GNUC | ||
| __attribute__((pure)) |
There was a problem hiding this comment.
This could be detected with:
#if defined (__GNUC__) || defined(__has_attribute)
# if defined(__GNUC__) || __has_attribute(pure)
__attribute__((pure))
# endif
#endif(or by defining Caml_has_attribute in misc.h a la Caml_has_builtin) but given that every version of GCC we support has this attribute, presumably the same is true of clang?
There was a problem hiding this comment.
AFAIK, unknown attributes are silently ignored. But, yes, Clang seems to support the attribute as well as GCC.
As per the previous discussion on this topic, it is not straightforward that using On Arm, In contrast, using In other words However, clang fails to do CSE with static inline caml_domain_state *caml_get_domain_state(void) __attribute__((pure));
static inline caml_domain_state *caml_get_domain_state(void)
{
return (caml_domain_state*) pthread_getspecific(caml_domain_state_key);
}
#define Caml_state_opt (caml_get_domain_state())
This is also not clear to me, especially when I am especially interested in performance from outside the runtime. The performance of accesses to A few notes:
|
The rest of us care about performance within the runtime. We routinely spend 1/3rd of the running time in the GC, damn it. |
|
(sorry, my ability to investigate what I wanted to for this is being hampered by ill-health) TL;DR I'd really like this PR in next week's beta, and I'd really like the static linking part of #11586 to be either with it or in a subsequent 5.0 beta. In more detail:
|
I'm in favor too! Why don't you start by approving this PR?
Yes, that's right. At some point I considered using direct access as soon as If we want to go this way, maybe |
dra27
left a comment
There was a problem hiding this comment.
I'm in favor too! Why don't you start by approving this PR?
It was only waiting on whether you wanted to do anything with the systhreads Makefile now or later. However, I'm happy with this being merged as-is, thank you!
|
I'm building the MinGW-64 version right now, so as to look at the code generated by GCC... Additional comments welcome! |
|
I was hoping GCC would compile |
But, as a consequence, the overhead of calling the |
|
Well wishes to you @dra27.
I do not understand what this is trying to claim and how this relates to my results. If what matter is really runtime performance, the best-known solution on MacOS is to use the current It makes sense to use the approach in this PR for MinGW with hopes that #11586 becomes mature enough at some later point. It does not make sense to change the current MacOS implementation in this direction. Later I can send my patch to add the attribute |
No. I'm getting tired of this discussion, so I just did some quick performance measurements on a Mac M1. The time it takes to evaluate
CSE is not exercised in these measurements, but keep in mind that the code that offers the best CSE opportunities is this PR in My old tiny benchmark suite (with Knuth-Bendix and the like) shows no meaningful speed differences, they are all lost in the noise. The GC bench (that spends all its time allocating and mutating data structures) is 0.9% faster with this PR, but I would not consider this meaningful. So, unless new concerns are expressed, I propose to merge this PR in 5.0 tomorrow. |
|
It is hard to conclude something from this kind of micro-benchmark without an interpretation of what is going on. Could you please send us your micro-benchmark? Naively, the 60% overhead measured for wrapping in a function seems a bit large for the compiled code I have seen. In addition, I am curious about controlling for the impact of indirect branch prediction, which should be more favourable to
I am also unconvinced by the GC bench, but mainly because a difference of 0.9% could come from unrelated things, and we also do not know the impact of the missing CSE when thread-local access is expensive. Can you please give us a pointer to the benchmark to reproduce? In summary:
|
… MinGW macOS and MinGW don't quite support __thread variables defined in the main program and accessed from a DLL. So, special handling of caml_state is required on these platforms. This commit refactors the way this is done. - Introduce HAS_FULL_THREAD_VARIABLES, determined by configure, to say that DLLs can access __thread variables from the main program. - If not, a getter function is used for reading `caml_state` (`caml_get_domain_state` in runtime/domain.c). - DLLs never write to `caml_state`, so remove the `SET_Caml_state` macro. - This implementation works on macOS (with HAS_FULL_THREAD_VARIABLES not defined), so remove the old macOS implementation based on pthread TLS.
There's no risk of accessing across DLLs in this case. Add the IN_CAML_RUNTIME preprocessor define, which is set when producing object files in runtime/
69baac8 to
6b7c5a5
Compare
caml_state on macOS and MinGWcaml_state on macOS and MinGW
|
I agree there's room for improvement in the future, esp. for the Win32 code. On the other hand this PR and #11586 should not block the beta releases of OCaml 5.0, for which we have a tight schedule. So, I'm taking the liberty to merge this PR in trunk, hoping that @dra27 can build the static linking of winpthreads on top of it. @Octachron will then decide whether and when to integrate this work in 5.0. |
Simpler handling of `caml_state` on macOS and MinGW (cherry picked from commit 1851722)
|
I would still like to understand what is going on since @xavierleroy's benchmarks surprise me. Xavier, can you please send us the code for your benchmarks? |
|
I probably should not, as I'm pretty sure any code that I post is going to be held against me. But against my better judgment, here is the trivial benchmark for And here is the GC benchmark, adapted from Java code that has been around for a while: |
|
Thanks. |
|
It turns out that clang optimizes TLS accesses very aggressively, even at I ran my own benchmarks and it is better to use In the case where TLS is wrapped in a function in a shared library, clang does not optimize the benchmark away. I measured the time to do 1024 accesses (fluctuations +/- 10% excluding unknown effects):
In this case my results are consistent with yours. My trick to prevent false optimizations by clang is to measure the time to access 1024 different TLS variables (clang still does some optimizations in this case which you would not find in actual code, but it does not entirely optimize the benchmark away).
We see the cost of missing branch target prediction in the case of thread_local. This is consistent with what one can predict by looking at the generated code for the various implementations, so one conclusion is that it is simpler to compare implementations by looking at the generated code (or at least it is better to look at the code without doing benchmarks, than doing benchmarks without looking at the code). Details of the experiment here: https://gitlab.com/gadmm/bench_tls/ (inspired by https://blog.cloudflare.com/branch-predictor/ for cold cache measurement, esp. BTB). |
Currently,
caml_stateis defined as a C11__threadvariable, except on macOS because it doesn't allow a DLL to access a__threadvariable defined in the main program. MinGW with static linking of libwinpthread has the same limitation, see #11586 .The current workaround for macOS, based on
pthread_getspecific, is costly, and the proposed workarounds for MinGW are complicated.This PR offers a simpler and more efficient solution to this problem. It starts with the assumption that
caml_statecan always be defined with__threadand directly used within the runtime system. The only restriction is that on macOS and MinGW, DLLs must not accesscaml_statedirectly, so we make them go through a getter function defined in the main program instead. (No setter function because DLLs don't need to setcaml_state.)The getter function is declared
pureso that GCC and Clang can perform CSE on repeated calls to the getter (see https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html).The resulting code is quite simple and should be faster than the existing code on macOS (no
pthread_getspecificcalls)and at least as fast as the Win32 code proposed earlier in #11586.