Skip to content

Add C function caml_thread_has_lock#11138

Closed
gadmm wants to merge 2 commits intoocaml:trunkfrom
gadmm:caml_try_get_caml_state2
Closed

Add C function caml_thread_has_lock#11138
gadmm wants to merge 2 commits intoocaml:trunkfrom
gadmm:caml_try_get_caml_state2

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Mar 27, 2022

From the commit log:

The function caml_thread_has_lock can be called from any C thread and returns true if the thread belongs to a domain and holds the lock of its domain.

It necessary for implementing C functions which are generic on whether the domain lock is held or not. In case caml_thread_has_lock returns false, the programmer can for instance:

  • report an error
  • try to acquire the domain lock
  • do something else (e.g. delay an operation until the lock is acquired)

One application is for the OCaml-Rust FFI: in some situations, it is not possible to enforce with static typing that some function runs only when the runtime lock is held (notably destructors). This function lets us implement:

  • a dynamic check of whether the lock is held (which plays well with the lifetime-based GC safety),
  • a way to acquire the domain lock only if it is not already held.

In OCaml 4, it is possible to implement this feature on the programmer-side (using internals) by customizing the enter/leave runtime hooks. Alas, this strategy does not work in OCaml 5. One would have to somehow install hooks after systhread is loaded, but before the second domain is spawned (after which hooks can no longer be modified), and this is very hard to do. So we need a native solution for OCaml 5.


Use-cases

  • We would use it as described above in the OCaml-Rust FFI. We suspect that this function has not appeared before in the C FFI because of a difference in code scale, and Rust poses unique challenges (the type system does not always lets us track the lock ownership, notably in destructors).

  • We would like to use it in the implementation of ocaml-boxroot to implement lock-free deallocation. If the lock is held, deallocation of a boxroot from the same domain is immediate and cheap, otherwise it is a "remote" deallocation involving a lock-free auxiliary freelist.

  • An old discussion already inquired about the existence of this function: https://discuss.ocaml.org/t/determining-if-the-runtime-system-is-currently-acquired-from-c/6934.

Notes

  • For OCaml 4, we can use the solution sketched above, but I can also backport the feature upon popular request.

  • No pressure, but having it in OCaml 5.0 would avoid a gap in the availability of the feature, assuming we implement it for OCaml 4 using internals.

  • This implementation adds a field to the domain state. It is a simple implementation. I experimented with an implementation that recycles internals and avoids a new field (see gadmm@4cff734b5), but the strategy did not work without systhreads. I just thought of another implementation that stores the info in a thread-local variable, not sure which one is best (edit: it's more complicated due to OSX TLS issues and the separate compilation of systhreads).

(cc @gasche, @jhjourdan)

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for this seems clear and the implementation looks fine. Only a minor comment about the naming in st_stubs.c.

caml_thread_restore_runtime_state();
}

static int thread_has_lock(void)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this isn't consistent with the other hooks' default implementations and prefixed with caml_?

Copy link
Copy Markdown
Contributor Author

@gadmm gadmm Apr 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not realise there was such a convention, to me this was just a static function, so without a prefix being needed.

@gadmm gadmm marked this pull request as draft April 3, 2022 22:18
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 3, 2022

Thanks for the review. It turns out that for ocaml-boxroot, we cannot afford the overhead of using something else than a boolean in the TLS (which lets the lock status be obtained with a single instruction on x86-64), or at least something not much more expensive. So I would like to propose an implementation along these lines, without changing the rationale and public API you have already reviewed (it is fine for us to access a boolean TLS variable protected with CAML_INTERNALS since we already rely on internals; the normal way to access it remains a public function caml_thread_has_lock). This should also give a much simpler implementation at least on platforms other than MacOS.

Now I am in the rabbit hole of TLS implementations on MacOS. I am looking for the best way to implement it there (and I have no access to a Mac, even Godbolt misses it). I read that C++11 thread_local (so I assume also GCC __thread and C11 _Thread_local) is supported since XCode 8. Is there anyone who I could ask questions to on this topic? My questions are probably also related to the implementation of Caml_state on MacOS which I would like to understand.

@gadmm gadmm force-pushed the caml_try_get_caml_state2 branch from a29634c to c7d7d25 Compare April 3, 2022 23:39
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 3, 2022

Here is the dead-simple implementation that ticks all the boxes for me (except MacOS implementation for now, which CI will probably complain about).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 4, 2022

MacOS CI passes, so I un-draft this PR. It would be nice if someone familiar with the TLS restrictions with MacOS could please have a look, though (to explain why this seems to work but for Caml_state you need to use pthread_gespecific).

@gadmm gadmm marked this pull request as ready for review April 4, 2022 17:00
@xavierleroy
Copy link
Copy Markdown
Contributor

I find this PR perplexing. Neither POSIX threads nor the Win32 API make it easy to check "do I own this mutex already?". Maybe it's a weakness of these APIs, but maybe it's a sign that this query is rarely needed, or even that it is an "anti-pattern" that should not be encouraged.

Concerning the per-domain master lock, I agree there could be a need in C stub functions to acquire it even if we hold it already. This could be achieved by using a recursive lock instead of an "error-checking" lock. Recursive locks are common enough (e.g. in Java) to not be an anti-pattern.

But the need here (in the ocaml-boxroot project) seems to be different and to call for completely different code paths depending on whether the master lock is owned by the calling thread or not. That's a completely new situation to me. Could we have more explanations?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 4, 2022

  • The situation with a global/domain lock might be specific. The way mutexes are usually used (e.g. held for the shortest time possible), such a test would be useless because you are supposed to keep track of the locking (same reason why recursive mutexes have a bad rep). It is also fairly straightforward to implement by hand if you have access to the lock. So I think it is not about mutex APIs, but about having a domain lock which is ambient and represents a richer abstraction (e.g. in this case there can be a performance incentive to not release the lock immediately), and maybe about programming language questions (e.g. interaction between two expressive programming languages where programmers can get creative, which makes it hard to determine in advance an "intended" use of the API).

  • Regarding recursive locking of the runtime lock, I see that it has been discussed before (Make Ocaml global lock reentrant from the same thread #5299). I just found Acquisition of runtime lock #4787 which also requests the current feature. Both give examples involving the need to program generically in terms of the lock status. For those, a recursive lock might be enough, but I am also convinced by the arguments against recursive mutexes. caml_thread_has_lock does not seem necessary for these situations, but I am now thinking that having it would be an alternative to having recursive locking as a default behaviour. It might be better to have error-checking as a default, plus a workaround for people who really mean it.

  • In case you are in a destructor, you cannot know whether you already have the lock. Also you do not want to lock in a destructor in general. You might even be in a thread that does not belong to any domain! So there is already an issue with the idea of doing malloc(size_of(value)) and register the pointer as a generational global root to store it in a foreign data structure, as people already do in C, because you might not always be able to clean that up. (I think that this is not just a type system limitation of Rust not letting you ensure that the destructor only runs when the runtime capability is held, because such a constraint would be too severe; but a symptom of a more general problem.)

  • Now on to specific boxroot needs. Following concurrent allocator designs, you want to distinguish local deallocations from remote deallocations, for performance (the latter need atomic operations and delay the effective deallocation). Moreover, modern allocators tend towards having per-CPU caches rather than per-thread caches (which is expensive if you have a lot of threads), so the domain abstraction is a good fit that might save a lot of headaches. Another reason to have domain-local data structures... is that scanning hooks run per-domain in the first place. So in the case of boxroot, the question “is the deallocation local or remote?” amounts to “do I hold the lock for this domain?”. This also solves the problem that we do not want to acquire the domain lock inside a destructor.

  • Summing up, the point of view of this PR is that there are legitimate situations where you might be coming from code paths that both own and do not own the domain lock, and also that the subsequent step is not necessarily "acquire the domain lock". However, we can discuss whether the concept of boxroot is becoming too tied up with internal runtime details. We could also have a tighter coupling between OCaml and boxroot, such as introducing an "internal" used only by boxroot. (Of course, a solution is to integrate boxroot with OCaml, but this discussion is for the RFC.)

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 5, 2022

To explain the boxroot situation differently, you could imagine that all deallocations are initially "remote", and that "local" deallocations are merely introduced as an optimisation for the common case. This avoids seeing it as a "completely different" code path.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 5, 2022

A possible alternative is to strengthen the runtime hooks to let us implement (an approximation of) the feature on our side (this would involve making more hooks thread-safe and making systhreads better respect other people's hooks, but it seems doable).

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 5, 2022

A possible alternative is to strengthen the runtime hooks to let us implement (an approximation of) the feature on our side (this would involve making more hooks thread-safe and making systhreads better respect other people's hooks, but it seems doable).

I think this is a valuable idea. Long-term, I think we will push to have boxroot (or something simliar) inside the runtime, but we are also learning that the ability to experiment with runtime extensions through hooks is quite nice -- it lets people deploy a runtime extension in a much shorter timeframe, and provide interesting experimental data for runtime evolutions.

@xavierleroy
Copy link
Copy Markdown
Contributor

Thanks @gadmm for all the explanations. I need more time to chew on all this, esp. since I'm neck deep in other things. Other opinions are most welcome.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 6, 2022

I think this is a valuable idea. Long-term, I think we will push to have boxroot (or something simliar) inside the runtime, but we are also learning that the ability to experiment with runtime extensions through hooks is quite nice -- it lets people deploy a runtime extension in a much shorter timeframe, and provide interesting experimental data for runtime evolutions.

To be clear, the solution I am thinking about might be ok-ish for boxroot, but will not constitute a solution for the general feature under discussion. On the plus side, it would let boxroot have the best possible emitted code for TLS.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 6, 2022

Looking into jemalloc's implementation, they use an extern __thread variable for the critical thread-local data, and they have similar performance/compatibility constraints. There is a different implementation for Windows but as far as I can see the MacOS implementation is the generic one. So I have to ask again, what is the problem with TLS on MacOS that requires a different implementation of Caml_state? (I can add this question to my review of the multicore PR, but this will also concern the usage of TLS in this PR.)

@gadmm gadmm mentioned this pull request Apr 6, 2022
@xavierleroy
Copy link
Copy Markdown
Contributor

+1 for a frank discussion of TLS in macOS as of today, and removal of macOS-specific cruft that may have been needed in the past but may no longer be needed.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 11, 2022

cc @ctk21, @bluddy

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Apr 15, 2022

@gadmm

Here is the summary of the issue: ocaml-multicore/ocaml-multicore#57 (comment). The corresponding LLVM bug report is here: llvm/llvm-project#26195 which has been closed without a fix.

FWIW, on my machine, the same example in the LLVM bug report does not fail if __thread is replaced with thread_local. My machine has the following clang and xcode versions:

% clang --version
Apple clang version 13.1.6 (clang-1316.0.21.2.3)
Target: x86_64-apple-darwin21.3.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
% /usr/bin/xcodebuild -version 

Xcode 13.3.1
Build version 13E500a

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 15, 2022

The LLVM bug report sounds more like "maybe fixed in a new version". It seems that the thread_local solution was not available at the time you introduced the pthread_getspecific solution (since XCode 8 was being released at around the same time). It is worth a try. It might not be faster than pthread_getspecific, though (see https://maskray.me/blog/2021-02-14-all-about-thread-local-storage#macos-tls).

So benchmarks are in order before one can simplify the implementation using thread_local... Looking at the compiled asm of the executable should be instructive too, to see if there is a link-time optimisation (this is how it is efficient on linux for non-dynamically-loaded code despite -fPIC).

Interestingly, they have reserved a couple of static key values for system libraries which have an optimised inline implementation of pthread_getspecific. So it is possible to get the efficient 1-instruction implementation using the segment register, but only if you work at Apple.

For the current PR I think I will propose a version using _Thread_local.

@gadmm gadmm force-pushed the caml_try_get_caml_state2 branch from c7d7d25 to 46d9f88 Compare April 15, 2022 15:55
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Apr 15, 2022

Here is a version that solves TLS issues on MacOS. All we need now is an agreement on the design and motivation.

gadmm added 2 commits April 18, 2022 23:30
This function can be called from any C thread and returns true if the
thread belongs to a domain and holds the lock of its domain.

It necessary for implementing C functions which are generic on whether
the domain lock is held or not. In case `caml_thread_has_lock` returns
false, the programmer can for instance:
- report an error
- try to acquire the domain lock
- do something else (e.g. delay an operation until the lock is
  acquired)

One application is for the OCaml-Rust FFI: in some situations, it is
not possible to enforce with static typing that some function runs
only when the runtime lock is held (notably destructors). This
function lets us implement:
- a dynamic check of whether the lock is held (which plays well with
  the lifetime-based GC safety),
- a way to acquire the domain lock only if it is not already held.

In OCaml 4, it is possible to implement this feature on the
programmer-side (using internals) by customizing the enter/leave
runtime hooks. Alas, this strategy does not work in OCaml 5. One would
have to somehow install hooks after systhread is loaded, but before
the second domain is spawned (after which hooks can no longer be
modified), and this is very hard to do. So we need a native solution
for OCaml 5.
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented May 22, 2022

An alternate implementation is proposed at #11272.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Jul 16, 2022

Since this alternate implementation is merged, I am closing this PR. Thank you for the discussion.

Also, the result of the discussion about TLS on MacOS is that Caml_state might be implemented using thread_local instead of explicit TLS (but that benchmarking/looking at the generated code is necessary to know whether this is better). I will let you open an issue referencing the discussion if this is important.

@gadmm gadmm closed this Jul 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants