Conversation
This is mostly the work of Vesa Karvonen as first discussed in https://discuss.ocaml.org/t/a-hack-to-implement-efficient-tls-thread-local-storage/13264 .
Do you think DLS is useful if TLS is also present? Can the TLS and DLS implementation share the same interface (and possibly some of the implementation)? In particular, would a Line 100 in 10f1334 |
On the implementationI agree that a The implementation is related to the orginal hack of @polytypic. I must say that I would be more at ease with a proper implementation in the runtime, as a Now that Dynarray is merged, can you use it? Note: the Domain.DLS implementation is not specific to domains, except for the On the designDo we want to have separate DLS and TLS values, or should we consider making the current DLS state actually thread-local? This would be even simpler to implement (I think that it suffices to save and restore |
I'm tempted by this approach. (If it can be made to work.) For one thing, this would provide a TLS implementation that is less hackish than the one proposed here (fewer Obj.magic). For another thing, it might fix the thread-related data races in the current DLS implementation (#12677). WDYT? |
This would be a breaking change, possibly for the better. With this change, we would lose the ability to share state between all the threads running on a given domain. However, it is not clear to me whether we'd need this ability. One may wish to get a unique id for the current domain. For this, we have |
|
Maybe we should consider writing a small design document to explain the current state and proposed change, to discuss more broadly within the multicore-using community. (We could also write a PR that implement this; I'm not sure if @c-cube is interested in trying this or this should be someone else.) |
I don't think I have much use for it, but having I have no opinion on
I think we could, but here it's a trivial resize, I don't think it needs to pull a dependency like that. The main issue imho is with the heterogeneity of the data. I think it's possible to use the classic tricks of universal values (e.g. by storing universal variants). I had a little experiment here: https://github.com/c-cube/moonpool/blob/wip-safer-tls/src/thread_local_storage.real.ml which translated to, I think, a couple percent slowdown in a TLS-heavy application (fork-join where TLS is used to find the current thread's scheduler state, into which new tasks are pushed). There's a bit of
Are |
I fully agree. The old VM-based implementation of threads (with context switching in the bytecode interpreter, and non-blocking system calls) was "green threads", but the current OS-based implementation is of a different nature. We already have synchronization mechanisms (mutexes, condition variables, semaphores) that work equally well for domains, threads, or any mixture of both. So, having local storage that works for domains as well as threads could be nice too. |
I strongly second that: the original was written with the intent of being a fine hack that does not require any changes to the runtime system. I don't think this part of the hack is desirable moving forward.
This make sense, having multiple facilities for domain or thread local storage sounds confusing and prone to programming errors, maybe the cleaner exit here is to have a general TLS facility. |
|
Naive question: if we wanted thread-local storage, could we implement by just reusing the pthread facility for thread-local storage? Here is what I understand so far:
|
TLS is a scarce resource, and also not necessarily born equal - for example, the first 64 TLS slots on Windows are faster than the last 1024 slots. Not necessarily a blocker to just wrapping the underlying TLS, but on the surface it sounds sensible to have a single native TLS slot used for OCaml's TLS "array".
That may well be true, especially in native code - although DLS is backed by TLS (the domain_state is backed by TLS), it's just a normal OCaml array from the mutator's perspective (which is what is causing the problem when systhreads concurrently access DLS).
The backup thread is only responsible for taking part in the GC - it must never run the mutator. Beyond that, I think it is and should always be the case that the mutator can only move to a different thread of execution by being explicitly moved using an effect. IIUC, these days even systhreads remain tied to the same domain, even if the original domain has terminated, so you don't even get "surprises" from a systhread perspective with DLS (just races...). |
|
About using the pthread native TLS, like @gasche says: I'm no expert really, but that means every access to a given TLS key would potentially have to setup the TLS slot (which presumably includes adding it to GC roots)? And when the thread terminates, all TLS slots must be removed from the list of GC roots? Having a single array sounds safer, but then is it really a cleaner solution than a new field to store the same array in edit: I think I would be game to try and implement adding a field to the thread struct, and primitives |
|
On
I think that would be a valid solution. If we decide that this function is performance-critical we can define a compiler primitive rather than a noalloc C function, but a noalloc C primitive should already be nice. (A compiler primitive is certainly not a prerequisite for a PR.) If you decide to make DLS thread-local, the change is very simple I think (I haven't tried!), you can just save the current |
But how do you unregister all these roots at thread exit? The POSIX threads API for TLS is quite poor, e.g. there's no clean way to iterate over all TLS (key, data) pairs for the current thread. And under the hood it's implemented generally just like we would implement it: an extensible array of values, pointed to from some kind of thread descriptor block. Better use our implementation. |
|
pthread_key_create takes an optional destructor/finalizer argument, so we could unregister the root at this point. |
the field is a regular field stored in caml_thread_struct, with a pair of C primitives to access it (just like DLS).
|
I decided to give a try to adding a field to |
It is not clear to me why we would need TLS (this PR) if DLS were made thread-local. The only issue there is the name "domain-local" storage which is in fact incorrect at that point and is actually "thread-local". |
|
Isn't there some code already in the wild that might get broken if DLS becomes thread local?
Or maybe it'd become weirdly inefficient, I'm not sure. Does domainslib use DLS?
|
I don't think so. Eio uses DLS for tracing, but the breaking change looks like it can be fixed easily. CC @talex5. Domainslib uses DLS. But this should also be OK.
I don't see how. The only cost incurred is when we switch between threads where we will now save and restore an additional word. The GC will have one additional root to scan. The |
TLS would be better here anyway. At the moment, we can only use |
|
superseeded by #12724 |
|
So what would be a nice API to offer both options to programmers? Can we do something with less overall complexity than proposing two modules with very similar interfaces and implementations? |
|
In terms of API, just having the current |
Unless I misunderstand, this is already available in the OCaml runtime as CAMLprim value caml_ml_domain_dense_id(value unused)
{
CAMLnoalloc;
return Val_int(Caml_state->id);
}external domain_dense_id : unit -> int = "caml_domain_ml_dense_id" [@@noalloc] |
|
I think that it is perfectly reasonable to provide basic building blocks in the compiler distribution in some way or another. On the other hand, I like it better when the advanced stuff is kept in places that are clearly marked as advanced and not "we expect everyone to use this regularly". Having a submodule of Ref dedicated to shared state concurrency would feel unfortunate to me, while having a Ref submodule of Domain (or Thread for that matter) feels more appropriate. (And yes, ideally we would have good implementations of common, higher-level approaches to concurrency available in the stdlib someday. But we need more experiments before we can do this. In particular I don't think that anyone should regret not having integrated, say, Domainslib, as my understanding is that people keep finding better ways to do these things, which would not necessarily be easy to retrofit.) |
|
I didn’t have anything in mind particularly for split from parent. I guess I don’t understand what you mean by
In order to distinguish thread creation from domain would your proposal store the domain id in the value so that the child can identify whether to initialise a new value? |
|
There would be a primary data structure, a concurrent map indexed by domain id. The TLS-initializer would retrieve the value from there and cache it. The first access does not need to be efficient. In case the value does not exist in the primary structure, it would create a new one by calling a supplied DLS-initializer for the key. |
So what is the logic here ? We are going to make terrible names, inconsistent signatures and an untidy |
|
"key" is an established name for thread-local storage, e.g. in the
pthreads API. It's also similar to things like `Hmap.Key`, so that seems
fairly consistent to me? Documentation does need to emphasize that these
keys must only be created statically and not in a loop, but otherwise I
don't find the naming particularly bad.
|
No these keys act on explicit dictionaries that you manipulate yourself. There's no dictionary here. "ref" is an established name for reference cells which is what these And since nowadays any argument can be won by mentioning rust, here we have. |
|
On Thu, 30 Nov 2023, Daniel Bünzli wrote:
"ref" is an established name for reference cells which is what these
things are. I don't see the need to add new terminology here.
A ref is something the user manipulates directly. These are kind-of-globals.
And since nowadays any argument can be won by mentioning rust, here [we have](https://docs.rs/ref_thread_local/latest/ref_thread_local/).
Do you mean https://doc.rust-lang.org/std/thread/struct.LocalKey.html ?
(in the stdlib, not a random crate)
|
It feels useful to know this fact generally in |
|
There are several threads of conversation going on here and on #12724. I'll try to summarise them here. Should DLS be present alongside TLS?There is at least one request for DLS to be retained: #12719 (comment). The concrete use cases are:
DLS implementation currently isn't thread-safe: #12677, and there is a sketch to make it thread safe using Should DLS be present alongside TLS? -- No.If the answer to the previous question is no, then one can simulate DLS by either
IIUC, both solutions require some effort from the programmer to utilise them correctly. Both solutions will require a level of indirection if the domain-local value is not read-only -- the value stored is a mutable reference to the actual value. Should DLS be present alongside TLS? -- Yes.If we agree that we want DLS alongside TLS, it may be easiest to implement them directly in the compiler using distinct I personally would prefer this direction. ErgonomicsThen there is the question of the API for DLS and TLS and whether this would be a good opportunity to introduce a |
|
Thanks for the great summary. A few points:
The interfaces for TLS and DLS might end up different for reasons of intra-domain synchronisation. |
I'm trying to understand this comment. If either the initialisation or the
Can you explain why?
Indeed. In my experience with programming with DLS, rarely is there a need for |
|
The access to shared state in init is a matter of synchronizing this shared state. However the creation of shared state by init needs synchronisation for DLS (e.g. two threads racing to initialize some mutable data structure). So I am curious to hear what are the synchronisation needs for init based on various examples. The situation is similar to the synchronisation problem for Lazy, but restricted to a single domain. (Hence, a bit like the problems of Lazy in OCaml 4 with systhreads—with the difference that the only use-cases for DLS are with multiple threads, so it has to be dealt with.) If you remember the proposition for a thread-safe design for Lazy, various synchronization methods were supported. |
|
If I understand correctly:
|
|
@gasche — you have a point that domains are few, and initializations are few (since DLS slots are few). |
No, I think the problem of single-domain thread-safety is still there, whether the programmer uses DLS.set or mutations of the value itself. (The point is then to use cheaper forms synchronisation than cross-domain ones: nanomutexes, reasoning on polling locations, etc.) |
Why is this general concurrent programming problem of concern to DLS? If the programmer uses mutable state as a value in DLS, which may be accessed by several domains, then the programmer has to include the necessary synchronization. |
|
We agree on this. I was reacting to "due to the limited interleaving of threads", which does not eliminate the concern for races. But this concern is on the user. The difference between set and init is intriguing. For init we do not know where to store a domain-local mutex (for instance). But one could use a unique mutex for all domains and store it in the closure of init. This can be used as a solution to bootstrap per-domain nanomutexes. |
Yes. Indeed. Agreed. The fact that we have fewer chances of races / concurrency bugs due to limited concurrency does not make the problem go away. That said, limited concurrency on rare operations may mean that we can have more expensive synchronization on such operations (if that were possible). How do we make forward progress on this? We seem to agree that having TLS is useful. We also seem to head towards the idea that DLS is useful. What interface it should be, whether to implement it on top of TLS, whether to have a different API from TLS, whether it can be implemented in a concurrecy-safe way, are still open questions. Would it make sense to start working towards a design such that we aim to have:
I'm a little reluctant to include |
Personally I remain unconvinced that it is worth offering both TLS and DLS. I think that we could have a single interface that does TLS, and make it expressive enough to let users define domain-constant keys easily -- and without an unacceptable performance cost. The overall system would be simpler and easier to think about. If there is an issue with this design, I suspect that it would come from the (non-)interaction with other cooperative layers on top of domains: we keep making Thread a first-class citizen in the runtime, does this have a negative impact on Eio and other such libraries? |
With my moderator hat on: I heard that people were unhappy with the tone of this discussion, so maybe we could just kill it off for now. |
|
To substantiate @gasche's point with a bit of code, would the following approach work? The following functor module type CELL = sig
type 'a t
val make : init:(unit -> 'a) -> 'a t
val get : 'a t -> 'a
end
module DLS_fast (* fast domain-local storage *)
(TLS : CELL) (* a fast thread-local storage *)
(DLS_slow : CELL) (* a slow domain-local storage *)
: CELL
= struct
type 'a t = 'a TLS.t
let get = TLS.get
let make ~init =
let dls_key = DLS_slow.make ~init in
TLS.make ~init:(fun () -> DLS_slow.get dls_key)
endIf the user needs |
|
This is very strange, this comment does not appear for me. |
|
@polytypic reg #12719 (comment) consider avoiding screenshots of textual conversations as it is inaccessible for folks with disabilities. |
I would consider Threads (systhreads) as first-class citizens in the runtime. Domains don't replace all uses of threads. This was previously discussed here: ocaml-multicore/ocaml-multicore#100 (comment). Some external calls do block, and that's unavoidable. Offloading them to a systhread is the right approach. Consider that you are reading megabytes from disk and loading them into memory. This operation may spend a considerable amount of time in the kernel. You'd like to be able to do other things in the meantime, such as running other user-level threads. You really don't want to assign a different domain to do this task as you don't really gain anything by having the ability to utilise another code in userspace as the operation will be blocked in the kernel, and you suffer the cost (GC, synchronization) of having an additional domain. Having a systhread on the same domain is the right way to do this operation. IIUC, this is how Taking a step back from OCaml, other language runtimes also have a separation between "objects representing units of parallelism" and "threads that are able to get hold of this object to execute" in addition to having "lightweight user-level threads"Two prominent examples are Go and GHC Haskell. . They also do it for the same reason -- handling synchronous calls without losing parallelism. Go has a notion of a goroutine G, a machine M and a processor P: https://go.dev/src/runtime/proc.go. They can roughly be mapped to OCaml -- G ~> Fibers, M ~> systhread, P ~> Domains. GHC Haskell has Haskell Execution Contexts (HECs), threads (created through Control.Concurrent.fork*), and OS threads (managed by the runtime). They roughly map to OCaml as follows -- HECs ~> Domains, OS thread ~> systhread, thread ~> fiber. The difference between OCaml and GHC/Go is that the compiler provides user-level threads as a primitive, which we don't. Moreover, unlike OCaml, the mappings of thread/G to OS thread/M and OS thread/M to HEC/P is managed by the runtime. I believe that there is utility in treating systhreads as a first-class citizen in the runtime.
I agree with this. There is value in supporting DLS natively rather than over TLS. The implementation in the compiler is no more complex; we can implement that with a single primitive lowered through the compiler. I'd be happy to implement this PR myself, building on top of #12724 and @polytypic's thread-safe DLS implementation. |
I agree. The argument for a separate DLS mechanism seems to require that you have a use case where:
I'm not sure I buy the existence of this use case, and it's certainly a thin argument for adding a feature to the compiler and stdlib. I'm also pretty sympathetic to getting rid of |
|
I would prefer to have a simple design with limited implementation complexity and maintenance surface. The current DLS implementation is a source of maintenance work already. We had to extend it with inheritance (
I am not excited at the idea of repeating similar design and implementation work anew for a separate TLS module. The design space is a bit different, the constraints a bit different, people suggest to maybe let go of some feature (inheritance). In the worse scenario we end up with two non-trivial interfaces that we need to support painfully over time, with little to no sharing of code, documentation or behavior between them, and that is twice the pain. I don't want this. Personally I think that a reasonable compromise would be as follows:
@polytypic, does that seem reasonable to you? |
|
That seems reasonable to me.
Essentially, this means taking #12724 to completion.
Do you mean that TLS is not super fast or that implementing DLS on TLS is not super fast? The former isn't true. |
What I have in mind I guess is that the current use-cases of Domain.DLS that I know of (those in the stdlib) are not super performance-sensitive. (The most sensitive use is for global PRNG state.) For all of those it would be acceptable to move from DLS to TLS, even if we have a few thousands of threads. |
This PR adds
Thread.TLS, a thread-local pendant toDomain.DLS.The discussion that started this is
https://discuss.ocaml.org/t/a-hack-to-implement-efficient-tls-thread-local-storage/13264 .
I personally think that TLS is more useful than DLS because threads are more
flexible and convenient than domains (mostly because you can start many more
threads than domains).