Skip to content

add thread local storage#12719

Closed
c-cube wants to merge 5 commits intoocaml:trunkfrom
c-cube:add-thread-local-storage
Closed

add thread local storage#12719
c-cube wants to merge 5 commits intoocaml:trunkfrom
c-cube:add-thread-local-storage

Conversation

@c-cube
Copy link
Copy Markdown
Contributor

@c-cube c-cube commented Nov 5, 2023

This PR adds Thread.TLS, a thread-local pendant to Domain.DLS.

The discussion that started this is
https://discuss.ocaml.org/t/a-hack-to-implement-efficient-tls-thread-local-storage/13264 .
I personally think that TLS is more useful than DLS because threads are more
flexible and convenient than domains (mostly because you can start many more
threads than domains).

@kayceesrk
Copy link
Copy Markdown
Contributor

I personally think that TLS is more useful than DLS because threads are more
flexible and convenient than domains (mostly because you can start many more
threads than domains).

Do you think DLS is useful if TLS is also present?

Can the TLS and DLS implementation share the same interface (and possibly some of the implementation)? In particular, would a split_from_parent argument for TLS.new_key not be useful?

val new_key : ?split_from_parent:('a -> 'a) -> (unit -> 'a) -> 'a key

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 6, 2023

On the implementation

I agree that a split_from_parent key would be useful, for the same reasons that it is useful for domains, in particular to seed thread-local random generator state.

The implementation is related to the orginal hack of @polytypic. I must say that I would be more at ease with a proper implementation in the runtime, as a domain_state field that is saved in the caml_thread_struct data -- see otherlibs/systhreads/st_stubs.c

Now that Dynarray is merged, can you use it?

Note: the Domain.DLS implementation is not specific to domains, except for the get_dls_state and set_dls_state primitives. Instead of duplicating the implementations, I wonder if it would be possible to define it as a functor over those functions, or even tweak those to take a flag asking for either the DLS or the TLS state (the performance may be better with a flag if the functor application do not inline; to be tested).

On the design

Do we want to have separate DLS and TLS values, or should we consider making the current DLS state actually thread-local? This would be even simpler to implement (I think that it suffices to save and restore Caml_state->dls_root in caml_thread_struct), and it may actually be the right thing to do. On the other hand, this would be yet another case of special-casing the Threads module as a blessed green-thread implementation on top of domains, enabling features natively that are not available to other libraries.

@xavierleroy
Copy link
Copy Markdown
Contributor

should we consider making the current DLS state actually thread-local? This would be even simpler to implement (I think that it suffices to save and restore Caml_state->dls_root in caml_thread_struct), and it may actually be the right thing to do.

I'm tempted by this approach. (If it can be made to work.) For one thing, this would provide a TLS implementation that is less hackish than the one proposed here (fewer Obj.magic). For another thing, it might fix the thread-related data races in the current DLS implementation (#12677). WDYT?

@kayceesrk
Copy link
Copy Markdown
Contributor

should we consider making the current DLS state actually thread-local?

This would be a breaking change, possibly for the better. With this change, we would lose the ability to share state between all the threads running on a given domain. However, it is not clear to me whether we'd need this ability. One may wish to get a unique id for the current domain. For this, we have Domain.self. So I'm in favour of this change.

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 6, 2023

Maybe we should consider writing a small design document to explain the current state and proposed change, to discuss more broadly within the multicore-using community. (We could also write a PR that implement this; I'm not sure if @c-cube is interested in trying this or this should be someone else.)

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 6, 2023

@kayceesrk

Do you think DLS is useful if TLS is also present?

I don't think I have much use for it, but having Random's global state be domain-local can kind of make sense, I suppose? I'm just not sure there's an upside to using DLS if we had good TLS (it'd make Random thread-safe which is a nice plus.)

I have no opinion on split_from_parent but why not, any overhead would just be at creation time.


@gasche

Now that Dynarray is merged, can you use it?

I think we could, but here it's a trivial resize, I don't think it needs to pull a dependency like that.

The main issue imho is with the heterogeneity of the data. I think it's possible to use the classic tricks of universal values (e.g. by storing universal variants). I had a little experiment here: https://github.com/c-cube/moonpool/blob/wip-safer-tls/src/thread_local_storage.real.ml which translated to, I think, a couple percent slowdown in a TLS-heavy application (fork-join where TLS is used to find the current thread's scheduler state, into which new tasks are pushed). There's a bit of Obj remaining to check whether the array is initialized for the current thread but it could probably use a dummy value somewhere instead?

On the other hand, this would be yet another case of special-casing the Threads module as a blessed green-thread implementation on top of domains, enabling features natively that are not available to other libraries.

Are Thread.t green threads?? I always thought of them as platform native threads (ie. pthread typically), but restricted by their exclusive access to the GC. They're already blessed as special because I don't think you can really implement the same thing on top of domains + effects (preemption, being scheduled independently by the OS, blocking syscalls, etc.).

@xavierleroy
Copy link
Copy Markdown
Contributor

Are Thread.t green threads?? I always thought of them as platform native threads (ie. pthread typically), but restricted by their exclusive access to the GC.

I fully agree. The old VM-based implementation of threads (with context switching in the bytecode interpreter, and non-blocking system calls) was "green threads", but the current OS-based implementation is of a different nature.

We already have synchronization mechanisms (mutexes, condition variables, semaphores) that work equally well for domains, threads, or any mixture of both. So, having local storage that works for domains as well as threads could be nice too.

@abbysmal
Copy link
Copy Markdown
Contributor

abbysmal commented Nov 7, 2023

The implementation is related to the orginal hack of @polytypic. I must say that I would be more at ease with a proper implementation in the runtime, as a domain_state field that is saved in the caml_thread_struct data -- see otherlibs/systhreads/st_stubs.c

I strongly second that: the original was written with the intent of being a fine hack that does not require any changes to the runtime system. I don't think this part of the hack is desirable moving forward.

We already have synchronization mechanisms (mutexes, condition variables, semaphores) that work equally well for domains, threads, or any mixture of both. So, having local storage that works for domains as well as threads could be nice too.

This make sense, having multiple facilities for domain or thread local storage sounds confusing and prone to programming errors, maybe the cleaner exit here is to have a general TLS facility.

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 7, 2023

Naive question: if we wanted thread-local storage, could we implement by just reusing the pthread facility for thread-local storage?

Here is what I understand so far:

  • There may be performance considerations, it might be that our current implementation of DLS is faster than the pthread TLS on some systems.
  • This assumes that OCaml user code on an OCaml user thread only ever runs on the same system thread. I am not sure that this is the case, as domains create some administrative worker threads for each domain. In particular, I wonder if the 'backup thread' of each domain could end up running mutator code.
  • It is unclear that we could easily offer the current API, in particular the split_from_parent function, if we just reused thread-local storage. I think we would need a small book-keeping layer on top of the pthread facility. (Not a blocker.)

@dra27
Copy link
Copy Markdown
Member

dra27 commented Nov 7, 2023

Naive question: if we wanted thread-local storage, could we implement by just reusing the pthread facility for thread-local storage?

TLS is a scarce resource, and also not necessarily born equal - for example, the first 64 TLS slots on Windows are faster than the last 1024 slots. Not necessarily a blocker to just wrapping the underlying TLS, but on the surface it sounds sensible to have a single native TLS slot used for OCaml's TLS "array".

Here is what I understand so far:

  • There may be performance considerations, it might be that our current implementation of DLS is faster than the pthread TLS on some systems.

That may well be true, especially in native code - although DLS is backed by TLS (the domain_state is backed by TLS), it's just a normal OCaml array from the mutator's perspective (which is what is causing the problem when systhreads concurrently access DLS).

  • This assumes that OCaml user code on an OCaml user thread only ever runs on the same system thread. I am not sure that this is the case, as domains create some administrative worker threads for each domain. In particular, I wonder if the 'backup thread' of each domain could end up running mutator code.

The backup thread is only responsible for taking part in the GC - it must never run the mutator. Beyond that, I think it is and should always be the case that the mutator can only move to a different thread of execution by being explicitly moved using an effect. IIUC, these days even systhreads remain tied to the same domain, even if the original domain has terminated, so you don't even get "surprises" from a systhread perspective with DLS (just races...).

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 7, 2023

About using the pthread native TLS, like @gasche says: I'm no expert really, but that means every access to a given TLS key would potentially have to setup the TLS slot (which presumably includes adding it to GC roots)? And when the thread terminates, all TLS slots must be removed from the list of GC roots?

Having a single array sounds safer, but then is it really a cleaner solution than a new field to store the same array in caml_thread_struct?

edit: I think I would be game to try and implement adding a field to the thread struct, and primitives caml_tls_{g,s}et to access it, if it's a valid solution.

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 7, 2023

On get you would call pthread_getspecific; if no value is set for the current thread you get a NULL, and at this point you have to compute an initial value, set the key with it, and, yes, register it as a GC root. (This could use boxroots if available.)

edit: I think I would be game to try and implement adding a field to the thread struct, and primitives caml_tls_{g,s}et to access it, if it's a valid solution.

I think that would be a valid solution. If we decide that this function is performance-critical we can define a compiler primitive rather than a noalloc C function, but a noalloc C primitive should already be nice. (A compiler primitive is certainly not a prerequisite for a PR.)

If you decide to make DLS thread-local, the change is very simple I think (I haven't tried!), you can just save the current dls_root -- this would be a really small change I think. If you want to have a separate TLS field, you have to add a field to both Caml_state and caml_thread_struct and add new access functions, etc., so this is more work. If you decide to implement both, you can have two PRs :-)

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Nov 7, 2023

On get you would call pthread_getspecific; if no value is set for the current thread you get a NULL, and at this point you have to compute an initial value, set the key with it, and, yes, register it as a GC root.

But how do you unregister all these roots at thread exit? The POSIX threads API for TLS is quite poor, e.g. there's no clean way to iterate over all TLS (key, data) pairs for the current thread. And under the hood it's implemented generally just like we would implement it: an extensible array of values, pointed to from some kind of thread descriptor block. Better use our implementation.

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 7, 2023

pthread_key_create takes an optional destructor/finalizer argument, so we could unregister the root at this point.

the field is a regular field stored in caml_thread_struct, with a pair
of C primitives to access it (just like DLS).
@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 8, 2023

I decided to give a try to adding a field to caml_thread_struct alongside a pair of C primitives. The change isn't massive, but I'm not sure I updated all the necessary places in st_stubs.c … Mutating the value in caml_thread_tls_set is done without any barrier because I assumed that the root will be properly marked in caml_thread_scan_roots.

@kayceesrk
Copy link
Copy Markdown
Contributor

If you decide to make DLS thread-local, the change is very simple I think (I haven't tried!), you can just save the current dls_root -- this would be a really small change I think. If you want to have a separate TLS field, you have to add a field to both Caml_state and caml_thread_struct and add new access functions, etc., so this is more work. If you decide to implement both, you can have two PRs :-)

It is not clear to me why we would need TLS (this PR) if DLS were made thread-local. The only issue there is the name "domain-local" storage which is in fact incorrect at that point and is actually "thread-local".

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 8, 2023 via email

@kayceesrk
Copy link
Copy Markdown
Contributor

Isn't there some code already in the wild that might get broken if DLS becomes thread-local?

I don't think so. Eio uses DLS for tracing, but the breaking change looks like it can be fixed easily. CC @talex5.

https://github.com/ocaml-multicore/eio/blob/bc1e231b64a69af5226d998d3286d235b61b0f5f/lib_eio/core/trace.ml#L10

Domainslib uses DLS. But this should also be OK.

https://github.com/ocaml-multicore/domainslib/blob/abaa4a37e177b8e9fb56cb8fd646a693214ad5de/lib/chan.ml#L22

Or maybe it'd become weirdly inefficient,

I don't see how. The only cost incurred is when we switch between threads where we will now save and restore an additional word. The GC will have one additional root to scan. The get and set operations remain just as fast.

@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Nov 8, 2023

Eio uses DLS for tracing, but the breaking change looks like it can be fixed easily.

TLS would be better here anyway. At the moment, we can only use mint_id from the main thread in each domain.

@c-cube c-cube mentioned this pull request Nov 9, 2023
@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 20, 2023

superseeded by #12724

@c-cube c-cube closed this Nov 20, 2023
@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 29, 2023

So what would be a nice API to offer both options to programmers? Can we do something with less overall complexity than proposing two modules with very similar interfaces and implementations?

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 29, 2023

In terms of API, just having the current Thread_local_storage and Domain.DLS (renamed to Domain_local_storage?) would be best imho. I don't understand how that's not the minimal complexity interface: uniform interface, two distinct names for distinct kinds of storage, distinct key type. Trying to merge them into the same abstraction just calls for mistakes and/or slowdowns?

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 29, 2023

@polytypic

Another example is that I'm just about to implement a notion of a "domain index" using DLS. The idea is that domain indices would basically be non-negative integers unique only among the currently existing domains and allocated so that they are as densely packed as possible (i.e. maximum domain index is always less than the maximum number of domains that have existed at any point).

Unless I misunderstand, this is already available in the OCaml runtime as Caml_state->id, which should be easy to expose through a [@noalloc] FFI binding. Something like (untested):

CAMLprim value caml_ml_domain_dense_id(value unused)
{
  CAMLnoalloc;
  return Val_int(Caml_state->id);
}
external domain_dense_id : unit -> int = "caml_domain_ml_dense_id" [@@noalloc]

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 30, 2023

I think that it is perfectly reasonable to provide basic building blocks in the compiler distribution in some way or another. On the other hand, I like it better when the advanced stuff is kept in places that are clearly marked as advanced and not "we expect everyone to use this regularly". Having a submodule of Ref dedicated to shared state concurrency would feel unfortunate to me, while having a Ref submodule of Domain (or Thread for that matter) feels more appropriate.

(And yes, ideally we would have good implementations of common, higher-level approaches to concurrency available in the stdlib someday. But we need more experiments before we can do this. In particular I don't think that anyone should regret not having integrated, say, Domainslib, as my understanding is that people keep finding better ways to do these things, which would not necessarily be easy to retrofit.)

@kayceesrk
Copy link
Copy Markdown
Contributor

@gadmm

I didn’t have anything in mind particularly for split from parent. I guess I don’t understand what you mean by

it seems easier to cache the value from the initializer,

In order to distinguish thread creation from domain would your proposal store the domain id in the value so that the child can identify whether to initialise a new value?

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Nov 30, 2023

There would be a primary data structure, a concurrent map indexed by domain id. The TLS-initializer would retrieve the value from there and cache it. The first access does not need to be efficient.

In case the value does not exist in the primary structure, it would create a new one by calling a supplied DLS-initializer for the key.

@dbuenzli
Copy link
Copy Markdown
Contributor

@dbuenzli an interesting suggestion, but I'm not sure that we want to make it so easy to use {domain,thread}-local keys/references.

So what is the logic here ? We are going to make terrible names, inconsistent signatures and an untidy Stdlib so that it doesn't get used ?

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 30, 2023 via email

@dbuenzli
Copy link
Copy Markdown
Contributor

It's also similar to things like Hmap.Key, so that seems fairly consistent to me?

No these keys act on explicit dictionaries that you manipulate yourself. There's no dictionary here.

"ref" is an established name for reference cells which is what these
things are. I don't see the need to add new terminology here.

And since nowadays any argument can be won by mentioning rust, here we have.

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 30, 2023 via email

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Nov 30, 2023

This isn't so straight-forward as I'd originally thought. In order to decide whether to copy or split/initialise new state, we also need to know whether the child runs on the same domain as the parent (thread) or a different one (domain).

It feels useful to know this fact generally in split_from_parent: perhaps the TLS interface should just include passing this information to the split function.

@kayceesrk
Copy link
Copy Markdown
Contributor

There are several threads of conversation going on here and on #12724. I'll try to summarise them here.

Should DLS be present alongside TLS?

There is at least one request for DLS to be retained: #12719 (comment). The concrete use cases are:

DLS implementation currently isn't thread-safe: #12677, and there is a sketch to make it thread safe using [@poll error] attribute to catch bugs: https://discuss.ocaml.org/t/using-poll-error-attribute-to-implement-systhread-safe-data-structures/12804.

Should DLS be present alongside TLS? -- No.

If the answer to the previous question is no, then one can simulate DLS by either

  1. using a concurrent hash map + caching (add thread local storage #12719 (comment)) or
  2. use split_from_parent to decide to create a new value or copy/split (add thread local storage #12719 (comment))

IIUC, both solutions require some effort from the programmer to utilise them correctly. Both solutions will require a level of indirection if the domain-local value is not read-only -- the value stored is a mutable reference to the actual value.

Should DLS be present alongside TLS? -- Yes.

If we agree that we want DLS alongside TLS, it may be easiest to implement them directly in the compiler using distinct Caml_state fields for DLS and TLS. The implementation is fairly straight-forward. We currently have the DLS implementation in trunk and TLS in #12724. The idea would be to have both.

I personally would prefer this direction.

Ergonomics

Then there is the question of the API for DLS and TLS and whether this would be a good opportunity to introduce a Ref module: #12719 (comment). It may be better to come to this issue after we decide on the higher-level question of whether to have both DLS and TLS.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Dec 1, 2023

Thanks for the great summary. A few points:

  • How does one ensures synchronisation between threads when running the initialization function? There could be more than one valid way to do it, and maybe not all are expressible with the current API. (Which methods of synchronisation are currently expressible?) This could argue in favour of a library-side solution if this offers more flexibility. I have not had the time to think more about this issue, but I am curious how you envision synchronisation for init.
  • In particular, thread-safe DLS might be slightly less straightforward to implement than described.
  • Regarding the need for set, I am under the impression that the examples given do not need it. In the presence of writes, there is also again the question of synchronisation. For instance, the programmer might want an indirection anyway to store a nanomutex, if one needed mutability.

The interfaces for TLS and DLS might end up different for reasons of intra-domain synchronisation.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Dec 2, 2023

How does one ensures synchronisation between threads when running the initialization function? There could be more than one valid way to do it, and maybe not all are expressible with the current API. (Which methods of synchronisation are currently expressible?) This could argue in favour of a library-side solution if this offers more flexibility. I have not had the time to think more about this issue, but I am curious how you envision synchronisation for init.

I'm trying to understand this comment. If either the initialisation or the split_from_parent functions involve shared state, then the user must utilise the synchronisation options available (such as Mutex, Atomics, etc). I must be missing something.

In particular, thread-safe DLS might be slightly less straightforward to implement than described.

Can you explain why?

Regarding the need for set, I am under the impression that the examples given do not need it. In the presence of writes, there is also again the question of synchronisation. For instance, the programmer might want an indirection anyway to store a nanomutex, if one needed mutability.

Indeed. In my experience with programming with DLS, rarely is there a need for DLS.set.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Dec 3, 2023

The access to shared state in init is a matter of synchronizing this shared state. However the creation of shared state by init needs synchronisation for DLS (e.g. two threads racing to initialize some mutable data structure). So I am curious to hear what are the synchronisation needs for init based on various examples.

The situation is similar to the synchronisation problem for Lazy, but restricted to a single domain. (Hence, a bit like the problems of Lazy in OCaml 4 with systhreads—with the difference that the only use-cases for DLS are with multiple threads, so it has to be dealt with.) If you remember the proposition for a thread-safe design for Lazy, various synchronization methods were supported.

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 3, 2023

If I understand correctly:

  • no cross-thread synchronization is necessary for TLS, given that each thread has a separate value
  • if we want to keep a DLS interface around, we have to worry about several threads on a single domain racing to operate on the DLS key. But then:
    • due to the limited interleaving for threads, in practice we only have to worry about races during initialization of the DLS value.
    • given that initialization is an infrequent operation, I guess we could just take a lock for that operation? (The question is where to define/initialize this lock.)

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Dec 3, 2023

@gasche — you have a point that domains are few, and initializations are few (since DLS slots are few).

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Dec 3, 2023

due to the limited interleaving for threads, in practice we only have to worry about races during initialization of the DLS value

No, I think the problem of single-domain thread-safety is still there, whether the programmer uses DLS.set or mutations of the value itself. (The point is then to use cheaper forms synchronisation than cross-domain ones: nanomutexes, reasoning on polling locations, etc.)

@kayceesrk
Copy link
Copy Markdown
Contributor

No, I think the problem of single-domain thread-safety is still there, whether the programmer uses DLS.set or mutations of the value itself. (The point is then to use cheaper forms synchronisation than cross-domain ones: nanomutexes, reasoning on polling locations, etc.)

Why is this general concurrent programming problem of concern to DLS? If the programmer uses mutable state as a value in DLS, which may be accessed by several domains, then the programmer has to include the necessary synchronization.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Dec 4, 2023

We agree on this. I was reacting to "due to the limited interleaving of threads", which does not eliminate the concern for races. But this concern is on the user.

The difference between set and init is intriguing. For init we do not know where to store a domain-local mutex (for instance). But one could use a unique mutex for all domains and store it in the closure of init. This can be used as a solution to bootstrap per-domain nanomutexes.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Dec 6, 2023

I was reacting to "due to the limited interleaving of threads", which does not eliminate the concern for races. But this concern is on the user.

Yes. Indeed. Agreed. The fact that we have fewer chances of races / concurrency bugs due to limited concurrency does not make the problem go away. That said, limited concurrency on rare operations may mean that we can have more expensive synchronization on such operations (if that were possible).

How do we make forward progress on this?

We seem to agree that having TLS is useful. We also seem to head towards the idea that DLS is useful. What interface it should be, whether to implement it on top of TLS, whether to have a different API from TLS, whether it can be implemented in a concurrecy-safe way, are still open questions.

Would it make sense to start working towards a design such that we aim to have:

  1. Separate TLS and DLS in Stdlib
  2. DLS get and set should be concurrency safe
  3. Define what concurrency safety means for init
  4. Develop a set of micro benchmarks with which we can evaluate alternatives.

I'm a little reluctant to include Ref module in the standard library to be included in this discussion as it feels like this would take the focus away from the issues at hand. This is also because I don't myself find a good pull for having the Ref module. Perhaps @dbuenzli can take this discussion up separately?

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 6, 2023

We seem to agree that having TLS is useful. We also seem to head towards the idea that DLS is useful.

Personally I remain unconvinced that it is worth offering both TLS and DLS. I think that we could have a single interface that does TLS, and make it expressive enough to let users define domain-constant keys easily -- and without an unacceptable performance cost. The overall system would be simpler and easier to think about.

If there is an issue with this design, I suspect that it would come from the (non-)interaction with other cooperative layers on top of domains: we keep making Thread a first-class citizen in the runtime, does this have a negative impact on Eio and other such libraries?

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 6, 2023

I'm a little reluctant to include Ref module in the standard library to be included in this discussion as it feels like this would take the focus away from the issues at hand. This is also because I don't myself find a good pull for having the Ref module. Perhaps @dbuenzli can take this discussion up separately?

With my moderator hat on: I heard that people were unhappy with the tone of this discussion, so maybe we could just kill it off for now.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Dec 6, 2023

To substantiate @gasche's point with a bit of code, would the following approach work? The following functor DLS_fast creates a fast DLS by caching a slow DLS with a fast TLS.

module type CELL = sig
  type 'a t
  val make : init:(unit -> 'a) -> 'a t
  val get : 'a t -> 'a
end

module DLS_fast (* fast domain-local storage *)
    (TLS : CELL) (* a fast thread-local storage *)
    (DLS_slow : CELL) (* a slow domain-local storage *)
  : CELL
= struct
  type 'a t = 'a TLS.t
  let get = TLS.get
  let make ~init =
    let dls_key = DLS_slow.make ~init in
    TLS.make ~init:(fun () -> DLS_slow.get dls_key)
end

If the user needs set, then an extra indirection is needed, in which case it is better for the user to have mutability built in their type (at worst, they use 'a ref DLS).

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Dec 6, 2023

This is very strange, this comment does not appear for me.

@kayceesrk
Copy link
Copy Markdown
Contributor

@polytypic reg #12719 (comment)

consider avoiding screenshots of textual conversations as it is inaccessible for folks with disabilities.

@kayceesrk
Copy link
Copy Markdown
Contributor

we keep making Thread a first-class citizen in the runtime, does this have a negative impact on Eio and other such libraries?

I would consider Threads (systhreads) as first-class citizens in the runtime. Domains don't replace all uses of threads. This was previously discussed here: ocaml-multicore/ocaml-multicore#100 (comment).

Some external calls do block, and that's unavoidable. Offloading them to a systhread is the right approach. Consider that you are reading megabytes from disk and loading them into memory. This operation may spend a considerable amount of time in the kernel. You'd like to be able to do other things in the meantime, such as running other user-level threads. You really don't want to assign a different domain to do this task as you don't really gain anything by having the ability to utilise another code in userspace as the operation will be blocked in the kernel, and you suffer the cost (GC, synchronization) of having an additional domain. Having a systhread on the same domain is the right way to do this operation. IIUC, this is how async scheduler works. Eio provides this as a basic operation: https://github.com/ocaml-multicore/eio/blob/a26c3ebccc5dcf1480aa72c39dd645b43ab32db2/lib_eio/unix/eio_unix.mli#L53C1-L53C1.

Taking a step back from OCaml, other language runtimes also have a separation between "objects representing units of parallelism" and "threads that are able to get hold of this object to execute" in addition to having "lightweight user-level threads"Two prominent examples are Go and GHC Haskell. . They also do it for the same reason -- handling synchronous calls without losing parallelism.

Go has a notion of a goroutine G, a machine M and a processor P: https://go.dev/src/runtime/proc.go. They can roughly be mapped to OCaml -- G ~> Fibers, M ~> systhread, P ~> Domains. GHC Haskell has Haskell Execution Contexts (HECs), threads (created through Control.Concurrent.fork*), and OS threads (managed by the runtime). They roughly map to OCaml as follows -- HECs ~> Domains, OS thread ~> systhread, thread ~> fiber. The difference between OCaml and GHC/Go is that the compiler provides user-level threads as a primitive, which we don't. Moreover, unlike OCaml, the mappings of thread/G to OS thread/M and OS thread/M to HEC/P is managed by the runtime.

I believe that there is utility in treating systhreads as a first-class citizen in the runtime.

The whole point of providing facilities like DLS and TLS is their performance.
The point-of-view from which I'm looking at this is what is possible (especially in terms of performance) and then what practical applications that allows.

I agree with this. There is value in supporting DLS natively rather than over TLS. The implementation in the compiler is no more complex; we can implement that with a single primitive lowered through the compiler.

I'd be happy to implement this PR myself, building on top of #12724 and @polytypic's thread-safe DLS implementation.

@lpw25
Copy link
Copy Markdown
Contributor

lpw25 commented Dec 7, 2023

Personally I remain unconvinced that it is worth offering both TLS and DLS.

I agree. The argument for a separate DLS mechanism seems to require that you have a use case where:

  • You need domain local storage
  • The storage you need is mutable so you can't just share the data between all threads in a domain
  • The performance of the DLS is so critical that the extra indirection is unacceptable

I'm not sure I buy the existence of this use case, and it's certainly a thin argument for adding a feature to the compiler and stdlib.

I'm also pretty sympathetic to getting rid of set from TLS and DLS interfaces, it just seems like a conceptually simpler primitive if its immutable.

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 7, 2023

I would prefer to have a simple design with limited implementation complexity and maintenance surface.

The current DLS implementation is a source of maintenance work already. We had to extend it with inheritance (split_from_parent), it is being (rightfully) pointed out that there are thread-safety issues that still need to be fixed. It is also:

  • not fast enough for certain use-cases (as @polytypic pointed out), which would require additional improvements, and
  • not adapted to creating many keys (as @lyrm pointed out), which may require additional changes, or clearly delineating yet another approach that people should use instead of DLS -- possibly outside the stdlib.

I am not excited at the idea of repeating similar design and implementation work anew for a separate TLS module. The design space is a bit different, the constraints a bit different, people suggest to maybe let go of some feature (inheritance). In the worse scenario we end up with two non-trivial interfaces that we need to support painfully over time, with little to no sharing of code, documentation or behavior between them, and that is twice the pain. I don't want this.

Personally I think that a reasonable compromise would be as follows:

  • We repurpose the current DLS runtime support to become TLS, which I now think is the right default for the use-cases of DLS that I know of and are appropriate with the current implementation (not super performance-sensitive, not creating too many keys, etc.).
  • We expose domain->id, or some other mechanism preferred by @polytypic, to let experts implement domain-indexed structures outside the stdlib for now. This adds a small implementation surface (much smaller than a full DLS module), with complementary benefits (this is usable in more scenarios than DLS today). If we can make their life easier by adding extra runtime services, for example "cache-padded arrays" or something, let's discuss this and be open.
  • We leave the door open to reintroducing a DLS abstraction if there are clear use-cases and convincing out-of-tree experiments that show that having something upstream (for example with support through compiler primitives?) is beneficial over out-of-tree approaches.

@polytypic, does that seem reasonable to you?

@kayceesrk
Copy link
Copy Markdown
Contributor

That seems reasonable to me.

We repurpose the current DLS runtime support to become TLS,

Essentially, this means taking #12724 to completion.

not super performance-sensitive,

Do you mean that TLS is not super fast or that implementing DLS on TLS is not super fast? The former isn't true.

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 8, 2023

Do you mean that TLS is not super fast or that implementing DLS on TLS is not super fast? The former isn't true.

What I have in mind I guess is that the current use-cases of Domain.DLS that I know of (those in the stdlib) are not super performance-sensitive. (The most sensitive use is for global PRNG state.) For all of those it would be acceptable to move from DLS to TLS, even if we have a few thousands of threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants