Skip to content

make dls thread local#12724

Closed
c-cube wants to merge 10 commits intoocaml:trunkfrom
c-cube:make-dls-thread-local
Closed

make dls thread local#12724
c-cube wants to merge 10 commits intoocaml:trunkfrom
c-cube:make-dls-thread-local

Conversation

@c-cube
Copy link
Copy Markdown
Contributor

@c-cube c-cube commented Nov 9, 2023

This is an attempt at an alternative implementation to #12719 . It segfaults though, apparently the current DLS state isn't fully initialized when my new code in thread.ml is called.

edit: if we make it work and it's decided to be the one good solution, there would be some squashing to be done.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Nov 9, 2023

Regarding the module names, would it be better to simply rename Domain.DLS into Domain.TLS? This could be clearer in the long term, and in the short term it forces programs relying on the experimental Domain module to be fixed if needed.

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below, perhaps related to the crash.

@kayceesrk
Copy link
Copy Markdown
Contributor

Regarding the module names, would it be better to simply rename Domain.DLS into Domain.TLS?

This is a good idea. I have also been thinking that DLS is not an appropriate name for something that's thread local.

Also, I did wonder whether the acronym should be dropped in favour of a longer name LocalStorage.

@c-cube c-cube force-pushed the make-dls-thread-local branch from 7432b01 to 16ea52a Compare November 13, 2023 04:30
CAMLassert(This_thread != NULL);
caml_thread_t this_thread = This_thread;
this_thread->current_stack = Caml_state->current_stack;
this_thread->dls_state = Caml_state->dls_root;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should both dls_state and dls_root be renamed to the same thing, similar to other fields? One option is tls_state or tls_root.

external id : t -> int = "caml_thread_id" [@@noalloc]
external join : t -> unit = "caml_thread_join"

type dls_state = Obj.t array
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All uses of dls should be renamed to tls?

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small suggestion below to avoid making create_dsl public and to remove a (preexisting) bug.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Nov 13, 2023

Since we're here to talk about implementing TLS, and to solve issues of thread-safety (etc.): one realistic usage (and one which I would like to rely on) consists in being able to access TLS from async callbacks. Think profiling, inside a memprof callback.

The guarantee needed is rather weak and is already satisfied by the current implementation: the user accesses inside async callbacks only already-initialised TLS fields (or with a dummy init), only for reading, and in exchange the implementation guarantees that races are innocuous.

Could this guarantee be documented?

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Nov 13, 2023

It looks like the wording of the documentation in domain.mli should be updated. In particular I am curious about what effect split_from_parent should have when spawning a thread. (In this PR, it is used when spawning a new domain but not when spawning a new thread.)

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 14, 2023

I'd be happy to work more on the documentation, if there is a general agreement that this PR is going in the right direction (and that #12719 isn't — I can close it).

Regarding split_from_parent, it's useful for random generators, but it looks to me like it means upfront work at every point a thread is created. The function TLS.set_initial_keys pk (which would have to be called in Thread.create as well if we wanted to generalize it to all threads) eagerly creates in the child, all TLS slots that are present in the parent.

Because threads are much more lightweight than domains (and likely to be created in great numbers) this might be a bit more of a performance issue, I think. What do people think?

@kayceesrk
Copy link
Copy Markdown
Contributor

Because threads are much more lightweight than domains (and likely to be created in great numbers) this might be a bit more of a performance issue, I think. What do people think?

The cost is proportional to the number of TLS slots, isn't it? Thread creation is cheaper than domains but is still an expensive operation. Is there a realistic example where the cost of TLS initialisation is the bottleneck in thread creation? If we don't have such an example, it may be ok to go ahead with the same API. The documentation of new_key already suggests the use of lazy if the initialisation functions are expensive. This would be applicable to TLS as well.

ocaml/stdlib/domain.mli

Lines 113 to 128 in b4308a0

If the splitting function is expensive or requires
child-side computation, consider using ['a Lazy.t key]:
{[
let init () = ...
let split_from_parent parent_value =
... parent-side computation ...;
lazy (
... child-side computation ...
)
let key = Domain.DLS.new_key ~split_from_parent init
let get () = Lazy.force (Domain.DLS.get key)
]}

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 14, 2023

Ideally one would want to initialize the value for the child thread on demand, when it is first requested. The problem is that deriving the value may require computations and state updates on the parent thread. (For random for example, splitting the PRNG state in two mutates the parent state to produce the child state.) If we try to do this on-demand we would have to synchronize the child and the parent thread again, which would be complex and costly, and may not even be possible if the parent thread is terminated at that point. It is much simpler to perform the parent-side computation eagery on thread creation, as the parent and child are synchronizing at this point anyway.

The documentation that @kayceesrk is pointing out explains how to perform the parent-side computation eagerly on thread creation, but wrap any child-side computation in a lazy thunk to be forced later.

Your point remains that thread creation has to pay a cost that is linear in the number of TLS keys with inheritance. That is true, but that seems to be the simplest option available for the provided expressivity, and as KC pointed out we are not aware of performance issues arising from this particular design choice yet.
If people don't need this inheritance mechanism, they don't have to use it and there is no startup cost.

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Nov 14, 2023

To put the costs in perspective: Random.split takes 1/200 th the time of Thread.create + Thread.join (on a Mac M1).

Edit: more precise ratios are 0.44% for macOS on an M1 and 0.37% for Linux on an AMD Zen4.

So, the cost of Random.split at every thread creation is very small, although not completely lost in the noise.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Nov 17, 2023

I'd be happy to work more on the documentation, if there is a general agreement that this PR is going in the right direction (and that #12719 isn't — I can close it).

I'd be happy to keep helping push this PR, but I won't be the one to say whether DLS should keep coexisting with TLS or not. I take the continued discussion by various people as a sign of interest in this approach, but let's help @c-cube by saying so explicitly. (There is the point about races in the current implementation of DLS—I do not make it an argument because the masking proposal is also an efficient way to make the current approach sound.)

@kayceesrk
Copy link
Copy Markdown
Contributor

but let's help @c-cube by saying so explicitly.

I am happy to support this proposal. From the discussion in #12719, it is not clear whether we need DLS at all if we have TLS. See this supporting comment #12719 (comment). There is not much use of DLS in the wild and the breaking change is ok today.

The risk with switching DLS to TLS is whether there are uses that we aren't seeing today which may necessitate a DLS. In the absence of such a need for DLS, I am happy to move forward with replacing this with TLS as (a) TLS seems more appropriate and (b) it fixes races observed in #12677.

masking proposal

@gadmm has this been written down? It would be useful to see how far we are from usable masking. In particular, what does it mean for callbacks. (We should probably discuss the details of this elsewhere so as to not digress.)

@xavierleroy
Copy link
Copy Markdown
Contributor

For what it's worth, I very much like the idea of turning domain-local storage to thread-local storage, so that "it just works" across domains, threads, and combinations of both, just like mutexes, condition variables and semaphores "just work".

If I remember correctly, this idea was briefly mentioned while we were planning the Multicore OCaml -> OCaml 5 merge. (I do remember a Zoom discussion during a stay-at-home order, so it must have been in 2020.) Nobody was opposed, but nobody wanted to work on it at that time, since there were so many other things to do before. Now is a good time to revisit this idea.

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 18, 2023

Alright, then it seems to me that what I need to do is:

  • update names in code (dls_set and the likes become tls_set, etc.)
  • update documentation
  • adapt split_from_parent so that it works at thread creation time too. I think it might be useful to have a hidden implementation module that can be used to share code between Domain.spawn and Thread.create wrt storing the list of default TLS slots.

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 21, 2023

Meeting a few issues to try and add CamlinternalTLS (with shared definitions and state for Thread and Domain):

time/libcamlrun.a .
make ocamlc
make[3]: Entering directory '/home/simon/w/ocaml'
  LINKC ocamlc
File "_none_", line 1:
Error: Module CamlinternalTLS is unavailable (required by Stdlib__Domain)

I've updated stdlib/.depend but maybe not enough…

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented Nov 21, 2023

Meeting a few issues to try and add CamlinternalTLS (with shared definitions and state for Thread and Domain):

You'll need to add the new module to stdlib/StdlibModules

 git diff   
diff --git a/stdlib/StdlibModules b/stdlib/StdlibModules
index efd654069e..14b6714717 100644
--- a/stdlib/StdlibModules
+++ b/stdlib/StdlibModules
@@ -70,6 +70,7 @@ STDLIB_MODULE_BASENAMES = \
   mutex \
   condition \
   semaphore \
+  camlinternalTLS \
   domain \
   camlinternalFormat \
   printf \

seems to work.

@dbuenzli
Copy link
Copy Markdown
Contributor

I'm not exactly sure why this is kept in the Domain module and an alias is added to Thread. The latter seems the right place no ? Less names please.

Also why not stick to the usual conventions ? This should rather be:

module Thread.Key : sig
  type 'a t
  val make : (unit -> 'a) -> 'a t
  val get : 'a t -> 'a 
  val set : 'a t -> 'a -> unit
end

Which incidentally brings us nearer to pthread_key.

@kayceesrk
Copy link
Copy Markdown
Contributor

I'm not exactly sure why this is kept in the Domain module and an alias is added to Thread. The latter seems the right place no ? Less names please.

Would this not make it necessary to link threads in order to use TLS even if the program only uses domains?

@dbuenzli
Copy link
Copy Markdown
Contributor

I thought all modules of the threads library had migrated to the stdlib (Mutex, Condition, Semaphore). Is there any reason why Thread wasn't ?

In any case I'd suggest not multiplying the names, avoid using obscure acronyms and stick to the conventions, as I already suggested in the past.

@kayceesrk
Copy link
Copy Markdown
Contributor

Is there any reason why Thread wasn't ?

When linked with threads, there is some extra overhead to save and restore runtime state, which is unnecessary when the program only has one thread. It is possible that we can do this lazily at the point when the first thread is created, but this is a bit of extra work.

In addition, if the Thread module were moved to Stdlib, then there is no need for the thread library to exist. This may need some changes for the cli tools (ocamlc and ocamlopt would no longer need -I +threads). I am not fully aware of the wider implications of removing the thread library.

In any case I'd suggest not multiplying the names, avoid using obscure acronyms and stick to the conventions, as I #11030 (comment) in the past.

I also mentioned something similar in #12724 (comment). It would be useful to avoid abbreviations.

Another alternative would be to move the module out of Domain and have it directly under Stdlib as Stdlib.ThreadLocalStorage. I don't see a particular reason to have it under Domain.

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Nov 22, 2023

Ok, this now uses common code for initialization in both domain.ml and thread.ml, thanks to a new camlinternalTLS module. I haven't worked on documentation yet (I assume it's mostly updating some docstrings and the manual) but I'd like to get feedback on the implementation first. I'm personally satisfied with the design, as a future user.

Note: I did not rename dls_get and the two externals get_dls_state and set_dls_state beause I think it requires a bootstrap and I don't think it's worth it?

@kayceesrk
Copy link
Copy Markdown
Contributor

but I'd like to get feedback on the implementation first.

Can you mark the PR as ready for review, please?

@c-cube c-cube marked this pull request as ready for review November 22, 2023 04:19
@c-cube c-cube force-pushed the make-dls-thread-local branch from c0ca1d5 to b272704 Compare November 22, 2023 04:28
@kayceesrk kayceesrk force-pushed the make-dls-thread-local branch from 60d890f to 2b96245 Compare December 9, 2023 08:19
| None -> f
| Some old_exit -> (fun () -> (f (); old_exit ()))
in
at_exit_callbacks := IMap.add domain_id new_exit m)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not know much about the future usages of Domain.at_exit so we don't know whether the mutex is expensive. Do note, though, that it is fairly straightforward to avoid holding the mutex in most cases (more details on request).

Minor detail, this change fixes a thread-unsafety of Domain.at_exit of current trunk.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Does the thread-unsafety of Domain.at_exit have an issue on ocaml/ocaml?

Do note, though, that it is fairly straightforward to avoid holding the mutex in most cases (more details on request).

If it is very straightforward, I am happy to hear. With that said, do you think we can have it as a separate PR? We've spent quite a bit of time on this PR and #12719 already. I'm keen to get this done and dusted as a first cut.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Does the thread-unsafety of Domain.at_exit have an issue on ocaml/ocaml?

Not that I know of, I just noticed it while reviewing.

If it is very straightforward, I am happy to hear. With that said, do you think we can have it as a separate PR? We've spent quite a bit of time on this PR and #12719 already. I'm keen to get this done and dusted as a first cut.

Sure, at the same time we can check if we need a "yet to run" flag as for Stdlib.at_exit. In this case I think this commit is good to go.

Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can check if we need a "yet to run" flag as for Stdlib.at_exit. In this case I think this commit is good to go.

sounds good.

@kayceesrk kayceesrk force-pushed the make-dls-thread-local branch from dc5bea8 to 73af69c Compare December 9, 2023 11:30
@kayceesrk
Copy link
Copy Markdown
Contributor

The pending tasks that I planned to do are done: #12724 (comment). The PR is ready IMO.

This is a stdlib change and requires at least two approvals from maintainers(?).

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the code, which seems fine overall. Two broader remarks:

  • We use Domain.at_exit in conjunction with Domain.DLS.new_key, to cleanup keyed values. Do we need Thread.at_exit now? (I suspect that we do.)

  • I reviewed the Thread_local_storage implementation to convince myself that it is thread-safe. (It is easier to have a thread-safe TLS than a thread-safe DLS, as we know that no other thread use the same indices.) I believe that I found various issues related to re-entrancy, that could arise not just due to threads, but also in single-threaded programs. One issue can only happen with asynchronous callback (running on poll points), so maybe we could decide to ignore it for now, but there are also bugs that (I think) can be observed when the initializer passed to Thread_local_storage.make itself accesses TLS values or registers new keys, and those I think we should fix if we can -- or document the restrictions clearly.

let temp_file_key = Thread_local_storage.new_key (fun _ ->
let tmp = snd (Filename.open_temp_file "" "") in
Domain.at_exit (fun () -> close_out_noerr tmp);
tmp)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is that this example is now broken, it should use a Thread.at_exit function that does not appear to currently exist? (And so it should rather be documented as a use-case for Thread.at_exit than Domain.at_exit.)

Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk Dec 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example isn't broken (well it is, since there is a typo). A temporary file descriptor will now be created per thread, which will be closed when the domain that the thread is running on exits. It is not what you would want ideally, but the file does get closed.

It may be useful to have a Thread.at_exit, but that's a conversation that I'd like to not have in this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think the typo is that new_key should now be make.)

When I wrote the comment I thought that the domain would only close one file descriptor on exit (which is broken), but in fact the cumulative behavior of Domain.at_exit (adding a new callback instead of replacing the callback) means, as you point out, that all descriptors will be closed on domain termination.

I still consider that this example is not right, however. Consider in particular the case of a single domain that regularly spawns short-lived threads (to execute a blocking call, for example). What the user needs is to be able to keep a number of file descriptors alive that corresponds to the number of threads that are alive. Instead the number of file descriptors open will grow in an unbounded way until program termination. This is not just "no ideal", in practice it is not acceptable for this workload.

Remark: putting a finalizer on the value produced by the initialisation function would work better¹ than using Domain.at_exit. If we don't want to have Thread.at_exit, maybe we should rewrite the example to use a finalizer?

¹: I assume that thread termination clears the TLS array of the thread, but I am not sure that this is the case with the current PR.

always happens, regardless of whether the child thread or
domain will use it.
If the splitting function is expensive or requires
child-side computation, consider using ['a Lazy.t key]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would also be nice to provide a code example of accessing Domain.id to determine whether we are spawning a new domain, or a new thread on the same domain.

let v = st.(idx) in
if v == unique_value then
let v' = Obj.repr (init ()) in
st.(idx) <- (Sys.opaque_identity v');
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is still incorrect due to re-entrancy issues: if init () in turn calls maybe_grow, st could be wrong at this point. I think that the following would work better:

let v' = Obj.repr (init ()) in
(* the tls array may have been resized by [init ()],
    call [maybe_grow] again to get the up-to-date array. *)
let st = maybe_grow idx in
st.(idx) <- Sys.opaque_identity v';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other similar issues you pointed out are not due to this PR's change -- making DLS thread-local. These issues are present on trunk. I would prefer if these unrelated issues do not delay this PR getting in. There are already many related issues that need sorting out.

We should move this and others to a separate issue. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this one as suggested.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I am happy to defer those three reentrancy issues to a later PR as you suggested. They make the code more complex and would probably deserve an independent review.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is straightforward. So picked this change here. Other ones are more complicated, and I'd like to defer those to a separate issue.

let new_sz = compute_new_size (max 8 sz) in
let new_st = Array.make new_sz unique_value in
Array.blit st 0 new_st 0 sz;
set_tls_state new_st;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: this implementation is incorrect in presence of asynchronous callbacks -- code executed on allocation and poll points. The call to Array.make may call asynchronous code that would maybe_grow and initialize and define new keys that are missing from st.

This is an obscure scenario and we may want to leave things as-is if there is no easy fix. But I think that there would be an easy fix, due to the fact that the tls array only grows and never shrinks:

    let new_sz = compute_new_size (max 8 sz) in
    let new_st = Array.make new_sz unique_value in
    (* the [new_st] allocation may result in an asynchronous resize of the tls_state array:
       continue from the current array. *)
    let cur_st = get_tls_state () in
    let cur_sz = Array.length cur_st in
    if idx < cur_st then cur_st
    else begin
      assert (cur_sz <= new_sz);
      Array.blit cur_st 0 new_st 0 cur_sz;
      set_tls_state new_st;
    end

(Note: I am bad at reasoning about poll points, so I am not sure that there are no poll points in the new version between the binding of cur_st and the set_tls_state call. If there are poll points in there, possibly after Array.blit for example, the new version is still incorrect for asynchronous callbacks.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #12724 (comment) (?).

List.map
(fun (KI ((idx, _) as k, split)) ->
(idx, Obj.repr (split (get k))))
(Atomic.get parent_keys)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a reentrancy issue here if the calls to get k change the value of parent_keys by registering new keys that use inheritance.

I wrote the following not-too-horrible version that adds new keys in a loop.

let get_initial_keys () : (int * Obj.t) list =
  (* compute the key-value pair for the child from an initializer *)
  let kv_of_ki (KI ((idx, _) as k, split)) =
    (idx, Obj.repr (split (get k)))
  in
  (* We compute key-value pairs for all 'parent keys' in a loop,
     because the list of parent keys may grow during the
     computation. *)
  let rec loop acc last_parent_keys =
    let new_parent_keys = Atomic.get parent_keys in
    if new_parent_keys == last_parent_keys then acc
    else
      (* process all key initializers that are in [new_parent_keys]
         but not in [last_parent_keys], then loop again. *)
      let rec add_new_keys acc keys =
        if keys == last_parent_keys
        then loop acc new_parent_keys
        else match keys with
          | [] -> loop acc new_parent_keys
          | ki :: rest ->
            add_new_keys (kv_of_ki ki :: acc) rest
      in
      add_new_keys acc new_parent_keys
  in
  loop [] []

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #12724 (comment) (?).

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 9, 2023

I wonder about the following two things:

  1. Do we understand well the interaction between TLS and the "callback registering" functions, namely Domain.before_first_spawn, Domain.at_exit, and the Thread equivalent of those?

  2. Do we have consensus, have the people who were actively involved in the discussions so far agreed that the current approach is a reasonable step moving forward?

let domain_id = (self () :> int) in
Mutex.protect at_exit_mutex (fun () ->
let m = !at_exit_callbacks in
let old_exit_opt = IMap.find_opt domain_id m in
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this implementation surprisingly complex and I wonder if this is the right approach.

  1. Maybe we actually want to have Thread.at_exit, which would be naturally easier to implement on top of Thread_local_storage? (Note: even if we did, we probably want to keep a Domain.at_exit around with the current semantics, but that would mostly be for advanced out-of-tree experiments.)

  2. We discussed a way to implement DLS on top of TLS, isn't this function a good opportunity to try this? For example, it would seem fairly natural to me to implement this as a (unit -> unit) ref Thread_local_storage.t that is shared across all threads of a given domain. Am I missing something? (If we try, and it appears that DLS-on-top-of-TLS is worse than we thought, maybe that is a reason to reconsider.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand what is complex with this implementation. It is a single piece of shared state at_exit_callbacks protected by a mutex.

If we want to retain Domain.at_exit, I'd rather keep the current implementation, and think about Thread.at_exit in a separate PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the argument we used to move DLS to TLS is that it "should be easy" to implement DLS on top of TLS.

But then, on the one time where we do need an implementation of DLS inside the compiler distribution, we punt and implement a globally-locked map of domain identifiers. Is this our recommendation for users as well, that they implement a globally-locked of domain identifiers? If not, why don't we follow our own recommendations?

I think that we should do our homework and implement Domain.at_exit as DLS on top of TLS. I see two possible outcomes:

  1. If it turns out to be simple and pleasant, this would be better I think than your current implementation. Plus we have an example of DLS-over-TLS implementation we can point people at. Win!
  2. If it turns out to be harder than we thought, or have performance downsides that we find problematic, then it suggests that our initial intuition was wrong and maybe we should reconsider or at least think harder. I would rather figure this out now than after a release.

Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the argument we used to move DLS to TLS is that it "should be easy" to implement DLS on top of TLS.

It was not my understanding that this was easy. But I sympathise with the push for having a DLS for Domain.at_exit.

The lazy-based implementation (#12719 (comment)) has a concurrency issue in that multiple threads on a domain may concurrently perform DLS.get, which will concurrently force the lazy value, which is not concurrency safe.

I haven't investigated the concurrent hash map + caching solution. This requires a concurrent hash map in the standard library @gadmm? or am I misunderstanding the solution?

If we are providing a DLS, I would prefer to implement it directly as a field in domain state, just as TLS is. We could potentially share the implementation between the two and focus on making them both concurrency-safe and reentrant.

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 10, 2023

@kayceesrk I understand the idea to get work done by pushing stuff for later, and I agree for the parts that can clearly be dealt with separately -- the reentrancy issues.

On the other hand, I would want us to ensure that the new design that we propose is sound and consistent. The previous design with first-class domain-local storage was sound and consistent, but inconvenient for users. If we want to move to first-class thread-local storage, I would want to ensure that the resulting design is nice and not somewhat-broken due to various compromises. I think that, in particular, the questions around Domain.at_exit and Thread.at_exit should get a serious discussion, rather than being swept under the carpet because we are in a hurry to merge.

(But then this is just my opinion. If there is a consensus among other people to say that the current state is just fine and we should go ahead, let's! I don't know that we have reached any such consensus yet.)

CAMLassert(th != NULL);
Active_thread = th;
Caml_state->current_stack = th->current_stack;
Caml_state->tls_state = th->tls_state;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that since Caml_state->tls_state is registered as a generational global root, this should use caml_modify_generational_global_root.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: shouldn't each th->tls_state field also be registered as a root?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer: there is no need to register each th->tls_state as a root, because it is scanned during caml_thread_scan_roots. This also answers my question related to finalizers: it should be the case that the tls_state of a terminated thread is collected by the GC, as it will not be kept alive by caml_thread_scan_roots after thread termination.

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Dec 10, 2023 via email

@kayceesrk
Copy link
Copy Markdown
Contributor

Do we need Thread.at_exit now? (I suspect that we do.)

(opening a can of worms)

Where would you want to have the Thread.at_exit? I suspect in the Thread module. Note that at_exit callbacks are used by Stdlib

The format one should be a thread-local one (don't mix format buffers of different threads), and IINM, the gc one should be a domain-local one. Does that mean that we should move Thread module to Stdlib? IIUC, the hooks for saving and restoring state will not be installed unless the program uses the Thread module.

type 'a t
(** Type of a thread-local storage key *)

val make: ?split_from_parent:('a -> 'a) -> (unit -> 'a) -> 'a t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I think we are free to change the API compared to the existing DLS one, a couple of suggestions:

  1. I think this would make more sense as two functions:
val make : (unit -> 'a) -> 'a t
val make_with_split : 'a -> split:('a -> 'a) -> 'a t
  1. I think it would be good to tell split whether we are splitting a new thread or splitting a new domain. Something like:
val make_with_split : 'a -> split:(same_domain:bool -> 'a -> 'a) -> 'a t

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpw25 I'm not sure what you mean with the 'a parameter of make_with_split. Maybe your intention is that this is the initial value on the main domain, that new domains will split from. But note that at the point where we call make_with_split, there may be several domains running already, what is the behavior you have in mind in this case?

With the current API, any domain that was created before the key will call the (unit -> 'a) initializer to get its value, lazily, on the first key access. (Note that this (unit -> 'a) thunk could include some subtle logic to derive those values from a common one. The only thing it cannot reconstruct is the parent-child relationship between existing domains.)

Some possible approaches that one may have in mind:

  • All existing domains get assigned the same 'a value. I don't think this works, especially in the case where 'a has mutable state, as the result could be very racy.
  • The initial domain has the 'a value. Other pre-existing domains initialize their value by splitting the initial-domain value on the first access. This is probably the best we can do with the API you propose, but note that for non-pre-existing domains the split functions is called on the parent, not on the child, and here we would call it on the child. I don't think this really works well. (In particular, the split function may mutate the parent's state, this is what the PRNG split does, and calling it from the child with the parent's value is very racy again.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I am reopening the discussion of DLS API in #13356 and would be happy to incorporate @lpw25's suggestion, but I am currently blocked by the interrogations above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that a while ago, but I think I was suggesting that if you use a split that it should run eagerly for existing domains starting from the initial domain based on the tree structure of their creation. This gives you a consistent semantics regardless of when you happen to call make relative to the creation of other domains. Perhaps the cost of running the split functions eagerly overrides the benefit, but I would have thought that the cases that wanted split were more likely to be run at top-level before the creation of any additional domains.

Separately from that part of the suggestion, I really think you should separate make_with_split from make without. Providing the split parameter fundamentally changes the behaviour of the function, which is not a good use of optional arguments. Another way to put it is that there is no value for split_from_parent that reproduces the behaviour of providing no split_from_parent and I think that is an API smell.

Copy link
Copy Markdown
Member

@gasche gasche Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was suggesting that if you use a split that it should run eagerly for existing domains starting from the initial domain based on the tree structure of their creation.

This would be tricky to implement as those existing domains are already running concurrently.

I guess that we could interrupt domains in turn following the spawning tree, so that they call the split function asynchronously at the next poll point. First we interrupt domain 0, it splits a value for each domain that it spawned, and then in turn it interrupts those domains if they themselves have children, etc. We can do this asynchronously, only DLS.get operations need to block until the value for the current child domain has been computed.

I'm not sure what to do if a domain in the spawn tree has already terminated. (I don't think we currently impose that child domains terminate before their parents, only that all domains are eventually joined.) We could reasonably decide that its own parent (the grand-parent of the domain we are spawning) is then in charge of computing the split value. (I think the behavior is less well-behaved if the child itself does it.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant that the call to make would immediately run split to get the values for all the existing domains, and then set them. This ensures that the splitting is all going to be thread safe. However, it does mean that split will not always be run on the parent thread, so if you use Thread_local_state.get from within a split function you won't always receive the values for the parent thread. Perhaps that is enough of a downside not to do it this way?

I find it a little difficult to judge the split API without a more precise understanding of the nature of its use cases. Are there use cases for it beyond the splittable PRNG? If not, should we tailor it more specifically to that case?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 30, 2024

I looked at Domain.at_exit usage in the wild. Except for the in-compiler uses which KC audited at #12724 (comment) , the two use-cases I found are from @polytypic:

Both use-cases follow the same pattern. Those libraries create a per-domain worker thread, meant to service all the threads of that domain (without cross-domain synchronization costs). The picos_select worker threads services select calls, and the domain_local_timeout worker thread services set_timeoutf calls. In both cases Domain.at_exit is called to stop the worker thread when the domain exits.

@gasche
Copy link
Copy Markdown
Member

gasche commented Aug 2, 2024

I have opened a new PR that goes back to providing a TLS module separate from Domain.DLS (an avoids the awkward question on confusing naming or painful transition of existing code), and/but tries to share both the interface and the implementation of both modules: #13355 .

(I think that it would be premature to declare that the present PR is superseded by that new PR, given that they represent two fairly different designs and people may reasonably decide to prefer the one here.)

@c-cube
Copy link
Copy Markdown
Contributor Author

c-cube commented Sep 2, 2024

replaced by #13355

@c-cube c-cube closed this Sep 2, 2024
@c-cube c-cube deleted the make-dls-thread-local branch September 2, 2024 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants