Conversation
|
Regarding the module names, would it be better to simply rename |
gadmm
left a comment
There was a problem hiding this comment.
Some comments below, perhaps related to the crash.
This is a good idea. I have also been thinking that DLS is not an appropriate name for something that's thread local. Also, I did wonder whether the acronym should be dropped in favour of a longer name |
7432b01 to
16ea52a
Compare
otherlibs/systhreads/st_stubs.c
Outdated
| CAMLassert(This_thread != NULL); | ||
| caml_thread_t this_thread = This_thread; | ||
| this_thread->current_stack = Caml_state->current_stack; | ||
| this_thread->dls_state = Caml_state->dls_root; |
There was a problem hiding this comment.
Should both dls_state and dls_root be renamed to the same thing, similar to other fields? One option is tls_state or tls_root.
otherlibs/systhreads/thread.ml
Outdated
| external id : t -> int = "caml_thread_id" [@@noalloc] | ||
| external join : t -> unit = "caml_thread_join" | ||
|
|
||
| type dls_state = Obj.t array |
There was a problem hiding this comment.
All uses of dls should be renamed to tls?
gadmm
left a comment
There was a problem hiding this comment.
I have a small suggestion below to avoid making create_dsl public and to remove a (preexisting) bug.
|
Since we're here to talk about implementing TLS, and to solve issues of thread-safety (etc.): one realistic usage (and one which I would like to rely on) consists in being able to access TLS from async callbacks. Think profiling, inside a memprof callback. The guarantee needed is rather weak and is already satisfied by the current implementation: the user accesses inside async callbacks only already-initialised TLS fields (or with a dummy Could this guarantee be documented? |
|
It looks like the wording of the documentation in domain.mli should be updated. In particular I am curious about what effect |
|
I'd be happy to work more on the documentation, if there is a general agreement that this PR is going in the right direction (and that #12719 isn't — I can close it). Regarding Because threads are much more lightweight than domains (and likely to be created in great numbers) this might be a bit more of a performance issue, I think. What do people think? |
The cost is proportional to the number of TLS slots, isn't it? Thread creation is cheaper than domains but is still an expensive operation. Is there a realistic example where the cost of TLS initialisation is the bottleneck in thread creation? If we don't have such an example, it may be ok to go ahead with the same API. The documentation of Lines 113 to 128 in b4308a0 |
|
Ideally one would want to initialize the value for the child thread on demand, when it is first requested. The problem is that deriving the value may require computations and state updates on the parent thread. (For random for example, splitting the PRNG state in two mutates the parent state to produce the child state.) If we try to do this on-demand we would have to synchronize the child and the parent thread again, which would be complex and costly, and may not even be possible if the parent thread is terminated at that point. It is much simpler to perform the parent-side computation eagery on thread creation, as the parent and child are synchronizing at this point anyway. The documentation that @kayceesrk is pointing out explains how to perform the parent-side computation eagerly on thread creation, but wrap any child-side computation in a lazy thunk to be forced later. Your point remains that thread creation has to pay a cost that is linear in the number of TLS keys with inheritance. That is true, but that seems to be the simplest option available for the provided expressivity, and as KC pointed out we are not aware of performance issues arising from this particular design choice yet. |
|
To put the costs in perspective: Edit: more precise ratios are 0.44% for macOS on an M1 and 0.37% for Linux on an AMD Zen4. So, the cost of |
I'd be happy to keep helping push this PR, but I won't be the one to say whether DLS should keep coexisting with TLS or not. I take the continued discussion by various people as a sign of interest in this approach, but let's help @c-cube by saying so explicitly. (There is the point about races in the current implementation of DLS—I do not make it an argument because the masking proposal is also an efficient way to make the current approach sound.) |
I am happy to support this proposal. From the discussion in #12719, it is not clear whether we need DLS at all if we have TLS. See this supporting comment #12719 (comment). There is not much use of DLS in the wild and the breaking change is ok today. The risk with switching DLS to TLS is whether there are uses that we aren't seeing today which may necessitate a DLS. In the absence of such a need for DLS, I am happy to move forward with replacing this with TLS as (a) TLS seems more appropriate and (b) it fixes races observed in #12677.
@gadmm has this been written down? It would be useful to see how far we are from usable masking. In particular, what does it mean for callbacks. (We should probably discuss the details of this elsewhere so as to not digress.) |
|
For what it's worth, I very much like the idea of turning domain-local storage to thread-local storage, so that "it just works" across domains, threads, and combinations of both, just like mutexes, condition variables and semaphores "just work". If I remember correctly, this idea was briefly mentioned while we were planning the Multicore OCaml -> OCaml 5 merge. (I do remember a Zoom discussion during a stay-at-home order, so it must have been in 2020.) Nobody was opposed, but nobody wanted to work on it at that time, since there were so many other things to do before. Now is a good time to revisit this idea. |
|
Alright, then it seems to me that what I need to do is:
|
|
Meeting a few issues to try and add I've updated |
You'll need to add the new module to git diff
diff --git a/stdlib/StdlibModules b/stdlib/StdlibModules
index efd654069e..14b6714717 100644
--- a/stdlib/StdlibModules
+++ b/stdlib/StdlibModules
@@ -70,6 +70,7 @@ STDLIB_MODULE_BASENAMES = \
mutex \
condition \
semaphore \
+ camlinternalTLS \
domain \
camlinternalFormat \
printf \seems to work. |
|
I'm not exactly sure why this is kept in the Also why not stick to the usual conventions ? This should rather be: module Thread.Key : sig
type 'a t
val make : (unit -> 'a) -> 'a t
val get : 'a t -> 'a
val set : 'a t -> 'a -> unit
endWhich incidentally brings us nearer to |
Would this not make it necessary to link threads in order to use TLS even if the program only uses domains? |
|
I thought all modules of the threads library had migrated to the stdlib ( In any case I'd suggest not multiplying the names, avoid using obscure acronyms and stick to the conventions, as I already suggested in the past. |
When linked with threads, there is some extra overhead to save and restore runtime state, which is unnecessary when the program only has one thread. It is possible that we can do this lazily at the point when the first thread is created, but this is a bit of extra work. In addition, if the Thread module were moved to Stdlib, then there is no need for the thread library to exist. This may need some changes for the cli tools (ocamlc and ocamlopt would no longer need
I also mentioned something similar in #12724 (comment). It would be useful to avoid abbreviations. Another alternative would be to move the module out of Domain and have it directly under Stdlib as |
|
Ok, this now uses common code for initialization in both domain.ml and thread.ml, thanks to a new Note: I did not rename |
Can you mark the PR as ready for review, please? |
c0ca1d5 to
b272704
Compare
60d890f to
2b96245
Compare
| | None -> f | ||
| | Some old_exit -> (fun () -> (f (); old_exit ())) | ||
| in | ||
| at_exit_callbacks := IMap.add domain_id new_exit m) |
There was a problem hiding this comment.
We do not know much about the future usages of Domain.at_exit so we don't know whether the mutex is expensive. Do note, though, that it is fairly straightforward to avoid holding the mutex in most cases (more details on request).
Minor detail, this change fixes a thread-unsafety of Domain.at_exit of current trunk.
There was a problem hiding this comment.
Thanks. Does the thread-unsafety of Domain.at_exit have an issue on ocaml/ocaml?
Do note, though, that it is fairly straightforward to avoid holding the mutex in most cases (more details on request).
If it is very straightforward, I am happy to hear. With that said, do you think we can have it as a separate PR? We've spent quite a bit of time on this PR and #12719 already. I'm keen to get this done and dusted as a first cut.
There was a problem hiding this comment.
Thanks. Does the thread-unsafety of
Domain.at_exithave an issue on ocaml/ocaml?
Not that I know of, I just noticed it while reviewing.
If it is very straightforward, I am happy to hear. With that said, do you think we can have it as a separate PR? We've spent quite a bit of time on this PR and #12719 already. I'm keen to get this done and dusted as a first cut.
Sure, at the same time we can check if we need a "yet to run" flag as for Stdlib.at_exit. In this case I think this commit is good to go.
There was a problem hiding this comment.
we can check if we need a "yet to run" flag as for Stdlib.at_exit. In this case I think this commit is good to go.
sounds good.
dc5bea8 to
73af69c
Compare
|
The pending tasks that I planned to do are done: #12724 (comment). The PR is ready IMO. This is a stdlib change and requires at least two approvals from maintainers(?). |
gasche
left a comment
There was a problem hiding this comment.
I looked at the code, which seems fine overall. Two broader remarks:
-
We use
Domain.at_exitin conjunction withDomain.DLS.new_key, to cleanup keyed values. Do we needThread.at_exitnow? (I suspect that we do.) -
I reviewed the Thread_local_storage implementation to convince myself that it is thread-safe. (It is easier to have a thread-safe TLS than a thread-safe DLS, as we know that no other thread use the same indices.) I believe that I found various issues related to re-entrancy, that could arise not just due to threads, but also in single-threaded programs. One issue can only happen with asynchronous callback (running on poll points), so maybe we could decide to ignore it for now, but there are also bugs that (I think) can be observed when the initializer passed to
Thread_local_storage.makeitself accesses TLS values or registers new keys, and those I think we should fix if we can -- or document the restrictions clearly.
| let temp_file_key = Thread_local_storage.new_key (fun _ -> | ||
| let tmp = snd (Filename.open_temp_file "" "") in | ||
| Domain.at_exit (fun () -> close_out_noerr tmp); | ||
| tmp) |
There was a problem hiding this comment.
My intuition is that this example is now broken, it should use a Thread.at_exit function that does not appear to currently exist? (And so it should rather be documented as a use-case for Thread.at_exit than Domain.at_exit.)
There was a problem hiding this comment.
The example isn't broken (well it is, since there is a typo). A temporary file descriptor will now be created per thread, which will be closed when the domain that the thread is running on exits. It is not what you would want ideally, but the file does get closed.
It may be useful to have a Thread.at_exit, but that's a conversation that I'd like to not have in this PR.
There was a problem hiding this comment.
(I think the typo is that new_key should now be make.)
When I wrote the comment I thought that the domain would only close one file descriptor on exit (which is broken), but in fact the cumulative behavior of Domain.at_exit (adding a new callback instead of replacing the callback) means, as you point out, that all descriptors will be closed on domain termination.
I still consider that this example is not right, however. Consider in particular the case of a single domain that regularly spawns short-lived threads (to execute a blocking call, for example). What the user needs is to be able to keep a number of file descriptors alive that corresponds to the number of threads that are alive. Instead the number of file descriptors open will grow in an unbounded way until program termination. This is not just "no ideal", in practice it is not acceptable for this workload.
Remark: putting a finalizer on the value produced by the initialisation function would work better¹ than using Domain.at_exit. If we don't want to have Thread.at_exit, maybe we should rewrite the example to use a finalizer?
¹: I assume that thread termination clears the TLS array of the thread, but I am not sure that this is the case with the current PR.
| always happens, regardless of whether the child thread or | ||
| domain will use it. | ||
| If the splitting function is expensive or requires | ||
| child-side computation, consider using ['a Lazy.t key]: |
There was a problem hiding this comment.
I think that it would also be nice to provide a code example of accessing Domain.id to determine whether we are spawning a new domain, or a new thread on the same domain.
stdlib/camlinternalTLS.ml
Outdated
| let v = st.(idx) in | ||
| if v == unique_value then | ||
| let v' = Obj.repr (init ()) in | ||
| st.(idx) <- (Sys.opaque_identity v'); |
There was a problem hiding this comment.
This implementation is still incorrect due to re-entrancy issues: if init () in turn calls maybe_grow, st could be wrong at this point. I think that the following would work better:
let v' = Obj.repr (init ()) in
(* the tls array may have been resized by [init ()],
call [maybe_grow] again to get the up-to-date array. *)
let st = maybe_grow idx in
st.(idx) <- Sys.opaque_identity v';There was a problem hiding this comment.
This and other similar issues you pointed out are not due to this PR's change -- making DLS thread-local. These issues are present on trunk. I would prefer if these unrelated issues do not delay this PR getting in. There are already many related issues that need sorting out.
We should move this and others to a separate issue. What do you think?
There was a problem hiding this comment.
Fixed this one as suggested.
There was a problem hiding this comment.
Note that I am happy to defer those three reentrancy issues to a later PR as you suggested. They make the code more complex and would probably deserve an independent review.
There was a problem hiding this comment.
This one is straightforward. So picked this change here. Other ones are more complicated, and I'd like to defer those to a separate issue.
| let new_sz = compute_new_size (max 8 sz) in | ||
| let new_st = Array.make new_sz unique_value in | ||
| Array.blit st 0 new_st 0 sz; | ||
| set_tls_state new_st; |
There was a problem hiding this comment.
Nitpick: this implementation is incorrect in presence of asynchronous callbacks -- code executed on allocation and poll points. The call to Array.make may call asynchronous code that would maybe_grow and initialize and define new keys that are missing from st.
This is an obscure scenario and we may want to leave things as-is if there is no easy fix. But I think that there would be an easy fix, due to the fact that the tls array only grows and never shrinks:
let new_sz = compute_new_size (max 8 sz) in
let new_st = Array.make new_sz unique_value in
(* the [new_st] allocation may result in an asynchronous resize of the tls_state array:
continue from the current array. *)
let cur_st = get_tls_state () in
let cur_sz = Array.length cur_st in
if idx < cur_st then cur_st
else begin
assert (cur_sz <= new_sz);
Array.blit cur_st 0 new_st 0 cur_sz;
set_tls_state new_st;
end(Note: I am bad at reasoning about poll points, so I am not sure that there are no poll points in the new version between the binding of cur_st and the set_tls_state call. If there are poll points in there, possibly after Array.blit for example, the new version is still incorrect for asynchronous callbacks.)
| List.map | ||
| (fun (KI ((idx, _) as k, split)) -> | ||
| (idx, Obj.repr (split (get k)))) | ||
| (Atomic.get parent_keys) |
There was a problem hiding this comment.
There is a reentrancy issue here if the calls to get k change the value of parent_keys by registering new keys that use inheritance.
I wrote the following not-too-horrible version that adds new keys in a loop.
let get_initial_keys () : (int * Obj.t) list =
(* compute the key-value pair for the child from an initializer *)
let kv_of_ki (KI ((idx, _) as k, split)) =
(idx, Obj.repr (split (get k)))
in
(* We compute key-value pairs for all 'parent keys' in a loop,
because the list of parent keys may grow during the
computation. *)
let rec loop acc last_parent_keys =
let new_parent_keys = Atomic.get parent_keys in
if new_parent_keys == last_parent_keys then acc
else
(* process all key initializers that are in [new_parent_keys]
but not in [last_parent_keys], then loop again. *)
let rec add_new_keys acc keys =
if keys == last_parent_keys
then loop acc new_parent_keys
else match keys with
| [] -> loop acc new_parent_keys
| ki :: rest ->
add_new_keys (kv_of_ki ki :: acc) rest
in
add_new_keys acc new_parent_keys
in
loop [] []|
I wonder about the following two things:
|
| let domain_id = (self () :> int) in | ||
| Mutex.protect at_exit_mutex (fun () -> | ||
| let m = !at_exit_callbacks in | ||
| let old_exit_opt = IMap.find_opt domain_id m in |
There was a problem hiding this comment.
I find this implementation surprisingly complex and I wonder if this is the right approach.
-
Maybe we actually want to have
Thread.at_exit, which would be naturally easier to implement on top ofThread_local_storage? (Note: even if we did, we probably want to keep aDomain.at_exitaround with the current semantics, but that would mostly be for advanced out-of-tree experiments.) -
We discussed a way to implement DLS on top of TLS, isn't this function a good opportunity to try this? For example, it would seem fairly natural to me to implement this as a
(unit -> unit) ref Thread_local_storage.tthat is shared across all threads of a given domain. Am I missing something? (If we try, and it appears that DLS-on-top-of-TLS is worse than we thought, maybe that is a reason to reconsider.)
There was a problem hiding this comment.
I'd like to understand what is complex with this implementation. It is a single piece of shared state at_exit_callbacks protected by a mutex.
If we want to retain Domain.at_exit, I'd rather keep the current implementation, and think about Thread.at_exit in a separate PR.
There was a problem hiding this comment.
Part of the argument we used to move DLS to TLS is that it "should be easy" to implement DLS on top of TLS.
But then, on the one time where we do need an implementation of DLS inside the compiler distribution, we punt and implement a globally-locked map of domain identifiers. Is this our recommendation for users as well, that they implement a globally-locked of domain identifiers? If not, why don't we follow our own recommendations?
I think that we should do our homework and implement Domain.at_exit as DLS on top of TLS. I see two possible outcomes:
- If it turns out to be simple and pleasant, this would be better I think than your current implementation. Plus we have an example of DLS-over-TLS implementation we can point people at. Win!
- If it turns out to be harder than we thought, or have performance downsides that we find problematic, then it suggests that our initial intuition was wrong and maybe we should reconsider or at least think harder. I would rather figure this out now than after a release.
There was a problem hiding this comment.
Part of the argument we used to move DLS to TLS is that it "should be easy" to implement DLS on top of TLS.
It was not my understanding that this was easy. But I sympathise with the push for having a DLS for Domain.at_exit.
The lazy-based implementation (#12719 (comment)) has a concurrency issue in that multiple threads on a domain may concurrently perform DLS.get, which will concurrently force the lazy value, which is not concurrency safe.
I haven't investigated the concurrent hash map + caching solution. This requires a concurrent hash map in the standard library @gadmm? or am I misunderstanding the solution?
If we are providing a DLS, I would prefer to implement it directly as a field in domain state, just as TLS is. We could potentially share the implementation between the two and focus on making them both concurrency-safe and reentrant.
|
@kayceesrk I understand the idea to get work done by pushing stuff for later, and I agree for the parts that can clearly be dealt with separately -- the reentrancy issues. On the other hand, I would want us to ensure that the new design that we propose is sound and consistent. The previous design with first-class domain-local storage was sound and consistent, but inconvenient for users. If we want to move to first-class thread-local storage, I would want to ensure that the resulting design is nice and not somewhat-broken due to various compromises. I think that, in particular, the questions around (But then this is just my opinion. If there is a consensus among other people to say that the current state is just fine and we should go ahead, let's! I don't know that we have reached any such consensus yet.) |
| CAMLassert(th != NULL); | ||
| Active_thread = th; | ||
| Caml_state->current_stack = th->current_stack; | ||
| Caml_state->tls_state = th->tls_state; |
There was a problem hiding this comment.
I think that since Caml_state->tls_state is registered as a generational global root, this should use caml_modify_generational_global_root.
There was a problem hiding this comment.
Question: shouldn't each th->tls_state field also be registered as a root?
There was a problem hiding this comment.
Answer: there is no need to register each th->tls_state as a root, because it is scanned during caml_thread_scan_roots. This also answers my question related to finalizers: it should be the case that the tls_state of a terminated thread is collected by the GC, as it will not be kept alive by caml_thread_scan_roots after thread termination.
|
I think it doesn't need to be, it's declared as a root in a C function somewhere in the C stubs. Like a lot of the thread state.
|
(opening a can of worms) Where would you want to have the The format one should be a thread-local one (don't mix format buffers of different threads), and IINM, the gc one should be a domain-local one. Does that mean that we should move |
| type 'a t | ||
| (** Type of a thread-local storage key *) | ||
|
|
||
| val make: ?split_from_parent:('a -> 'a) -> (unit -> 'a) -> 'a t |
There was a problem hiding this comment.
Since I think we are free to change the API compared to the existing DLS one, a couple of suggestions:
- I think this would make more sense as two functions:
val make : (unit -> 'a) -> 'a t
val make_with_split : 'a -> split:('a -> 'a) -> 'a t- I think it would be good to tell
splitwhether we are splitting a new thread or splitting a new domain. Something like:
val make_with_split : 'a -> split:(same_domain:bool -> 'a -> 'a) -> 'a t
There was a problem hiding this comment.
@lpw25 I'm not sure what you mean with the 'a parameter of make_with_split. Maybe your intention is that this is the initial value on the main domain, that new domains will split from. But note that at the point where we call make_with_split, there may be several domains running already, what is the behavior you have in mind in this case?
With the current API, any domain that was created before the key will call the (unit -> 'a) initializer to get its value, lazily, on the first key access. (Note that this (unit -> 'a) thunk could include some subtle logic to derive those values from a common one. The only thing it cannot reconstruct is the parent-child relationship between existing domains.)
Some possible approaches that one may have in mind:
- All existing domains get assigned the same
'avalue. I don't think this works, especially in the case where'ahas mutable state, as the result could be very racy. - The initial domain has the '
avalue. Other pre-existing domains initialize their value by splitting the initial-domain value on the first access. This is probably the best we can do with the API you propose, but note that for non-pre-existing domains the split functions is called on the parent, not on the child, and here we would call it on the child. I don't think this really works well. (In particular, the split function may mutate the parent's state, this is what the PRNG split does, and calling it from the child with the parent's value is very racy again.)
There was a problem hiding this comment.
I wrote that a while ago, but I think I was suggesting that if you use a split that it should run eagerly for existing domains starting from the initial domain based on the tree structure of their creation. This gives you a consistent semantics regardless of when you happen to call make relative to the creation of other domains. Perhaps the cost of running the split functions eagerly overrides the benefit, but I would have thought that the cases that wanted split were more likely to be run at top-level before the creation of any additional domains.
Separately from that part of the suggestion, I really think you should separate make_with_split from make without. Providing the split parameter fundamentally changes the behaviour of the function, which is not a good use of optional arguments. Another way to put it is that there is no value for split_from_parent that reproduces the behaviour of providing no split_from_parent and I think that is an API smell.
There was a problem hiding this comment.
I think I was suggesting that if you use a split that it should run eagerly for existing domains starting from the initial domain based on the tree structure of their creation.
This would be tricky to implement as those existing domains are already running concurrently.
I guess that we could interrupt domains in turn following the spawning tree, so that they call the split function asynchronously at the next poll point. First we interrupt domain 0, it splits a value for each domain that it spawned, and then in turn it interrupts those domains if they themselves have children, etc. We can do this asynchronously, only DLS.get operations need to block until the value for the current child domain has been computed.
I'm not sure what to do if a domain in the spawn tree has already terminated. (I don't think we currently impose that child domains terminate before their parents, only that all domains are eventually joined.) We could reasonably decide that its own parent (the grand-parent of the domain we are spawning) is then in charge of computing the split value. (I think the behavior is less well-behaved if the child itself does it.)
There was a problem hiding this comment.
Sorry, I meant that the call to make would immediately run split to get the values for all the existing domains, and then set them. This ensures that the splitting is all going to be thread safe. However, it does mean that split will not always be run on the parent thread, so if you use Thread_local_state.get from within a split function you won't always receive the values for the parent thread. Perhaps that is enough of a downside not to do it this way?
I find it a little difficult to judge the split API without a more precise understanding of the nature of its use cases. Are there use cases for it beyond the splittable PRNG? If not, should we tailor it more specifically to that case?
|
I looked at
Both use-cases follow the same pattern. Those libraries create a per-domain worker thread, meant to service all the threads of that domain (without cross-domain synchronization costs). The picos_select worker threads services |
|
I have opened a new PR that goes back to providing a TLS module separate from Domain.DLS (an avoids the awkward question on confusing naming or painful transition of existing code), and/but tries to share both the interface and the implementation of both modules: #13355 . (I think that it would be premature to declare that the present PR is superseded by that new PR, given that they represent two fairly different designs and people may reasonably decide to prefer the one here.) |
|
replaced by #13355 |
This is an attempt at an alternative implementation to #12719 . It segfaults though, apparently the current DLS state isn't fully initialized when my new code in thread.ml is called.
edit: if we make it work and it's decided to be the one good solution, there would be some squashing to be done.