Fix locking bug with custom runtime events#12900
Conversation
|
I tested this and it fixes #12897 for me - thanks! |
|
Great, thanks! Now this needs a review. (I marked this "5.2" as this is a bugfix.) |
|
This can still deadlock if a signal handler emits an event. Here's a test-case: let unit =
let encode _buf () =
print_endline "Start encoding event";
Unix.sleep 2;
print_endline "Finished encoding event";
0
in
let decode _buf _len = () in
Runtime_events.Type.register ~encode ~decode
type Runtime_events.User.tag += My_event
let my_event = Runtime_events.User.register "event" My_event unit
let handle_signal _ =
print_endline "Signal handler called; writing trace event...";
Runtime_events.User.write my_event ()
let () =
Runtime_events.start ();
Sys.set_signal Sys.sigalrm (Signal_handle handle_signal);
ignore (Unix.alarm 1 : int);
Runtime_events.User.write my_event ()The output is: I think that if the mutex is locked then it should just create a fresh buffer rather than waiting. (possibly it could |
|
My previous version was using something like you suggest, but I replaced it with a mutex. My worry was that the cache-stealing design that you proposed would silently degrade performance for programs that arguably are using runtime events incorrectly, and that a clean failure is a better. Is it supposed to be correct to emit a Custom runtime event from inside a signal handler? |
I don't know. According to @OlivierNicole #12840 (comment):
So I guess it depends on the definition of "etc". But it seems reasonable to support it, given that doing so is actually simpler than using a mutex. Having your service crash after a few hours with a mysterious and difficult-to-reproduce "Resource deadlock avoided" doesn't seem helpful to me, especially when reusing the buffer is only an optimisation anyway. |
|
Okay, I'll move to a more permissive version. Thanks! |
|
Thanks! Another issue is that we now allocate the buffer in all cases, whereas before the C code checked
|
|
I pushed a new version that uses a stack of buffers (instead of just one protected by a mutex) on each domain, and also implements your suggestion of a Can you torture it with your signal handlers? |
|
I guess this relies on OCaml not switching threads between the (I would guess one buffer per domain would be OK. It should be very unusual that we switch threads or handle a signal while writing an event, so I suspect optimising for this isn't useful.) |
My mental model: allocations, function calls/returns and backward jumps may poll. Matching or reference assignments do not introduce poll points.
I am not sure. If you do have a single-domain, multi-thread program that uses runtime events at a fine-enough granularity, it may occur regularly that two threads race to write a runtime event if the serialization function allocates something. (I am especially wary that the result could be a silent noticeable degradation in performance, that would then be fairly painful for the user to track down understand, and no actual way to fix it.) |
|
OK then. The new version passes the signal test-case above. |
cc64ce6 to
4060e0b
Compare
|
Great, thanks for testing! This still needs a review. |
|
It’s the next item on my list, but I can’t get to it before Monday. |
OlivierNicole
left a comment
There was a problem hiding this comment.
I believe this fixes the bug and is probably correct, but the fact that it’s OK to mutate a linked list in a way that may drop elements—because it only affects performance and not correctness—deserves a comment in the source IMO.
| let with_write_buffer = | ||
| (* [caml_runtime_events_user_write] needs a write buffer in which | ||
| the user-provided serializer will write its data. The user may | ||
| want to write a lot of custom events fast, so we want to cache | ||
| the write buffer across calls. | ||
|
|
||
| To be safe for multi-domain programs, we use domain-local | ||
| storage for the write buffer. To accomodate for multi-threaded | ||
| programs (without depending on the Thread module), we store | ||
| a list of caches for each domain. This might leak a bit of | ||
| memory: the number of buffers for a domain is equal to the | ||
| maximum number of threads that requested a buffer concurrently, | ||
| and we never free those buffers. *) | ||
| let create_buffer () = Bytes.create 1024 in | ||
| let write_buffer_cache = Domain.DLS.new_key (fun () -> ref []) in | ||
| fun consumer -> | ||
| let buffers = Domain.DLS.get write_buffer_cache in | ||
| let buf = | ||
| match !buffers with | ||
| | [] -> create_buffer () | ||
| | b::bs -> buffers := bs; b | ||
| in | ||
| Fun.protect ~finally:(fun () -> buffers := buf :: !buffers) | ||
| (fun () -> consumer buf) |
There was a problem hiding this comment.
So if I understand correctly your intent:
- This implementation is not vulnerable to
Domain.DLSis not thread-safe #12677 because the worst that can happen is that, if two threads on the domain runDomain.DLS.getconcurrently, and this is the first timeDomain.DLS.getis being run on that domain, then they will get different mutable cells; but this will only cause the drop of an allocated cache, which merely causes unnecessary allocation next time, but no functional bug; - Thread switches may happen on safe points such as the allocations in
create_buffer ()andbuf :: !buffers, and as a result elements of the list can be dropped. But likewise, this causes at worst an additional buffer allocation (and we expect it to be rare).
Is it an accurate description of why you think the code is thread-safe?
There was a problem hiding this comment.
I think we should fix #12677, and proposed #12889 to do so. The code here is not meant to be robust against #12677.
Your point on the fact that buf :: !buffers allocates is something that I did not have in mind, thanks! The code is functionally correct for the reason you mention -- it is an the issue with this caching stuff, most ways to get it wrong will still pass the tests. I believe that I could:
- Pretend that there is no issue and write nothing about that. Very simple!
- Write a lengthy comment on why dropping buffers is fine here because I don't like it but it never happens in practice. Annoying.
- Write a CAS loop to push the buffer back, which is more code and less comments.
I went with (3) in a new commit in this PR, could you have a look?
There was a problem hiding this comment.
In fact I believe that the code with buffers := buf :: !buffers is not correct because, in addition to sometimes dropping buffers, it could also re-add in the stack some buffers that have been stolen by a thread and not been given back yet. This could result in two thread racing to serialize into the same buffer, which is incorrect.
There was a problem hiding this comment.
Yes, that does seem possible.
4060e0b to
3b27201
Compare
| let rec push buffers buf = | ||
| (* intended to be thread-safe *) | ||
| let cur_buffers = !buffers in | ||
| let cons = buf :: cur_buffers in | ||
| (* retry if !buffers changed under our feet: *) | ||
| (* begin atomic *) | ||
| if !buffers == cur_buffers then |
There was a problem hiding this comment.
I fear that the compiler might perform CSE, making this safeguard ineffective.
There was a problem hiding this comment.
I stand corrected, it looks like allocations cause equations over mutable loads to be dropped.
There was a problem hiding this comment.
What you should worry about here is that the compiler is 100% allowed to move the allocation into the branch (after the test).
You can use Sys.opaque_identity around the allocation to prevent that, but in the end there is no specification that would prevent the compiler from inserting a poll between the equality check and the assignment so you can't be provably safe.
There was a problem hiding this comment.
You know better than me about this, but it is not easy to work with the fact that there is no documentation of what "the compiler is allowed to do" -- I agree with you that having more precise semantics for this would be useful. Is this allowed in Flambda land? I would have thought that !buffers (in the test !buffers == cur_buffers) is considered as having coeffects, and that buf :: cur_buffers is considered as having generative effects, and that the two thus cannot be permuted -- but I lost myself trying to check the implementation for this.
There was a problem hiding this comment.
Is this allowed in Flambda land?
Technically yes, our semantics would allow this. But I don't think we have implemented any transformations that would make use of that in Flambda 1.
... is considered as having generative effects, and that the two thus cannot be permuted
Generative effects are a different category of effects. The only generative effects we track are memory allocations, and we do not track co-effects in this category, so they are allowed to be permuted with everything. Other examples of operations that could have generative effects only could be random number generation, if that was a language builtin, or monotonic counters such as caml_fresh_oo_id.
If we wanted to accurately model the runtime execution we could give memory co-effects to everything that interacts with the GC (polls and C calls, typically), but that would also mean that we wouldn't be allowed to remove any allocation if there might be a poll later in the program (just like you cannot remove an assignment to a reference if there might be a read from it later). So not very useful for an optimiser.
There was a problem hiding this comment.
My opinion is that this code should use atomic cells instead of references if consistency is required. If not (i.e. if we can deal with dropping a few bits of data from time to time), then let push buffers buf = buffers := buf :: !buffers will work.
There was a problem hiding this comment.
Moving a memory read across a poll point is not valid in my book
You wouldn't necessarily move anything. Typically:
let original () =
let block = allocate () in
let field = read () in
(block, field)
let reified () =
let block = allocate () in
let field = read () in
(* Add a new allocation just before use *)
let reified_block = allocate () in
(reified_block, field)
let simplified () =
(* Remove the old, unused allocation *)
let field = read () in
let reified_block = allocate () in
(reified_block, field)This is basically how float/int64/... unboxing works, and it produces the same results as moving the allocation but without actually moving anything.
Note that the ability to replace one allocated block with another of the same shape is very important for various other optimisations too, such as constant lifting and sharing of structurally equal (immutable) values.
(... if it is aware of the implicit placement of poll points)
I believe that poll points were designed as a low-level mechanism to ensure we have bounds on the time spent without synchronisation. My opinion is that as we did not add anything to the language that relates to critical sections or polling primitives, users should not expect to be able to reason about poll points at the source level (in fact, I believe that what allocates or not should be left unspecified in source-level semantics).
I'm all for having new specific language constructs for reasoning about critical sections though; that would solve most of these problems, and it would not be too hard to correctly fit that into the existing compiler code.
There was a problem hiding this comment.
My opinion is that this code should use atomic cells instead of references
Mph, I don't want to do this because the point is precisely to avoid any synchronization bottleneck between runtime_event users. I use Domain.DLS to avoid synchronization, I don't want to use Atomic for single-domain multi-threaded programming.
I reformulated the code using an explicit compare_and_set operation that is annotated @poll error (this implies [@inline never]), does that approach look correct to you?
There was a problem hiding this comment.
It looks more robust (although I don't consider the [@poll error] mechanism as foolproof), but I wouldn't be surprised if it ends up more expensive than regular atomic operations.
There was a problem hiding this comment.
I'll take "expensive like a direct function call" over "complex stuff that might create weird bottlenecks on relaxed machines and work fine everywhere else" any day.
a937e43 to
8b8b5d8
Compare
|
My understanding is that, after a bit and with the help of @lthls, I have put the code into a state that both @OlivierNicole and @lthls should be okay with. For now I shall wait for the review process to run its course. |
8b8b5d8 to
091934f
Compare
|
The new code looks correct to me. Do we want to add @talex5’s bug reproducers (the initial one and the one with a signal handler) as regression tests? |
|
I'm not fond of the initial test, it's a torture test that seems likely to break on some weird CI machines. The new test is fine, but I'm not sure what to test about its output -- just that it runs normally? I feel lazy about this. Would you by chance be willing to propose an integration of the test on top of my PR branch, that I could cherry-pick into the PR? |
|
Sure, I added it in OlivierNicole@63cf577. I can’t swear that it will pass on absolutely all OSes and CI configurations. |
|
Thanks! I cherry-picked your commit. |
|
(Note: if you/when are feeling like Approving the PR, you should do this explicitly.) |
OlivierNicole
left a comment
There was a problem hiding this comment.
My bad, looks like SIGALRM is not available on Windows. We should at least disable this test on Windows (see OlivierNicole/ocaml@af432c2f0d). Otherwise, LGTM.
81dcf4d to
223dcda
Compare
|
I started a precheck build to test the test on our various CI machines. This still needs an approval from a maintainer to be merged. (From their own review or on behalf of @OlivierNicole.) |
|
Precheck did find a failure on ocaml-linux-64, on the test |
|
I noted down the failure for later because it’s a worrying one, but I too can’t see any link with this PR. |
|
This PR still needs to be approved by a maintainer to be merged. It aims for 5.2. |
damiendoligez
left a comment
There was a problem hiding this comment.
Looks correct, although I'm left wondering whether it would have been simpler to implement the buffer-pooling logic in C rather than OCaml.
|
Another remark: will you revisit this code when we switch from DLS to TLS? |
Should we? Yes. |
|
Thanks for the review!
It may be easier to write, but also easy to get wrong in subtle ways. Currently I agree that it is easier to write thread-safe code in C than in OCaml, but I think that it is worth trying the hard stuff a bit to understand what we need and improve the situation there. |
|
I had a discussion with @damiendoligez on how to implement this directly in C. The initial reason why I did it on the OCaml side is that I intended to take a lock -- and taking a lock from OCaml is safer than taking a lock from C, precisely for the reasons mentioned in the issue report. But locks don't work well in re-entrancy scenarios so we gave up on that, and doing it in C would be fine. But then we still have to decide on how to pool our buffers -- do we keep at most one buffer, or a stack of buffers, or a per-thread buffer, or a per-thread stack of buffers? with pretty much the same tradeoffs and the same complexity. So in the end we agreed to merge as-is and work on something else. |
Suggested-by: Thomas Leonard <talex5@gmail.com>
223dcda to
b6c277b
Compare
|
Another reason to do it in OCaml is that it's not obvious we want a buffer pool anyway; that was just to preserve the current API. For example, if I want to trace a string then it would be more efficient to pass the string directly, rather than having C code give me a buffer that I have to copy it into first, only to have the C code copy it again to the ring. |
|
I don't disagree (we discussed the idea that maybe the custom-runtime-events API should evolve based on your EIO feedback, and I think it's a good idea), but note that the PR code still imposes its buffer pool. I think that you mean that the resulting code is slightly closer to the final API you have in mind, where the user serializes on their own and only calls the runtime-event logic after that. For now we take a buffer from the OCaml side, but then we call the user serializer from C. |
Fix locking bug with custom runtime events (cherry picked from commit 84bab61)
This is a proposal to fix #12897.