Fix locking bug with custom runtime events by gasche · Pull Request #12900 · ocaml/ocaml

gasche · 2024-01-15T19:45:09Z

This is a proposal to fix #12897.

talex5 · 2024-01-16T09:53:34Z

I tested this and it fixes #12897 for me - thanks!

gasche · 2024-01-16T10:47:13Z

Great, thanks!

Now this needs a review.

(I marked this "5.2" as this is a bugfix.)

talex5 · 2024-01-17T09:56:01Z

This can still deadlock if a signal handler emits an event. Here's a test-case:

let unit =
  let encode _buf () =
    print_endline "Start encoding event";
    Unix.sleep 2;
    print_endline "Finished encoding event";
    0
  in
  let decode _buf _len = () in
  Runtime_events.Type.register ~encode ~decode

type Runtime_events.User.tag += My_event
let my_event = Runtime_events.User.register "event" My_event unit

let handle_signal _ =
  print_endline "Signal handler called; writing trace event...";
  Runtime_events.User.write my_event ()

let () =
  Runtime_events.start ();
  Sys.set_signal Sys.sigalrm (Signal_handle handle_signal);
  ignore (Unix.alarm 1 : int);
  Runtime_events.User.write my_event ()

The output is:

Start encoding event             
Signal handler called; writing trace event...
Fatal error: exception Sys_error("Mutex.lock: Resource deadlock avoided")

I think that if the mutex is locked then it should just create a fresh buffer rather than waiting.

(possibly it could Atomic.exchange buf None to try to get the buffer, and so avoid using a mutex at all)

gasche · 2024-01-17T10:05:02Z

My previous version was using something like you suggest, but I replaced it with a mutex. My worry was that the cache-stealing design that you proposed would silently degrade performance for programs that arguably are using runtime events incorrectly, and that a clean failure is a better.

Is it supposed to be correct to emit a Custom runtime event from inside a signal handler?

talex5 · 2024-01-17T10:22:51Z

Is it supposed to be correct to emit a Custom runtime event from inside a signal handler?

I don't know. According to @OlivierNicole #12840 (comment):

I would of the opinion to stick to the implicit convention that if the manual is silent about these issues (thread safety, authorization to allocate, etc.), it means that the API is safe (thread-safe, GC-safe, etc.).

So I guess it depends on the definition of "etc".

But it seems reasonable to support it, given that doing so is actually simpler than using a mutex. Having your service crash after a few hours with a mysterious and difficult-to-reproduce "Resource deadlock avoided" doesn't seem helpful to me, especially when reusing the buffer is only an optimisation anyway.

gasche · 2024-01-17T10:40:15Z

Okay, I'll move to a more permissive version. Thanks!

talex5 · 2024-01-17T14:20:17Z

Thanks! Another issue is that we now allocate the buffer in all cases, whereas before the C code checked ring_is_active first.

ring_is_active should probably be exposed to OCaml code. Then write can check it quickly first, for the common fast case where events are disabled.

gasche · 2024-01-17T17:08:52Z

I pushed a new version that uses a stack of buffers (instead of just one protected by a mutex) on each domain, and also implements your suggestion of a ring_is_active fast path.

Can you torture it with your signal handlers?

talex5 · 2024-01-17T19:00:10Z

I guess this relies on OCaml not switching threads between the match !buffers with and the buffers := bs? Does it need some [@poll error] annotation then? I don't really understand the rules about when this is safe.

(I would guess one buffer per domain would be OK. It should be very unusual that we switch threads or handle a signal while writing an event, so I suspect optimising for this isn't useful.)

gasche · 2024-01-17T19:19:06Z

I don't really understand the rules about when this is safe.

My mental model: allocations, function calls/returns and backward jumps may poll. Matching or reference assignments do not introduce poll points.

I would guess one buffer per domain would be OK. It should be very unusual that we switch threads or handle a signal while writing an event, so I suspect optimising for this isn't useful.

I am not sure. If you do have a single-domain, multi-thread program that uses runtime events at a fine-enough granularity, it may occur regularly that two threads race to write a runtime event if the serialization function allocates something. (I am especially wary that the result could be a silent noticeable degradation in performance, that would then be fairly painful for the user to track down understand, and no actual way to fix it.)

talex5 · 2024-01-19T09:56:00Z

OK then. The new version passes the signal test-case above.

gasche · 2024-01-19T11:30:44Z

Great, thanks for testing! This still needs a review.

OlivierNicole · 2024-01-19T13:13:07Z

It’s the next item on my list, but I can’t get to it before Monday.

OlivierNicole

I believe this fixes the bug and is probably correct, but the fact that it’s OK to mutate a linked list in a way that may drop elements—because it only affects performance and not correctness—deserves a comment in the source IMO.

OlivierNicole · 2024-01-22T14:14:45Z

otherlibs/runtime_events/runtime_events.ml

+  let with_write_buffer =
+    (* [caml_runtime_events_user_write] needs a write buffer in which
+       the user-provided serializer will write its data. The user may
+       want to write a lot of custom events fast, so we want to cache
+       the write buffer across calls.
+
+       To be safe for multi-domain programs, we use domain-local
+       storage for the write buffer. To accomodate for multi-threaded
+       programs (without depending on the Thread module), we store
+       a list of caches for each domain. This might leak a bit of
+       memory: the number of buffers for a domain is equal to the
+       maximum number of threads that requested a buffer concurrently,
+       and we never free those buffers. *)
+    let create_buffer () = Bytes.create 1024 in
+    let write_buffer_cache = Domain.DLS.new_key (fun () -> ref []) in
+    fun consumer ->
+      let buffers = Domain.DLS.get write_buffer_cache in
+      let buf =
+        match !buffers with
+        | [] -> create_buffer ()
+        | b::bs -> buffers := bs; b
+      in
+      Fun.protect ~finally:(fun () -> buffers := buf :: !buffers)
+        (fun () -> consumer buf)


So if I understand correctly your intent:

This implementation is not vulnerable to Domain.DLS is not thread-safe #12677 because the worst that can happen is that, if two threads on the domain run Domain.DLS.get concurrently, and this is the first time Domain.DLS.get is being run on that domain, then they will get different mutable cells; but this will only cause the drop of an allocated cache, which merely causes unnecessary allocation next time, but no functional bug;

Thread switches may happen on safe points such as the allocations in create_buffer () and buf :: !buffers, and as a result elements of the list can be dropped. But likewise, this causes at worst an additional buffer allocation (and we expect it to be rare).

Is it an accurate description of why you think the code is thread-safe?

I think we should fix #12677, and proposed #12889 to do so. The code here is not meant to be robust against #12677.

Your point on the fact that buf :: !buffers allocates is something that I did not have in mind, thanks! The code is functionally correct for the reason you mention -- it is an the issue with this caching stuff, most ways to get it wrong will still pass the tests. I believe that I could:

Pretend that there is no issue and write nothing about that. Very simple!

Write a lengthy comment on why dropping buffers is fine here because I don't like it but it never happens in practice. Annoying.

Write a CAS loop to push the buffer back, which is more code and less comments.

I went with (3) in a new commit in this PR, could you have a look?

In fact I believe that the code with buffers := buf :: !buffers is not correct because, in addition to sometimes dropping buffers, it could also re-add in the stack some buffers that have been stolen by a thread and not been given back yet. This could result in two thread racing to serialize into the same buffer, which is incorrect.

Yes, that does seem possible.

OlivierNicole · 2024-01-22T17:30:45Z

otherlibs/runtime_events/runtime_events.ml

+    let rec push buffers buf =
+      (* intended to be thread-safe *)
+      let cur_buffers = !buffers in
+      let cons = buf :: cur_buffers in
+      (* retry if !buffers changed under our feet: *)
+      (* begin atomic *)
+      if !buffers == cur_buffers then


I fear that the compiler might perform CSE, making this safeguard ineffective.

I stand corrected, it looks like allocations cause equations over mutable loads to be dropped.

What you should worry about here is that the compiler is 100% allowed to move the allocation into the branch (after the test).
You can use Sys.opaque_identity around the allocation to prevent that, but in the end there is no specification that would prevent the compiler from inserting a poll between the equality check and the assignment so you can't be provably safe.

You know better than me about this, but it is not easy to work with the fact that there is no documentation of what "the compiler is allowed to do" -- I agree with you that having more precise semantics for this would be useful. Is this allowed in Flambda land? I would have thought that !buffers (in the test !buffers == cur_buffers) is considered as having coeffects, and that buf :: cur_buffers is considered as having generative effects, and that the two thus cannot be permuted -- but I lost myself trying to check the implementation for this.

Is this allowed in Flambda land?

Technically yes, our semantics would allow this. But I don't think we have implemented any transformations that would make use of that in Flambda 1.

... is considered as having generative effects, and that the two thus cannot be permuted

Generative effects are a different category of effects. The only generative effects we track are memory allocations, and we do not track co-effects in this category, so they are allowed to be permuted with everything. Other examples of operations that could have generative effects only could be random number generation, if that was a language builtin, or monotonic counters such as caml_fresh_oo_id.
If we wanted to accurately model the runtime execution we could give memory co-effects to everything that interacts with the GC (polls and C calls, typically), but that would also mean that we wouldn't be allowed to remove any allocation if there might be a poll later in the program (just like you cannot remove an assignment to a reference if there might be a read from it later). So not very useful for an optimiser.

My opinion is that this code should use atomic cells instead of references if consistency is required. If not (i.e. if we can deal with dropping a few bits of data from time to time), then let push buffers buf = buffers := buf :: !buffers will work.

Moving a memory read across a poll point is not valid in my book

You wouldn't necessarily move anything. Typically:

let original () = let block = allocate () in let field = read () in (block, field) let reified () = let block = allocate () in let field = read () in (* Add a new allocation just before use *) let reified_block = allocate () in (reified_block, field) let simplified () = (* Remove the old, unused allocation *) let field = read () in let reified_block = allocate () in (reified_block, field)

This is basically how float/int64/... unboxing works, and it produces the same results as moving the allocation but without actually moving anything.
Note that the ability to replace one allocated block with another of the same shape is very important for various other optimisations too, such as constant lifting and sharing of structurally equal (immutable) values.

(... if it is aware of the implicit placement of poll points)

I believe that poll points were designed as a low-level mechanism to ensure we have bounds on the time spent without synchronisation. My opinion is that as we did not add anything to the language that relates to critical sections or polling primitives, users should not expect to be able to reason about poll points at the source level (in fact, I believe that what allocates or not should be left unspecified in source-level semantics).

I'm all for having new specific language constructs for reasoning about critical sections though; that would solve most of these problems, and it would not be too hard to correctly fit that into the existing compiler code.

My opinion is that this code should use atomic cells instead of references

Mph, I don't want to do this because the point is precisely to avoid any synchronization bottleneck between runtime_event users. I use Domain.DLS to avoid synchronization, I don't want to use Atomic for single-domain multi-threaded programming.

I reformulated the code using an explicit compare_and_set operation that is annotated @poll error (this implies [@inline never]), does that approach look correct to you?

It looks more robust (although I don't consider the [@poll error] mechanism as foolproof), but I wouldn't be surprised if it ends up more expensive than regular atomic operations.

I'll take "expensive like a direct function call" over "complex stuff that might create weird bottlenecks on relaxed machines and work fine everywhere else" any day.

gasche · 2024-01-23T14:59:31Z

My understanding is that, after a bit and with the help of @lthls, I have put the code into a state that both @OlivierNicole and @lthls should be okay with. For now I shall wait for the review process to run its course.

OlivierNicole · 2024-01-23T15:09:38Z

The new code looks correct to me. Do we want to add @talex5’s bug reproducers (the initial one and the one with a signal handler) as regression tests?

gasche · 2024-01-23T15:22:59Z

I'm not fond of the initial test, it's a torture test that seems likely to break on some weird CI machines. The new test is fine, but I'm not sure what to test about its output -- just that it runs normally? I feel lazy about this. Would you by chance be willing to propose an integration of the test on top of my PR branch, that I could cherry-pick into the PR?

OlivierNicole · 2024-01-23T16:10:39Z

Sure, I added it in OlivierNicole@63cf577. I can’t swear that it will pass on absolutely all OSes and CI configurations.

gasche · 2024-01-23T17:14:26Z

Thanks! I cherry-picked your commit.

gasche · 2024-01-23T17:14:54Z

(Note: if you/when are feeling like Approving the PR, you should do this explicitly.)

OlivierNicole

My bad, looks like SIGALRM is not available on Windows. We should at least disable this test on Windows (see OlivierNicole/ocaml@af432c2f0d). Otherwise, LGTM.

gasche · 2024-01-23T22:06:01Z

I started a precheck build to test the test on our various CI machines.

This still needs an approval from a maintainer to be merged. (From their own review or on behalf of @OlivierNicole.)

gasche · 2024-01-23T22:40:19Z

Precheck did find a failure on ocaml-linux-64, on the test parallel/domain_parallel_spawn_burn.ml (segfault in bytecode). Given that the test in question does not use runtime_events as far as I can see, I will assume that this failure is independent from the PR.

OlivierNicole · 2024-01-24T16:15:22Z

I noted down the failure for later because it’s a worrying one, but I too can’t see any link with this PR.

gasche · 2024-02-01T07:30:43Z

This PR still needs to be approved by a maintainer to be merged. It aims for 5.2.

damiendoligez

Looks correct, although I'm left wondering whether it would have been simpler to implement the buffer-pooling logic in C rather than OCaml.

damiendoligez · 2024-02-08T14:47:51Z

Another remark: will you revisit this code when we switch from DLS to TLS?

gasche · 2024-02-08T14:59:28Z

Another remark: will you revisit this code when we switch from DLS to TLS?

Should we? Yes.
Will we? Life is a mystery.

gasche · 2024-02-08T15:02:31Z

Thanks for the review!

I'm left wondering whether it would have been simpler to implement the buffer-pooling logic in C rather than OCaml.

It may be easier to write, but also easy to get wrong in subtle ways. Currently I agree that it is easier to write thread-safe code in C than in OCaml, but I think that it is worth trying the hard stuff a bit to understand what we need and improve the situation there.

gasche · 2024-02-08T15:34:42Z

I had a discussion with @damiendoligez on how to implement this directly in C.

The initial reason why I did it on the OCaml side is that I intended to take a lock -- and taking a lock from OCaml is safer than taking a lock from C, precisely for the reasons mentioned in the issue report. But locks don't work well in re-entrancy scenarios so we gave up on that, and doing it in C would be fine. But then we still have to decide on how to pool our buffers -- do we keep at most one buffer, or a stack of buffers, or a per-thread buffer, or a per-thread stack of buffers? with pretty much the same tradeoffs and the same complexity. So in the end we agreed to merge as-is and work on something else.

Suggested-by: Thomas Leonard <talex5@gmail.com>

talex5 · 2024-02-08T15:39:50Z

Another reason to do it in OCaml is that it's not obvious we want a buffer pool anyway; that was just to preserve the current API. For example, if I want to trace a string then it would be more efficient to pass the string directly, rather than having C code give me a buffer that I have to copy it into first, only to have the C code copy it again to the ring.

gasche · 2024-02-08T16:11:31Z

I don't disagree (we discussed the idea that maybe the custom-runtime-events API should evolve based on your EIO feedback, and I think it's a good idea), but note that the PR code still imposes its buffer pool. I think that you mean that the resulting code is slightly closer to the final API you have in mind, where the user serializes on their own and only calls the runtime-event logic after that. For now we take a buffer from the OCaml side, but then we call the user serializer from C.

Fix locking bug with custom runtime events (cherry picked from commit 84bab61)

gasche mentioned this pull request Jan 15, 2024

Writing custom events causes GC deadlocks #12897

Closed

gasche added the bug label Jan 16, 2024

gasche added this to the 5.2 milestone Jan 16, 2024

gasche force-pushed the runtime-events-fix-lock branch from cc64ce6 to 4060e0b Compare January 19, 2024 11:18

OlivierNicole reviewed Jan 22, 2024

View reviewed changes

gasche force-pushed the runtime-events-fix-lock branch from 4060e0b to 3b27201 Compare January 22, 2024 14:54

OlivierNicole reviewed Jan 22, 2024

View reviewed changes

gasche force-pushed the runtime-events-fix-lock branch 2 times, most recently from a937e43 to 8b8b5d8 Compare January 23, 2024 14:38

gasche force-pushed the runtime-events-fix-lock branch from 8b8b5d8 to 091934f Compare January 23, 2024 15:03

OlivierNicole approved these changes Jan 23, 2024

View reviewed changes

gasche force-pushed the runtime-events-fix-lock branch from 81dcf4d to 223dcda Compare January 23, 2024 19:30

OlivierNicole mentioned this pull request Feb 5, 2024

Attempt at a thread-safe implementation of DLS #12889

Merged

gasche assigned damiendoligez Feb 7, 2024

damiendoligez approved these changes Feb 8, 2024

View reviewed changes

gasche and others added 4 commits February 8, 2024 16:35

runtime_events: expose [ring_is_active] in the C and OCaml APIs

f95c962

Suggested-by: Thomas Leonard <talex5@gmail.com>

Runtime events: fix write-buffer locking

882d879

Changes

ada3b6a

Add regression test emitting custom event from signal handler

b6c277b

gasche force-pushed the runtime-events-fix-lock branch from 223dcda to b6c277b Compare February 8, 2024 15:36

gasche added the merge-me label Feb 8, 2024

gasche merged commit 84bab61 into ocaml:trunk Feb 8, 2024

gasche added a commit that referenced this pull request Feb 12, 2024

Merge pull request #12900 from gasche/runtime-events-fix-lock

7bb15e1

Fix locking bug with custom runtime events (cherry picked from commit 84bab61)

Conversation

gasche commented Jan 15, 2024

Uh oh!

talex5 commented Jan 16, 2024

Uh oh!

gasche commented Jan 16, 2024

Uh oh!

talex5 commented Jan 17, 2024

Uh oh!

gasche commented Jan 17, 2024

Uh oh!

talex5 commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Jan 17, 2024

Uh oh!

talex5 commented Jan 17, 2024

Uh oh!

gasche commented Jan 17, 2024

Uh oh!

talex5 commented Jan 17, 2024

Uh oh!

gasche commented Jan 17, 2024

Uh oh!

talex5 commented Jan 19, 2024

Uh oh!

gasche commented Jan 19, 2024

Uh oh!

OlivierNicole commented Jan 19, 2024

Uh oh!

OlivierNicole left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche commented Jan 23, 2024

Uh oh!

OlivierNicole commented Jan 23, 2024

Uh oh!

gasche commented Jan 23, 2024

Uh oh!

OlivierNicole commented Jan 23, 2024

Uh oh!

gasche commented Jan 23, 2024

Uh oh!

gasche commented Jan 23, 2024

Uh oh!

OlivierNicole left a comment

Choose a reason for hiding this comment

Uh oh!

talex5 commented Jan 17, 2024 •

edited

Loading

gasche Jan 22, 2024 •

edited

Loading

gasche Jan 23, 2024 •

edited

Loading