Skip to content

Fix locking bug with custom runtime events#12900

Merged
gasche merged 4 commits intoocaml:trunkfrom
gasche:runtime-events-fix-lock
Feb 8, 2024
Merged

Fix locking bug with custom runtime events#12900
gasche merged 4 commits intoocaml:trunkfrom
gasche:runtime-events-fix-lock

Conversation

@gasche
Copy link
Copy Markdown
Member

@gasche gasche commented Jan 15, 2024

This is a proposal to fix #12897.

@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Jan 16, 2024

I tested this and it fixes #12897 for me - thanks!

@gasche gasche added the bug label Jan 16, 2024
@gasche gasche added this to the 5.2 milestone Jan 16, 2024
@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 16, 2024

Great, thanks!

Now this needs a review.

(I marked this "5.2" as this is a bugfix.)

@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Jan 17, 2024

This can still deadlock if a signal handler emits an event. Here's a test-case:

let unit =
  let encode _buf () =
    print_endline "Start encoding event";
    Unix.sleep 2;
    print_endline "Finished encoding event";
    0
  in
  let decode _buf _len = () in
  Runtime_events.Type.register ~encode ~decode

type Runtime_events.User.tag += My_event
let my_event = Runtime_events.User.register "event" My_event unit

let handle_signal _ =
  print_endline "Signal handler called; writing trace event...";
  Runtime_events.User.write my_event ()

let () =
  Runtime_events.start ();
  Sys.set_signal Sys.sigalrm (Signal_handle handle_signal);
  ignore (Unix.alarm 1 : int);
  Runtime_events.User.write my_event ()

The output is:

Start encoding event             
Signal handler called; writing trace event...
Fatal error: exception Sys_error("Mutex.lock: Resource deadlock avoided")

I think that if the mutex is locked then it should just create a fresh buffer rather than waiting.

(possibly it could Atomic.exchange buf None to try to get the buffer, and so avoid using a mutex at all)

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 17, 2024

My previous version was using something like you suggest, but I replaced it with a mutex. My worry was that the cache-stealing design that you proposed would silently degrade performance for programs that arguably are using runtime events incorrectly, and that a clean failure is a better.

Is it supposed to be correct to emit a Custom runtime event from inside a signal handler?

@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Jan 17, 2024

Is it supposed to be correct to emit a Custom runtime event from inside a signal handler?

I don't know. According to @OlivierNicole #12840 (comment):

I would of the opinion to stick to the implicit convention that if the manual is silent about these issues (thread safety, authorization to allocate, etc.), it means that the API is safe (thread-safe, GC-safe, etc.).

So I guess it depends on the definition of "etc".

But it seems reasonable to support it, given that doing so is actually simpler than using a mutex. Having your service crash after a few hours with a mysterious and difficult-to-reproduce "Resource deadlock avoided" doesn't seem helpful to me, especially when reusing the buffer is only an optimisation anyway.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 17, 2024

Okay, I'll move to a more permissive version. Thanks!

@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Jan 17, 2024

Thanks! Another issue is that we now allocate the buffer in all cases, whereas before the C code checked ring_is_active first.

ring_is_active should probably be exposed to OCaml code. Then write can check it quickly first, for the common fast case where events are disabled.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 17, 2024

I pushed a new version that uses a stack of buffers (instead of just one protected by a mutex) on each domain, and also implements your suggestion of a ring_is_active fast path.

Can you torture it with your signal handlers?

@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Jan 17, 2024

I guess this relies on OCaml not switching threads between the match !buffers with and the buffers := bs? Does it need some [@poll error] annotation then? I don't really understand the rules about when this is safe.

(I would guess one buffer per domain would be OK. It should be very unusual that we switch threads or handle a signal while writing an event, so I suspect optimising for this isn't useful.)

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 17, 2024

I don't really understand the rules about when this is safe.

My mental model: allocations, function calls/returns and backward jumps may poll. Matching or reference assignments do not introduce poll points.

I would guess one buffer per domain would be OK. It should be very unusual that we switch threads or handle a signal while writing an event, so I suspect optimising for this isn't useful.

I am not sure. If you do have a single-domain, multi-thread program that uses runtime events at a fine-enough granularity, it may occur regularly that two threads race to write a runtime event if the serialization function allocates something. (I am especially wary that the result could be a silent noticeable degradation in performance, that would then be fairly painful for the user to track down understand, and no actual way to fix it.)

@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Jan 19, 2024

OK then. The new version passes the signal test-case above.

@gasche gasche force-pushed the runtime-events-fix-lock branch from cc64ce6 to 4060e0b Compare January 19, 2024 11:18
@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 19, 2024

Great, thanks for testing! This still needs a review.

@OlivierNicole
Copy link
Copy Markdown
Contributor

It’s the next item on my list, but I can’t get to it before Monday.

Copy link
Copy Markdown
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this fixes the bug and is probably correct, but the fact that it’s OK to mutate a linked list in a way that may drop elements—because it only affects performance and not correctness—deserves a comment in the source IMO.

Comment on lines +246 to +292
let with_write_buffer =
(* [caml_runtime_events_user_write] needs a write buffer in which
the user-provided serializer will write its data. The user may
want to write a lot of custom events fast, so we want to cache
the write buffer across calls.

To be safe for multi-domain programs, we use domain-local
storage for the write buffer. To accomodate for multi-threaded
programs (without depending on the Thread module), we store
a list of caches for each domain. This might leak a bit of
memory: the number of buffers for a domain is equal to the
maximum number of threads that requested a buffer concurrently,
and we never free those buffers. *)
let create_buffer () = Bytes.create 1024 in
let write_buffer_cache = Domain.DLS.new_key (fun () -> ref []) in
fun consumer ->
let buffers = Domain.DLS.get write_buffer_cache in
let buf =
match !buffers with
| [] -> create_buffer ()
| b::bs -> buffers := bs; b
in
Fun.protect ~finally:(fun () -> buffers := buf :: !buffers)
(fun () -> consumer buf)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly your intent:

  • This implementation is not vulnerable to Domain.DLS is not thread-safe #12677 because the worst that can happen is that, if two threads on the domain run Domain.DLS.get concurrently, and this is the first time Domain.DLS.get is being run on that domain, then they will get different mutable cells; but this will only cause the drop of an allocated cache, which merely causes unnecessary allocation next time, but no functional bug;
  • Thread switches may happen on safe points such as the allocations in create_buffer () and buf :: !buffers, and as a result elements of the list can be dropped. But likewise, this causes at worst an additional buffer allocation (and we expect it to be rare).

Is it an accurate description of why you think the code is thread-safe?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix #12677, and proposed #12889 to do so. The code here is not meant to be robust against #12677.

Your point on the fact that buf :: !buffers allocates is something that I did not have in mind, thanks! The code is functionally correct for the reason you mention -- it is an the issue with this caching stuff, most ways to get it wrong will still pass the tests. I believe that I could:

  1. Pretend that there is no issue and write nothing about that. Very simple!
  2. Write a lengthy comment on why dropping buffers is fine here because I don't like it but it never happens in practice. Annoying.
  3. Write a CAS loop to push the buffer back, which is more code and less comments.

I went with (3) in a new commit in this PR, could you have a look?

Copy link
Copy Markdown
Member Author

@gasche gasche Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I believe that the code with buffers := buf :: !buffers is not correct because, in addition to sometimes dropping buffers, it could also re-add in the stack some buffers that have been stolen by a thread and not been given back yet. This could result in two thread racing to serialize into the same buffer, which is incorrect.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that does seem possible.

@gasche gasche force-pushed the runtime-events-fix-lock branch from 4060e0b to 3b27201 Compare January 22, 2024 14:54
Comment on lines +273 to +279
let rec push buffers buf =
(* intended to be thread-safe *)
let cur_buffers = !buffers in
let cons = buf :: cur_buffers in
(* retry if !buffers changed under our feet: *)
(* begin atomic *)
if !buffers == cur_buffers then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that the compiler might perform CSE, making this safeguard ineffective.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stand corrected, it looks like allocations cause equations over mutable loads to be dropped.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you should worry about here is that the compiler is 100% allowed to move the allocation into the branch (after the test).
You can use Sys.opaque_identity around the allocation to prevent that, but in the end there is no specification that would prevent the compiler from inserting a poll between the equality check and the assignment so you can't be provably safe.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know better than me about this, but it is not easy to work with the fact that there is no documentation of what "the compiler is allowed to do" -- I agree with you that having more precise semantics for this would be useful. Is this allowed in Flambda land? I would have thought that !buffers (in the test !buffers == cur_buffers) is considered as having coeffects, and that buf :: cur_buffers is considered as having generative effects, and that the two thus cannot be permuted -- but I lost myself trying to check the implementation for this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this allowed in Flambda land?

Technically yes, our semantics would allow this. But I don't think we have implemented any transformations that would make use of that in Flambda 1.

... is considered as having generative effects, and that the two thus cannot be permuted

Generative effects are a different category of effects. The only generative effects we track are memory allocations, and we do not track co-effects in this category, so they are allowed to be permuted with everything. Other examples of operations that could have generative effects only could be random number generation, if that was a language builtin, or monotonic counters such as caml_fresh_oo_id.
If we wanted to accurately model the runtime execution we could give memory co-effects to everything that interacts with the GC (polls and C calls, typically), but that would also mean that we wouldn't be allowed to remove any allocation if there might be a poll later in the program (just like you cannot remove an assignment to a reference if there might be a read from it later). So not very useful for an optimiser.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that this code should use atomic cells instead of references if consistency is required. If not (i.e. if we can deal with dropping a few bits of data from time to time), then let push buffers buf = buffers := buf :: !buffers will work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving a memory read across a poll point is not valid in my book

You wouldn't necessarily move anything. Typically:

let original () =
  let block = allocate () in
  let field = read () in
  (block, field)

let reified  () =
  let block = allocate () in
  let field = read () in
  (* Add a new allocation just before use *)
  let reified_block = allocate () in
  (reified_block, field)

let simplified  () =
  (* Remove the old, unused allocation *)
  let field = read () in
  let reified_block = allocate () in
  (reified_block, field)

This is basically how float/int64/... unboxing works, and it produces the same results as moving the allocation but without actually moving anything.
Note that the ability to replace one allocated block with another of the same shape is very important for various other optimisations too, such as constant lifting and sharing of structurally equal (immutable) values.

(... if it is aware of the implicit placement of poll points)

I believe that poll points were designed as a low-level mechanism to ensure we have bounds on the time spent without synchronisation. My opinion is that as we did not add anything to the language that relates to critical sections or polling primitives, users should not expect to be able to reason about poll points at the source level (in fact, I believe that what allocates or not should be left unspecified in source-level semantics).

I'm all for having new specific language constructs for reasoning about critical sections though; that would solve most of these problems, and it would not be too hard to correctly fit that into the existing compiler code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that this code should use atomic cells instead of references

Mph, I don't want to do this because the point is precisely to avoid any synchronization bottleneck between runtime_event users. I use Domain.DLS to avoid synchronization, I don't want to use Atomic for single-domain multi-threaded programming.

I reformulated the code using an explicit compare_and_set operation that is annotated @poll error (this implies [@inline never]), does that approach look correct to you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks more robust (although I don't consider the [@poll error] mechanism as foolproof), but I wouldn't be surprised if it ends up more expensive than regular atomic operations.

Copy link
Copy Markdown
Member Author

@gasche gasche Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take "expensive like a direct function call" over "complex stuff that might create weird bottlenecks on relaxed machines and work fine everywhere else" any day.

@gasche gasche force-pushed the runtime-events-fix-lock branch 2 times, most recently from a937e43 to 8b8b5d8 Compare January 23, 2024 14:38
@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 23, 2024

My understanding is that, after a bit and with the help of @lthls, I have put the code into a state that both @OlivierNicole and @lthls should be okay with. For now I shall wait for the review process to run its course.

@gasche gasche force-pushed the runtime-events-fix-lock branch from 8b8b5d8 to 091934f Compare January 23, 2024 15:03
@OlivierNicole
Copy link
Copy Markdown
Contributor

The new code looks correct to me. Do we want to add @talex5’s bug reproducers (the initial one and the one with a signal handler) as regression tests?

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 23, 2024

I'm not fond of the initial test, it's a torture test that seems likely to break on some weird CI machines. The new test is fine, but I'm not sure what to test about its output -- just that it runs normally? I feel lazy about this. Would you by chance be willing to propose an integration of the test on top of my PR branch, that I could cherry-pick into the PR?

@OlivierNicole
Copy link
Copy Markdown
Contributor

Sure, I added it in OlivierNicole@63cf577. I can’t swear that it will pass on absolutely all OSes and CI configurations.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 23, 2024

Thanks! I cherry-picked your commit.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 23, 2024

(Note: if you/when are feeling like Approving the PR, you should do this explicitly.)

Copy link
Copy Markdown
Contributor

@OlivierNicole OlivierNicole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, looks like SIGALRM is not available on Windows. We should at least disable this test on Windows (see OlivierNicole/ocaml@af432c2f0d). Otherwise, LGTM.

@gasche gasche force-pushed the runtime-events-fix-lock branch from 81dcf4d to 223dcda Compare January 23, 2024 19:30
@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 23, 2024

I started a precheck build to test the test on our various CI machines.

This still needs an approval from a maintainer to be merged. (From their own review or on behalf of @OlivierNicole.)

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Jan 23, 2024

Precheck did find a failure on ocaml-linux-64, on the test parallel/domain_parallel_spawn_burn.ml (segfault in bytecode). Given that the test in question does not use runtime_events as far as I can see, I will assume that this failure is independent from the PR.

@OlivierNicole
Copy link
Copy Markdown
Contributor

I noted down the failure for later because it’s a worrying one, but I too can’t see any link with this PR.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Feb 1, 2024

This PR still needs to be approved by a maintainer to be merged. It aims for 5.2.

Copy link
Copy Markdown
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct, although I'm left wondering whether it would have been simpler to implement the buffer-pooling logic in C rather than OCaml.

@damiendoligez
Copy link
Copy Markdown
Member

Another remark: will you revisit this code when we switch from DLS to TLS?

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Feb 8, 2024

Another remark: will you revisit this code when we switch from DLS to TLS?

Should we? Yes.
Will we? Life is a mystery.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Feb 8, 2024

Thanks for the review!

I'm left wondering whether it would have been simpler to implement the buffer-pooling logic in C rather than OCaml.

It may be easier to write, but also easy to get wrong in subtle ways. Currently I agree that it is easier to write thread-safe code in C than in OCaml, but I think that it is worth trying the hard stuff a bit to understand what we need and improve the situation there.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Feb 8, 2024

I had a discussion with @damiendoligez on how to implement this directly in C.

The initial reason why I did it on the OCaml side is that I intended to take a lock -- and taking a lock from OCaml is safer than taking a lock from C, precisely for the reasons mentioned in the issue report. But locks don't work well in re-entrancy scenarios so we gave up on that, and doing it in C would be fine. But then we still have to decide on how to pool our buffers -- do we keep at most one buffer, or a stack of buffers, or a per-thread buffer, or a per-thread stack of buffers? with pretty much the same tradeoffs and the same complexity. So in the end we agreed to merge as-is and work on something else.

@gasche gasche force-pushed the runtime-events-fix-lock branch from 223dcda to b6c277b Compare February 8, 2024 15:36
@gasche gasche added the merge-me label Feb 8, 2024
@talex5
Copy link
Copy Markdown
Contributor

talex5 commented Feb 8, 2024

Another reason to do it in OCaml is that it's not obvious we want a buffer pool anyway; that was just to preserve the current API. For example, if I want to trace a string then it would be more efficient to pass the string directly, rather than having C code give me a buffer that I have to copy it into first, only to have the C code copy it again to the ring.

@gasche
Copy link
Copy Markdown
Member Author

gasche commented Feb 8, 2024

I don't disagree (we discussed the idea that maybe the custom-runtime-events API should evolve based on your EIO feedback, and I think it's a good idea), but note that the PR code still imposes its buffer pool. I think that you mean that the resulting code is slightly closer to the final API you have in mind, where the user serializes on their own and only calls the runtime-event logic after that. For now we take a buffer from the OCaml side, but then we call the user serializer from C.

@gasche gasche merged commit 84bab61 into ocaml:trunk Feb 8, 2024
gasche added a commit that referenced this pull request Feb 12, 2024
Fix locking bug with custom runtime events

(cherry picked from commit 84bab61)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Writing custom events causes GC deadlocks

5 participants