Skip to content

Condition.broadcast must interlock as well#324

Merged
talex5 merged 1 commit intoocaml-multicore:mainfrom
haesbaert:condrace
Sep 24, 2022
Merged

Condition.broadcast must interlock as well#324
talex5 merged 1 commit intoocaml-multicore:mainfrom
haesbaert:condrace

Conversation

@haesbaert
Copy link
Copy Markdown
Contributor

There's a race where broadcast doesn't interlock and a in-flight waiter will have a lost wakeup:

CPU0 at:
    Eio.mutex_lock mutex;
    check_condition -> maybe abort
    Mutex.lock t.mutex; (inlined await for visibility)
B:  Eio_mutex.unlock mutex;
---> HERE
    match Waiters.await ~mutex:(Some t.mutex) t.waiters t.id with
    | ()           -> Eio_mutex.lock mutex
    | exception ex -> Eio_mutex.lock mutex; raise ex

CPU1 at:
    Eio.Mutex.use_rw ~protect:false mutex (fun () -> foo := true)
---> HERE
    Waiters.wake_all t.waiters (); (inlined broadcast for visibility)

1 - CPU0:HERE already tested the condition, it's false so it's ready to sleep,
    it interlocked with the runtime Mutex and released the Eio one.
2 - CPU1 sets the condition just after CPU0:B and now drains the queue and
    wakes the waiters, but CPU0 didn't make the waiting list yet, so it loses
    the wakeup.
3 - CPU0 waits forever.

The fix is to lock t.mutex, we don't need the full dance: LOCK(eio_mutex)->LOCK(mutex)->UNLOCK(eio_mutex), just locking the inner mutex is enough.

disclaimer: It's late and I might have missed something.

There's a race where broadcast doesn't interlock and a in-flight waiter will
have a lost wakeup:

CPU0 at:
    Eio.mutex_lock mutex;
    check_condition -> maybe abort
    Mutex.lock t.mutex; (inlined await for visibility)
B:  Eio_mutex.unlock mutex;
---> HERE
    match Waiters.await ~mutex:(Some t.mutex) t.waiters t.id with
    | ()           -> Eio_mutex.lock mutex
    | exception ex -> Eio_mutex.lock mutex; raise ex

CPU1 at:
    Eio.Mutex.use_rw ~protect:false mutex (fun () -> foo := true)
---> HERE
    Waiters.wake_all t.waiters (); (inlined broadcast for visibility)

1 - CPU0:HERE already tested the condition, it's false so it's ready to sleep,
    it interlocked with the runtime Mutex and released the Eio one.
2 - CPU1 sets the condition just after CPU0:B and now drains the queue and
    wakes the waiters, but CPU0 didn't make the waiting list yet, so it loses
    the wakeup.
3 - CPU0 waits forever.

The fix is to lock t.mutex, we don't need the full dance:
LOCK(eio_mutex)->LOCK(mutex)->UNLOCK(eio_mutex), just locking the
inner mutex is enough.

disclaimer: It's late and I might have missed something.
Copy link
Copy Markdown
Collaborator

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for Waiters.wake_all say:

If [t] is shared between domains, the caller must hold the mutex while calling this.

So, yeah, the old code is obviously wrong.

Thanks!

@talex5 talex5 merged commit c3431a0 into ocaml-multicore:main Sep 24, 2022
talex5 added a commit to talex5/opam-repository that referenced this pull request Oct 12, 2022
CHANGES:

Changes:

- Update to OCaml 5.0.0~beta1 (@anmonteiro @talex5 ocaml-multicore/eio#346).

- Add API for seekable read/writes (@nojb ocaml-multicore/eio#307).

- Add `Flow.write` (@haesbaert ocaml-multicore/eio#318).
  This provides an optimised alternative to `copy` in the case where you are writing from a buffer.

- Add `Net.with_tcp_connect` (@bikallem ocaml-multicore/eio#302).
  Convenience function for opening a TCP connection.

- Add `Eio.Time.Timeout` (@talex5 ocaml-multicore/eio#320).
  Makes it easier to pass timeouts around.

- Add `Eio_mock.Clock` (@talex5 ocaml-multicore/eio#328).
  Control time in tests.

- Add `Buf_read.take_while1` and `skip_while1` (@bikallem ocaml-multicore/eio#309).
  These fail if no characters match.

- Make the type parameter for `Promise.t` covariant (@anmonteiro ocaml-multicore/eio#300).

- Move list functions into a dedicated submodule (@raphael-proust ocaml-multicore/eio#315).

- Direct implementation of `Flow.source_string` (@c-cube ocaml-multicore/eio#317).
  Slightly faster.

Bug fixes:

- `Condition.broadcast` must interlock as well (@haesbaert ocaml-multicore/eio#324).

- Split the reads into no more than 2^32-1 for luv (@haesbaert @talex5 @EduardoRFS ocaml-multicore/eio#343).
  Luv uses a 32 bit int for buffer sizes and wraps if the value passed is too big.

- eio_luv: allow `Net.connect` to be cancelled (@talex5 @nojb ocaml-multicore/eio#311).

- eio_main: Use dedicated log source (@anmonteiro ocaml-multicore/eio#326).

- linux_eio: fix kernel version number in log message (@talex5 @nojb ocaml-multicore/eio#314).

- Account for stack differences in the socketpair test (issue ocaml-multicore/eio#312) (@haesbaert ocaml-multicore/eio#313).

Documentation:

- Add Best Practices section to README (@talex5 ocaml-multicore/eio#299).

- Documentation improvements (@talex5 ocaml-multicore/eio#295 ocaml-multicore/eio#337).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants