Skip to content

Allow sharing of libuv poll handles#279

Merged
talex5 merged 11 commits intoocaml-multicore:mainfrom
patricoferris:fix-luv-poll-handles
Aug 23, 2022
Merged

Allow sharing of libuv poll handles#279
talex5 merged 11 commits intoocaml-multicore:mainfrom
patricoferris:fix-luv-poll-handles

Conversation

@patricoferris
Copy link
Copy Markdown
Collaborator

Multiple poll handles for the same file descriptor are not allowed by libuv as @talex5 noted in ocaml-multicore/lwt_eio#11

http://docs.libuv.org/en/v1.x/poll.html says:

It is not okay to have multiple active poll handles for the same socket, this can cause libuv to busyloop or otherwise ? malfunction.

This can happen seemingly quite easily, for example:

let () =
  Eio_luv.run @@ fun _env ->
  Eio.Switch.run @@ fun sw ->
  let src, _dst = Eio_unix.socketpair ~sw () in
  let fd = Option.get @@ Eio_unix.FD.peek_opt src in
  Eio.Fiber.both
    (fun () -> Eio_unix.await_writable fd)
    (fun () -> Eio_unix.await_writable fd)
(* Eio_luv.Luv_error(EEXIST) (* file already exists *) *)

This PR makes it so we keep track of the various fibers that are waiting on read and write events and we enqueue them accordingly based on the events that the polling callback receives. This means the same handle must register for both kinds of events and then we decided how to handle them rather than having a dedicated READABLE and a dedicated WRITABLE poll handle.

So far this PR fixes both the simple example and the origin Irmin example that caused me to find this (code here), however I haven't tried:

  • Awaiting across multiple domains
  • I believe the logic for when to stop a handle is currently incorrect, a fiber could be cancelled but this doesn't remove it from the Lf_queue and I'm using the Lf_queue size to check whether there are any fibers waiting on the handle -- perhaps an integer that we decrement/increment would suffice here?

@talex5
Copy link
Copy Markdown
Collaborator

talex5 commented Aug 15, 2022

I had a look at lwt_luv to see how they handle it:

open Lwt.Infix

let () =
  let () = Lwt_engine.set (new Lwt_luv.engine) in
  let socket, _ = Lwt_unix.(socketpair PF_UNIX SOCK_STREAM 0) in
  Lwt_main.run begin
    let _a = Lwt_unix.wait_read socket in
    let rec aux () =
      Lwt_unix.wait_write socket >>= fun () ->
      Lwt_unix.write socket (Bytes.of_string "a") 0 1 >>= fun (_ : int) ->
      aux ()
    in
    aux ()
  end

The result is:

Fatal error: exception Failure(
"Could not register fd for write polling, this is probably a error in Lwt,
please open a issue on the repo. \n
Error message: EEXIST")

However, Lwt does handle the case of multiple readers OK (just not a mix of readers and writers):
https://github.com/ocsigen/lwt/blob/5.6.1/src/unix/lwt_unix.cppo.ml#L495

I believe the logic for when to stop a handle is currently incorrect, a fiber could be cancelled but this doesn't remove it from the Lf_queue and I'm using the Lf_queue size to check whether there are any fibers waiting on the handle -- perhaps an integer that we decrement/increment would suffice here?

You probably want a Lwt_dllist here instead of an Lf_queue. There shouldn't be any need for anything lock-free here, since we run a separate luv event loop in each domain.

This means the same handle must register for both kinds of events and then we decided how to handle them

I don't think this will work. http://docs.libuv.org/en/v1.x/poll.html#c.uv_poll_start says:

If one of the events UV_READABLE or UV_WRITABLE are set, the callback will be called again, as long as the given fd/socket remains readable or writable accordingly.

However, it also seems unnecessary:

Note: Calling uv_poll_start() on a handle that is already active is fine. Doing so will update the events mask that is being watched for.

It's not clear to me what happens to the callback if you call start multiple times. Possibly it ignores the second one, or disconnects the first. In any case, stop doesn't say which callback to stop.

@patricoferris
Copy link
Copy Markdown
Collaborator Author

Yes, I'm not really sure what the actual behaviour is. From the source code, it would seem after calling Poll.start multiple times a new callback is installed each time https://github.com/libuv/libuv/blob/988f2bfc4defb9a85a536a3e645834c161143ee0/src/unix/poll.c#L152. In this tutorial under "Setting up Polling", it would seem to suggest this is the case and they're happy enough because it is the same callback installed each time.

In b12c0d6 I've made it such that callbacks are the same and a cancelled fiber will remove itself from the list of continuations to enqueue so the is_empty check (in my mind) in safe_to_stop_polling is now correct.

In 4999630 I've added the extra logic to handle simultaneously awaiting for writability and readability on a FD. The bit mask for events is updated to whatever you pass to Poll.start, I had thought it might bitwise OR with whatever is already there, this is not the case.

Copy link
Copy Markdown
Collaborator

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me (apart from minor whitespace issues). I've put some suggested simplifications at 4985168. Feel free to include those if they look sensible. I put fd in the fd_event_waiters. We have to store it anyway, and it simplifies things to have it here rather than captured in a closure. Sorry about the extra re-indentation!

Do we need to fail all waiters if close is called too?

@patricoferris
Copy link
Copy Markdown
Collaborator Author

Yes your changes are very sensible :)) What tool do you use for indentation and whitespaces, ocp-indent?

Do we need to fail all waiters if close is called too?

This seems reasonable, the docs say:

The user should not close a file descriptor while it is being polled by an active poll handle. This can cause the handle to report an error, but it might also start polling another socket. However the fd can be safely closed immediately after a call to uv_poll_stop() or uv_close().

It seems reasonable to me to fail any suspended fibers, would this approach work on Linux?

Copy link
Copy Markdown
Collaborator

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tool do you use for indentation and whitespaces, ocp-indent?

I use ocp-indent for indentation. For whitespace, .git/hooks/pre-commit.sample will detect trailing spaces (if you remember to enable it).

It seems reasonable to me to fail any suspended fibers, would this approach work on Linux?

I'm not sure if it's worth doing. I think Linux only uses the FD to look up the file object when you submit the request, and it's quite happy to continue waiting for the file after the FD is closed. It's more a limitation of luv that files can't be used after their FD is closed. Maybe it's needed on Windows? Though it might be nice to be consistent.

I've pushed a couple more changes:

  • One removes the use of Obj.magic by stashing the FD away at the start. Using Obj.magic with an external library like this seems pretty risky.
  • I got it to update the poll mask in the case where one queue becomes empty but the other isn't. It's hard to test for this because it just uses a lot of CPU if you forget, but doesn't stop working. I also moved the code around a bit to group all the polling stuff together.

If that looks OK to you, I'm happy to merge it.

@talex5
Copy link
Copy Markdown
Collaborator

talex5 commented Aug 22, 2022

One removes the use of Obj.magic ...

Hmm, this doesn't work for all handles. I guess the FD isn't known initially in some cases.

@talex5 talex5 force-pushed the fix-luv-poll-handles branch from 0dbda02 to d6792e2 Compare August 22, 2022 10:54
@talex5 talex5 force-pushed the fix-luv-poll-handles branch from d6792e2 to 8408dd4 Compare August 22, 2022 11:05
@talex5
Copy link
Copy Markdown
Collaborator

talex5 commented Aug 22, 2022

OK, fixed it by adding a constraint. It seems we only ever use handles types that may have FDs anyway. I also got it to ignore EBADF when getting the FD, since sockets don't get their FD set immediately (e.g. a failure after TCP.init and before bind wouldn't have any FD).

@patricoferris
Copy link
Copy Markdown
Collaborator Author

OK, fixed it by adding a constraint. It seems we only ever use handles types that may have FDs anyway.

Ah, I wasn't sure if that was the case. Definitely better than Obj.magic ! All looks good to me :))

@talex5 talex5 merged commit 7d34e99 into ocaml-multicore:main Aug 23, 2022
@talex5
Copy link
Copy Markdown
Collaborator

talex5 commented Aug 23, 2022

Thanks!

talex5 added a commit to talex5/opam-repository that referenced this pull request Aug 26, 2022
CHANGES:

New features:

- Add `Eio.Condition` (@TheLortex @talex5 ocaml-multicore/eio#277).
  Allows a fiber to wait for some condition to become true.

- Add `Eio.Net.getaddrinfo` and `getnameinfo` (@bikallem @talex5 ocaml-multicore/eio#278 ocaml-multicore/eio#288 ocaml-multicore/eio#291).
  Convert between host names and addresses.

- Add `Eio.Debug` (@talex5 ocaml-multicore/eio#276).
  Currently, this allows overriding the `traceln` function.

- `Buf_write.create`: make switch optional (@talex5 ocaml-multicore/eio#283).
  This makes things easier for people porting code from Faraday.

Bug fixes:

- Allow sharing of libuv poll handles (@patricoferris @talex5 ocaml-multicore/eio#279).
  Luv doesn't allow two callers to watch the same file handle, so we need to handle that in Eio.

Other changes:

- Upgrade to uring 0.4 (@talex5 ocaml-multicore/eio#290).

- Mention `Mutex`, `Semaphore` and `Condition` in the README (@talex5 ocaml-multicore/eio#281).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants