Eio_linux: handle sleeping tasks that are now due before paused fibers by TheLortex · Pull Request #213 · ocaml-multicore/eio

TheLortex · 2022-04-25T14:03:17Z

When using Eio I encountered an issue while porting a test in the mirage arp library.

The failing code can be reduced to the following:

let rec loop () =
  Eio.Fiber.yield ();
  loop ()

let () = run @@ fun ~clock ->
  Fiber.first
    loop
    (fun () -> Eio.Time.sleep clock 0.1)

I expected this code to finish after 0.1s of execution, but instead it does an infinite loop. This is because the scheduler prioritizes paused fibers before sleeping fibers that are now due.

This PR adds a Zzz.pop before checking for paused fibers.

haesbaert · 2022-04-25T14:13:41Z

I'm not familiar with the code so excuse me if I'm asking something stupid.
But can't you add the now Due task to the existing runq, and then keep the same loop, it would cause the tasks to be in a roundrobin behaviour instead of always running the wakers first. Now, If you reach the case where you would have enough jobs on Zzz, the runq would now starve, in essence just inverting your problem.

talex5 · 2022-04-25T14:14:41Z

This does need fixing, but I think we need a more general solution. A fiber that keeps yielding will also prevent any other IO from happening, not just sleeps.

Perhaps we should keep a "check-for-IO" job on the run queue. When it gets to the head, we dispatch any IO (and add another check-for-IO job)?

TheLortex · 2022-04-25T14:26:04Z

Thanks for your reviews ! Yes indeed I should have warned that this was written on top of my head to fix the specific issue and start a discussion.

Perhaps we should keep a "check-for-IO" job on the run queue. When it gets to the head, we dispatch any IO (and add another check-for-IO job)?

That's what I was thinking, I can try to implement that.

TheLortex · 2022-04-25T16:39:48Z

@talex5 I have updated the PR. Please take a look and tell me if something is wrong.
The invariant I'm enforcing is that at any time the run queue has a single IO task.
As a consequence the exit cases have to check if the run queue has a single element (instead of being empty).
I have added the has_single_element but I'm not sure of it's thread-safety.

talex5

Thanks; this looks generally sensible!

But with this change, make test_luv now hangs, so we probably need a similar fix there.

Other notes:

I think it would be simpler to delay adding the IO event until just before returning. Then you don't need has_single_element. I had a go at that here: https://github.com/ocaml-multicore/eio/compare/main...talex5:fairness?expand=1
The fake test clock in prelude.ml says:

At the moment, the scheduler only checks for expired timers when the run-queue is
empty, so this is a convenient way to wait for the system to be idle.

It still seems to be working, but might need to make it more robust, or at least update the comment to warn that it might not work now.

lib_eio/utils/lf_queue.ml

lib_eio_linux/eio_linux.ml

TheLortex · 2022-05-02T12:29:53Z

Thank you for your review.

But with this change, make test_luv now hangs, so we probably need a similar fix there.

I have added a fix to the luv backend in the second commit. In particular I'm using Luv.Async.send to yield back to the event loop when the IO task is due.

I think it would be simpler to delay adding the IO event until just before returning.

Indeed that sounds better and reduces the diff. I have force-pushed my first commit with the suggested changes

It still seems to be working, but might need to make it more robust, or at least update the comment to warn that it might not work now.

Do you have an idea on how to make it more robust ? Do you have tests that are waiting for the run queue to become empty ?

talex5 · 2022-05-06T12:48:04Z

I have added a fix to the luv backend in the second commit. In particular I'm using Luv.Async.send to yield back to the event loop when the IO task is due.

This has a large effect on the benchmarks! e.g.

Before:

$ EIO_BACKEND=luv make bench
dune exec -- ./bench/bench_yield.exe
n_fibers, ns/iter, promoted/iter
       1,   92.68,        4.9931
       2,   91.56,        4.9955
       3,   91.59,        4.9979

After:

$ EIO_BACKEND=luv make bench
dune exec -- ./bench/bench_yield.exe
n_fibers, ns/iter, promoted/iter
       1,  968.06,        9.9792
       2,  535.77,        7.4967
       3,  395.65,        6.6500

io-uring is also affected, though not as much.
Before:

$ EIO_BACKEND=io-uring make bench
dune exec -- ./bench/bench_yield.exe
n_fibers, ns/iter, promoted/iter
       1,   79.11,        4.9939
       2,   77.59,        4.9954
       3,   78.49,        4.9969

After:

$ EIO_BACKEND=io-uring make bench
dune exec -- ./bench/bench_yield.exe
n_fibers, ns/iter, promoted/iter
       1,  165.01,        9.9854
       2,  118.29,        7.4761
       3,  106.06,        6.6621

TheLortex · 2022-05-06T13:49:35Z

Thank you for performing the benchmarks. I expected a performance cost but not that high for the luv backend. A single task yielding is indeed the worst case, the run queue having 1 task and and 1 IO at any time, so the IO is performed once every two iterations.

To address the performance cost globally, what we can do is remove the IO event from the run queue, and have instead a counter-based strategy:

initially counter is at zero
each run queue pop increases the counter
when the counter is higher than a threshold (which value?), take the IO path instead of popping the next task
when the IO path is taken set the counter to zero

This strategy has a negligible runtime cost when no busy yielding is happening.
When busy yielding is happening, the cost will depend on on the number of tasks run between each IO lookup. The higher the threshold, the lower the overhead but it increases latency.

What do you think ? About the luv backend, I'm not familiar enough with it to figure out how to improve the performances on that side.

talex5

Thinking on about this, I don't think the performance change for Eio_linux is a problem: the yield benchmark previously wasn't checking for IO, but now it is, and that's correct. It's not surprising it's slightly slower.

I tested the luv changes with the HTTP benchmark (https://github.com/ocaml-multicore/retro-httpaf-bench, adjusted to use Eio_luv) and this PR didn't make any noticeable difference, so I think it's fine.

However, the tests are failing in CI (but they work for me locally). Any idea why that is?

lib_eio_linux/eio_linux.ml

TheLortex · 2022-05-17T08:28:06Z

I have no idea why the CI failure is happening. I have also been able to reproduce the test using the Docker instructions of the CI job..

talex5 · 2022-05-19T08:47:05Z

Thanks!

@talex5

CHANGES: API changes: - `Net.accept_sub` is deprecated in favour of `accept_fork` (@talex5 ocaml-multicore/eio#240). `Fiber.fork_on_accept`, which it used internally, has been removed. - Allow short writes in `Read_source_buffer` (@talex5 ocaml-multicore/eio#239). The reader is no longer required to consume all the data in one go. Also, add `Linux_eio.Low_level.writev_single` to expose this behaviour directly. - `Eio.Unix_perm` is now `Eio.Dir.Unix_perm`. New features: - Add `Eio.Mutex` (@TheLortex @talex5 ocaml-multicore/eio#223). - Add `Eio.Buf_write` (@talex5 ocaml-multicore/eio#235). This is a buffered writer for Eio sinks, based on Faraday. - Add `Eio_mock` library for testing (@talex5 ocaml-multicore/eio#228). At the moment it has mock flows and networks. - Add `Eio_mock.Backend` (@talex5 ocaml-multicore/eio#237 ocaml-multicore/eio#238). Allows running tests without needing a dependency on eio_main. Also, as it is single-threaded, it can detect deadlocks in test code instead of just hanging. - Add `Buf_read.{of_buffer, of_string, parse_string{,_exn}, return}` (@talex5 ocaml-multicore/eio#225). - Add `<*>` combinator to `Buf_read.Syntax` (@talex5 ocaml-multicore/eio#227). - Add `Eio.Dir.read_dir` (@patricoferris @talex5 ocaml-multicore/eio#207 ocaml-multicore/eio#218 ocaml-multicore/eio#219) Performance: - Add `Buf_read` benchmark and optimise it a bit (@talex5 ocaml-multicore/eio#230). - Inline `Buf_read.consume` to improve performance (@talex5 ocaml-multicore/eio#232). Bug fixes / minor changes: - Allow IO to happen even if a fiber keeps yielding (@TheLortex @talex5 ocaml-multicore/eio#213). - Fallback for `traceln` without an effect handler (@talex5 ocaml-multicore/eio#226). `traceln` now works outside of an event loop too. - Check for cancellation when creating a non-protected child context (@talex5 ocaml-multicore/eio#222). - eio_linux: handle EINTR when calling `getrandom` (@bikallem ocaml-multicore/eio#212). - Update to cmdliner.1.1.0 (@talex5 ocaml-multicore/eio#190).

@talex5

CHANGES: API changes: - `Net.accept_sub` is deprecated in favour of `accept_fork` (@talex5 ocaml-multicore/eio#240). `Fiber.fork_on_accept`, which it used internally, has been removed. - Allow short writes in `Read_source_buffer` (@talex5 ocaml-multicore/eio#239). The reader is no longer required to consume all the data in one go. Also, add `Linux_eio.Low_level.writev_single` to expose this behaviour directly. - `Eio.Unix_perm` is now `Eio.Dir.Unix_perm`. New features: - Add `Eio.Mutex` (@TheLortex @talex5 ocaml-multicore/eio#223). - Add `Eio.Buf_write` (@talex5 ocaml-multicore/eio#235). This is a buffered writer for Eio sinks, based on Faraday. - Add `Eio_mock` library for testing (@talex5 ocaml-multicore/eio#228). At the moment it has mock flows and networks. - Add `Eio_mock.Backend` (@talex5 ocaml-multicore/eio#237 ocaml-multicore/eio#238). Allows running tests without needing a dependency on eio_main. Also, as it is single-threaded, it can detect deadlocks in test code instead of just hanging. - Add `Buf_read.{of_buffer, of_string, parse_string{,_exn}, return}` (@talex5 ocaml-multicore/eio#225). - Add `<*>` combinator to `Buf_read.Syntax` (@talex5 ocaml-multicore/eio#227). - Add `Eio.Dir.read_dir` (@patricoferris @talex5 ocaml-multicore/eio#207 ocaml-multicore/eio#218 ocaml-multicore/eio#219) Performance: - Add `Buf_read` benchmark and optimise it a bit (@talex5 ocaml-multicore/eio#230). - Inline `Buf_read.consume` to improve performance (@talex5 ocaml-multicore/eio#232). Bug fixes / minor changes: - Allow IO to happen even if a fiber keeps yielding (@TheLortex @talex5 ocaml-multicore/eio#213). - Fallback for `traceln` without an effect handler (@talex5 ocaml-multicore/eio#226). `traceln` now works outside of an event loop too. - Check for cancellation when creating a non-protected child context (@talex5 ocaml-multicore/eio#222). - eio_linux: handle EINTR when calling `getrandom` (@bikallem ocaml-multicore/eio#212). - Update to cmdliner.1.1.0 (@talex5 ocaml-multicore/eio#190).

TheLortex changed the title ~~handle sleeping tasks that are now due before paused fibers~~ Eio_linux: handle sleeping tasks that are now due before paused fibers Apr 25, 2022

TheLortex force-pushed the time-scheduling branch from 6e43b55 to 23997a8 Compare April 25, 2022 16:31

talex5 reviewed Apr 29, 2022

View reviewed changes

lib_eio/utils/lf_queue.ml Outdated Show resolved Hide resolved

lib_eio_linux/eio_linux.ml Outdated Show resolved Hide resolved

lib_eio_linux/eio_linux.ml Outdated Show resolved Hide resolved

TheLortex force-pushed the time-scheduling branch from 23997a8 to 6eaaa5a Compare May 2, 2022 12:24

talex5 reviewed May 13, 2022

View reviewed changes

lib_eio_linux/eio_linux.ml Outdated Show resolved Hide resolved

Lucas Pluvinage added 2 commits May 18, 2022 21:28

Allow IO to happen even if a fiber keeps yielding

0612c3d

IO vs yielding: similar change for the luv backend

1d2e80d

talex5 force-pushed the time-scheduling branch from 1ff8c44 to 1d2e80d Compare May 19, 2022 08:17

talex5 approved these changes May 19, 2022

View reviewed changes

talex5 merged commit 29ff919 into ocaml-multicore:main May 19, 2022

talex5 mentioned this pull request Jun 28, 2022

[new release] eio_main, eio_luv, eio_linux and eio (0.3) ocaml/opam-repository#21676

Merged

talex5 mentioned this pull request Aug 2, 2022

Fix scheduling fairness in luv backend #269

Merged

Conversation

TheLortex commented Apr 25, 2022

Uh oh!

haesbaert commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

talex5 commented Apr 25, 2022

Uh oh!

TheLortex commented Apr 25, 2022

Uh oh!

TheLortex commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

talex5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheLortex commented May 2, 2022

Uh oh!

talex5 commented May 6, 2022

Uh oh!

TheLortex commented May 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

talex5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheLortex commented May 17, 2022

Uh oh!

talex5 commented May 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haesbaert commented Apr 25, 2022 •

edited

Loading

TheLortex commented Apr 25, 2022 •

edited

Loading

TheLortex commented May 6, 2022 •

edited

Loading