[Scheduler] add timer mechanism by rgrinberg · Pull Request #4751 · ocaml/dune

rgrinberg · 2021-06-17T21:03:17Z

Allow waiting for a duration to elapse before resuming a fiber

Scheduler.sleep
Scheduler.wait_for_process_with_timeout

rgrinberg · 2021-06-17T21:13:00Z

+    }
+
+  let polling_loop t () =
+    let rec loop () =


I could change the polling loop to wait until there are active timers, but i'm not sure if it's worth it.

Not sure either.

rgrinberg · 2021-06-22T17:41:44Z

@aalekseyev ping on this PR (and the related test PR).

aalekseyev · 2021-06-22T18:09:09Z

    ; mutable running_count : int
    }

+  let is_running t pid =


This is_running function is suspect because it unlocks the mutex, so on return its result may already be incorrect.
In fact even if this is solved, we end up racing a kill against a wait, which is not good.
In particular, if wait finishes before kill happens, then kill will hit a process id that was "released" by free, which means we may kill a different process that happens to share the pid.

...

Oh wait, the same bug exists in Process_watcher.killall if I'm not mistaken.

aalekseyev · 2021-06-22T18:43:26Z

    Option.iter watcher ~f:(fun watcher ->
        ignore (wait_for_process t (Dune_file_watcher.pid watcher) : _ Fiber.t));
    ignore (kill_and_wait_for_all_processes t : saw_signal);
+    if Lazy.is_val t.timer then Timer.close (Lazy.force t.timer);


Aren't we using go multiple times in some cases (in tests?). I'm worried we'll use the timer after close or (which is weirder) force the lazy value after close.

Aren't we using go multiple times in some cases (in tests?)

We're using a new scheduler each time, so I don't see how it's a problem.

I'm worried we'll use the timer after close or (which is weirder) force the lazy value after close.

Creating a timer after close is already a code error. We still allow cancelling timers after close however. Should that be allowed?

Fair enough, no need to change anything then.

aalekseyev · 2021-06-22T18:45:58Z

  Event.Queue.send_invalidation_event t.events invalidation;
  Fiber.return ()
+
+let wait_for_process_with_timeout t pid ~timeout =


By the way, this is not a great way to "timeout" a process because it kills one process instead of killing a process group. The "timeout" command-line tool creates a process group and kills the entire group, which is better if the command you're running is going to have child processes.

I suggest adding that to a comment and to take that into account when writing tests.

Good point. In fact, isn't this a problem with our process clean up in general? We could writing bindings for killpg and address this in both places.

Indeed. Now that you say it I think we have an issue open about it somewhere.

By the way, it's not simply a matter of writing bindings, it also means that we'd have to spawn every process in its own process group, which means that we need a patch for spawn and that C-c will no longer be sent to them etc. All of that is probably fine, just not such a self-contained change as one might expect.

aalekseyev · 2021-06-22T18:53:55Z

I did the first pass of review and it looks alright overall, with some improvements to be made.
I feel unsure about adding all this stuff to scheduler.ml if the whole motivation is to make tests a bit easier to debug, though.
Do you expect to need it for other things soon, or are we adding it only for the convenience of debugging tests?

@jeremiedimino, you probably thought before about how the alarms should be integrated with dune scheduler.
Do you have any opinion on this?

rgrinberg · 2021-06-22T18:57:37Z

Do you expect to need it for other things soon, or are we adding it only for the convenience of debugging tests?

I'd like to use it for two other things:

Adding a general timeout action. E.g. (timeout 2 (run ./foo)).
Debouncing progress notifications.

rgrinberg · 2021-06-24T00:13:16Z

I use this PR to replace a dedicated thread used for polling with this new mechanism.

rgrinberg · 2021-06-24T00:24:17Z

This PR should be ready then. There's still two issues:

We aren't killing process groups. I created a separate issue for this.
The various pid related races. Tbh, I'm not very concerned about these because the OS takes some time to start reusing pids (at least on the operating systems I'm familiar with). But it's certainly worth fixing in the long run.

ghost · 2021-06-24T08:47:13Z

@jeremiedimino, you probably thought before about how the alarms should be integrated with dune scheduler.
Do you have any opinion on this?

Only vaguely, we haven't needed alarm much so far. I had imagined something a bit more sophisticated, where the alarm thread could be interrupted in order to add a new alarm. IIUC, here we chose a simpler path where sleeps are implemented via an alarm thread that wakes up 10 times per second. This method seems fine to me for the purpose of adding actions timeout and rate limiting notifications.

However, it seems to me that the documentation should mention the alarm precision to give clearer expectations to developers. For instance, someone might be surprised when they do sleep 0.001 and the thread ends up pausing for more than 0.1 seconds.

Regarding pid related races, I didn't follow the discussion in detail, but just wanted to point out that we had such races in the past in the Windows implementation. These were fixed.

rgrinberg · 2021-06-24T17:33:06Z

@jeremiedimino:

I changed the code to use Fiber.of_thunk
And improved the comments to note that timers smaller than the resolution aren't going to work well

aalekseyev · 2021-06-24T18:36:17Z

@jeremiedimino ,

Regarding pid related races, ... we had such races in the past in the Windows implementation. These were fixed.

I'm sure some things were fixed, but on unix I see we're running a blocking wait and we're trying to interrupt that with a kill. That will always have a race, no?

ghost · 2021-06-28T15:19:36Z

Indeed, I don't know how to solve this race. Except interrupting the wait call when we want to kill something maybe?

For the Windows race, the only explanation I found that could explain what we were observing was that Windows would be reusing pids quickly. If that's not the case on Unix as @rgrinberg mentions, then we have less chances of hitting this race.

aalekseyev · 2021-06-29T14:32:35Z

+        in
+        t.alarms <- active;
+        Mutex.unlock t.mutex;
+        if active <> [] then Event.Queue.send_timers_completed t.events expired;


Meant to check expired <> [] here? That's probably why the CI times out.
By the way, do you know if we can reduce the timeout from 6 hours to, say, one hour?
Also, you can use a simple pattern-match here. Why use polymorphic equality when you can easily avoid it?
Another suggestion: you could make the type accepted by send_timers_completed a Non_empty_list.t and then you can't make this error.

Also, you can use a simple pattern-match here. Why use polymorphic equality when you can easily avoid it?

The pattern match is a couple of lines longer :)

you could make the type accepted by send_timers_completed a Non_empty_list.t and then you can't make this error.

Good idea. I keep forgetting that we introduced this.

aalekseyev

Looks good to me, except for the small bug I wrote about.

aalekseyev · 2021-06-29T15:02:04Z

@jeremiedimino, I can think of two ways:

Rely on SIGCHLD signal to wake up the polling loop. This is what Async does, but Async also for some reason loops over all processed and does waitpid on each. I think we can simply do a wait to find out which process terminated, instead, which should be more efficient.
As you say, interrupt our own thread with pthread_kill. That should work, but is probably less portable than SIGCHLD.

ghost · 2021-06-29T15:59:40Z

The SIGCHLD method seems good

* Allow waiting for a duration to elapse before resuming a fiber * Add timeout argument to wait_for_process Signed-off-by: Rudi Grinberg <me@rgrinberg.com>

rgrinberg requested a review from aalekseyev June 17, 2021 21:03

rgrinberg force-pushed the scheduler-timer branch 2 times, most recently from 057fac5 to f5b82bd Compare June 17, 2021 21:12

rgrinberg commented Jun 17, 2021

View reviewed changes

rgrinberg force-pushed the scheduler-timer branch 3 times, most recently from 958c993 to 46119b5 Compare June 18, 2021 07:54

rgrinberg mentioned this pull request Jun 18, 2021

[RPC] test diagnostics handling #4746

Merged

rgrinberg force-pushed the scheduler-timer branch from 46119b5 to 71f2b0c Compare June 18, 2021 19:06