[Scheduler] add timer mechanism#4751
Conversation
057fac5 to
f5b82bd
Compare
| } | ||
|
|
||
| let polling_loop t () = | ||
| let rec loop () = |
There was a problem hiding this comment.
I could change the polling loop to wait until there are active timers, but i'm not sure if it's worth it.
958c993 to
46119b5
Compare
46119b5 to
71f2b0c
Compare
|
@aalekseyev ping on this PR (and the related test PR). |
| ; mutable running_count : int | ||
| } | ||
|
|
||
| let is_running t pid = |
There was a problem hiding this comment.
This is_running function is suspect because it unlocks the mutex, so on return its result may already be incorrect.
In fact even if this is solved, we end up racing a kill against a wait, which is not good.
In particular, if wait finishes before kill happens, then kill will hit a process id that was "released" by free, which means we may kill a different process that happens to share the pid.
...
Oh wait, the same bug exists in Process_watcher.killall if I'm not mistaken.
| Option.iter watcher ~f:(fun watcher -> | ||
| ignore (wait_for_process t (Dune_file_watcher.pid watcher) : _ Fiber.t)); | ||
| ignore (kill_and_wait_for_all_processes t : saw_signal); | ||
| if Lazy.is_val t.timer then Timer.close (Lazy.force t.timer); |
There was a problem hiding this comment.
Aren't we using go multiple times in some cases (in tests?). I'm worried we'll use the timer after close or (which is weirder) force the lazy value after close.
There was a problem hiding this comment.
Aren't we using go multiple times in some cases (in tests?)
We're using a new scheduler each time, so I don't see how it's a problem.
I'm worried we'll use the timer after close or (which is weirder) force the lazy value after close.
Creating a timer after close is already a code error. We still allow cancelling timers after close however. Should that be allowed?
There was a problem hiding this comment.
Fair enough, no need to change anything then.
| Event.Queue.send_invalidation_event t.events invalidation; | ||
| Fiber.return () | ||
|
|
||
| let wait_for_process_with_timeout t pid ~timeout = |
There was a problem hiding this comment.
By the way, this is not a great way to "timeout" a process because it kills one process instead of killing a process group. The "timeout" command-line tool creates a process group and kills the entire group, which is better if the command you're running is going to have child processes.
I suggest adding that to a comment and to take that into account when writing tests.
There was a problem hiding this comment.
Good point. In fact, isn't this a problem with our process clean up in general? We could writing bindings for killpg and address this in both places.
There was a problem hiding this comment.
Indeed. Now that you say it I think we have an issue open about it somewhere.
There was a problem hiding this comment.
By the way, it's not simply a matter of writing bindings, it also means that we'd have to spawn every process in its own process group, which means that we need a patch for spawn and that C-c will no longer be sent to them etc. All of that is probably fine, just not such a self-contained change as one might expect.
|
I did the first pass of review and it looks alright overall, with some improvements to be made. @jeremiedimino, you probably thought before about how the alarms should be integrated with dune scheduler. |
I'd like to use it for two other things:
|
|
I use this PR to replace a dedicated thread used for polling with this new mechanism. |
09e3915 to
80905be
Compare
|
This PR should be ready then. There's still two issues:
|
Only vaguely, we haven't needed alarm much so far. I had imagined something a bit more sophisticated, where the alarm thread could be interrupted in order to add a new alarm. IIUC, here we chose a simpler path where sleeps are implemented via an alarm thread that wakes up 10 times per second. This method seems fine to me for the purpose of adding actions timeout and rate limiting notifications. However, it seems to me that the documentation should mention the alarm precision to give clearer expectations to developers. For instance, someone might be surprised when they do Regarding pid related races, I didn't follow the discussion in detail, but just wanted to point out that we had such races in the past in the Windows implementation. These were fixed. |
|
I'm sure some things were fixed, but on unix I see we're running a blocking |
|
Indeed, I don't know how to solve this race. Except interrupting the For the Windows race, the only explanation I found that could explain what we were observing was that Windows would be reusing pids quickly. If that's not the case on Unix as @rgrinberg mentions, then we have less chances of hitting this race. |
| in | ||
| t.alarms <- active; | ||
| Mutex.unlock t.mutex; | ||
| if active <> [] then Event.Queue.send_timers_completed t.events expired; |
There was a problem hiding this comment.
Meant to check expired <> [] here? That's probably why the CI times out.
By the way, do you know if we can reduce the timeout from 6 hours to, say, one hour?
Also, you can use a simple pattern-match here. Why use polymorphic equality when you can easily avoid it?
Another suggestion: you could make the type accepted by send_timers_completed a Non_empty_list.t and then you can't make this error.
There was a problem hiding this comment.
Also, you can use a simple pattern-match here. Why use polymorphic equality when you can easily avoid it?
The pattern match is a couple of lines longer :)
you could make the type accepted by send_timers_completed a Non_empty_list.t and then you can't make this error.
Good idea. I keep forgetting that we introduced this.
aalekseyev
left a comment
There was a problem hiding this comment.
Looks good to me, except for the small bug I wrote about.
|
@jeremiedimino, I can think of two ways:
|
|
The |
abcc123 to
4286e76
Compare
* Allow waiting for a duration to elapse before resuming a fiber * Add timeout argument to wait_for_process Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
4286e76 to
9713536
Compare
Allow waiting for a duration to elapse before resuming a fiber