Skip to content

[Scheduler] add timer mechanism#4751

Merged
rgrinberg merged 1 commit intoocaml:mainfrom
rgrinberg:scheduler-timer
Jun 29, 2021
Merged

[Scheduler] add timer mechanism#4751
rgrinberg merged 1 commit intoocaml:mainfrom
rgrinberg:scheduler-timer

Conversation

@rgrinberg
Copy link
Copy Markdown
Member

@rgrinberg rgrinberg commented Jun 17, 2021

Allow waiting for a duration to elapse before resuming a fiber

  • Scheduler.sleep
  • Scheduler.wait_for_process_with_timeout

@rgrinberg rgrinberg requested a review from aalekseyev June 17, 2021 21:03
@rgrinberg rgrinberg force-pushed the scheduler-timer branch 2 times, most recently from 057fac5 to f5b82bd Compare June 17, 2021 21:12
}

let polling_loop t () =
let rec loop () =
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change the polling loop to wait until there are active timers, but i'm not sure if it's worth it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure either.

@rgrinberg rgrinberg force-pushed the scheduler-timer branch 3 times, most recently from 958c993 to 46119b5 Compare June 18, 2021 07:54
@rgrinberg
Copy link
Copy Markdown
Member Author

@aalekseyev ping on this PR (and the related test PR).

; mutable running_count : int
}

let is_running t pid =
Copy link
Copy Markdown
Collaborator

@aalekseyev aalekseyev Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is_running function is suspect because it unlocks the mutex, so on return its result may already be incorrect.
In fact even if this is solved, we end up racing a kill against a wait, which is not good.
In particular, if wait finishes before kill happens, then kill will hit a process id that was "released" by free, which means we may kill a different process that happens to share the pid.

...

Oh wait, the same bug exists in Process_watcher.killall if I'm not mistaken.

Comment thread src/dune_engine/scheduler.ml Outdated
Comment thread src/dune_engine/scheduler.ml Outdated
Comment thread src/dune_engine/scheduler.ml Outdated
Comment thread src/dune_engine/scheduler.ml Outdated
Comment thread src/dune_engine/scheduler.ml Outdated
Comment thread src/dune_engine/scheduler.ml Outdated
Comment thread src/dune_engine/scheduler.ml Outdated
Option.iter watcher ~f:(fun watcher ->
ignore (wait_for_process t (Dune_file_watcher.pid watcher) : _ Fiber.t));
ignore (kill_and_wait_for_all_processes t : saw_signal);
if Lazy.is_val t.timer then Timer.close (Lazy.force t.timer);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we using go multiple times in some cases (in tests?). I'm worried we'll use the timer after close or (which is weirder) force the lazy value after close.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we using go multiple times in some cases (in tests?)

We're using a new scheduler each time, so I don't see how it's a problem.

I'm worried we'll use the timer after close or (which is weirder) force the lazy value after close.

Creating a timer after close is already a code error. We still allow cancelling timers after close however. Should that be allowed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, no need to change anything then.

Event.Queue.send_invalidation_event t.events invalidation;
Fiber.return ()

let wait_for_process_with_timeout t pid ~timeout =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, this is not a great way to "timeout" a process because it kills one process instead of killing a process group. The "timeout" command-line tool creates a process group and kills the entire group, which is better if the command you're running is going to have child processes.

I suggest adding that to a comment and to take that into account when writing tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In fact, isn't this a problem with our process clean up in general? We could writing bindings for killpg and address this in both places.

Copy link
Copy Markdown
Collaborator

@aalekseyev aalekseyev Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Now that you say it I think we have an issue open about it somewhere.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, it's not simply a matter of writing bindings, it also means that we'd have to spawn every process in its own process group, which means that we need a patch for spawn and that C-c will no longer be sent to them etc. All of that is probably fine, just not such a self-contained change as one might expect.

@aalekseyev
Copy link
Copy Markdown
Collaborator

I did the first pass of review and it looks alright overall, with some improvements to be made.
I feel unsure about adding all this stuff to scheduler.ml if the whole motivation is to make tests a bit easier to debug, though.
Do you expect to need it for other things soon, or are we adding it only for the convenience of debugging tests?

@jeremiedimino, you probably thought before about how the alarms should be integrated with dune scheduler.
Do you have any opinion on this?

@rgrinberg
Copy link
Copy Markdown
Member Author

Do you expect to need it for other things soon, or are we adding it only for the convenience of debugging tests?

I'd like to use it for two other things:

  • Adding a general timeout action. E.g. (timeout 2 (run ./foo)).
  • Debouncing progress notifications.

@rgrinberg
Copy link
Copy Markdown
Member Author

I use this PR to replace a dedicated thread used for polling with this new mechanism.

@rgrinberg
Copy link
Copy Markdown
Member Author

This PR should be ready then. There's still two issues:

  • We aren't killing process groups. I created a separate issue for this.
  • The various pid related races. Tbh, I'm not very concerned about these because the OS takes some time to start reusing pids (at least on the operating systems I'm familiar with). But it's certainly worth fixing in the long run.

@ghost
Copy link
Copy Markdown

ghost commented Jun 24, 2021

@jeremiedimino, you probably thought before about how the alarms should be integrated with dune scheduler.
Do you have any opinion on this?

Only vaguely, we haven't needed alarm much so far. I had imagined something a bit more sophisticated, where the alarm thread could be interrupted in order to add a new alarm. IIUC, here we chose a simpler path where sleeps are implemented via an alarm thread that wakes up 10 times per second. This method seems fine to me for the purpose of adding actions timeout and rate limiting notifications.

However, it seems to me that the documentation should mention the alarm precision to give clearer expectations to developers. For instance, someone might be surprised when they do sleep 0.001 and the thread ends up pausing for more than 0.1 seconds.

Regarding pid related races, I didn't follow the discussion in detail, but just wanted to point out that we had such races in the past in the Windows implementation. These were fixed.

Comment thread src/dune_engine/scheduler.mli Outdated
@rgrinberg
Copy link
Copy Markdown
Member Author

@jeremiedimino:

  • I changed the code to use Fiber.of_thunk
  • And improved the comments to note that timers smaller than the resolution aren't going to work well

@aalekseyev
Copy link
Copy Markdown
Collaborator

@jeremiedimino ,

Regarding pid related races, ... we had such races in the past in the Windows implementation. These were fixed.

I'm sure some things were fixed, but on unix I see we're running a blocking wait and we're trying to interrupt that with a kill. That will always have a race, no?

@ghost
Copy link
Copy Markdown

ghost commented Jun 28, 2021

Indeed, I don't know how to solve this race. Except interrupting the wait call when we want to kill something maybe?

For the Windows race, the only explanation I found that could explain what we were observing was that Windows would be reusing pids quickly. If that's not the case on Unix as @rgrinberg mentions, then we have less chances of hitting this race.

Comment thread src/dune_engine/scheduler.ml Outdated
in
t.alarms <- active;
Mutex.unlock t.mutex;
if active <> [] then Event.Queue.send_timers_completed t.events expired;
Copy link
Copy Markdown
Collaborator

@aalekseyev aalekseyev Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to check expired <> [] here? That's probably why the CI times out.
By the way, do you know if we can reduce the timeout from 6 hours to, say, one hour?
Also, you can use a simple pattern-match here. Why use polymorphic equality when you can easily avoid it?
Another suggestion: you could make the type accepted by send_timers_completed a Non_empty_list.t and then you can't make this error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can use a simple pattern-match here. Why use polymorphic equality when you can easily avoid it?

The pattern match is a couple of lines longer :)

you could make the type accepted by send_timers_completed a Non_empty_list.t and then you can't make this error.

Good idea. I keep forgetting that we introduced this.

Copy link
Copy Markdown
Collaborator

@aalekseyev aalekseyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, except for the small bug I wrote about.

@aalekseyev
Copy link
Copy Markdown
Collaborator

aalekseyev commented Jun 29, 2021

@jeremiedimino, I can think of two ways:

  1. Rely on SIGCHLD signal to wake up the polling loop. This is what Async does, but Async also for some reason loops over all processed and does waitpid on each. I think we can simply do a wait to find out which process terminated, instead, which should be more efficient.
  2. As you say, interrupt our own thread with pthread_kill. That should work, but is probably less portable than SIGCHLD.

@ghost
Copy link
Copy Markdown

ghost commented Jun 29, 2021

The SIGCHLD method seems good

@rgrinberg rgrinberg force-pushed the scheduler-timer branch 2 times, most recently from abcc123 to 4286e76 Compare June 29, 2021 17:17
* Allow waiting for a duration to elapse before resuming a fiber
* Add timeout argument to wait_for_process

Signed-off-by: Rudi Grinberg <me@rgrinberg.com>
@rgrinberg rgrinberg merged commit dedd703 into ocaml:main Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants