Fix infinite loop in signal handling.#12342
Conversation
Co-authored-by: Guillaume Munch-Maccagnoni <Guillaume.Munch-Maccagnoniමinria.fr> This commit fixes a race in the signal handling code that creates an infinite loop in `caml_enter_blocking_section`. In order to process signals, the function `caml_process_pending_signals_exn` first checks whether signals are pending in the flag (`caml_pending_signals` array) and whether the signal is part of the current signal set. Then, it clears the pending status in the flag for that signal and calls `caml_execute_signals_exn`, which masks the signal in the current thread's signal set while the OCaml handler is running. It is possible that a signal may arrive after the signal pending flag is reset but before the signal is masked. This puts the system in inconsistent state. The signal handler may call `caml_enter_blocking_section. The function `caml_enter_blocking_section` ensures that all the pending signals are processed before entering the blocking section by calling `caml_process_pending_signals_exn` **until the pending flag is cleared**. Recall that the signal is masked in the thread's signal set. In order to process the pending signal, `caml_process_pending_signals_exn` checks whether signals are pending by checking the flag (yes) and also whether the signal is part of the current signal set (no). So the signal is not processed. Hence, `caml_enter_blocking_section` goes into an infinite loop. The fix here is to perform a weaker check for pending signals by checking whether the `young_limit` is set to `UINTNAT_MAX` (which is the mechanism by which a domain is interrupted when a signal arrives, but also in other situations).
b40b75b to
b26eb26
Compare
|
Whether we want to include this as-is depends on if we think #11307 will closely follow or not. The change to Previously we could start a minor GC or a major slice on a call to Now if there's a pending signal the main thread will take it, finish it and then go off and do the blocking operation. It does fix the infinite loop though. |
|
Given that the code is currently executing OCaml code when calling I expect #11307 to follow soon. I've already reviewed it and left comments. The bug this PR fixes is a blocker for @smondet for their production deployment. Hence, I'm keen for this fix to go in even if it is not the optimal one. |
I agree, I was thinking more of the GC interrupts asynchronously arriving from other domains. Given #11307 is close behind and this is down as a 5.1 requirement, LGTM. |
|
Thanks @kayceesrk for submitting the patch. The reviewers should address the risk that programs that did not use to use signals will start seeing OCaml callbacks being called inside |
|
Regarding the addition of a polling location for the GC: this can a priori affect cross-domain interrupts and explicitly-requested minor & major GC from the same domain. The former is a race situation so I do not think adding a polling location among many others will result in noticeable change. The latter could be a concern if there is code that requested GCs before entering a blocking section, and relied on the GC not being performed immediately; this is unlikely especially as the backup GC thread is a recent addition. Note that the same question can be asked with #11307 which changes behaviour in a similar way. |
b26eb26 to
ca7090c
Compare
|
Thanks for the approval @sadiqj. Merging this now. @Octachron please can you cherry-pick this for 5.1? |
Fix infinite loop in signal handling. (cherry picked from commit c7be435)
|
Cherry-picked on 5.1 as 592c18e . |
Originally authored by @gadmm. The code is from the patch at #11307 (comment).
Fixes #12253.
Issue
This commit fixes a race in the signal handling code that creates an infinite loop in
caml_enter_blocking_section.In order to process signals, the function
caml_process_pending_signals_exnfirst checks whether signals are pending in the flag (caml_pending_signalsarray) and whether the signal belongs to the current signal set of the thread. Then, it clears the pending status in the flag for that signal and callscaml_execute_signals_exn.caml_execute_signals_exnmasks the signal in the current thread's signal set while it runs the OCaml signal handler.It is possible that a signal may arrive after the signal pending flag is reset by
caml_process_pending_signals_exnbut before the signal is masked bycaml_execute_signals_exn. The C signal handler for the racy signal sets the pending status flag after which the signal is masked. This puts the system in an inconsistent state.The OCaml signal handler invoked in
caml_execute_signals_exnmay callcaml_enter_blocking_section. The functioncaml_enter_blocking_sectionensures that all the pending signals are processed before entering the blocking section by callingcaml_process_pending_signals_exnuntil the pending flag is cleared.Recall that the signal is masked in the thread's signal set. In order to process the pending signal,
caml_process_pending_signals_exnchecks whether signals are pending by checking the flag (yes) and also whether the signal is part of the current signal set (no, since it has been masked bycaml_execute_signal_exn). So the pending flag is not cleared and the signal is not processed. Hence,caml_enter_blocking_sectiongoes into an infinite loop.Fix
The fix is to not check the pending flag for retrying the loop in
caml_enter_blocking_sectionbut rather check whether the young limit is set toUINTNAT_MAX. Setting the young limit toUINTNAT_MAXis done whenever a domain needs to be interrupted such as signals but also requesting major slice, minor gc, etc. We replacecaml_process_pending_signals_exnat the beginning of thecaml_enter_blocking_sectionloop withcaml_process_pending_actions, which handles all the interrupt actions and not just signals. Importantly,caml_process_pending_actionsresets the young limit. By deciding to retry the loop only when the young limit isUINTNAT_MAX, we avoid the infinite loop. Note that the signal is not forgotten as the pending flag still remembers that the signal arrived. We will process it after the signal is unmasked.A more comprehensive fix for asynchronous actions including signals should arrive in #11307. This PR fixes the bug for 5.1 release.