Skip to content

Fix infinite loop in signal handling.#12342

Merged
kayceesrk merged 2 commits intoocaml:trunkfrom
kayceesrk:fix_signal_hang
Jul 2, 2023
Merged

Fix infinite loop in signal handling.#12342
kayceesrk merged 2 commits intoocaml:trunkfrom
kayceesrk:fix_signal_hang

Conversation

@kayceesrk
Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk commented Jun 30, 2023

Originally authored by @gadmm. The code is from the patch at #11307 (comment).

Fixes #12253.

Issue

This commit fixes a race in the signal handling code that creates an infinite loop in caml_enter_blocking_section.

In order to process signals, the function caml_process_pending_signals_exn first checks whether signals are pending in the flag (caml_pending_signals array) and whether the signal belongs to the current signal set of the thread. Then, it clears the pending status in the flag for that signal and calls caml_execute_signals_exn. caml_execute_signals_exn masks the signal in the current thread's signal set while it runs the OCaml signal handler.

It is possible that a signal may arrive after the signal pending flag is reset by caml_process_pending_signals_exn but before the signal is masked by caml_execute_signals_exn. The C signal handler for the racy signal sets the pending status flag after which the signal is masked. This puts the system in an inconsistent state.

The OCaml signal handler invoked in caml_execute_signals_exn may call caml_enter_blocking_section. The function caml_enter_blocking_section ensures that all the pending signals are processed before entering the blocking section by calling caml_process_pending_signals_exn until the pending flag is cleared.

Recall that the signal is masked in the thread's signal set. In order to process the pending signal, caml_process_pending_signals_exn checks whether signals are pending by checking the flag (yes) and also whether the signal is part of the current signal set (no, since it has been masked by caml_execute_signal_exn). So the pending flag is not cleared and the signal is not processed. Hence, caml_enter_blocking_section goes into an infinite loop.

Fix

The fix is to not check the pending flag for retrying the loop in caml_enter_blocking_section but rather check whether the young limit is set to UINTNAT_MAX. Setting the young limit to UINTNAT_MAX is done whenever a domain needs to be interrupted such as signals but also requesting major slice, minor gc, etc. We replace caml_process_pending_signals_exn at the beginning of the caml_enter_blocking_section loop with caml_process_pending_actions, which handles all the interrupt actions and not just signals. Importantly, caml_process_pending_actions resets the young limit. By deciding to retry the loop only when the young limit is UINTNAT_MAX, we avoid the infinite loop. Note that the signal is not forgotten as the pending flag still remembers that the signal arrived. We will process it after the signal is unmasked.

A more comprehensive fix for asynchronous actions including signals should arrive in #11307. This PR fixes the bug for 5.1 release.

@Octachron Octachron added this to the 5.1 milestone Jun 30, 2023
Co-authored-by: Guillaume Munch-Maccagnoni <Guillaume.Munch-Maccagnoniමinria.fr>

This commit fixes a race in the signal handling code that creates an
infinite loop in `caml_enter_blocking_section`.

In order to process signals, the function
`caml_process_pending_signals_exn` first checks whether signals are
pending in the flag (`caml_pending_signals` array) and whether the
signal is part of the current signal set. Then, it clears the pending
status in the flag for that signal and calls `caml_execute_signals_exn`,
which masks the signal in the current thread's signal set while the
OCaml handler is running.

It is possible that a signal may arrive after the signal pending flag is
reset but before the signal is masked. This puts the system in
inconsistent state.

The signal handler may call `caml_enter_blocking_section. The function
`caml_enter_blocking_section` ensures that all the pending signals are
processed before entering the blocking section by calling
`caml_process_pending_signals_exn` **until the pending flag is
cleared**.

Recall that the signal is masked in the thread's signal set. In order to
process the pending signal, `caml_process_pending_signals_exn` checks
whether signals are pending by checking the flag (yes) and also whether
the signal is part of the current signal set (no). So the signal is not
processed. Hence, `caml_enter_blocking_section` goes into an infinite
loop.

The fix here is to perform a weaker check for pending signals by
checking whether the `young_limit` is set to `UINTNAT_MAX` (which is the
mechanism by which a domain is interrupted when a signal arrives, but
also in other situations).
@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jul 1, 2023

Whether we want to include this as-is depends on if we think #11307 will closely follow or not.

The change to caml_process_pending_actions means there's a less than ideal change in behaviour.

Previously we could start a minor GC or a major slice on a call to caml_enter_blocking_section only if there was a pending signal that we ran an OCaml signal handler for. If there was a pending GC the backup thread would handle it while the main thread went off and did it's blocking operation in parallel.

Now if there's a pending signal the main thread will take it, finish it and then go off and do the blocking operation.

It does fix the infinite loop though.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

Given that the code is currently executing OCaml code when calling caml_enter_blocking_section, one could possibly argue that any major or minor GC work that is now performed in the caml_enter_blocking_section function may presumably be done in the last allocation in the OCaml code before the call to this function? Yes, the code introduces less than ideal behaviour, but it is only marginally worse than current one.

I expect #11307 to follow soon. I've already reviewed it and left comments.

The bug this PR fixes is a blocker for @smondet for their production deployment. Hence, I'm keen for this fix to go in even if it is not the optimal one.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jul 1, 2023

Given that the code is currently executing OCaml code when calling caml_enter_blocking_section, one could possibly argue that any major or minor GC work that is now performed in the caml_enter_blocking_section function may presumably be done in the last allocation in the OCaml code

I agree, I was thinking more of the GC interrupts asynchronously arriving from other domains.

Given #11307 is close behind and this is down as a 5.1 requirement, LGTM.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 1, 2023

Thanks @kayceesrk for submitting the patch.

The reviewers should address the risk that programs that did not use to use signals will start seeing OCaml callbacks being called inside caml_enter_blocking_section (see longer comment at #11307 (comment)). Or any other reason this change has not been made before.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 1, 2023

Regarding the addition of a polling location for the GC: this can a priori affect cross-domain interrupts and explicitly-requested minor & major GC from the same domain. The former is a race situation so I do not think adding a polling location among many others will result in noticeable change. The latter could be a concern if there is code that requested GCs before entering a blocking section, and relied on the GC not being performed immediately; this is unlikely especially as the backup GC thread is a recent addition. Note that the same question can be asked with #11307 which changes behaviour in a similar way.

@kayceesrk kayceesrk merged commit c7be435 into ocaml:trunk Jul 2, 2023
@kayceesrk
Copy link
Copy Markdown
Contributor Author

Thanks for the approval @sadiqj. Merging this now.

@Octachron please can you cherry-pick this for 5.1?

Octachron pushed a commit that referenced this pull request Jul 3, 2023
Fix infinite loop in signal handling.

(cherry picked from commit c7be435)
@Octachron
Copy link
Copy Markdown
Member

Cherry-picked on 5.1 as 592c18e .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race in signal.c causes programs to hang

4 participants