Skip to content

EINTR-based signals, again#9722

Merged
xavierleroy merged 6 commits intoocaml:trunkfrom
stedolan:eintr-again
Jul 28, 2020
Merged

EINTR-based signals, again#9722
xavierleroy merged 6 commits intoocaml:trunkfrom
stedolan:eintr-again

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

@stedolan stedolan commented Jun 29, 2020

This is a (greatly simplified) reimplementation of #1128, to solve a collection of tricky locking and signal-handling issues.

In trunk, when a signal arrives it is handled:

  • at the next polling point, if not in a blocking section
  • If in a blocking section, then:
    • immediately, if the systhreads library has not been linked
    • at the next polling point, if the systhreads library has been linked

Polling points include all allocations in OCaml and calls to caml_process_pending_actions from C. (Before #8691 / #9027 / #8897, all allocations in C were also polling points, but no longer)

There are a number of problems with the current behaviour:

  • It is odd for the responsiveness of a program to signals to depend on whether systhreads has been linked.

  • (IO mutex and threads #5141, Bad locking in io.c #7503) When they interrupt code using mutexes, the signal handlers run under an unknown collection of locks, which can cause deadlocks. This is particularly bad for signal handlers that write a message to stdout or stderr, as io.c maintains mutexes on those channels.

  • If a signal is handled immediately during a blocking section that was executing a non-async-signal-safe function, arbitrary corruption can occur. (It is possible to write a blocking section that only uses async-signal-safe functions, but this requires care, and various blocking sections even in the stdlib are not async-signal-safe).

  • If a signal is handled immediately during a blocking section, then any progress made by that blocking section is discarded if the handler raises. Specifically, a signal may occur after a system call has successfully returned but before the blocking section has ended, and its return value was lost. There is in general no way to recover from this in the case of say, writing to a pipe.

This patch fixes these issues by:

  • Always waiting for the next polling point, and never handling signals immediately (regardless of whether systhreads is linked)

  • Modifying the blocking read and write calls in io.c to insert a polling point (and hence handle signals) before retrying a system call on receipt of an EINTR.

If a signal arrives during a blocking system call, instead of handling the signal then and there, we mark the pending signal and rely on the OS to make the system call fail with EINTR, at which point we check for pending signals before retrying the operation.

Compared with the previous EINTR patch (#1128), this one is much simpler. When #1128 was written, there was no easy way to delay handling of a signal to a convenient point, and so that patch included some gymnastics to ensure it was always safe to handle one.

Since #8691 / #9027 / #8897, polling points are less frequent in C code, and delaying a signal until convenient is easier. To allow this, the C API is modified by:

  • Adding caml_check_pending_actions, which checks for pending actions without yet running them
  • Adding caml_enter_blocking_section_no_pending, which enters a blocking section without checking for pending actions
  • Never checking for pending actions in caml_leave_blocking_section

(This PR is best read commit-by-commit)

TODO:

  • There's currently a signals_are_pending = 1 in caml_leave_blocking_section, which might cause unnecessary work. I'll investigate.
  • Windows support (left until after Windows Console Unicode Support #1408 is merged)

@stedolan stedolan mentioned this pull request Jun 29, 2020
4 tasks
} else {
nread = caml_read_fd(channel->fd, channel->flags, channel->buff,
channel->end - channel->buff);
if (nread == Io_interrupted) goto again;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that there is no corresponding check for nread == 0 here to check for an EOF condition, as there is with caml_read_fd a few hunk above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is odd! I'm trying to preserve existing behaviour here, and it looks like caml_refill raises end_of_file on a zero read but caml_getblock does not. The only user of getblock in the stdlib (md5.c) has an explicit == 0 check after the call.

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the blocking read and write calls in io.c to insert a polling point (and hence handle signals) before retrying a system call on receipt of an EINTR.

The way I understand the previous code, it already polled before retrying the system call, inside caml_enter_blocking_section and caml_leave_blocking_section. The new version ensures that channels are unlocked when signals are called, though. Am I right?

The PR contains a proposal for new public API functions, and a behaviour change for caml_leave_blocking_section, about which I am a bit sceptical (comments below).

If a signal is handled immediately during a blocking section, then any progress made by that blocking section is discarded if the handler raises. Specifically, a signal may occur after a system call has successfully returned but before the blocking section has ended, and its return value was lost. There is in general no way to recover from this in the case of say, writing to a pipe.

I do not see exactly what situations you are referring to, but could these situations also be dealt with explicit clean-up after call to a caml_{enter,leave}_blocking_section_exn function as the ones proposed at #8997?

Now, the clean-up approach cannot deal with the problem of unlocking channels, but I think that recursive mutexes might be a good solution to locking twice in this situation, which would simplify the code.

Another question: it is not clear to me that the (non-)SA_RESTART behaviour of signal is portable. Linux signal man page contradicts the glibc manual about glibc's own behaviour (which is also different from the Linux system call)... I did not find anything in POSIX. I wonder if it is better to use sigaction with explicit omission of SA_RESTART.

Like Jacques-Henri, I had further questions about what guarantees the PR should try to achieve. I thought a bit about it while reviewing the other patch; I will write about it later if I find the time.


/* Functions and macros that can be called from C. Take arguments of
type struct channel *. No locking is performed. */
type struct channel *. The channel must be locked before calling these. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify the documentation, the change applies to all the 14 functions and macros taking a channel as argument that follow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've rearranged this file to make it clearer when the channel must be locked.

#endif

CAMLextern void caml_enter_blocking_section (void);
CAMLextern void caml_enter_blocking_section_no_pending (void);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding its inclusion in the public (non-internal) part of the interface: at #8997 there is a proposal to add a variant caml_enter_blocking_section_exn as part of making raising explicit in the C api, and #8961 implements masking which disables the processing of signals inside caml_enter_blocking_section in a different manner. I am slightly worried about this interface becoming complicated with many possible variants, so I wonder what minimal part of the interface should be offered to the programmer. (The function could also be proposed in the internals part of the interface.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, having too many variants of these functions is definitely bad. I'm not sure what the correct set is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour for enter/leave_section you are implementing is precisely the one at #8961 when the "uninterruptible" mask is set. Would an extra call to caml_mask/caml_unmask (from 3a72aab9f) be worth the greater simplicity of the public API? If so, a solution would be to keep caml_enter_blocking_section_no_pending for now, but only in the internal part of the API, and then later I can make the change.

Memprof callbacks. Assumes that the runtime lock is held. Can raise
exceptions asynchronously into OCaml code. */

CAMLextern int caml_check_pending_actions (void);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding its inclusion in the public (non-internal) part of the interface (as well): such a check was explicitly left out of the public interface. If there is a good reason to let the programmer do the check, this can change of course, but my reasoning was that it was not clear that it is useful for the programmer, whereas keeping the race between checking and executing signals under control inside the process_pending_* functions makes the signals api and delivery logic easier to evolve should it become necessary. (I remember ocamlcc used to execute signals in a loop up until caml_something_to_do was false; while this probably worked, I do not think this is the intended semantics of polling.) #8961 proposes a similar function (taking masking into account), but in the internals part of the interface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at the use of this function in this patch? It's difficult to replicate without it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but there is a problem of atomicity. Essentially, if caml_check_pending_actions returns false and a signal arrives after the check but before the system call, then we have to wait for the end of the system call to execute the signal handler.

So, essentially, in this PR, we are trying to do some kind of "best effort", praying that no signal will arrive during this short period of time, and I think this function is useful for that.

The usual solution to avoid this problem is to use the "self-pipe trick" instead of a volatile variable, or to use the pselect system call. But none of these is compatible with all blocking calls (e.g., calling open on a named pipe, locking a pthread mutex, ...)....

So, I agree with @gadmm that this function should be avoided in theory to get well-behaved code, but in practice the POSIX API does not give us the choice to not write bogus code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this is a really small detail and if I am excessively nit-picking with naming. If you insist in making caml_check_pending_actions public, can we please rename it into something like caml_has_action_to_do like at #8961, in order to make the distinction with simply having an action pending? The distinction becomes important after #8961 (we must compare the value of the actual action pending with the value of the mask), but if you make the function public now, then I am stuck with the slightly misleading name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second look, your naming is nicely consistent with the other functions *_pending_actions. Sorry for the excessive bikeshedding, please do as you see fit.

CAMLexport int caml_flush_partial(struct channel *channel)
{
int towrite, written;
again:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merely a matter of style, but couldn't all the gotos be replaced with simple loops? Would that be better?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could certainly do this with a state variable and a loop. I copied the coding style here from the EINTR handling in runtime/unix.c, but I'm not especially attached to it. It's not obvious to me which would be clearer.

handled at this point. */
signals_are_pending = 1;
caml_raise_if_exception(caml_process_pending_signals_exn());
//caml_raise_if_exception(caml_process_pending_signals_exn());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for changing the polling behaviour of caml_leave_blocking_section, instead of e.g. introducing a new function caml_leave_blocking_section_no_pending?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is usually a bug to check for pending signals in caml_leave_blocking_section? You generally have a result (success or error) of whatever was in the blocking section, and discarding it because a signal handler raises is probably not what you want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

But if would be better to remove the line instead of commenting it out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jhjourdan. Please remove the commented-out code. Optionally, you could add a comment that explains "We used to process pending signals at this point but it is not a good idea because...".

runtime/unix.c Outdated
#endif
if (retcode == -1) {
if (errno == EINTR) goto again;
if (errno == EINTR) return -1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (errno == EINTR) return -1;
if (errno == EINTR) return Io_interrupted;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Jun 30, 2020

The way I understand the previous code, it already polled before retrying the system call, inside caml_enter_blocking_section and caml_leave_blocking_section. The new version ensures that channels are unlocked when signals are called, though. Am I right?

Yes, that's correct. I'll update the PR description to make this clearer.

If a signal is handled immediately during a blocking section, then any progress made by that blocking section is discarded if the handler raises. Specifically, a signal may occur after a system call has successfully returned but before the blocking section has ended, and its return value was lost. There is in general no way to recover from this in the case of say, writing to a pipe.

I do not see exactly what situations you are referring to, but could these situations also be dealt with explicit clean-up after call to a caml_{enter,leave}_blocking_section_exn function as the ones proposed at #8997?

Now, the clean-up approach cannot deal with the problem of unlocking channels, but I think that recursive mutexes might be a good solution to locking twice in this situation, which would simplify the code.

Using recursive mutexes here would introduce subtle bugs. The arguments to e.g. caml_read_fd would be invalidated by a recursive use of the same channel. This is hard to spot and hard to fix.

It is much simpler to prevent signal handlers running between the accesses to chan in io.c and the call to read.

Another question: it is not clear to me that the (non-)SA_RESTART behaviour of signal is portable. Linux signal man page contradicts the glibc manual about glibc's own behaviour (which is also different from the Linux system call)... I did not find anything in POSIX. I wonder if it is better to use sigaction with explicit omission of SA_RESTART.

I think we unconditionally use sigaction? I agree signal is dicey here, but I'm not sure where it's used.

@stedolan
Copy link
Copy Markdown
Contributor Author

Another question: it is not clear to me that the (non-)SA_RESTART behaviour of signal is portable. Linux signal man page contradicts the glibc manual about glibc's own behaviour (which is also different from the Linux system call)... I did not find anything in POSIX. I wonder if it is better to use sigaction with explicit omission of SA_RESTART.

I think we unconditionally use sigaction? I agree signal is dicey here, but I'm not sure where it's used.

Ah, there's some code that uses signal if POSIX_SIGNALS isn't defined. I think this might only be reachable on Windows?

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I uncovered a bug in the handling of roots in io.c.

Apart from that, I'm copying the comment I already done in #1128:

It is not clear to me which guarantee this PR is trying to achieve. Indeed, there is still no guarantee that a signal will be handled with reasonable delay: if a signal arrives right before a system call, then no EINTR error code will be returned, and we will need to wait for the end of the system call before the signal gets handled. If we do not want that guarantee, then we can very well avoid checking for signals during I/O. The fix for this problem is the infamous "self-pipe trick". I know this is a larger change and does not handle ll the system calls, but at least it should be considered for this PR.

Apart from that, I am not quite sure I understand why #8691 simplified this PR that much. Which is the OCaml runtime function which checked for signals in #1128 but no longer does so?

Comment on lines +144 to +145
Lock (dbg_in);
Lock (dbg_out);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what's the purpose of this.
1- Without systhreads, this is a no-op.
2- With systhreads, ocamldebug doesn't work (see #4538), but even if this worked, taking the lock when opening the channel and releasing it when closing the channel clearly does not prevent races.

So, I think locks should not be taken and released in the debugger. If you want, you can add a comment explaining this is not necessary since ocamldebug is not supported with systhreads.

Otherwise, it would be good to comment why we are adding these instructions at these non-usual places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to ensure that the channel is always locked when doing I/O. This is needed because the I/O functions now temporarily unlock the channel when handling a signal. Unlocking a channel which is not locked is bad.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but (un)locking without the thread library is no-op. So I would prefer simply checking here that the thread library is not loaded.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the "declaration of intent" to lock even if this is a noop because it makes the code more readable, but I agree that a comment clarifying with @jhjourdan's point is desirable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this would work, but it seems unnecessarily fragile. The comments in io.h now state that the channel should be locked while doing any I/O, so why not follow that advice here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for letting the locking instructions here. That said, instead of a bare comment, I would prefer checking whether the systhread library and raise a fatal error if so. It is actually unsafe to use both features at the same time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I am somewhat disturbed about these locking instructions is that they do not seem to protect anything, since any OCaml code can be executed in the critical section. So we are left choosing between two ways of writing fragile code:
1- Code that would break if locking suddenly stops being a no-op in single-threaded mode
2- Code that would break if somehow the debugger code was executed in a re-entrant way.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that in all other places, only one channel is locked at a time. This is important for the mechanism that unlocks channels when an exception arises (cf Unlock_exn), which only remembers one channel. This is not a bug here as long as they are no-ops, but nowhere else you are going to see two locks at once, so it can be confusing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gadmm is right: the Unlock_exn mechanism assumes that Lock/Unlock calls are lexically well-bracketted. In particular, an unrelated exception may very well unlock one of the debugger locks, which is clearly unexpected. The only reason this is not a bug is that when the debugger is loaded, systhread cannot be loaded, and hence locks are no-ops.

Clearly, this is fragile, and I would prefer getting rid of these two locks, with a check that systhread is not loaded (and some comments explaining all that).

Memprof callbacks. Assumes that the runtime lock is held. Can raise
exceptions asynchronously into OCaml code. */

CAMLextern int caml_check_pending_actions (void);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but there is a problem of atomicity. Essentially, if caml_check_pending_actions returns false and a signal arrives after the check but before the system call, then we have to wait for the end of the system call to execute the signal handler.

So, essentially, in this PR, we are trying to do some kind of "best effort", praying that no signal will arrive during this short period of time, and I think this function is useful for that.

The usual solution to avoid this problem is to use the "self-pipe trick" instead of a volatile variable, or to use the pselect system call. But none of these is compatible with all blocking calls (e.g., calling open on a named pipe, locking a pthread mutex, ...)....

So, I agree with @gadmm that this function should be avoided in theory to get well-behaved code, but in practice the POSIX API does not give us the choice to not write bogus code.

Comment on lines -262 to +276
return caml_read_fd(fd, 0, p, n);
int r;
do {
r = caml_read_fd(fd, 0, p, n);
} while (r == Io_interrupted);
return r;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of coherence, cold you please use a goto and a call to check_pending here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not. I don't know the code that calls this and whether it locks.

handled at this point. */
signals_are_pending = 1;
caml_raise_if_exception(caml_process_pending_signals_exn());
//caml_raise_if_exception(caml_process_pending_signals_exn());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

But if would be better to remove the line instead of commenting it out.

runtime/io.c Outdated
{
int n, free, towrite, written;
again:
check_pending(channel);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem with this loop and the interaction with the GC. Indeed, the parameter p is called with &Byte(buff, pos), which means that it points to something in the heap which may move when signal handler are executed.

(Note that the loop existed in unix.c before that PR, but this is still a bug, so let's fix it now.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was already fixed by f43daf9 ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

@xavierleroy
Copy link
Copy Markdown
Contributor

I'm sorry I haven't had time for a proper review, and will not until next week. I just wanted to say I'm glad to see this work, as it looks simpler than #1128 yet should fix several issues.

Concerning the lack of atomicity pointed out by @jhjourdan : the current implementation has similar problems, and many other problems that this PR fixes, so 1- it's still a big step forward, and 2- once this is reviewed and merged, we can think of addressing the atomicity problem with the self-pipe trick.

@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Jul 6, 2020

@gadmm, @jhjourdan This patch makes signal handling have the following properties, which do not currently hold:

  • Signal handlers are run only at polling points. In particular, signal handlers do not interrupt arbitrary C code
  • Signal handlers are never run while holding channel locks
  • Signal handlers can do channel I/O and GC without corrupting state

There is an additional nice property, which as @xavierleroy points out did not hold before this patch and does not hold afterwards:

  • The runtime should never wait an unbounded time before handling a signal

The reason this does not hold is the same before and after this patch: a race between the last check for signals and entering a blocking operation. I have some ideas for a generic fix for this, but they'll be a separate PR.

(In other words: the current runtime has both safety and liveness issues. This patch fixes only the former)


@jhjourdan

Apart from that, I am not quite sure I understand why #8691 simplified this PR that much. Which is the OCaml runtime function which checked for signals in #1128 but no longer does so?

There are broadly two coding styles that allow correct handling of asynchronous signals.

  1. Write code that can be interrupted at any point, and try to ensure that it safely cleans up resources / locks / etc. if it is ever interrupted.
  2. Write code that polls for signals at designated points where cleanup is easy, and try to ensure that it always reaches such a point shortly after a signal is delivered.

#1128 attempted style 1, which required extremely careful and fragile coding, and needed fairly extensive changes to the internal interfaces of io.c / unix.c. This patch attempts style 2, which is much simpler.

At the time of writing #1128, I didn't see style 2 as an option, since almost any interaction with the runtime could in principle run signal handlers. Since #8691, controlling when signals may run seems easier. (Having said that, it's likely that style 2 might have worked before then, by being careful about which runtime functions were called)

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 6, 2020

Ah, there's some code that uses signal if POSIX_SIGNALS isn't defined. I think this might only be reachable on Windows?

This is indeed what must have confused me. Thanks for checking.

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this new version I OK for merging.

My only minor concern is about the locking in debugger.c.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 7, 2020

My thoughts about the possible guarantees, and the replies to previous points, all in one message (the only one).

In this PR there are two changes that I think are better to discuss separately:

  1. No longer interrupt blocking sections by running OCaml signal handlers from C signal handlers.
  2. Unlock channels before checking for pending signals, and publish the API to allow users to write this kind of code.

My concerns are mainly about the API part of 2.

1. No longer interrupt blocking sections by running OCaml code inside signal handlers

The question is what should it be replaced with (either now if deemed necessary, or later). The titles of the PRs let us think (or me at least) that this would involve handling EINTR inside more blocking sections than before, and this is what I have been thinking about. The current PR does not change where EINTR is handled.

Taking a step back (these notes are from when I was reviewing the previous PR), there are several things achieved (in spirit at least) by the current broken implementation:

The user does not need to handle EINTR

This has been requested at #7508:

System call wrappers provided in the standard library should be retried automatically when they fail with EINTR, to relieve application code from the burden of doing so.

referring to by a similar python proposal (https://www.python.org/dev/peps/pep-0475/). This sounds reasonable, independently of the next points. Currently, EINTR is only handled for read and write in unix.c.

Guarantee that signals are handled promptly enough

The race between the signal and the start of the system call has been discussed at length. A couple of points though:

  • the race is special, because if the signal can be re-sent, then the race is no longer an issue. This is true for sigint where the user can press Ctrl-C again, less obvious for signals like sigterm (there might be workarounds though by waiting for the signal on a separate thread). I believe one could live with the race if necessary. (This is a "do not fight POSIX" approach.)
  • the self-pipe trick is interesting because in addition to avoiding the race, it becomes possible to interrupt the blocking sections without having to rely on signals (which is cumbersome in the presence of threads).

Blocking sections in general can be interrupted, not merely system calls able to return EINTR

This point has not been raised before, but I wanted to raise it because the question of whether OCaml will support cleanly interrupting blocking calls in general is going to influence how the masking API (#8961) is designed (details better discussed elsewhere). As a reference point, blocking calls in GHC (relevant here in the context of their experiment with masking) are interruptible in general, for instance takeMVar and putMVar are interruptible (whereas Mutex.lock and Condition.wait are not interruptible in OCaml). (For POSIX threads it seems that one could use semaphores, for instance, with sem_wait returning EINTR.)

I do not believe it is necessary to do any of this as part of this PR (because this did not work reliably up to now, and has always been disabled under systhreads), but I wonder what was the original purpose of the non-systhreads behaviour (e.g. make the toplevel more responsive to sigint?), and whether this could inform what replacement is desirable in the short term (if any).

2. Unlock channels before checking for pending signals, and publish the API to allow users to write this kind of code

The way I understand it now, the API changes mix two different goals:

  1. one the one hand, make public what is needed to implement unlocking as in this PR,
  2. on the other hand, trying to deal with the consistency of state and the release of resources should an async exception arise.

Unlocking

This PR allows using the same channel in a signal handler and in the main code (typically stdout). Note: this costs the atomicity for the channel operation that previously was guaranteed by the lock. I am not sure what impact this is going to have. This includes being interrupted by a signal (whose handler need not use the channel), only to yield to another thread that wants to use the same channel. (Recursive mutexes would avoid this but as you note this looks at least as complicated to implement.)

It is still delicate to implement this, while I am not convinced that this use-case should be supported, at least in the general case. In general, the programmer could be told to either avoid sharing a mutex between async callbacks and the main code, or to have the critical sections be uninterruptible.

So, as far as the API proposal is concerned, I am not convinced that it is good to offer the functions caml_enter_blocking_section_no_pending and caml_check_pending_actions in the public API (given other reasons I mentioned previously).

In addition, while the proposed implementation of unlocking is good enough, an alternative I have thought about would be to use an errorcheck mutex so as to be able to raise a nice exception in case the signal handler tries to re-use a channel (and to document this fact). Then this would preserve the atomicity.

State/resource safety

Signal handlers are run only at polling points. In particular, signal handlers do not interrupt arbitrary C code

This concerns the change in caml_leave_blocking_section, which currently polls to raise the interrupt as soon as possible. So there is #8997 which already proposes caml_enter/leave_blocking_section_exn variants for being able to clean-up after an exception, and #8961 that gives the same behaviour as yours under the appropriate mask. (There is a nice example usage with the implementation of caml_sys_open in sys.c which is leaky and this is bad, but the proposed caml_leave_blocking_section_exn is meant to fix this.)

Changing the polling behaviour of caml_leave_blocking_section is a radical move that I would not have dared try. I would not know how to evaluate the consequences. I think it is good that signals are checked first thing, as there are not that many polling points. I also believe that the change is unnecessary and overkill: I am missing what is different between the problems you see with polling there, and the usual way ML interrupts invalidate state.

What worries me is if the implication is that it should be the responsibility of the person who polls to ensure the consistency of state, because this is not the correct way to deal with the problem. Instead, it is the responsibility of the person catching the interrupt to make sure that no invalid state escapes. (I have documented at https://gitlab.com/gadmm/memprof-limits/-/blob/master/doc/recovering.md this, which is the only way I know to safely handle interrupts.) To take your example of writing to a pipe, one should not care about losing the result of the syscall, because the pipe itself should disappear before the interrupt is caught, and never be seen again.

(The alternative is to add a new internal function caml_leave_blocking_section_no_pending for the moment.)

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 9, 2020

In short, what disturbs me with the unlocking and the moving of polling to the topmost callers is that this tries to treat asynchronous callbacks as if they were synchronous. Moreover, the API change means that this is going to be considered a good pattern for users to follow. But one has to make the burden of asynchronous callbacks bearable, and I am worried that this might not be a viable strategy at scale. It is much simpler to consider that users are not entitled to share locks between the asynchronous callbacks and the rest of the program.

I have updated #8997, which deals with a few leaks in the runtime by explicitly returning the exception in caml_enter/leave_blocking_section and applying standard resource-management patterns. From this I do not see how no longer polling in caml_leave_blocking_section would make things much simpler. Next I will try to apply the recipe in #8997 to experiment with the alternative I proposed for the second part of your PR (where one gets an error when locking recursively).

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 12, 2020

  • Resource-safe C interface, part 2 #8997 now fixes the leaks in io.c. It is worth it to guarantee that the channel state remains valid at all times since it is simple to implement. For comparison with the current PR, see 8971cec1 for caml_read_fd and 6e0152bb for caml_write_fd. They rely on boilerplate which is also used to fix other bugs (67250e56).

    This approach is incompatible with unlocking the channels before running the asynchronous callbacks, in which case a larger change like yours is necessary (it is a different coding style as you describe). The upside of this coding style is that the fixes at Resource-safe C interface, part 2 #8997 are mechanical and change the current behaviour the least, and the API proposal is accordingly simpler.

  • In addition, at Make systhread mutexes errorcheck #9757 I implemented the raising of an exception if a mutex is being locked recursively. (I think it is worthwhile to consider it separately, because it fixes deadlocks with Mutex too.)

I found it interesting to discover these issues with state invalidation with your work around this PR. I believe that with #8997 and #9757, the approach in the second part of this PR is not necessarily needed, and it comes down to whether one wants to preserve the current behaviour of programs under systhread or under no systhread:

@stedolan
Copy link
Copy Markdown
Contributor Author

@gadmm I'll try to respond to as many of your points as I can here:

1. No longer interrupt blocking sections by running OCaml code inside signal handlers

The user does not need to handle EINTR
this would involve handling EINTR inside more blocking sections than before

This is a misconception. Currently, blocking sections return EINTR:

  • when interrupted by a signal, if systhreads is not linked
  • when interrupted by a signal whose handler does not raise (whether or not systhreads is linked)

So, with this patch the set of EINTR-returning places does not change. The difference is that EINTR is used to signal interrupts even for signal handlers which raise, rather than using the inherently-unsafe approach of running OCaml code in a Unix signal handler.

Guarantee that signals are handled promptly enough

I agree with your points here. I'd like to discuss them in a separate PR.

Blocking sections in general can be interrupted, not merely system calls able to return EINTR
As a reference point, blocking calls in GHC (relevant here in the context of their experiment with masking) are interruptible in general

GHC implements interruptible blocking C calls in exactly the way proposed by this patch - to interrupt, it causes a system call to return EINTR. See the GHC manual, section 10.2.5. There is no other way to make arbitrary blocking sections interruptible in general.

2. Unlock channels before checking for pending signals, and publish the API to allow users to write this kind of code

Unlocking
Note: this costs the atomicity for the channel operation that previously was guaranteed by the lock

In what operation is atomicity lost? For most of these operations, partial results are returned without checking for signals, so I'm not sure there can be I/O interleaved with signal handling.

Besides, I am not convinced there was any such guarantee! The implementations used a lock internally to avoid corruption, but no guarantee was made anywhere. Many operations on channels (e.g. print_endline) in fact involve multiple lock-unlock cycles, and can interleave output when used from multiple threads.

an alternative I have thought about would be to use an errorcheck mutex so as to be able to raise a nice exception in case the signal handler tries to re-use a channel (and to document this fact).

I think this is a bad idea. Printing a message to stdout or stderr is a very common use of signal handlers, and under your proposal this operation could nondeterministically raise a mysterious exception about mutexes.

Your approach in #8997 + #9757 still seems to run arbitrary user code while holding a channel lock. What happens if a signal handler prints to stderr, while the main program is also doing so? What happens if a signal handler closes stderr? Have you tested your code with my example program from #1128?

@xavierleroy
Copy link
Copy Markdown
Contributor

While reviewing, I took the liberty of pushing a small Win32 fix directly to the branch (commit f4bf623). Since Win32 has no notion of signal that can interrupt a system call, the change is mostly cosmetic: replace caml_enter_blocking_section with caml_enter_blocking_section_no_pending, for consistency with what was done in runtime/unix.c.

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now pretty confident about this code. Even if it doesn't fix all the atomicity issues, it's a huge improvement in safety.

Two minor suggestions below, but nothing that should block merging this PR.

Changes Outdated
which has never existed.
(Jacques-Henri Jourdan, review by Xavier Leroy)

- #1128, #9722: EINTR-based signals
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a more descriptive entry, for example:

#1128, #7503, #9036, #9722: revised handling of signals.
Avoid running OCaml signal handlers in the middle of blocked system calls.
Instead, make sure system calls interrupted by a signal return quickly
to a program point where the signal handler can safely run.
Also revise the locking of I/O buffers to avoid deadlocks.

Copy link
Copy Markdown
Contributor

@gadmm gadmm Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus a description of changes to the public API and polling behaviour.

handled at this point. */
signals_are_pending = 1;
caml_raise_if_exception(caml_process_pending_signals_exn());
//caml_raise_if_exception(caml_process_pending_signals_exn());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jhjourdan. Please remove the commented-out code. Optionally, you could add a comment that explains "We used to process pending signals at this point but it is not a good idea because...".

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 14, 2020

@stedolan, for 1. I was unclear, the first point was not meant to be about a property we might lose but one that was requested elsewhere. I think we are in agreement overall.

For 2., I had in mind operations like caml_input_value, caml_output_value and caml_input_scan_line which are atomic currently (and thus Stdlib.input_value and Stdlib.output_value). Your point is that for instance Stdlib.input_line is already buggy if somebody else accesses the channel concurrently, so it does not matter if caml_input_scan_line loses atomicity. I agree, and I was not making this behaviour change critical (because there is good chance that the code this would affect is already buggy).

Printing a message to stdout or stderr is a very common use of signal handlers, and under your proposal this operation could nondeterministically raise a mysterious exception about mutexes.

Currently you could have mysterious deadlocks (with systhreads) or crashes (without it) instead. My point of view is that these concurrent accesses do bad things in general and would generally be considered as bugs, so it is better to fail early with an exception. You are interested in a use-case where the bad thing is acceptable. The approach at #9757 is biased towards fast bug reporting, but there are workarounds if the programmer insists, such as (AFAIU) creating a separate channel from a dup of stdout/stderr.

But, I agree that the additional guarantee of not running callbacks while holding a lock can be sensible in this case (and the specification "use a lock internally to avoid corruption, but make no guarantee of atomicity"). My point of view is mainly from thinking about what is a good API for the programmer, and I am not convinced by the public API and behaviour changes. In particular, #8997 shows that polling inside caml_leave_blocking_section is not an obstacle to ensuring the consistency of state in general, and your issue with it is rather specific to what you are trying to achieve and to me does not warrant such a large change in polling behaviour.

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion did not convince me about the public API and behaviour change regarding polling. To sum up my points:

Not polling inside caml_leave_blocking_section was based on a very broad argument about state corruption, but this is nothing specific to system calls but is a general concern when dealing with interrupts. #8997 shows that there is no obstacle to ensuring the consistency of channel state in this case. Instead, the reason you want to not poll here is to be able to unlock the mutex before running signal handlers, which is fairly specific to what you are trying to achieve, relying on a non-compositional coding style where the programmer must control every places where polling could happen.

First, an alternative, compositional way to control where polling happens is being proposed at #8961, so I do not see the opportunity to encourage this style by making _no_pending functions public at this point.

Second, you could introduce a specific function caml_leave_blocking_section_no_pending, but instead you change the behaviour for everybody who ever acquires the runtime lock. That signals are checked as soon as you acquire the runtime lock is a nice and natural guarantee, and I am not convinced that the change will make things much simpler. For instance, the number of functions that one has to transform into an _exn variant for explicit resource clean-up in #8997 does not change since they will already contain caml_enter_blocking_section, i.e. this polling location comes for free in terms of non-raising guarantees of functions.

@xavierleroy
Copy link
Copy Markdown
Contributor

you could introduce a specific function caml_leave_blocking_section_no_pending, but instead you change the behaviour for everybody who ever acquires the runtime lock.

I can sympathize with this argument. Perhaps a new caml_leave_blocking_section_no_pending function could be introduced by symmetry with caml_enter_blocking_section_no_pending and to minimize changes?

I would still like this PR to be merged well in time for release 4.12. #8961 and #8997 look like longer-term projects to me, of the kind that will not be reviewed in time for 4.12.

@jhjourdan
Copy link
Copy Markdown
Contributor

you could introduce a specific function caml_leave_blocking_section_no_pending, but instead you change the behaviour for everybody who ever acquires the runtime lock.

I can sympathize with this argument. Perhaps a new caml_leave_blocking_section_no_pending function could be introduced by symmetry with caml_enter_blocking_section_no_pending and to minimize changes?

The thing is, it's not like this is a change that is likely to break programs. This will simply delay some signals a little more (until the next polling point). But since polling point (should be) are inserted regularly, the introduced delay should be too long.

@stedolan
Copy link
Copy Markdown
Contributor Author

In particular, #8997 shows that polling inside caml_leave_blocking_section is not an obstacle to ensuring the consistency of state in general

#8997 introduces caml_enter_blocking_section_noexn and caml_leave_blocking_section_noexn (with the semantics of caml_enter_blocking_section_no_pending and caml_leave_blocking_section here). From what I can tell, they are introduced because polling inside caml_leave_blocking_section is an obstacle to ensuring the consistency of state in general.

In the cases where #8997 uses the polling variant of leave_blocking_section, it risks discarding an exception from a failed operation. (Perhaps you don't care about this case, but this case is what my argument above was about).

I have yet to see a program whose behaviour is improved by polling in leave_blocking_section, but I've seen several where the polling results in a bug. The usual argument against removing a polling point from a function is that it might cause unbounded delay if a program repeatedly calls it, but that can't happen here: any code that calls caml_leave_blocking_section repeatedly must also call caml_enter_blocking_section repeatedly, and that's a better place to poll.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 15, 2020

In particular, #8997 shows that polling inside caml_leave_blocking_section is not an obstacle to ensuring the consistency of state in general

#8997 introduces caml_enter_blocking_section_noexn and caml_leave_blocking_section_noexn (with the semantics of caml_enter_blocking_section_no_pending and caml_leave_blocking_section here).

Yes, we agree that sometimes we need this behaviour! This is the point of #8961 (not yet ready to land).

From what I can tell, they are introduced because polling inside caml_leave_blocking_section is an obstacle to ensuring the consistency of state in general.

Not really, this is about avoiding leaks rather than invalidating the state of values that are still in use, and it is exclusively used in the release path (I did not remove polling locations). The transformation used at #8997 is systematic and explains why you need them in the release path. If you used the raising version, then it tells you to use the _exn variant, test for an exception, and clean-up, that is... proceed with what you were doing in the release path (re-release the lock, re-test for an exception, etc.). From this, one quickly concludes that signals must be masked during release, as a general principle. Also, while I applied a systematic transformation that preserved all polling locations, I agree that there are places where I could have simplified the code if I removed polling at some places (both when acquiring and releasing the lock). My objection was rather that with masking and the _exn variants, the user eventually will have enough tools, and it is better to keep the public API as simple as possible. If they have to be superseded, then it is easier if they remain internal.

In the cases where #8997 uses the polling variant of leave_blocking_section, it risks discarding an exception from a failed operation. (Perhaps you don't care about this case, but this case is what my argument above was about).

True, I considered the error from the failed operation as discardable. Your point is quite subtle and interesting. If we apply the general rule not to drop exceptions, I agree that one cannot poll when acquiring the lock. But this generally applies to exceptions that we do not know where they come from. Here we know what the exception is, and we could consider that it has not been raised yet (one could permute the polling and the conversion the error code into the exception). At least this is my reasoning here.

I have yet to see a program whose behaviour is improved by polling in leave_blocking_section, but I've seen several where the polling results in a bug. The usual argument against removing a polling point from a function is that it might cause unbounded delay if a program repeatedly calls it, but that can't happen here: any code that calls caml_leave_blocking_section repeatedly must also call caml_enter_blocking_section repeatedly, and that's a better place to poll.

I agree that this "usual argument" does not apply and I like the argument. Rather, I am very cautious about this sort of radical change. My approach is to not assume that I can know what code exists in the wild; here maybe some code is already poor in polling locations. And as Jacques-Henri recalls, polling should be inserted regularly. Right when acquiring the lock has the most chance to be helpful.

@xavierleroy
Copy link
Copy Markdown
Contributor

Could we please make progress towards merging this PR? The points of disagreement have been presented in details, and look small to me, so it should be possible to move forward.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 17, 2020

I think the issue with caml_leave_blocking_section is no longer blocking. While the arguments given initially seem incorrect to me, I agree with @stedolan's new explanation about the change being unable to cause unbounded delays. I am still uncomfortable with such a radical change, but I leave it to @stedolan's appreciation.

(I think the change deserved such a discussion. Some of the disagreement can probably be explained by the fact that we have different ideas of how frequently the runtime should poll, which has never been specified.)

Whether we should be wary of adding too many and redundant functions to the public API is also a matter of taste to some degree, so I leave it to @stedolan's appreciation.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 17, 2020

Since Win32 has no notion of signal that can interrupt a system call, the change is mostly cosmetic

It is worth noting that according to @stedolan's link, it is possible to interrupt the system call:

[Vista and later only] The RTS calls the Win32 function CancelSynchronousIo, which will cause a blocking I/O operation to return with the error ERROR_OPERATION_ABORTED.

@xavierleroy
Copy link
Copy Markdown
Contributor

It is worth noting that according to @stedolan's link, it is possible to interrupt the system call: [Vista and later only] The RTS calls the Win32 function CancelSynchronousIo, which will cause a blocking I/O operation to return with the error ERROR_OPERATION_ABORTED.

Good to know! I see there's also WSACancelBlockingCall for socket I/O. I'll think about how to take this into account in runtime/win32.c. It shouldn't be hard to return the magic "Io_interrupted" result in case of cancellation. What is less clear to me is whether and how cancellation should be mapped to a signal so that it can be handled using the signal handler machinery.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 17, 2020

Here is a consequence of not polling in caml_leave_blocking_section (AFAIU).

Currently, if a system call gives EINTR (other than caml_fd_read and caml_fd_write from this PR), then caml_leave_blocking_section will perform the check for signals. After this PR, it might raise Sys_error or Unix.Unix_error instead. Examples include caml_ml_close_channel and caml_sys_open from the runtime, wrappers from the Unix module, and probably 3rd-party code in the wild.

Now, user code might typically perform some clean-up before handling the exception. In the current incorrect way to do clean-up available to the OCaml programmer (e.g. Fun.protect or manual try/re-raise), a signal handler being executed inside the clean-up code and raising is likely to cause a leak. Currently, this is a race condition that happens rarely (or so we hope). By raising Sys_error or Unix.Unix_error while a signal is pending, these leaks no longer result from a race condition but from the way the system call wrappers are implemented.

Furthermore, if it was possible to run clean-up code correctly, that is with pending signals masked, then the execution of the signal handler will instead be delayed until after all the clean-up is done. One consequence is that if Sys_error or Unix.Unix_error was not handled specifically, then the task will see an uncaught Sys_error or Unix.Unix_error exception. Yet we have seen code that treats the exception arising from the signal handler specially (for instance taking the decision to respawn the task instead of aborting, and other examples at #8873 (comment)). In this case, the execution of the signal handler might happen too late. (This is also true, in current OCaml, if the code does no clean-up or if clean-up does not poll.)

Thus, even if we believe that the program will poll soon enough, it might not be where we want it. A solution is that the users should first check for EINTR and poll, but polling in caml_leave_blocking_section has been doing that for them.

Do you agree with this analysis?

stedolan added 3 commits July 27, 2020 17:29
The debugger's use of channels doesn't support locking, but it
doesn't work on threaded programs anyway.
@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Jul 27, 2020

I've rebased the history into a reasonable series of commits, and added tests for various cases of signal handling, GC and I/O. These tests uncovered a couple more bugs (e.g. out_channel_length didn't lock the channel) so there are a few extra changes to io.c.

I'd appreciate someone having a quick read of the io.c changes, but otherwise I consider this ready to merge (let's see if CI agrees...)

Changes Outdated
Comment on lines +97 to +99
- #1128, #9722: EINTR-based signal handling: ensure that signals are handled
only at polling points (which no longer includes leave_blocking_section),
and that I/O locks are not held during signal handling.
Copy link
Copy Markdown
Contributor

@gadmm gadmm Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- #1128, #9722: EINTR-based signal handling: ensure that signals are handled
only at polling points (which no longer includes leave_blocking_section),
and that I/O locks are not held during signal handling.
* #1128, #9722: EINTR-based signal handling: ensure that signals are handled
only at polling points, and that I/O locks are not held during signal
handling. The function `caml_leave_blocking_section` no longer checks for
pending signals, this is replaced by polling inside `caml_raise`. This is a
change we believe will fix bugs in user C code.

edit: forgot to mention caml_raise.

Comment on lines +37 to +41
if (caml_check_pending_actions()) {
CAMLparam1 (v);
caml_process_pending_actions();
CAMLdrop;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (caml_check_pending_actions()) {
CAMLparam1 (v);
caml_process_pending_actions();
CAMLdrop;
}
v = caml_process_pending_actions_with_root(v);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks.

(I added an assertion that !Is_exception_result(v), because failing to call Extract_exception before raise is an easy mistake to make, and the implementation of caml_process_pending_actions_with_root will do something strange in that case)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Comment on lines +65 to +68

if (caml_check_pending_actions()) {
CAMLparam1 (v);
caml_process_pending_actions();
CAMLdrop;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (caml_check_pending_actions()) {
CAMLparam1 (v);
caml_process_pending_actions();
CAMLdrop;
}
v = caml_process_pending_actions_with_root(v);

Comment on lines +192 to +195
if (check_for_pending_signals()) {
signals_are_pending = 1;
caml_set_action_pending();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. At #8961 I was simply calling caml_set_action_pending after removing signals_are_pending.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we could call caml_set_action_pending(); even if check_for_pending_signals() returns 0.

First, it does not cost much : the only cost is that we will examine all the potential callbacks at the next polling point. This is not much more expensive than calling check_for_pending_signals(). Second, with #9674, there will be another source of thread-local callback, which will need updating caml_something_to_do at every context switch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep it like this. The cost of set_action_pending isn't zero: the next allocation will fail and spill/restore all registers.

In many programs, entering and leaving a blocking section is massively more frequent than context-switching between threads. If you want to implement per-thread callbacks, I think it would make more sense to do them in caml_thread_{save,restore}_runtime_state in the systhreads library, which is called from the blocking_section_hooks. (Also, doing it there gives you the opportunity to optimise the common case where no context switch has occurred, and the thread taking the lock is the last one to release it)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, go for it.

retcode = caml_sigmask_hook(how, &set, &oldset);
caml_leave_blocking_section();
/* Run any handlers for just-unmasked pending signals */
caml_process_pending_actions();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, this is another example of behaviour change from not polling in caml_leave_blocking_section, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see the commit message.

This is related to #9802: POSIX stipulates polling points for signals in exactly two functions: kill and sigprocmask (or its pthread_sigmask alias). sigprocmask had one via caml_leave_blocking_section, but kill did not. #9802 adds a test that ensures both are present.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw it.

stedolan added 2 commits July 28, 2020 10:54
To preserve behaviour, explicit polls are added:

   - in caml_raise, to raise the right exception when as system
     call is interrupted by a signal.

   - in sigprocmask, to ensure that signals are handled as soon
     as they are unmasked.
Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version looks good to me, except a small comment in signal.c.

Comment on lines +192 to +195
if (check_for_pending_signals()) {
signals_are_pending = 1;
caml_set_action_pending();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we could call caml_set_action_pending(); even if check_for_pending_signals() returns 0.

First, it does not cost much : the only cost is that we will examine all the potential callbacks at the next polling point. This is not much more expensive than calling check_for_pending_signals(). Second, with #9674, there will be another source of thread-local callback, which will need updating caml_something_to_do at every context switch.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 28, 2020

I think the Changes file deserves some clarification, in the vein of how I sketched it in my comment above.

@stedolan
Copy link
Copy Markdown
Contributor Author

I've updated the Changes entry, by editing together my, @xavierleroy and @gadmm's proposals.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 28, 2020

I've updated the Changes entry, by editing together my, @xavierleroy and @gadmm's proposals.

Looks good, thanks!

@xavierleroy
Copy link
Copy Markdown
Contributor

I had a quick look at runtime/io.c again, and it looked good. At any rate, this PR has been reviewed in exquisite details and polished with much care (thank you, author and reviewers!), so it's high time to merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stdlib.flush hangs when used within signal handler Bad locking in io.c

6 participants