Conversation
| } else { | ||
| nread = caml_read_fd(channel->fd, channel->flags, channel->buff, | ||
| channel->end - channel->buff); | ||
| if (nread == Io_interrupted) goto again; |
There was a problem hiding this comment.
Is it intentional that there is no corresponding check for nread == 0 here to check for an EOF condition, as there is with caml_read_fd a few hunk above?
There was a problem hiding this comment.
That is odd! I'm trying to preserve existing behaviour here, and it looks like caml_refill raises end_of_file on a zero read but caml_getblock does not. The only user of getblock in the stdlib (md5.c) has an explicit == 0 check after the call.
There was a problem hiding this comment.
Modifying the blocking read and write calls in io.c to insert a polling point (and hence handle signals) before retrying a system call on receipt of an EINTR.
The way I understand the previous code, it already polled before retrying the system call, inside caml_enter_blocking_section and caml_leave_blocking_section. The new version ensures that channels are unlocked when signals are called, though. Am I right?
The PR contains a proposal for new public API functions, and a behaviour change for caml_leave_blocking_section, about which I am a bit sceptical (comments below).
If a signal is handled immediately during a blocking section, then any progress made by that blocking section is discarded if the handler raises. Specifically, a signal may occur after a system call has successfully returned but before the blocking section has ended, and its return value was lost. There is in general no way to recover from this in the case of say, writing to a pipe.
I do not see exactly what situations you are referring to, but could these situations also be dealt with explicit clean-up after call to a caml_{enter,leave}_blocking_section_exn function as the ones proposed at #8997?
Now, the clean-up approach cannot deal with the problem of unlocking channels, but I think that recursive mutexes might be a good solution to locking twice in this situation, which would simplify the code.
Another question: it is not clear to me that the (non-)SA_RESTART behaviour of signal is portable. Linux signal man page contradicts the glibc manual about glibc's own behaviour (which is also different from the Linux system call)... I did not find anything in POSIX. I wonder if it is better to use sigaction with explicit omission of SA_RESTART.
Like Jacques-Henri, I had further questions about what guarantees the PR should try to achieve. I thought a bit about it while reviewing the other patch; I will write about it later if I find the time.
runtime/caml/io.h
Outdated
|
|
||
| /* Functions and macros that can be called from C. Take arguments of | ||
| type struct channel *. No locking is performed. */ | ||
| type struct channel *. The channel must be locked before calling these. */ |
There was a problem hiding this comment.
To clarify the documentation, the change applies to all the 14 functions and macros taking a channel as argument that follow?
There was a problem hiding this comment.
Good point. I've rearranged this file to make it clearer when the channel must be locked.
| #endif | ||
|
|
||
| CAMLextern void caml_enter_blocking_section (void); | ||
| CAMLextern void caml_enter_blocking_section_no_pending (void); |
There was a problem hiding this comment.
Regarding its inclusion in the public (non-internal) part of the interface: at #8997 there is a proposal to add a variant caml_enter_blocking_section_exn as part of making raising explicit in the C api, and #8961 implements masking which disables the processing of signals inside caml_enter_blocking_section in a different manner. I am slightly worried about this interface becoming complicated with many possible variants, so I wonder what minimal part of the interface should be offered to the programmer. (The function could also be proposed in the internals part of the interface.)
There was a problem hiding this comment.
Agreed, having too many variants of these functions is definitely bad. I'm not sure what the correct set is.
There was a problem hiding this comment.
The behaviour for enter/leave_section you are implementing is precisely the one at #8961 when the "uninterruptible" mask is set. Would an extra call to caml_mask/caml_unmask (from 3a72aab9f) be worth the greater simplicity of the public API? If so, a solution would be to keep caml_enter_blocking_section_no_pending for now, but only in the internal part of the API, and then later I can make the change.
| Memprof callbacks. Assumes that the runtime lock is held. Can raise | ||
| exceptions asynchronously into OCaml code. */ | ||
|
|
||
| CAMLextern int caml_check_pending_actions (void); |
There was a problem hiding this comment.
Regarding its inclusion in the public (non-internal) part of the interface (as well): such a check was explicitly left out of the public interface. If there is a good reason to let the programmer do the check, this can change of course, but my reasoning was that it was not clear that it is useful for the programmer, whereas keeping the race between checking and executing signals under control inside the process_pending_* functions makes the signals api and delivery logic easier to evolve should it become necessary. (I remember ocamlcc used to execute signals in a loop up until caml_something_to_do was false; while this probably worked, I do not think this is the intended semantics of polling.) #8961 proposes a similar function (taking masking into account), but in the internals part of the interface.
There was a problem hiding this comment.
Have you looked at the use of this function in this patch? It's difficult to replicate without it.
There was a problem hiding this comment.
Yes, but there is a problem of atomicity. Essentially, if caml_check_pending_actions returns false and a signal arrives after the check but before the system call, then we have to wait for the end of the system call to execute the signal handler.
So, essentially, in this PR, we are trying to do some kind of "best effort", praying that no signal will arrive during this short period of time, and I think this function is useful for that.
The usual solution to avoid this problem is to use the "self-pipe trick" instead of a volatile variable, or to use the pselect system call. But none of these is compatible with all blocking calls (e.g., calling open on a named pipe, locking a pthread mutex, ...)....
So, I agree with @gadmm that this function should be avoided in theory to get well-behaved code, but in practice the POSIX API does not give us the choice to not write bogus code.
There was a problem hiding this comment.
Apologies if this is a really small detail and if I am excessively nit-picking with naming. If you insist in making caml_check_pending_actions public, can we please rename it into something like caml_has_action_to_do like at #8961, in order to make the distinction with simply having an action pending? The distinction becomes important after #8961 (we must compare the value of the actual action pending with the value of the mask), but if you make the function public now, then I am stuck with the slightly misleading name.
There was a problem hiding this comment.
On a second look, your naming is nicely consistent with the other functions *_pending_actions. Sorry for the excessive bikeshedding, please do as you see fit.
| CAMLexport int caml_flush_partial(struct channel *channel) | ||
| { | ||
| int towrite, written; | ||
| again: |
There was a problem hiding this comment.
Merely a matter of style, but couldn't all the gotos be replaced with simple loops? Would that be better?
There was a problem hiding this comment.
You could certainly do this with a state variable and a loop. I copied the coding style here from the EINTR handling in runtime/unix.c, but I'm not especially attached to it. It's not obvious to me which would be clearer.
runtime/signals.c
Outdated
| handled at this point. */ | ||
| signals_are_pending = 1; | ||
| caml_raise_if_exception(caml_process_pending_signals_exn()); | ||
| //caml_raise_if_exception(caml_process_pending_signals_exn()); |
There was a problem hiding this comment.
What is the reason for changing the polling behaviour of caml_leave_blocking_section, instead of e.g. introducing a new function caml_leave_blocking_section_no_pending?
There was a problem hiding this comment.
I think it is usually a bug to check for pending signals in caml_leave_blocking_section? You generally have a result (success or error) of whatever was in the blocking section, and discarding it because a signal handler raises is probably not what you want.
There was a problem hiding this comment.
+1
But if would be better to remove the line instead of commenting it out.
There was a problem hiding this comment.
I agree with @jhjourdan. Please remove the commented-out code. Optionally, you could add a comment that explains "We used to process pending signals at this point but it is not a good idea because...".
runtime/unix.c
Outdated
| #endif | ||
| if (retcode == -1) { | ||
| if (errno == EINTR) goto again; | ||
| if (errno == EINTR) return -1; |
There was a problem hiding this comment.
| if (errno == EINTR) return -1; | |
| if (errno == EINTR) return Io_interrupted; |
Yes, that's correct. I'll update the PR description to make this clearer.
Using recursive mutexes here would introduce subtle bugs. The arguments to e.g. It is much simpler to prevent signal handlers running between the accesses to
I think we unconditionally use sigaction? I agree |
Ah, there's some code that uses |
jhjourdan
left a comment
There was a problem hiding this comment.
I think I uncovered a bug in the handling of roots in io.c.
Apart from that, I'm copying the comment I already done in #1128:
It is not clear to me which guarantee this PR is trying to achieve. Indeed, there is still no guarantee that a signal will be handled with reasonable delay: if a signal arrives right before a system call, then no EINTR error code will be returned, and we will need to wait for the end of the system call before the signal gets handled. If we do not want that guarantee, then we can very well avoid checking for signals during I/O. The fix for this problem is the infamous "self-pipe trick". I know this is a larger change and does not handle ll the system calls, but at least it should be considered for this PR.
Apart from that, I am not quite sure I understand why #8691 simplified this PR that much. Which is the OCaml runtime function which checked for signals in #1128 but no longer does so?
runtime/debugger.c
Outdated
| Lock (dbg_in); | ||
| Lock (dbg_out); |
There was a problem hiding this comment.
It's not clear to me what's the purpose of this.
1- Without systhreads, this is a no-op.
2- With systhreads, ocamldebug doesn't work (see #4538), but even if this worked, taking the lock when opening the channel and releasing it when closing the channel clearly does not prevent races.
So, I think locks should not be taken and released in the debugger. If you want, you can add a comment explaining this is not necessary since ocamldebug is not supported with systhreads.
Otherwise, it would be good to comment why we are adding these instructions at these non-usual places.
There was a problem hiding this comment.
The point is to ensure that the channel is always locked when doing I/O. This is needed because the I/O functions now temporarily unlock the channel when handling a signal. Unlocking a channel which is not locked is bad.
There was a problem hiding this comment.
Ok, but (un)locking without the thread library is no-op. So I would prefer simply checking here that the thread library is not loaded.
There was a problem hiding this comment.
I like the "declaration of intent" to lock even if this is a noop because it makes the code more readable, but I agree that a comment clarifying with @jhjourdan's point is desirable.
There was a problem hiding this comment.
You're right that this would work, but it seems unnecessarily fragile. The comments in io.h now state that the channel should be locked while doing any I/O, so why not follow that advice here?
There was a problem hiding this comment.
Ok for letting the locking instructions here. That said, instead of a bare comment, I would prefer checking whether the systhread library and raise a fatal error if so. It is actually unsafe to use both features at the same time.
There was a problem hiding this comment.
The reason I am somewhat disturbed about these locking instructions is that they do not seem to protect anything, since any OCaml code can be executed in the critical section. So we are left choosing between two ways of writing fragile code:
1- Code that would break if locking suddenly stops being a no-op in single-threaded mode
2- Code that would break if somehow the debugger code was executed in a re-entrant way.
There was a problem hiding this comment.
Another issue is that in all other places, only one channel is locked at a time. This is important for the mechanism that unlocks channels when an exception arises (cf Unlock_exn), which only remembers one channel. This is not a bug here as long as they are no-ops, but nowhere else you are going to see two locks at once, so it can be confusing.
There was a problem hiding this comment.
@gadmm is right: the Unlock_exn mechanism assumes that Lock/Unlock calls are lexically well-bracketted. In particular, an unrelated exception may very well unlock one of the debugger locks, which is clearly unexpected. The only reason this is not a bug is that when the debugger is loaded, systhread cannot be loaded, and hence locks are no-ops.
Clearly, this is fragile, and I would prefer getting rid of these two locks, with a check that systhread is not loaded (and some comments explaining all that).
| Memprof callbacks. Assumes that the runtime lock is held. Can raise | ||
| exceptions asynchronously into OCaml code. */ | ||
|
|
||
| CAMLextern int caml_check_pending_actions (void); |
There was a problem hiding this comment.
Yes, but there is a problem of atomicity. Essentially, if caml_check_pending_actions returns false and a signal arrives after the check but before the system call, then we have to wait for the end of the system call to execute the signal handler.
So, essentially, in this PR, we are trying to do some kind of "best effort", praying that no signal will arrive during this short period of time, and I think this function is useful for that.
The usual solution to avoid this problem is to use the "self-pipe trick" instead of a volatile variable, or to use the pselect system call. But none of these is compatible with all blocking calls (e.g., calling open on a named pipe, locking a pthread mutex, ...)....
So, I agree with @gadmm that this function should be avoided in theory to get well-behaved code, but in practice the POSIX API does not give us the choice to not write bogus code.
| return caml_read_fd(fd, 0, p, n); | ||
| int r; | ||
| do { | ||
| r = caml_read_fd(fd, 0, p, n); | ||
| } while (r == Io_interrupted); | ||
| return r; |
There was a problem hiding this comment.
For the sake of coherence, cold you please use a goto and a call to check_pending here too?
There was a problem hiding this comment.
I'd rather not. I don't know the code that calls this and whether it locks.
runtime/signals.c
Outdated
| handled at this point. */ | ||
| signals_are_pending = 1; | ||
| caml_raise_if_exception(caml_process_pending_signals_exn()); | ||
| //caml_raise_if_exception(caml_process_pending_signals_exn()); |
There was a problem hiding this comment.
+1
But if would be better to remove the line instead of commenting it out.
runtime/io.c
Outdated
| { | ||
| int n, free, towrite, written; | ||
| again: | ||
| check_pending(channel); |
There was a problem hiding this comment.
I think there is a problem with this loop and the interaction with the GC. Indeed, the parameter p is called with &Byte(buff, pos), which means that it points to something in the heap which may move when signal handler are executed.
(Note that the loop existed in unix.c before that PR, but this is still a bug, so let's fix it now.)
There was a problem hiding this comment.
I think this was already fixed by f43daf9 ?
|
I'm sorry I haven't had time for a proper review, and will not until next week. I just wanted to say I'm glad to see this work, as it looks simpler than #1128 yet should fix several issues. Concerning the lack of atomicity pointed out by @jhjourdan : the current implementation has similar problems, and many other problems that this PR fixes, so 1- it's still a big step forward, and 2- once this is reviewed and merged, we can think of addressing the atomicity problem with the self-pipe trick. |
|
@gadmm, @jhjourdan This patch makes signal handling have the following properties, which do not currently hold:
There is an additional nice property, which as @xavierleroy points out did not hold before this patch and does not hold afterwards:
The reason this does not hold is the same before and after this patch: a race between the last check for signals and entering a blocking operation. I have some ideas for a generic fix for this, but they'll be a separate PR. (In other words: the current runtime has both safety and liveness issues. This patch fixes only the former)
There are broadly two coding styles that allow correct handling of asynchronous signals.
#1128 attempted style 1, which required extremely careful and fragile coding, and needed fairly extensive changes to the internal interfaces of io.c / unix.c. This patch attempts style 2, which is much simpler. At the time of writing #1128, I didn't see style 2 as an option, since almost any interaction with the runtime could in principle run signal handlers. Since #8691, controlling when signals may run seems easier. (Having said that, it's likely that style 2 might have worked before then, by being careful about which runtime functions were called) |
This is indeed what must have confused me. Thanks for checking. |
jhjourdan
left a comment
There was a problem hiding this comment.
I think this new version I OK for merging.
My only minor concern is about the locking in debugger.c.
|
My thoughts about the possible guarantees, and the replies to previous points, all in one message (the only one). In this PR there are two changes that I think are better to discuss separately:
My concerns are mainly about the API part of 2. 1. No longer interrupt blocking sections by running OCaml code inside signal handlersThe question is what should it be replaced with (either now if deemed necessary, or later). The titles of the PRs let us think (or me at least) that this would involve handling EINTR inside more blocking sections than before, and this is what I have been thinking about. The current PR does not change where EINTR is handled. Taking a step back (these notes are from when I was reviewing the previous PR), there are several things achieved (in spirit at least) by the current broken implementation: The user does not need to handle EINTRThis has been requested at #7508:
referring to by a similar python proposal (https://www.python.org/dev/peps/pep-0475/). This sounds reasonable, independently of the next points. Currently, EINTR is only handled for read and write in Guarantee that signals are handled promptly enoughThe race between the signal and the start of the system call has been discussed at length. A couple of points though:
Blocking sections in general can be interrupted, not merely system calls able to return EINTRThis point has not been raised before, but I wanted to raise it because the question of whether OCaml will support cleanly interrupting blocking calls in general is going to influence how the masking API (#8961) is designed (details better discussed elsewhere). As a reference point, blocking calls in GHC (relevant here in the context of their experiment with masking) are interruptible in general, for instance takeMVar and putMVar are interruptible (whereas Mutex.lock and Condition.wait are not interruptible in OCaml). (For POSIX threads it seems that one could use semaphores, for instance, with I do not believe it is necessary to do any of this as part of this PR (because this did not work reliably up to now, and has always been disabled under systhreads), but I wonder what was the original purpose of the non-systhreads behaviour (e.g. make the toplevel more responsive to sigint?), and whether this could inform what replacement is desirable in the short term (if any). 2. Unlock channels before checking for pending signals, and publish the API to allow users to write this kind of codeThe way I understand it now, the API changes mix two different goals:
UnlockingThis PR allows using the same channel in a signal handler and in the main code (typically stdout). Note: this costs the atomicity for the channel operation that previously was guaranteed by the lock. I am not sure what impact this is going to have. This includes being interrupted by a signal (whose handler need not use the channel), only to yield to another thread that wants to use the same channel. (Recursive mutexes would avoid this but as you note this looks at least as complicated to implement.) It is still delicate to implement this, while I am not convinced that this use-case should be supported, at least in the general case. In general, the programmer could be told to either avoid sharing a mutex between async callbacks and the main code, or to have the critical sections be uninterruptible. So, as far as the API proposal is concerned, I am not convinced that it is good to offer the functions In addition, while the proposed implementation of unlocking is good enough, an alternative I have thought about would be to use an errorcheck mutex so as to be able to raise a nice exception in case the signal handler tries to re-use a channel (and to document this fact). Then this would preserve the atomicity. State/resource safety
This concerns the change in Changing the polling behaviour of What worries me is if the implication is that it should be the responsibility of the person who polls to ensure the consistency of state, because this is not the correct way to deal with the problem. Instead, it is the responsibility of the person catching the interrupt to make sure that no invalid state escapes. (I have documented at https://gitlab.com/gadmm/memprof-limits/-/blob/master/doc/recovering.md this, which is the only way I know to safely handle interrupts.) To take your example of writing to a pipe, one should not care about losing the result of the syscall, because the pipe itself should disappear before the interrupt is caught, and never be seen again. (The alternative is to add a new internal function |
|
In short, what disturbs me with the unlocking and the moving of polling to the topmost callers is that this tries to treat asynchronous callbacks as if they were synchronous. Moreover, the API change means that this is going to be considered a good pattern for users to follow. But one has to make the burden of asynchronous callbacks bearable, and I am worried that this might not be a viable strategy at scale. It is much simpler to consider that users are not entitled to share locks between the asynchronous callbacks and the rest of the program. I have updated #8997, which deals with a few leaks in the runtime by explicitly returning the exception in |
I found it interesting to discover these issues with state invalidation with your work around this PR. I believe that with #8997 and #9757, the approach in the second part of this PR is not necessarily needed, and it comes down to whether one wants to preserve the current behaviour of programs under systhread or under no systhread:
|
|
@gadmm I'll try to respond to as many of your points as I can here: 1. No longer interrupt blocking sections by running OCaml code inside signal handlers
This is a misconception. Currently, blocking sections return EINTR:
So, with this patch the set of EINTR-returning places does not change. The difference is that EINTR is used to signal interrupts even for signal handlers which raise, rather than using the inherently-unsafe approach of running OCaml code in a Unix signal handler.
I agree with your points here. I'd like to discuss them in a separate PR.
GHC implements interruptible blocking C calls in exactly the way proposed by this patch - to interrupt, it causes a system call to return EINTR. See the GHC manual, section 10.2.5. There is no other way to make arbitrary blocking sections interruptible in general. 2. Unlock channels before checking for pending signals, and publish the API to allow users to write this kind of code
In what operation is atomicity lost? For most of these operations, partial results are returned without checking for signals, so I'm not sure there can be I/O interleaved with signal handling. Besides, I am not convinced there was any such guarantee! The implementations used a lock internally to avoid corruption, but no guarantee was made anywhere. Many operations on channels (e.g.
I think this is a bad idea. Printing a message to stdout or stderr is a very common use of signal handlers, and under your proposal this operation could nondeterministically raise a mysterious exception about mutexes. Your approach in #8997 + #9757 still seems to run arbitrary user code while holding a channel lock. What happens if a signal handler prints to stderr, while the main program is also doing so? What happens if a signal handler closes stderr? Have you tested your code with my example program from #1128? |
|
While reviewing, I took the liberty of pushing a small Win32 fix directly to the branch (commit f4bf623). Since Win32 has no notion of signal that can interrupt a system call, the change is mostly cosmetic: replace |
xavierleroy
left a comment
There was a problem hiding this comment.
I'm now pretty confident about this code. Even if it doesn't fix all the atomicity issues, it's a huge improvement in safety.
Two minor suggestions below, but nothing that should block merging this PR.
Changes
Outdated
| which has never existed. | ||
| (Jacques-Henri Jourdan, review by Xavier Leroy) | ||
|
|
||
| - #1128, #9722: EINTR-based signals |
There was a problem hiding this comment.
I would prefer a more descriptive entry, for example:
#1128, #7503, #9036, #9722: revised handling of signals.
Avoid running OCaml signal handlers in the middle of blocked system calls.
Instead, make sure system calls interrupted by a signal return quickly
to a program point where the signal handler can safely run.
Also revise the locking of I/O buffers to avoid deadlocks.
There was a problem hiding this comment.
plus a description of changes to the public API and polling behaviour.
runtime/signals.c
Outdated
| handled at this point. */ | ||
| signals_are_pending = 1; | ||
| caml_raise_if_exception(caml_process_pending_signals_exn()); | ||
| //caml_raise_if_exception(caml_process_pending_signals_exn()); |
There was a problem hiding this comment.
I agree with @jhjourdan. Please remove the commented-out code. Optionally, you could add a comment that explains "We used to process pending signals at this point but it is not a good idea because...".
|
@stedolan, for 1. I was unclear, the first point was not meant to be about a property we might lose but one that was requested elsewhere. I think we are in agreement overall. For 2., I had in mind operations like
Currently you could have mysterious deadlocks (with systhreads) or crashes (without it) instead. My point of view is that these concurrent accesses do bad things in general and would generally be considered as bugs, so it is better to fail early with an exception. You are interested in a use-case where the bad thing is acceptable. The approach at #9757 is biased towards fast bug reporting, but there are workarounds if the programmer insists, such as (AFAIU) creating a separate channel from a dup of stdout/stderr. But, I agree that the additional guarantee of not running callbacks while holding a lock can be sensible in this case (and the specification "use a lock internally to avoid corruption, but make no guarantee of atomicity"). My point of view is mainly from thinking about what is a good API for the programmer, and I am not convinced by the public API and behaviour changes. In particular, #8997 shows that polling inside |
There was a problem hiding this comment.
The discussion did not convince me about the public API and behaviour change regarding polling. To sum up my points:
Not polling inside caml_leave_blocking_section was based on a very broad argument about state corruption, but this is nothing specific to system calls but is a general concern when dealing with interrupts. #8997 shows that there is no obstacle to ensuring the consistency of channel state in this case. Instead, the reason you want to not poll here is to be able to unlock the mutex before running signal handlers, which is fairly specific to what you are trying to achieve, relying on a non-compositional coding style where the programmer must control every places where polling could happen.
First, an alternative, compositional way to control where polling happens is being proposed at #8961, so I do not see the opportunity to encourage this style by making _no_pending functions public at this point.
Second, you could introduce a specific function caml_leave_blocking_section_no_pending, but instead you change the behaviour for everybody who ever acquires the runtime lock. That signals are checked as soon as you acquire the runtime lock is a nice and natural guarantee, and I am not convinced that the change will make things much simpler. For instance, the number of functions that one has to transform into an _exn variant for explicit resource clean-up in #8997 does not change since they will already contain caml_enter_blocking_section, i.e. this polling location comes for free in terms of non-raising guarantees of functions.
I can sympathize with this argument. Perhaps a new I would still like this PR to be merged well in time for release 4.12. #8961 and #8997 look like longer-term projects to me, of the kind that will not be reviewed in time for 4.12. |
The thing is, it's not like this is a change that is likely to break programs. This will simply delay some signals a little more (until the next polling point). But since polling point (should be) are inserted regularly, the introduced delay should be too long. |
#8997 introduces In the cases where #8997 uses the polling variant of I have yet to see a program whose behaviour is improved by polling in leave_blocking_section, but I've seen several where the polling results in a bug. The usual argument against removing a polling point from a function is that it might cause unbounded delay if a program repeatedly calls it, but that can't happen here: any code that calls caml_leave_blocking_section repeatedly must also call caml_enter_blocking_section repeatedly, and that's a better place to poll. |
Yes, we agree that sometimes we need this behaviour! This is the point of #8961 (not yet ready to land).
Not really, this is about avoiding leaks rather than invalidating the state of values that are still in use, and it is exclusively used in the release path (I did not remove polling locations). The transformation used at #8997 is systematic and explains why you need them in the release path. If you used the raising version, then it tells you to use the
True, I considered the error from the failed operation as discardable. Your point is quite subtle and interesting. If we apply the general rule not to drop exceptions, I agree that one cannot poll when acquiring the lock. But this generally applies to exceptions that we do not know where they come from. Here we know what the exception is, and we could consider that it has not been raised yet (one could permute the polling and the conversion the error code into the exception). At least this is my reasoning here.
I agree that this "usual argument" does not apply and I like the argument. Rather, I am very cautious about this sort of radical change. My approach is to not assume that I can know what code exists in the wild; here maybe some code is already poor in polling locations. And as Jacques-Henri recalls, polling should be inserted regularly. Right when acquiring the lock has the most chance to be helpful. |
|
Could we please make progress towards merging this PR? The points of disagreement have been presented in details, and look small to me, so it should be possible to move forward. |
|
I think the issue with (I think the change deserved such a discussion. Some of the disagreement can probably be explained by the fact that we have different ideas of how frequently the runtime should poll, which has never been specified.) Whether we should be wary of adding too many and redundant functions to the public API is also a matter of taste to some degree, so I leave it to @stedolan's appreciation. |
It is worth noting that according to @stedolan's link, it is possible to interrupt the system call:
|
Good to know! I see there's also |
|
Here is a consequence of not polling in Currently, if a system call gives EINTR (other than Now, user code might typically perform some clean-up before handling the exception. In the current incorrect way to do clean-up available to the OCaml programmer (e.g. Fun.protect or manual try/re-raise), a signal handler being executed inside the clean-up code and raising is likely to cause a leak. Currently, this is a race condition that happens rarely (or so we hope). By raising Furthermore, if it was possible to run clean-up code correctly, that is with pending signals masked, then the execution of the signal handler will instead be delayed until after all the clean-up is done. One consequence is that if Thus, even if we believe that the program will poll soon enough, it might not be where we want it. A solution is that the users should first check for EINTR and poll, but polling in Do you agree with this analysis? |
The debugger's use of channels doesn't support locking, but it doesn't work on threaded programs anyway.
|
I've rebased the history into a reasonable series of commits, and added tests for various cases of signal handling, GC and I/O. These tests uncovered a couple more bugs (e.g. I'd appreciate someone having a quick read of the io.c changes, but otherwise I consider this ready to merge (let's see if CI agrees...) |
Changes
Outdated
| - #1128, #9722: EINTR-based signal handling: ensure that signals are handled | ||
| only at polling points (which no longer includes leave_blocking_section), | ||
| and that I/O locks are not held during signal handling. |
There was a problem hiding this comment.
| - #1128, #9722: EINTR-based signal handling: ensure that signals are handled | |
| only at polling points (which no longer includes leave_blocking_section), | |
| and that I/O locks are not held during signal handling. | |
| * #1128, #9722: EINTR-based signal handling: ensure that signals are handled | |
| only at polling points, and that I/O locks are not held during signal | |
| handling. The function `caml_leave_blocking_section` no longer checks for | |
| pending signals, this is replaced by polling inside `caml_raise`. This is a | |
| change we believe will fix bugs in user C code. |
edit: forgot to mention caml_raise.
runtime/fail_byt.c
Outdated
| if (caml_check_pending_actions()) { | ||
| CAMLparam1 (v); | ||
| caml_process_pending_actions(); | ||
| CAMLdrop; | ||
| } |
There was a problem hiding this comment.
| if (caml_check_pending_actions()) { | |
| CAMLparam1 (v); | |
| caml_process_pending_actions(); | |
| CAMLdrop; | |
| } | |
| v = caml_process_pending_actions_with_root(v); |
There was a problem hiding this comment.
Changed, thanks.
(I added an assertion that !Is_exception_result(v), because failing to call Extract_exception before raise is an easy mistake to make, and the implementation of caml_process_pending_actions_with_root will do something strange in that case)
|
|
||
| if (caml_check_pending_actions()) { | ||
| CAMLparam1 (v); | ||
| caml_process_pending_actions(); | ||
| CAMLdrop; | ||
| } | ||
|
|
There was a problem hiding this comment.
| if (caml_check_pending_actions()) { | |
| CAMLparam1 (v); | |
| caml_process_pending_actions(); | |
| CAMLdrop; | |
| } | |
| v = caml_process_pending_actions_with_root(v); |
| if (check_for_pending_signals()) { | ||
| signals_are_pending = 1; | ||
| caml_set_action_pending(); | ||
| } |
There was a problem hiding this comment.
LGTM. At #8961 I was simply calling caml_set_action_pending after removing signals_are_pending.
There was a problem hiding this comment.
I would prefer if we could call caml_set_action_pending(); even if check_for_pending_signals() returns 0.
First, it does not cost much : the only cost is that we will examine all the potential callbacks at the next polling point. This is not much more expensive than calling check_for_pending_signals(). Second, with #9674, there will be another source of thread-local callback, which will need updating caml_something_to_do at every context switch.
There was a problem hiding this comment.
I'd prefer to keep it like this. The cost of set_action_pending isn't zero: the next allocation will fail and spill/restore all registers.
In many programs, entering and leaving a blocking section is massively more frequent than context-switching between threads. If you want to implement per-thread callbacks, I think it would make more sense to do them in caml_thread_{save,restore}_runtime_state in the systhreads library, which is called from the blocking_section_hooks. (Also, doing it there gives you the opportunity to optimise the common case where no context switch has occurred, and the thread taking the lock is the last one to release it)
| retcode = caml_sigmask_hook(how, &set, &oldset); | ||
| caml_leave_blocking_section(); | ||
| /* Run any handlers for just-unmasked pending signals */ | ||
| caml_process_pending_actions(); |
There was a problem hiding this comment.
To be clear, this is another example of behaviour change from not polling in caml_leave_blocking_section, right?
There was a problem hiding this comment.
To preserve behaviour, explicit polls are added:
- in caml_raise, to raise the right exception when as system
call is interrupted by a signal.
- in sigprocmask, to ensure that signals are handled as soon
as they are unmasked.
jhjourdan
left a comment
There was a problem hiding this comment.
The current version looks good to me, except a small comment in signal.c.
| if (check_for_pending_signals()) { | ||
| signals_are_pending = 1; | ||
| caml_set_action_pending(); | ||
| } |
There was a problem hiding this comment.
I would prefer if we could call caml_set_action_pending(); even if check_for_pending_signals() returns 0.
First, it does not cost much : the only cost is that we will examine all the potential callbacks at the next polling point. This is not much more expensive than calling check_for_pending_signals(). Second, with #9674, there will be another source of thread-local callback, which will need updating caml_something_to_do at every context switch.
|
I think the Changes file deserves some clarification, in the vein of how I sketched it in my comment above. |
|
I've updated the Changes entry, by editing together my, @xavierleroy and @gadmm's proposals. |
Looks good, thanks! |
|
I had a quick look at runtime/io.c again, and it looked good. At any rate, this PR has been reviewed in exquisite details and polished with much care (thank you, author and reviewers!), so it's high time to merge it. |
This is a (greatly simplified) reimplementation of #1128, to solve a collection of tricky locking and signal-handling issues.
In trunk, when a signal arrives it is handled:
systhreadslibrary has not been linkedsysthreadslibrary has been linkedPolling points include all allocations in OCaml and calls to
caml_process_pending_actionsfrom C. (Before #8691 / #9027 / #8897, all allocations in C were also polling points, but no longer)There are a number of problems with the current behaviour:
It is odd for the responsiveness of a program to signals to depend on whether
systhreadshas been linked.(IO mutex and threads #5141, Bad locking in io.c #7503) When they interrupt code using mutexes, the signal handlers run under an unknown collection of locks, which can cause deadlocks. This is particularly bad for signal handlers that write a message to stdout or stderr, as io.c maintains mutexes on those channels.
If a signal is handled immediately during a blocking section that was executing a non-async-signal-safe function, arbitrary corruption can occur. (It is possible to write a blocking section that only uses async-signal-safe functions, but this requires care, and various blocking sections even in the stdlib are not async-signal-safe).
If a signal is handled immediately during a blocking section, then any progress made by that blocking section is discarded if the handler raises. Specifically, a signal may occur after a system call has successfully returned but before the blocking section has ended, and its return value was lost. There is in general no way to recover from this in the case of say, writing to a pipe.
This patch fixes these issues by:
Always waiting for the next polling point, and never handling signals immediately (regardless of whether
systhreadsis linked)Modifying the blocking
readandwritecalls inio.cto insert a polling point (and hence handle signals) before retrying a system call on receipt of an EINTR.If a signal arrives during a blocking system call, instead of handling the signal then and there, we mark the pending signal and rely on the OS to make the system call fail with EINTR, at which point we check for pending signals before retrying the operation.
Compared with the previous EINTR patch (#1128), this one is much simpler. When #1128 was written, there was no easy way to delay handling of a signal to a convenient point, and so that patch included some gymnastics to ensure it was always safe to handle one.
Since #8691 / #9027 / #8897, polling points are less frequent in C code, and delaying a signal until convenient is easier. To allow this, the C API is modified by:
caml_check_pending_actions, which checks for pending actions without yet running themcaml_enter_blocking_section_no_pending, which enters a blocking section without checking for pending actionscaml_leave_blocking_section(This PR is best read commit-by-commit)
TODO:
signals_are_pending = 1incaml_leave_blocking_section, which might cause unnecessary work. I'll investigate.Windows support(left until after Windows Console Unicode Support #1408 is merged)