feat(cram): cleanup subprocesses on exit#11827
Conversation
4cabe9e to
9573983
Compare
9573983 to
eff4020
Compare
Signed-off-by: Ali Caglayan <alizter@gmail.com>
Signed-off-by: Ali Caglayan <alizter@gmail.com>
adbbb35 to
49cc1b4
Compare
This should fix a common problem where users run subprocesses, for example using `./script.sh &` inside their cram tests. If these subprocesses do not terminate by themselves, they become orphaned and have to be manually killed. A common use case is testing a web-server. We add a more sophisticated trap for the underlying shell script of a cram test so that it can successfully terminate all of its children. Signed-off-by: Ali Caglayan <alizter@gmail.com>
6595f3f to
5889ad5
Compare
Signed-off-by: Ali Caglayan <alizter@gmail.com>
5889ad5 to
883b219
Compare
| subprocess PIDs as not to orphan them. The first "trap" will kill all | ||
| processes in the process group and the second "trap" will make sure the main | ||
| shell process exits gracefully. *) | ||
| fprln oc {|trap "trap 'exit 0' TERM && kill -- -$$" EXIT|}; |
There was a problem hiding this comment.
I'd like a little more detail in the comment here for future readers, specifically to understand the reason for registering a trap handler inside another trap handler. Here's my guess at why it's written this way:
The kill -- -$$ sends TERM to all processes in the process group. Is the reason for the "inner" trap so that when said TERM is received by the main shell process that it exits with 0? I guess there's no easy way to send TERM to all processes in the current process's group except for the current process? And if you were to move the inner trap outside of the EXIT trap handler (and just do kill -- -$$ on EXIT) then the process could exit 0 in response to an unexpected TERM which we don't want.
There was a problem hiding this comment.
Exactly, this was my solution to avoiding killing the main process. If we end up killing the main process, then Dune will complain that the process is was running received a kill signal, what we want instead is to preserve the old trap behaviour which was to exit 0.
By switching the trapping behaviour just before we kill the process group, we can signal to all the processes in the group that they should exit when encountering TERM, which they all promptly do.
gridbugs
left a comment
There was a problem hiding this comment.
Very useful! Minor comment to clarify the nested trap but otherwise happy with this.
rgrinberg
left a comment
There was a problem hiding this comment.
I don't think this is the right fix. We can use the same mechanism for cleanup that we do for regular actions.
|
Thinking about this again, it's probably fine to have some sort of behavior for cram tests, but it needs to provide some sort of an improvement over the regular behavior of killing the process group, and a way to disable this behavior for when the user wants to write their own signal handler. One way to improve the behavior could be to tell the user which pids are being leaked and making the test fail if there's leakage. |
|
Closing as it's not useful in its current state. |
This PR adds a reproduction case for #11820 where a subprocess is started in a cram test but becomes orphaned once the cram test terminates.
We then introduce the following fix. We use a more sophisticated
trapat the start of the underlying shell script in order to terminate all processes in the process group. In order to avoid the main shell being terminated, we use a second trap to exit gracefully.Fix #11820
This also allows us to re-enable some previously flakey tests in the CI.
A future improvement would be to run cram tests entirely as dune actions themselves, that way we can delegate process handling to the engine and have better cross platform support. That will require a lot more work however, so in the meantime we offer this improvement.