Fix graceful shutdown "broken pipe" error#527
Conversation
| t.Fatalf("expected exec.ExitError, got %T, %s", err, err) | ||
| } | ||
| if !ee.Exited() { | ||
| t.Fatalf("expected process to have exited, but it did not (%s)", ee.ProcessState.String()) |
There was a problem hiding this comment.
This is where we get the signal: broken pipe error:
expected process to have exited, but it did not (signal: broken pipe)
Terraform v1.0 has slightly different behaviour. Without the fix, the error is the same but with the fix it's this:
expected process to have exited, but it did not (signal: killed)
This is because Terraform doesn't shut down within the 1-second WaitDelay we set on line 212. I tried to figure out why but failed. I suspect Terraform v1.1 introduced a subtle change to the way Terraform handles signals, but I'm not sure. I figure this is orthogonal to this issue though. All v1.1+ versions behave the same. So this test skips any version of Terraform below 1.1.0.
| // ReadBytes will block until bytes are read, which can cause a delay in | ||
| // returning even if the command's context has been canceled. Use a separate | ||
| // goroutine to prompt ReadBytes to return on cancel | ||
| closeCtx, closeCancel := context.WithCancel(ctx) | ||
| defer closeCancel() | ||
| go func() { | ||
| select { | ||
| case <-ctx.Done(): | ||
| r.Close() | ||
| case <-closeCtx.Done(): | ||
| return | ||
| } | ||
| }() |
There was a problem hiding this comment.
This comment is pretty explicit about purposefully closing the pipes early, rather than letting them close "naturally". For graceful shutdown to work, we need to keep the pipes open as long as terraform is running.
The key here is removing r.Close() on line 264. Once we do that, all context-related code in this function is a no-op, so we can remove all of it.
|
@busser Thank you for the deep investigation and a very helpful and detailed PR description, much appreciated. 🙏🏻 I ran the whole matrix of tests which raised some failures unfortunately. Could you look into these failures? While we focus on fixing those I would be happy to accept & merge 7434949 in a separate PR if you can raise it. Feel free to cc me on that PR and I'll try to merge it in a timely manner so you can rebase here and keep this PR focused on solving the bug. Regarding the tests run time - we can always trim some versions if it becomes a problem but I think the trade-offs here are in favour of correctness over speed considering the current cadence of contributions to the project. |
|
I'm wondering if these failures could be simply a result of a flakey test as it is dependent on If that is the case, we could try wrapping it in the new experimental As to how we could introduce it - Go 1.25 is around the corner (expected for August, i.e. in a few weeks). We could wait until the next Go release, bump the version and just use it. Alternatively I'd also be open to modifying the CI workflows such that we can use it as an experiment. |
|
Thanks for looking into this! I investigated the test failures and I don't think it's a flaky test issue. Rather, these failures are related to something I mentioned earlier: Terraform pre-1.1 doesn't seem to react when sent an interrupt signal. The tests fail specifically for Terraform versions 0.12 to 1.0 inclusive. The I don't think Terraform ignoring the signal has anything to do with To address this, I'll add a commit to skip that test for versions up to 1.0, similar to the test I introduced in this PR. I'll also open a separate PR with the commit that adds all the versions to test as you suggested, so we can get that merged first and keep this PR focused on the bug fix. |
cb3fd9c to
a09bccb
Compare
|
@busser The 1.1 behaviour change is a fantastic find. 👏🏻 I managed to find specifically the PR which introduced the change: hashicorp/terraform#29825 - as verified by rebuilding Terraform with and without that commit and re-running your test which fails without that patch. That PR was actually missed from the public changelog, perhaps because it was difficult to tell how exactly it impacted end users and what they can possibly do about it. This is now - sic retrospectively - a bit clearer. Whilst cancellations <1.1 were not graceful, the library would still honour the cancellation in some shape or form, i.e. calling code of tfexec would not be left hanging. In this PR, the Having looked into why it is left hanging, I could observe a zombie process which is left hanging even after the Of course closing the pipes early (before cancellation) like we do now (prior to this PR) is still wrong and we need to fix that. I just can't think of any decent backwards-compatible solution right now. Need to think about this a bit more. Constructive ideas welcomed. |
|
Some ideas I ran through:
If I really had to choose between those I would probably choose (1) and made it very clear in the changelog to end-users that they can still fall back to the old behaviour. I think it would set us up better for correct handling being the default. Alternative ideas are welcomed! |
|
One additional observation: The process becomes a zombie only after the SIGKILL, the SIGINT is just being ignored by Terraform <1.1. The reason we end up with a zombie is because Go chooses to close the I/O pipes only after sending the SIGKILL and only in some specific circumstances: https://cs.opensource.google/go/go/+/refs/tags/go1.24.5:src/os/exec/exec.go;l=828-859 In theory this means we could check if the process has exited after SIGINT and before SIGKILL and if not, close the pipes and send the SIGKILL. Practically though it means re-implementing the stdlib logic behind Just sharing that as a possible solution number (4). |
|
Here is a prototype as per (4): https://gist.github.com/radeksimko/8e29a393c4101a79846565fe893cb9bb I think it makes the right trade-offs, mainly it remains a hidden implementation detail which we can relatively easily deprecate later. It makes the existing @busser Do you have any thoughts on this topic? |
|
Hey @radeksimko, sorry for the late response, I just got back from summer break. I like option 1: fix the pipe closing bug and add a configuration option to allow pre-1.1 users to re-enable the previous behaviour. This would indeed be a change in behaviour but I don't think it's a breaking change. I would call it a bug fix that reveals another bug. The To quote the Terraform v1.x Compatibility Promises:
In our case, fixing the pipe-closing bug allows graceful shutdown to work for Terraform 1.1+ and reveals that Terraform 1.0 ignores SIGINT. I think it's reasonable to fix this bug in terraform-exec even if it reveals a bug in an old version of Terraform. I also think option 1 would be easier to implement than option 4, since we'd be keeping the exact same behavior regarding closing pipes. In Option 4, the behavior changes: the pipes now close after the delay rather than before. Users would also not have any configuration options to revert back to the current behavior, in case they don't want a graceful shutdown. I think option 2 is what the compatibility promise called remaining "bug-compatible". This pipe closing issue is clearly a bug and keeping it the default seems unnecessary. Option 3, as you've noted, could indeed have performance implications. I think it also breaks an implicit assumptions users of terraform-exec make that each method call results in one command. I think it would be surprising to discover that Happy to get your thoughts on this :) |
|
@busser That is a fair assessment. I'm fine with (1). I'd propose the following description for the new field // EnableLegacyPipeClosing causes the library to "force-close" stdio pipes.
// This works around a bug in Terraform < v1.1 that would otherwise leave
// the process (and caller) hanging after graceful shutdown.
//
// This option can be safely ignored (set to false) with Terraform 1.1+.
EnableLegacyPipeClosing boolI will leave the implementation up to you - feel free to ping me once you have it ready. |
|
@radeksimko Ready for your review :) |
radeksimko
left a comment
There was a problem hiding this comment.
LGTM, Thank you for working on this and for talking through all the details at length.
Much appreciated.
As reported in comments on PR #512, the graceful shutdown behaviour introduced in that PR revealed an issue with how
tfexecattempts to shut downterraform.The symptoms were that when we canceled the
context.Contextpassed toexec.CommandContextintfexec.Terraform.buildTerraformCmd, rather than initiating a graceful shutdown ofterraformand waiting for it to exit,terraformwould suddenly exit with asignal: broken pipeerror.The root cause
PR #523 provides a way to reproduce the error, which allowed me to do some good old
log.Printfdebugging to figure out what's happening. Thanks @dastrobu for that!The bug was revealed by PR #512 but it did not introduce it. That PR adds a graceful shutdown mechanism built on
context.Context. When atfexecuser calls a method ontfexec.Terraform, we create anexec.Commandconfigured to send anos.Interruptsignal toterraformwhen the user cancels theircontext.Context. This works fine andterraformdoes receive that signal and begins its usual shutdown process.We expect
tfexecto wait forterraformto exit, and it does. Since the context was canceled,tfexecuses the customcmdErrerror type to aggregate both thecontext.Context's error and theexec.Command's error. Intfexec's tests, thecontext.Context's error hides the other error which is why it hasn't been caught.The
exec.Command's error issignal: broken pipe. But why?Independently of the graceful shutdown mechanism,
tfexecalso pipesterraform's output streams (stdoutandstderr) to buffers for later use. We close those pipes when thecontext.Contextis canceled, so beforeterraformhas finished shutting down. During its shutdown process,terraformattempts to log something and boom 💥:broken pipe.The fix
The fix is to not close those pipes.
exec.Command.Waitdoes it for us after theterraformprocess exits. This consists of removing any use ofcontext.Contextin the function that copiesterraform's output from pipes into buffers. That function initially stopped as soon ascontext.Contextwas canceled, but that's the opposite of what we want. Callingexec.Command.Waitwill close the pipes whenterraformstops, so we don't need to close them ourselves.This pull request
This PR is split into
4546 commits:Update thee2etestpackage to test all current v1.x versions of Terraform. I wanted to make sure this behaviour wasn't specific to any one version ofterraform. It's not. I'm fine with removing that commit if you don't want to make the E2E test suite too long to run.e2etestthat detects thisbroken pipeissue. The test doesn't specifically look for that error, but rather for cases where theterraformcommand stopped because a signal forced it to instead of stopping on its own withos.Exit. On Unix systems,broken pipeis signal 13. Without the fix, this new test fails with thesignal: broken pipeerror.tfexecin any way.