Skip to content

Fix graceful shutdown "broken pipe" error#527

Merged
radeksimko merged 6 commits intohashicorp:mainfrom
busser:arthur/fix-broken-pipe-during-shutdown
Aug 26, 2025
Merged

Fix graceful shutdown "broken pipe" error#527
radeksimko merged 6 commits intohashicorp:mainfrom
busser:arthur/fix-broken-pipe-during-shutdown

Conversation

@busser
Copy link
Contributor

@busser busser commented Jul 22, 2025

As reported in comments on PR #512, the graceful shutdown behaviour introduced in that PR revealed an issue with how tfexec attempts to shut down terraform.

The symptoms were that when we canceled the context.Context passed to exec.CommandContext in tfexec.Terraform.buildTerraformCmd, rather than initiating a graceful shutdown of terraform and waiting for it to exit, terraform would suddenly exit with a signal: broken pipe error.

The root cause

PR #523 provides a way to reproduce the error, which allowed me to do some good old log.Printf debugging to figure out what's happening. Thanks @dastrobu for that!

The bug was revealed by PR #512 but it did not introduce it. That PR adds a graceful shutdown mechanism built on context.Context. When a tfexec user calls a method on tfexec.Terraform, we create an exec.Command configured to send an os.Interrupt signal to terraform when the user cancels their context.Context. This works fine and terraform does receive that signal and begins its usual shutdown process.

We expect tfexec to wait for terraform to exit, and it does. Since the context was canceled, tfexec uses the custom cmdErr error type to aggregate both the context.Context's error and the exec.Command's error. In tfexec's tests, the context.Context's error hides the other error which is why it hasn't been caught.

The exec.Command's error is signal: broken pipe. But why?

Independently of the graceful shutdown mechanism, tfexec also pipes terraform's output streams (stdout and stderr) to buffers for later use. We close those pipes when the context.Context is canceled, so before terraform has finished shutting down. During its shutdown process, terraform attempts to log something and boom 💥: broken pipe.

The fix

The fix is to not close those pipes. exec.Command.Wait does it for us after the terraform process exits. This consists of removing any use of context.Context in the function that copies terraform's output from pipes into buffers. That function initially stopped as soon as context.Context was canceled, but that's the opposite of what we want. Calling exec.Command.Wait will close the pipes when terraform stops, so we don't need to close them ourselves.

This pull request

This PR is split into 4 5 4 6 commits:

  1. (rebased) Update the e2etest package to test all current v1.x versions of Terraform. I wanted to make sure this behaviour wasn't specific to any one version of terraform. It's not. I'm fine with removing that commit if you don't want to make the E2E test suite too long to run.
  2. Add a new test to e2etest that detects this broken pipe issue. The test doesn't specifically look for that error, but rather for cases where the terraform command stopped because a signal forced it to instead of stopping on its own with os.Exit. On Unix systems, broken pipe is signal 13. Without the fix, this new test fails with the signal: broken pipe error.
  3. Implement the fix.
  4. Apply the same fix to Linux-only code.
  5. (added) Skip graceful shutdown tests for pre-1.1 versions of Terraform. Terraform doesn't seem react to the receiving an interrupt signal in v1.0 and earlier, so no graceful shutdown happens. I don't think this is related to tfexec in any way.
  6. (added) Add an option to explicitly enable the pipe-closing behavior, to work around pre-1.1 versions of Terraform ignoring the interrupt signal.
  7. (added) Delete a bit of dead code from a test.

@busser busser requested review from a team as code owners July 22, 2025 17:32
@busser busser marked this pull request as draft July 22, 2025 17:32
@busser busser changed the title arthur/fix broken pipe during shutdown Fix graceful shutdown "broken pipe" error Jul 22, 2025
@busser busser changed the title Fix graceful shutdown "broken pipe" error Fix "broken pipe" error during graceful shutdown on Unix systems Jul 22, 2025
@busser busser changed the title Fix "broken pipe" error during graceful shutdown on Unix systems Fix graceful shutdown "broken pipe" error Jul 22, 2025
t.Fatalf("expected exec.ExitError, got %T, %s", err, err)
}
if !ee.Exited() {
t.Fatalf("expected process to have exited, but it did not (%s)", ee.ProcessState.String())
Copy link
Contributor Author

@busser busser Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we get the signal: broken pipe error:

expected process to have exited, but it did not (signal: broken pipe)

Terraform v1.0 has slightly different behaviour. Without the fix, the error is the same but with the fix it's this:

expected process to have exited, but it did not (signal: killed)

This is because Terraform doesn't shut down within the 1-second WaitDelay we set on line 212. I tried to figure out why but failed. I suspect Terraform v1.1 introduced a subtle change to the way Terraform handles signals, but I'm not sure. I figure this is orthogonal to this issue though. All v1.1+ versions behave the same. So this test skips any version of Terraform below 1.1.0.

Comment on lines -256 to -268
// ReadBytes will block until bytes are read, which can cause a delay in
// returning even if the command's context has been canceled. Use a separate
// goroutine to prompt ReadBytes to return on cancel
closeCtx, closeCancel := context.WithCancel(ctx)
defer closeCancel()
go func() {
select {
case <-ctx.Done():
r.Close()
case <-closeCtx.Done():
return
}
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is pretty explicit about purposefully closing the pipes early, rather than letting them close "naturally". For graceful shutdown to work, we need to keep the pipes open as long as terraform is running.

The key here is removing r.Close() on line 264. Once we do that, all context-related code in this function is a no-op, so we can remove all of it.

@radeksimko
Copy link
Member

@busser Thank you for the deep investigation and a very helpful and detailed PR description, much appreciated. 🙏🏻

I ran the whole matrix of tests which raised some failures unfortunately. Could you look into these failures?


While we focus on fixing those I would be happy to accept & merge 7434949 in a separate PR if you can raise it. Feel free to cc me on that PR and I'll try to merge it in a timely manner so you can rebase here and keep this PR focused on solving the bug. Regarding the tests run time - we can always trim some versions if it becomes a problem but I think the trade-offs here are in favour of correctness over speed considering the current cadence of contributions to the project.

@radeksimko
Copy link
Member

I'm wondering if these failures could be simply a result of a flakey test as it is dependent on time.Sleep?

If that is the case, we could try wrapping it in the new experimental testing/synctest potentially, unless we find some other way of removing the flakiness.

As to how we could introduce it - Go 1.25 is around the corner (expected for August, i.e. in a few weeks). We could wait until the next Go release, bump the version and just use it. Alternatively I'd also be open to modifying the CI workflows such that we can use it as an experiment.

@busser
Copy link
Contributor Author

busser commented Jul 23, 2025

Thanks for looking into this! I investigated the test failures and I don't think it's a flaky test issue. Rather, these failures are related to something I mentioned earlier: Terraform pre-1.1 doesn't seem to react when sent an interrupt signal.

The tests fail specifically for Terraform versions 0.12 to 1.0 inclusive. The TestContext_sleepTimeoutExpired test (the only one that fails) is already skipped for versions under 0.12, and starting with version 1.1, Terraform shuts down correctly when signaled.

I don't think Terraform ignoring the signal has anything to do with tfexec itself. The issue doesn't discriminate based on OS, so it's likely not a platform-specific signaling issue. The test fails now because before this PR's fix, Terraform crashed due to the broken pipes, which the test considered a successful shutdown. Now that the crash is gone, tests detect that Terraform doesn't shut down properly for these older versions.

To address this, I'll add a commit to skip that test for versions up to 1.0, similar to the test I introduced in this PR.

I'll also open a separate PR with the commit that adds all the versions to test as you suggested, so we can get that merged first and keep this PR focused on the bug fix.

@busser busser force-pushed the arthur/fix-broken-pipe-during-shutdown branch from cb3fd9c to a09bccb Compare July 23, 2025 17:35
@radeksimko
Copy link
Member

@busser The 1.1 behaviour change is a fantastic find. 👏🏻

I managed to find specifically the PR which introduced the change: hashicorp/terraform#29825 - as verified by rebuilding Terraform with and without that commit and re-running your test which fails without that patch.

That PR was actually missed from the public changelog, perhaps because it was difficult to tell how exactly it impacted end users and what they can possibly do about it. This is now - sic retrospectively - a bit clearer.


Whilst cancellations <1.1 were not graceful, the library would still honour the cancellation in some shape or form, i.e. calling code of tfexec would not be left hanging.

In this PR, the TestContext_sleepTimeoutExpired fails after 10 seconds w/ Terraform <1.1, i.e. it would hit the 10s (or any longer) timeout in the test. I think that implies that consumers of tfexec using it alongside Terraform <1.1 would now be left hanging on attempted cancellation, which does not seem ideal. 🤔

Having looked into why it is left hanging, I could observe a zombie process which is left hanging even after the Kill() initiated via WaitDelay. 😑

Of course closing the pipes early (before cancellation) like we do now (prior to this PR) is still wrong and we need to fix that.

I just can't think of any decent backwards-compatible solution right now. Need to think about this a bit more. Constructive ideas welcomed.

@radeksimko
Copy link
Member

Some ideas I ran through:

  1. gate the OLD behaviour behind a field on Terraform struct - e.g. LegacyPipeClosingEnabled - this would be practically a breaking change for users of <1.1
  2. gate the NEW behaviour behind a field on Terraform struct - e.g. EnableGracefulCancellation - this would make the desired (correct) behaviour opt-in
  3. call version before every command and gate and decide on the behaviour based on version - this could have some undesirable performance impact

If I really had to choose between those I would probably choose (1) and made it very clear in the changelog to end-users that they can still fall back to the old behaviour. I think it would set us up better for correct handling being the default.

Alternative ideas are welcomed!

@radeksimko
Copy link
Member

radeksimko commented Jul 24, 2025

One additional observation:

The process becomes a zombie only after the SIGKILL, the SIGINT is just being ignored by Terraform <1.1. The reason we end up with a zombie is because Go chooses to close the I/O pipes only after sending the SIGKILL and only in some specific circumstances: https://cs.opensource.google/go/go/+/refs/tags/go1.24.5:src/os/exec/exec.go;l=828-859

In theory this means we could check if the process has exited after SIGINT and before SIGKILL and if not, close the pipes and send the SIGKILL. Practically though it means re-implementing the stdlib logic behind WaitDelay or at least working around it.

Just sharing that as a possible solution number (4).

@radeksimko
Copy link
Member

Here is a prototype as per (4): https://gist.github.com/radeksimko/8e29a393c4101a79846565fe893cb9bb

I think it makes the right trade-offs, mainly it remains a hidden implementation detail which we can relatively easily deprecate later.

It makes the existing TestContext_sleepTimeoutExpired test pass on both <1.1 versions and 1.1+.

@busser Do you have any thoughts on this topic?

@busser
Copy link
Contributor Author

busser commented Aug 18, 2025

Hey @radeksimko, sorry for the late response, I just got back from summer break.

I like option 1: fix the pipe closing bug and add a configuration option to allow pre-1.1 users to re-enable the previous behaviour. This would indeed be a change in behaviour but I don't think it's a breaking change. I would call it a bug fix that reveals another bug. The EnableLegacyPipeClosing setting would also allow any user to fall back to the current behavior, if it matters to them.

To quote the Terraform v1.x Compatibility Promises:

If the actual behavior of a feature differs from what we explicitly documented as the feature's behavior, we will usually treat that as a bug and change the feature to match the documentation, although we will avoid making such changes if they seem likely to cause broad compatibility problems. We cannot promise to always remain "bug-compatible" with previous releases, but we will consider such fixes carefully to minimize their impact.

In our case, fixing the pipe-closing bug allows graceful shutdown to work for Terraform 1.1+ and reveals that Terraform 1.0 ignores SIGINT. I think it's reasonable to fix this bug in terraform-exec even if it reveals a bug in an old version of Terraform.

I also think option 1 would be easier to implement than option 4, since we'd be keeping the exact same behavior regarding closing pipes. In Option 4, the behavior changes: the pipes now close after the delay rather than before. Users would also not have any configuration options to revert back to the current behavior, in case they don't want a graceful shutdown.

I think option 2 is what the compatibility promise called remaining "bug-compatible". This pipe closing issue is clearly a bug and keeping it the default seems unnecessary.

Option 3, as you've noted, could indeed have performance implications. I think it also breaks an implicit assumptions users of terraform-exec make that each method call results in one command. I think it would be surprising to discover that Plan(ctx) runs both version and plan. I've seen people use terraform-exec with other Terraform-like executables that aren't fully identical. For example, Plan(ctx) works with all versions of Terragrunt but Version(ctx) does not 😅 I've run into this issue as a maintainer of https://github.com/busser/tfautomv.

Happy to get your thoughts on this :)

@radeksimko
Copy link
Member

@busser That is a fair assessment. I'm fine with (1).

I'd propose the following description for the new field

// EnableLegacyPipeClosing causes the library to "force-close" stdio pipes.
// This works around a bug in Terraform < v1.1 that would otherwise leave
// the process (and caller) hanging after graceful shutdown.
//
// This option can be safely ignored (set to false) with Terraform 1.1+.
EnableLegacyPipeClosing bool

I will leave the implementation up to you - feel free to ping me once you have it ready.

@busser
Copy link
Contributor Author

busser commented Aug 19, 2025

@radeksimko Ready for your review :)

Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thank you for working on this and for talking through all the details at length.

Much appreciated.

@radeksimko radeksimko merged commit 1597b75 into hashicorp:main Aug 26, 2025
107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants