Skip to content

test/testenv/build.go: don't deadlock when sub-test panics#895

Closed
invidian wants to merge 1 commit into
project-dalec:mainfrom
invidian:invidian/fix-integration-test-deadlock-on-panic
Closed

test/testenv/build.go: don't deadlock when sub-test panics#895
invidian wants to merge 1 commit into
project-dalec:mainfrom
invidian:invidian/fix-integration-test-deadlock-on-panic

Conversation

@invidian

@invidian invidian commented Dec 10, 2025

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Right now if a sub-test inside RunTest() panics, test run deadlocks, because done channel never gets closed, which causes cleanup from getBuildOutputStream() to block forever.

Closing done channel on t.Cleanup before goroutine starts ensuring that the channel is closed when test ends, even if it panics and without the race condition hazard in case test would panic before goroutine starts.

Special notes for your reviewer:

To reproduce existing issue, apply the following patch:

diff --git test/linux_target_test.go test/linux_target_test.go
index 4bcb84d..998fbb6 100644
--- test/linux_target_test.go
+++ test/linux_target_test.go
@@ -1252,6 +1252,10 @@ WantedBy=multi-user.target
                }

                testEnv.RunTest(ctx, t, func(ctx context.Context, client gwclient.Client) {
+                       t.Run("foo", func(t *testing.T) {
+                               panic("foo")
+                       })
+
                        req := newSolveRequest(withBuildTarget(testConfig.Target.Container), withSpec(ctx, t, spec))
                        solveT(ctx, t, client, req)
                })

Then run:

go test --timeout=10s ./test -run=TestRockylinux9/test_systemd_unit_single -v

Copilot AI review requested due to automatic review settings December 10, 2025 19:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a deadlock issue in the test infrastructure that occurs when a sub-test panics during test execution. The fix ensures that the done channel is properly closed via t.Cleanup rather than relying on a goroutine's defer statement, preventing the cleanup function in getBuildOutputStream from blocking indefinitely.

Key changes:

  • Moved channel close operation from goroutine's defer to t.Cleanup in displaySolveStatus
  • Added clear comments explaining the cleanup ordering requirement to prevent deadlocks

MorrisLaw
MorrisLaw previously approved these changes Dec 10, 2025
Comment thread test/testenv/build.go Outdated

// This cleanup must run before cleanup from getBuildOutputStream, otherwise cleanup in getBuildOutputStream
// may block forever if for example a sub-test panics.
t.Cleanup(func() { close(done) })

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right fix.
We want to block until display.UpdateFrom finishes so that we have a complete log file before attempting to read it.

So the issue is display.UpdateFrom is not finishing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so here is minimal reproducer for this issue:

package test

import (
	"bufio"
	"context"
	"io"
	"os"
	"path/filepath"
	"testing"

	"github.com/moby/buildkit/client"
	gwclient "github.com/moby/buildkit/frontend/gateway/client"
	"github.com/moby/buildkit/util/progress/progressui"
)

func getBuildOutputStream(ctx context.Context, t *testing.T, done <-chan struct{}) io.Writer {
	t.Helper()

	f, err := os.OpenFile(filepath.Join(t.TempDir(), "build.log"), os.O_RDWR|os.O_CREATE, 0o600)
	if err != nil {
		t.Fatalf("Error opening temp file: %v", err)
	}

	t.Cleanup(func() {
		defer f.Close()

		select { // TODO: Blocks here.
		case <-ctx.Done():
			return
		case <-done:
		}

		_, err = f.Seek(0, io.SeekStart)
		if err != nil {
			t.Log(err)
			return
		}

		scanner := bufio.NewScanner(f)
		for scanner.Scan() {
			t.Log(scanner.Text())
		}
		if err := scanner.Err(); err != nil {
			t.Log(err)
		}
	})

	return f
}

func displaySolveStatus(ctx context.Context, t *testing.T) chan *client.SolveStatus {
	ch := make(chan *client.SolveStatus)
	done := make(chan struct{})

	output := getBuildOutputStream(ctx, t, done)
	display, err := progressui.NewDisplay(output, progressui.AutoMode, progressui.WithPhase(t.Name()))
	if err != nil {
		t.Fatal(err)
	}

	go func() {
		defer close(done)

		_, err := display.UpdateFrom(ctx, ch) // TODO: Blocks here.
		if err != nil {
			t.Log(err)
		}
	}()

	return ch
}

func build(ch chan *client.SolveStatus, f func(ctx context.Context, gwc gwclient.Client)) {
	f(nil, nil)
	close(ch)
}

func fakeRunTest(ctx context.Context, t *testing.T, f func(ctx context.Context, gwc gwclient.Client)) {
	ch := displaySolveStatus(ctx, t)
	build(ch, f)
}

func TestReproducer(t *testing.T) {
	ctx := context.Background()

	t.Run("foo", func(t *testing.T) {
		t.Parallel()

		fakeRunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) {
			t.Run("panicking-subtest", func(t *testing.T) {
				panic("subtest error")
			})
		})
	})
}

Indeed, the issue is that display.UpdateFrom(ctx, ch) never returns, done never gets closed, so t.Cleanup() in getBuildOutputStream() never returns.

Now, display.UpdateFrom will return either when ctx is cancelled or ch is closed. We don't cancel the context right now and ch never gets closed as it is normally done by buildkit client Build() call, but since this one panics, defer closing the channel never runs.

Maybe we need something like this then?

  ch := displaySolveStatus(ctx, t)
  t.Cleanup(func() {
    if t.Failed() {
      close(ch)                                                                                                                    }
  })
  build(ch, f)

So if build() call never finishes and test fails, we close the channel explicitly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm,

  t.Cleanup(func() {
    if t.Failed() {
      close(ch)                                                                                                                    }
  })

that seems to panic sometimes.

I am trying the following patch right now:

diff --git test/testenv/buildx.go test/testenv/buildx.go
index b099b61..a3a12e2 100644
--- test/testenv/buildx.go
+++ test/testenv/buildx.go
@@ -384,6 +384,18 @@ func (b *BuildxEnv) RunTest(ctx context.Context, t *testing.T, f TestFunc, opts
                f(&so)
        }

+       buildComplete := false
+
+       t.Cleanup(func() {
+               if t.Failed() && !buildComplete {
+                       select {
+                       case <-ch:
+                       default:
+                               close(ch)
+                       }
+               }
+       })
+
        _, err = c.Build(ctx, so, "", func(ctx context.Context, gwc gwclient.Client) (*gwclient.Result, error) {
                gwc = &clientForceDalecWithInput{gwc}

@@ -393,6 +405,8 @@ func (b *BuildxEnv) RunTest(ctx context.Context, t *testing.T, f TestFunc, opts
                }
                b.mu.Unlock()
                f(ctx, gwc)
+               buildComplete = true
+
                return gwclient.NewResult(), nil
        }, ch)

@MorrisLaw MorrisLaw dismissed their stale review December 10, 2025 19:38

we need to address Brian's comment about display.UpdateFrom being the culprit

Right now if a sub-test inside TestFunc panics, test run deadlocks,
because ch channel never gets closed by Build() which causes cleanup from
getBuildOutputStream() to block forever.

Closing ch channel on t.Cleanup when test fails, Build() did not
return and ch did not get closed seems to be addressing the issue.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
@invidian invidian force-pushed the invidian/fix-integration-test-deadlock-on-panic branch from 8bbd095 to 13e7e58 Compare December 12, 2025 16:15
@invidian invidian requested a review from cpuguy83 December 12, 2025 16:16
Comment thread test/testenv/buildx.go
select {
case <-ch:
default:
close(ch)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a channel we own.
The buildkit client should be closing this when the build completes.

Have you tried doing a recover from the panic inside c.Build?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think I finally understood how the execution flow works, but I am not sure about the fix yet. So IIUC, this is what happens:

  1. A sub-test created with t.Run panics. It runs in a separate goroutine as in https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1997.
  2. On panic, Go runtime starts calling defer function in reverse order to the calls, so first its test cleanup called: https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1926.
  3. If I understand correctly and from what I have observed, all cleanup functions get called in revert order from being registered, here: https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1560 if there is no sub-tests or tests running in parallel.
    Example:
func TestPanicPropagation(t *testing.T) {
  t.Cleanup(func() {
    fmt.Println("Grandparent cleanup called")
  })

  t.Run("parent", func(t *testing.T) {
    t.Cleanup(func() {
      fmt.Println("Parent cleanup called")
    })

    t.Run("panicking-test", func(t *testing.T) {
      t.Cleanup(func() {
        fmt.Println("Panic child cleanup called")
      })

      panic("panic")
    })

    t.Run("regular-test", func(t *testing.T) {
      t.Cleanup(func() {
        fmt.Println("Regular child cleanup called")
      })
    })
  })

  t.Run("parent-regular", func(t *testing.T) {
    t.Cleanup(func() {
      fmt.Println("Parent regular child cleanup called")
    })
  })
}

Gives:

=== RUN   TestPanicPropagation
=== RUN   TestPanicPropagation/parent
=== RUN   TestPanicPropagation/parent/panicking-test
Panic child cleanup called
Parent cleanup called
Grandparent cleanup called
--- FAIL: TestPanicPropagation (0.00s)
    --- FAIL: TestPanicPropagation/parent (0.00s)
        --- FAIL: TestPanicPropagation/parent/panicking-test (0.00s)
panic: panic [recovered, repanicked]

goroutine 22 [running]:
testing.tRunner.func1.2({0xc52420, 0xeeebe0})
        testing/testing.go:1872 +0x237
testing.tRunner.func1()
        testing/testing.go:1875 +0x35b
panic({0xc52420?, 0xeeebe0?})
        runtime/panic.go:783 +0x132
command-line-arguments.TestPanicPropagation.func2.2(0xc000105180?)
        ./foo_test.go:102 +0x2f
testing.tRunner(0xc000105180, 0xe133c0)
        testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 21
        testing/testing.go:1997 +0x465
FAIL    command-line-arguments  0.007s
FAIL

This means first cleanup to be called is the one in getBuildOutputStream(), blocking until either done is closed or ctx is cancelled.

  1. While the cleanup is running, done would normally be closed by displaySolveStatus() once ch is closed by c.Build. However, that never happens as c.Build blocks calling t.Run(), which panicked and is now running the stuck cleanups. This of course creates a deadlock.

Now that I think about it more, it's not a huge deal overall, since it will just cause the test suite to hang until timeout when panic inside RunTest combined with t.Run occurs, which is probably pretty rare, even though I ran into in pretty quickly. The tests are still going to fail, so we don't risk any false negatives here, just some annoyance.

Still, that seems like a bug in a RunTest. If we decide that a fix for this is not trivial, I will just open an issue for it, so we defer fixing it.

I am also curious about the purpose of RunTest itself. I understand it provides building capabilities with gwclient.Client integrated with testing.T, but I am wondering if building a test suite, with multiple sub-tests, is it desired to have those tests inside or outside RunTest (as this is the source of this issue actually).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each subtest needs to do its own RunTest call.
RunTest is really about providing a gwclient.Client that has the frontend wired in, and in order to get there the normal (non-gwclient) build also has to be configured.

The testing.T integration is for removing boiler plate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about structured sub-tests to aim for having a one assertion per sub-test? So something like:

    t.Run("result_metadata_contains_image_config", func(t *testing.T) {
      t.Parallel()

      testEnv.RunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) {
        spec := spec
        spec.Tests = nil

        sr := newSolveRequest(
          withSpec(ctx, t, &spec),
          withBuildTarget(testConfig.Target.Container),
        )

        beforeBuild := time.Now()
        res := solveT(ctx, t, gwc, sr)

        dt, ok := res.Metadata[exptypes.ExporterImageConfigKey]
        assert.Assert(t, ok, "result metadata should contain an image config: available metadata: %s", strings.Join(maps.Keys(res.Metadata), ", "))

        t.Run("in_docker_image_spec_format", func(t *testing.T) {
          var cfg dalec.DockerImageSpec
          assert.Assert(t, json.Unmarshal(dt, &cfg))

          t.Run("with_created_timestamp_older_than_current_time", func(t *testing.T) {
            assert.Check(t, cfg.Created.After(beforeBuild))
            assert.Check(t, cfg.Created.Before(time.Now()))
          })
        })
      })
    })

should actually be:

		t.Run("result_metadata_contains_image_config", func(t *testing.T) {
			t.Parallel()

			beforeBuild := time.Now()
			var res *gwclient.Result

			testEnv.RunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) {
				spec := spec
				spec.Tests = nil

				sr := newSolveRequest(
					withSpec(ctx, t, &spec),
					withBuildTarget(testConfig.Target.Container),
				)

				res = solveT(ctx, t, gwc, sr)
			})

			dt, ok := res.Metadata[exptypes.ExporterImageConfigKey]
			assert.Assert(t, ok, "result metadata should contain an image config: available metadata: %s", strings.Join(maps.Keys(res.Metadata), ", "))

			t.Run("in_docker_image_spec_format", func(t *testing.T) {
				var cfg dalec.DockerImageSpec
				assert.Assert(t, json.Unmarshal(dt, &cfg))

				t.Run("with_created_timestamp_older_than_current_time", func(t *testing.T) {
					assert.Check(t, cfg.Created.After(beforeBuild))
					assert.Check(t, cfg.Created.Before(time.Now()))
				})
			})
		})

What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all fine.
Recalling more on this now, I think maybe its the t.Parallel that can have interesting interactions, e.g.:

t.Parallel()
// whatever test setup
testEnv.RunTest(func(ctx context.Context, gwc gwclient.Client) {
  t.Run("whatever", func(t *testing.T) {
    t.Parallel()
  })
})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Parallel() added to the mix is yet another scenario to consider indeed. But I think the code should be safe to handle both parallel and non-parallel tests.

I wonder then if we want to change something to the test code to handle that somehow better (like proposed in this PR) or we just don't call t.Run inside testEnv.RunTest by convention (or linter etc.).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Run should be fine, but t.Run with t.Parallel inside of a t.Parallel is the problem.

But yeah, linter to help prevent the bad case is probably simplest.

@invidian

Copy link
Copy Markdown
Contributor Author

Opened #905 to eventually address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants