test/testenv/build.go: don't deadlock when sub-test panics by invidian · Pull Request #895 · project-dalec/dalec

invidian · 2025-12-10T19:07:56Z

What this PR does / why we need it:

Right now if a sub-test inside RunTest() panics, test run deadlocks, because done channel never gets closed, which causes cleanup from getBuildOutputStream() to block forever.

Closing done channel on t.Cleanup before goroutine starts ensuring that the channel is closed when test ends, even if it panics and without the race condition hazard in case test would panic before goroutine starts.

Special notes for your reviewer:

To reproduce existing issue, apply the following patch:

diff --git test/linux_target_test.go test/linux_target_test.go
index 4bcb84d..998fbb6 100644
--- test/linux_target_test.go
+++ test/linux_target_test.go
@@ -1252,6 +1252,10 @@ WantedBy=multi-user.target
                }

                testEnv.RunTest(ctx, t, func(ctx context.Context, client gwclient.Client) {
+                       t.Run("foo", func(t *testing.T) {
+                               panic("foo")
+                       })
+
                        req := newSolveRequest(withBuildTarget(testConfig.Target.Container), withSpec(ctx, t, spec))
                        solveT(ctx, t, client, req)
                })

Then run:

go test --timeout=10s ./test -run=TestRockylinux9/test_systemd_unit_single -v

Copilot

Pull request overview

This PR fixes a deadlock issue in the test infrastructure that occurs when a sub-test panics during test execution. The fix ensures that the done channel is properly closed via t.Cleanup rather than relying on a goroutine's defer statement, preventing the cleanup function in getBuildOutputStream from blocking indefinitely.

Key changes:

Moved channel close operation from goroutine's defer to t.Cleanup in displaySolveStatus
Added clear comments explaining the cleanup ordering requirement to prevent deadlocks

cpuguy83 · 2025-12-10T19:33:46Z

+
+	// This cleanup must run before cleanup from getBuildOutputStream, otherwise cleanup in getBuildOutputStream
+	// may block forever if for example a sub-test panics.
+	t.Cleanup(func() { close(done) })


I'm not sure this is the right fix.
We want to block until display.UpdateFrom finishes so that we have a complete log file before attempting to read it.

So the issue is display.UpdateFrom is not finishing.

Hmm, so here is minimal reproducer for this issue:

package test import ( "bufio" "context" "io" "os" "path/filepath" "testing" "github.com/moby/buildkit/client" gwclient "github.com/moby/buildkit/frontend/gateway/client" "github.com/moby/buildkit/util/progress/progressui" ) func getBuildOutputStream(ctx context.Context, t *testing.T, done <-chan struct{}) io.Writer { t.Helper() f, err := os.OpenFile(filepath.Join(t.TempDir(), "build.log"), os.O_RDWR|os.O_CREATE, 0o600) if err != nil { t.Fatalf("Error opening temp file: %v", err) } t.Cleanup(func() { defer f.Close() select { // TODO: Blocks here. case <-ctx.Done(): return case <-done: } _, err = f.Seek(0, io.SeekStart) if err != nil { t.Log(err) return } scanner := bufio.NewScanner(f) for scanner.Scan() { t.Log(scanner.Text()) } if err := scanner.Err(); err != nil { t.Log(err) } }) return f } func displaySolveStatus(ctx context.Context, t *testing.T) chan *client.SolveStatus { ch := make(chan *client.SolveStatus) done := make(chan struct{}) output := getBuildOutputStream(ctx, t, done) display, err := progressui.NewDisplay(output, progressui.AutoMode, progressui.WithPhase(t.Name())) if err != nil { t.Fatal(err) } go func() { defer close(done) _, err := display.UpdateFrom(ctx, ch) // TODO: Blocks here. if err != nil { t.Log(err) } }() return ch } func build(ch chan *client.SolveStatus, f func(ctx context.Context, gwc gwclient.Client)) { f(nil, nil) close(ch) } func fakeRunTest(ctx context.Context, t *testing.T, f func(ctx context.Context, gwc gwclient.Client)) { ch := displaySolveStatus(ctx, t) build(ch, f) } func TestReproducer(t *testing.T) { ctx := context.Background() t.Run("foo", func(t *testing.T) { t.Parallel() fakeRunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) { t.Run("panicking-subtest", func(t *testing.T) { panic("subtest error") }) }) }) }

Indeed, the issue is that display.UpdateFrom(ctx, ch) never returns, done never gets closed, so t.Cleanup() in getBuildOutputStream() never returns.

Now, display.UpdateFrom will return either when ctx is cancelled or ch is closed. We don't cancel the context right now and ch never gets closed as it is normally done by buildkit client Build() call, but since this one panics, defer closing the channel never runs.

Maybe we need something like this then?

ch := displaySolveStatus(ctx, t) t.Cleanup(func() { if t.Failed() { close(ch) } }) build(ch, f)

So if build() call never finishes and test fails, we close the channel explicitly?

Hmm,

t.Cleanup(func() { if t.Failed() { close(ch) } })

that seems to panic sometimes.

I am trying the following patch right now:

diff --git test/testenv/buildx.go test/testenv/buildx.go index b099b61..a3a12e2 100644 --- test/testenv/buildx.go +++ test/testenv/buildx.go @@ -384,6 +384,18 @@ func (b *BuildxEnv) RunTest(ctx context.Context, t *testing.T, f TestFunc, opts f(&so) } + buildComplete := false + + t.Cleanup(func() { + if t.Failed() && !buildComplete { + select { + case <-ch: + default: + close(ch) + } + } + }) + _, err = c.Build(ctx, so, "", func(ctx context.Context, gwc gwclient.Client) (*gwclient.Result, error) { gwc = &clientForceDalecWithInput{gwc} @@ -393,6 +405,8 @@ func (b *BuildxEnv) RunTest(ctx context.Context, t *testing.T, f TestFunc, opts } b.mu.Unlock() f(ctx, gwc) + buildComplete = true + return gwclient.NewResult(), nil }, ch)

we need to address Brian's comment about display.UpdateFrom being the culprit

Right now if a sub-test inside TestFunc panics, test run deadlocks, because ch channel never gets closed by Build() which causes cleanup from getBuildOutputStream() to block forever. Closing ch channel on t.Cleanup when test fails, Build() did not return and ch did not get closed seems to be addressing the issue. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

cpuguy83 · 2025-12-12T16:53:06Z

+			select {
+			case <-ch:
+			default:
+				close(ch)


This is not a channel we own.
The buildkit client should be closing this when the build completes.

Have you tried doing a recover from the panic inside c.Build?

Funnily enough: https://github.com/project-copacetic/copacetic/pull/1399/changes#r2614958617

Hmm, I think I finally understood how the execution flow works, but I am not sure about the fix yet. So IIUC, this is what happens:

A sub-test created with t.Run panics. It runs in a separate goroutine as in https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1997.

On panic, Go runtime starts calling defer function in reverse order to the calls, so first its test cleanup called: https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1926.

If I understand correctly and from what I have observed, all cleanup functions get called in revert order from being registered, here: https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1560 if there is no sub-tests or tests running in parallel.
Example:

func TestPanicPropagation(t *testing.T) { t.Cleanup(func() { fmt.Println("Grandparent cleanup called") }) t.Run("parent", func(t *testing.T) { t.Cleanup(func() { fmt.Println("Parent cleanup called") }) t.Run("panicking-test", func(t *testing.T) { t.Cleanup(func() { fmt.Println("Panic child cleanup called") }) panic("panic") }) t.Run("regular-test", func(t *testing.T) { t.Cleanup(func() { fmt.Println("Regular child cleanup called") }) }) }) t.Run("parent-regular", func(t *testing.T) { t.Cleanup(func() { fmt.Println("Parent regular child cleanup called") }) }) }

Gives:

=== RUN TestPanicPropagation === RUN TestPanicPropagation/parent === RUN TestPanicPropagation/parent/panicking-test Panic child cleanup called Parent cleanup called Grandparent cleanup called --- FAIL: TestPanicPropagation (0.00s) --- FAIL: TestPanicPropagation/parent (0.00s) --- FAIL: TestPanicPropagation/parent/panicking-test (0.00s) panic: panic [recovered, repanicked] goroutine 22 [running]: testing.tRunner.func1.2({0xc52420, 0xeeebe0}) testing/testing.go:1872 +0x237 testing.tRunner.func1() testing/testing.go:1875 +0x35b panic({0xc52420?, 0xeeebe0?}) runtime/panic.go:783 +0x132 command-line-arguments.TestPanicPropagation.func2.2(0xc000105180?) ./foo_test.go:102 +0x2f testing.tRunner(0xc000105180, 0xe133c0) testing/testing.go:1934 +0xea created by testing.(*T).Run in goroutine 21 testing/testing.go:1997 +0x465 FAIL command-line-arguments 0.007s FAIL

This means first cleanup to be called is the one in getBuildOutputStream(), blocking until either done is closed or ctx is cancelled.

While the cleanup is running, done would normally be closed by displaySolveStatus() once ch is closed by c.Build. However, that never happens as c.Build blocks calling t.Run(), which panicked and is now running the stuck cleanups. This of course creates a deadlock.

Now that I think about it more, it's not a huge deal overall, since it will just cause the test suite to hang until timeout when panic inside RunTest combined with t.Run occurs, which is probably pretty rare, even though I ran into in pretty quickly. The tests are still going to fail, so we don't risk any false negatives here, just some annoyance.

Still, that seems like a bug in a RunTest. If we decide that a fix for this is not trivial, I will just open an issue for it, so we defer fixing it.

I am also curious about the purpose of RunTest itself. I understand it provides building capabilities with gwclient.Client integrated with testing.T, but I am wondering if building a test suite, with multiple sub-tests, is it desired to have those tests inside or outside RunTest (as this is the source of this issue actually).

Each subtest needs to do its own RunTest call.
RunTest is really about providing a gwclient.Client that has the frontend wired in, and in order to get there the normal (non-gwclient) build also has to be configured.

The testing.T integration is for removing boiler plate.

What about structured sub-tests to aim for having a one assertion per sub-test? So something like:

t.Run("result_metadata_contains_image_config", func(t *testing.T) { t.Parallel() testEnv.RunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) { spec := spec spec.Tests = nil sr := newSolveRequest( withSpec(ctx, t, &spec), withBuildTarget(testConfig.Target.Container), ) beforeBuild := time.Now() res := solveT(ctx, t, gwc, sr) dt, ok := res.Metadata[exptypes.ExporterImageConfigKey] assert.Assert(t, ok, "result metadata should contain an image config: available metadata: %s", strings.Join(maps.Keys(res.Metadata), ", ")) t.Run("in_docker_image_spec_format", func(t *testing.T) { var cfg dalec.DockerImageSpec assert.Assert(t, json.Unmarshal(dt, &cfg)) t.Run("with_created_timestamp_older_than_current_time", func(t *testing.T) { assert.Check(t, cfg.Created.After(beforeBuild)) assert.Check(t, cfg.Created.Before(time.Now())) }) }) }) })

should actually be:

t.Run("result_metadata_contains_image_config", func(t *testing.T) { t.Parallel() beforeBuild := time.Now() var res *gwclient.Result testEnv.RunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) { spec := spec spec.Tests = nil sr := newSolveRequest( withSpec(ctx, t, &spec), withBuildTarget(testConfig.Target.Container), ) res = solveT(ctx, t, gwc, sr) }) dt, ok := res.Metadata[exptypes.ExporterImageConfigKey] assert.Assert(t, ok, "result metadata should contain an image config: available metadata: %s", strings.Join(maps.Keys(res.Metadata), ", ")) t.Run("in_docker_image_spec_format", func(t *testing.T) { var cfg dalec.DockerImageSpec assert.Assert(t, json.Unmarshal(dt, &cfg)) t.Run("with_created_timestamp_older_than_current_time", func(t *testing.T) { assert.Check(t, cfg.Created.After(beforeBuild)) assert.Check(t, cfg.Created.Before(time.Now())) }) }) })

What do you think?

That's all fine.
Recalling more on this now, I think maybe its the t.Parallel that can have interesting interactions, e.g.:

t.Parallel() // whatever test setup testEnv.RunTest(func(ctx context.Context, gwc gwclient.Client) { t.Run("whatever", func(t *testing.T) { t.Parallel() }) })

t.Parallel() added to the mix is yet another scenario to consider indeed. But I think the code should be safe to handle both parallel and non-parallel tests.

I wonder then if we want to change something to the test code to handle that somehow better (like proposed in this PR) or we just don't call t.Run inside testEnv.RunTest by convention (or linter etc.).

t.Run should be fine, but t.Run with t.Parallel inside of a t.Parallel is the problem.

But yeah, linter to help prevent the bad case is probably simplest.

invidian · 2025-12-16T15:22:10Z

Opened #905 to eventually address this.

Copilot AI review requested due to automatic review settings December 10, 2025 19:07

Copilot started reviewing on behalf of invidian December 10, 2025 19:08 View session

Copilot AI reviewed Dec 10, 2025

View reviewed changes

MorrisLaw previously approved these changes Dec 10, 2025

View reviewed changes

cpuguy83 requested changes Dec 10, 2025

View reviewed changes

invidian force-pushed the invidian/fix-integration-test-deadlock-on-panic branch from 8bbd095 to 13e7e58 Compare December 12, 2025 16:15

invidian requested a review from cpuguy83 December 12, 2025 16:16

cpuguy83 reviewed Dec 12, 2025

View reviewed changes

invidian mentioned this pull request Dec 16, 2025

[BUG] test/testenv: RunTest deadlocks until go test timeouts when t.Run sub-test panics #905

Open

invidian closed this Dec 16, 2025

invidian mentioned this pull request Feb 20, 2026

[BUG] Docker daemon crashing or restarting causes integration tests to hang and timeout #972

Closed

1 task

Conversation

invidian commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

invidian commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

invidian commented Dec 10, 2025 •

edited

Loading