test/testenv/build.go: don't deadlock when sub-test panics#895
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a deadlock issue in the test infrastructure that occurs when a sub-test panics during test execution. The fix ensures that the done channel is properly closed via t.Cleanup rather than relying on a goroutine's defer statement, preventing the cleanup function in getBuildOutputStream from blocking indefinitely.
Key changes:
- Moved channel close operation from goroutine's defer to
t.CleanupindisplaySolveStatus - Added clear comments explaining the cleanup ordering requirement to prevent deadlocks
|
|
||
| // This cleanup must run before cleanup from getBuildOutputStream, otherwise cleanup in getBuildOutputStream | ||
| // may block forever if for example a sub-test panics. | ||
| t.Cleanup(func() { close(done) }) |
There was a problem hiding this comment.
I'm not sure this is the right fix.
We want to block until display.UpdateFrom finishes so that we have a complete log file before attempting to read it.
So the issue is display.UpdateFrom is not finishing.
There was a problem hiding this comment.
Hmm, so here is minimal reproducer for this issue:
package test
import (
"bufio"
"context"
"io"
"os"
"path/filepath"
"testing"
"github.com/moby/buildkit/client"
gwclient "github.com/moby/buildkit/frontend/gateway/client"
"github.com/moby/buildkit/util/progress/progressui"
)
func getBuildOutputStream(ctx context.Context, t *testing.T, done <-chan struct{}) io.Writer {
t.Helper()
f, err := os.OpenFile(filepath.Join(t.TempDir(), "build.log"), os.O_RDWR|os.O_CREATE, 0o600)
if err != nil {
t.Fatalf("Error opening temp file: %v", err)
}
t.Cleanup(func() {
defer f.Close()
select { // TODO: Blocks here.
case <-ctx.Done():
return
case <-done:
}
_, err = f.Seek(0, io.SeekStart)
if err != nil {
t.Log(err)
return
}
scanner := bufio.NewScanner(f)
for scanner.Scan() {
t.Log(scanner.Text())
}
if err := scanner.Err(); err != nil {
t.Log(err)
}
})
return f
}
func displaySolveStatus(ctx context.Context, t *testing.T) chan *client.SolveStatus {
ch := make(chan *client.SolveStatus)
done := make(chan struct{})
output := getBuildOutputStream(ctx, t, done)
display, err := progressui.NewDisplay(output, progressui.AutoMode, progressui.WithPhase(t.Name()))
if err != nil {
t.Fatal(err)
}
go func() {
defer close(done)
_, err := display.UpdateFrom(ctx, ch) // TODO: Blocks here.
if err != nil {
t.Log(err)
}
}()
return ch
}
func build(ch chan *client.SolveStatus, f func(ctx context.Context, gwc gwclient.Client)) {
f(nil, nil)
close(ch)
}
func fakeRunTest(ctx context.Context, t *testing.T, f func(ctx context.Context, gwc gwclient.Client)) {
ch := displaySolveStatus(ctx, t)
build(ch, f)
}
func TestReproducer(t *testing.T) {
ctx := context.Background()
t.Run("foo", func(t *testing.T) {
t.Parallel()
fakeRunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) {
t.Run("panicking-subtest", func(t *testing.T) {
panic("subtest error")
})
})
})
}Indeed, the issue is that display.UpdateFrom(ctx, ch) never returns, done never gets closed, so t.Cleanup() in getBuildOutputStream() never returns.
Now, display.UpdateFrom will return either when ctx is cancelled or ch is closed. We don't cancel the context right now and ch never gets closed as it is normally done by buildkit client Build() call, but since this one panics, defer closing the channel never runs.
Maybe we need something like this then?
ch := displaySolveStatus(ctx, t)
t.Cleanup(func() {
if t.Failed() {
close(ch) }
})
build(ch, f)So if build() call never finishes and test fails, we close the channel explicitly?
There was a problem hiding this comment.
Hmm,
t.Cleanup(func() {
if t.Failed() {
close(ch) }
})that seems to panic sometimes.
I am trying the following patch right now:
diff --git test/testenv/buildx.go test/testenv/buildx.go
index b099b61..a3a12e2 100644
--- test/testenv/buildx.go
+++ test/testenv/buildx.go
@@ -384,6 +384,18 @@ func (b *BuildxEnv) RunTest(ctx context.Context, t *testing.T, f TestFunc, opts
f(&so)
}
+ buildComplete := false
+
+ t.Cleanup(func() {
+ if t.Failed() && !buildComplete {
+ select {
+ case <-ch:
+ default:
+ close(ch)
+ }
+ }
+ })
+
_, err = c.Build(ctx, so, "", func(ctx context.Context, gwc gwclient.Client) (*gwclient.Result, error) {
gwc = &clientForceDalecWithInput{gwc}
@@ -393,6 +405,8 @@ func (b *BuildxEnv) RunTest(ctx context.Context, t *testing.T, f TestFunc, opts
}
b.mu.Unlock()
f(ctx, gwc)
+ buildComplete = true
+
return gwclient.NewResult(), nil
}, ch)we need to address Brian's comment about display.UpdateFrom being the culprit
Right now if a sub-test inside TestFunc panics, test run deadlocks, because ch channel never gets closed by Build() which causes cleanup from getBuildOutputStream() to block forever. Closing ch channel on t.Cleanup when test fails, Build() did not return and ch did not get closed seems to be addressing the issue. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
8bbd095 to
13e7e58
Compare
| select { | ||
| case <-ch: | ||
| default: | ||
| close(ch) |
There was a problem hiding this comment.
This is not a channel we own.
The buildkit client should be closing this when the build completes.
Have you tried doing a recover from the panic inside c.Build?
There was a problem hiding this comment.
There was a problem hiding this comment.
Hmm, I think I finally understood how the execution flow works, but I am not sure about the fix yet. So IIUC, this is what happens:
- A sub-test created with
t.Runpanics. It runs in a separate goroutine as in https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1997. - On panic, Go runtime starts calling
deferfunction in reverse order to the calls, so first its test cleanup called: https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1926. - If I understand correctly and from what I have observed, all cleanup functions get called in revert order from being registered, here: https://cs.opensource.google/go/go/+/refs/tags/go1.25.5:src/testing/testing.go;l=1560 if there is no sub-tests or tests running in parallel.
Example:
func TestPanicPropagation(t *testing.T) {
t.Cleanup(func() {
fmt.Println("Grandparent cleanup called")
})
t.Run("parent", func(t *testing.T) {
t.Cleanup(func() {
fmt.Println("Parent cleanup called")
})
t.Run("panicking-test", func(t *testing.T) {
t.Cleanup(func() {
fmt.Println("Panic child cleanup called")
})
panic("panic")
})
t.Run("regular-test", func(t *testing.T) {
t.Cleanup(func() {
fmt.Println("Regular child cleanup called")
})
})
})
t.Run("parent-regular", func(t *testing.T) {
t.Cleanup(func() {
fmt.Println("Parent regular child cleanup called")
})
})
}Gives:
=== RUN TestPanicPropagation
=== RUN TestPanicPropagation/parent
=== RUN TestPanicPropagation/parent/panicking-test
Panic child cleanup called
Parent cleanup called
Grandparent cleanup called
--- FAIL: TestPanicPropagation (0.00s)
--- FAIL: TestPanicPropagation/parent (0.00s)
--- FAIL: TestPanicPropagation/parent/panicking-test (0.00s)
panic: panic [recovered, repanicked]
goroutine 22 [running]:
testing.tRunner.func1.2({0xc52420, 0xeeebe0})
testing/testing.go:1872 +0x237
testing.tRunner.func1()
testing/testing.go:1875 +0x35b
panic({0xc52420?, 0xeeebe0?})
runtime/panic.go:783 +0x132
command-line-arguments.TestPanicPropagation.func2.2(0xc000105180?)
./foo_test.go:102 +0x2f
testing.tRunner(0xc000105180, 0xe133c0)
testing/testing.go:1934 +0xea
created by testing.(*T).Run in goroutine 21
testing/testing.go:1997 +0x465
FAIL command-line-arguments 0.007s
FAILThis means first cleanup to be called is the one in getBuildOutputStream(), blocking until either done is closed or ctx is cancelled.
- While the cleanup is running,
donewould normally be closed bydisplaySolveStatus()oncechis closed byc.Build. However, that never happens asc.Buildblocks callingt.Run(), which panicked and is now running the stuck cleanups. This of course creates a deadlock.
Now that I think about it more, it's not a huge deal overall, since it will just cause the test suite to hang until timeout when panic inside RunTest combined with t.Run occurs, which is probably pretty rare, even though I ran into in pretty quickly. The tests are still going to fail, so we don't risk any false negatives here, just some annoyance.
Still, that seems like a bug in a RunTest. If we decide that a fix for this is not trivial, I will just open an issue for it, so we defer fixing it.
I am also curious about the purpose of RunTest itself. I understand it provides building capabilities with gwclient.Client integrated with testing.T, but I am wondering if building a test suite, with multiple sub-tests, is it desired to have those tests inside or outside RunTest (as this is the source of this issue actually).
There was a problem hiding this comment.
Each subtest needs to do its own RunTest call.
RunTest is really about providing a gwclient.Client that has the frontend wired in, and in order to get there the normal (non-gwclient) build also has to be configured.
The testing.T integration is for removing boiler plate.
There was a problem hiding this comment.
What about structured sub-tests to aim for having a one assertion per sub-test? So something like:
t.Run("result_metadata_contains_image_config", func(t *testing.T) {
t.Parallel()
testEnv.RunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) {
spec := spec
spec.Tests = nil
sr := newSolveRequest(
withSpec(ctx, t, &spec),
withBuildTarget(testConfig.Target.Container),
)
beforeBuild := time.Now()
res := solveT(ctx, t, gwc, sr)
dt, ok := res.Metadata[exptypes.ExporterImageConfigKey]
assert.Assert(t, ok, "result metadata should contain an image config: available metadata: %s", strings.Join(maps.Keys(res.Metadata), ", "))
t.Run("in_docker_image_spec_format", func(t *testing.T) {
var cfg dalec.DockerImageSpec
assert.Assert(t, json.Unmarshal(dt, &cfg))
t.Run("with_created_timestamp_older_than_current_time", func(t *testing.T) {
assert.Check(t, cfg.Created.After(beforeBuild))
assert.Check(t, cfg.Created.Before(time.Now()))
})
})
})
})should actually be:
t.Run("result_metadata_contains_image_config", func(t *testing.T) {
t.Parallel()
beforeBuild := time.Now()
var res *gwclient.Result
testEnv.RunTest(ctx, t, func(ctx context.Context, gwc gwclient.Client) {
spec := spec
spec.Tests = nil
sr := newSolveRequest(
withSpec(ctx, t, &spec),
withBuildTarget(testConfig.Target.Container),
)
res = solveT(ctx, t, gwc, sr)
})
dt, ok := res.Metadata[exptypes.ExporterImageConfigKey]
assert.Assert(t, ok, "result metadata should contain an image config: available metadata: %s", strings.Join(maps.Keys(res.Metadata), ", "))
t.Run("in_docker_image_spec_format", func(t *testing.T) {
var cfg dalec.DockerImageSpec
assert.Assert(t, json.Unmarshal(dt, &cfg))
t.Run("with_created_timestamp_older_than_current_time", func(t *testing.T) {
assert.Check(t, cfg.Created.After(beforeBuild))
assert.Check(t, cfg.Created.Before(time.Now()))
})
})
})What do you think?
There was a problem hiding this comment.
That's all fine.
Recalling more on this now, I think maybe its the t.Parallel that can have interesting interactions, e.g.:
t.Parallel()
// whatever test setup
testEnv.RunTest(func(ctx context.Context, gwc gwclient.Client) {
t.Run("whatever", func(t *testing.T) {
t.Parallel()
})
})
There was a problem hiding this comment.
t.Parallel() added to the mix is yet another scenario to consider indeed. But I think the code should be safe to handle both parallel and non-parallel tests.
I wonder then if we want to change something to the test code to handle that somehow better (like proposed in this PR) or we just don't call t.Run inside testEnv.RunTest by convention (or linter etc.).
There was a problem hiding this comment.
t.Run should be fine, but t.Run with t.Parallel inside of a t.Parallel is the problem.
But yeah, linter to help prevent the bad case is probably simplest.
|
Opened #905 to eventually address this. |
What this PR does / why we need it:
Right now if a sub-test inside RunTest() panics, test run deadlocks, because done channel never gets closed, which causes cleanup from getBuildOutputStream() to block forever.
Closing done channel on t.Cleanup before goroutine starts ensuring that the channel is closed when test ends, even if it panics and without the race condition hazard in case test would panic before goroutine starts.
Special notes for your reviewer:
To reproduce existing issue, apply the following patch:
Then run:
go test --timeout=10s ./test -run=TestRockylinux9/test_systemd_unit_single -v