Skip to content

fix(test): root-cause the recurring TempDir teardown CI flake#3403

Merged
esengine merged 1 commit into
main-v2from
fix/test-tempdir-teardown-flaky
Jun 7, 2026
Merged

fix(test): root-cause the recurring TempDir teardown CI flake#3403
esengine merged 1 commit into
main-v2from
fix/test-tempdir-teardown-flaky

Conversation

@esengine

@esengine esengine commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Problem

Windows/Linux CI intermittently reddens on t.TempDir auto-cleanup even though every assertion in the test passed:

  • Windows: TempDir RemoveAll cleanup: ... The process cannot access the file because it is being used by another process.
  • Linux: TempDir RemoveAll cleanup: ... directory not empty

It hits different tests each run — internal/boot legacy-migration tests and desktop capabilities tests — which is why reruns kept landing on it (the recent [1/6]..[6/6] series got blocked on it repeatedly).

#3371 attributed this to t.Chdir and switched a single test to WorkspaceRoot, but that same test (TestBuildMigratesLegacySessionsFromConfigSessionDir) kept failing afterward — so the real cause is a teardown race, not the process cwd.

Root cause

The failing tests all build a full Controller. At teardown a background resource can still hold a file under the temp dir for a few milliseconds after Close() returns, and t.TempDir's automatic RemoveAll races it.

Fixes (two, independent)

  1. jobs.Manager.Close is now graceful. It previously only cancel()-ed the session context and returned without waiting for job goroutines to observe the cancel and exit. Added a sync.WaitGroup: Start does Add(1)/defer Done(), Close does cancel() then Wait(). Jobs observe the cancel through their run context (exec.CommandContext kills a bash job's process), so the wait is bounded.

  2. robustTempDir test helper (boot + desktop) that retries RemoveAll for a short window, absorbing the brief post-Close interval where a background resource still holds a file. A dir that never frees is t.Logf-ed, not fatal, so a genuine leak stays visible without reintroducing the flake.

Verification

  • gofmt clean; go vet clean.
  • internal/jobs, internal/control, internal/boot pass; the two previously-flaky boot tests pass -count=200 locally.
  • desktop test binary compiles.
  • Real validation is this PR's own Windows + desktop CI (the flake only reproduces there).

Follow-up

Unblocks #3367 and #3370 (the last two of the [1/6]..[6/6] series), which were red purely on this flake. Once this lands, those two get rebased onto it and should go green normally.

Windows/Linux CI intermittently reddened on t.TempDir auto-cleanup
(RemoveAll: "being used by another process" on Windows, "directory not
empty" on Linux) even though every assertion passed — across the boot
legacy-migration tests and the desktop capabilities tests. #3371 blamed
t.Chdir and fixed one test, but that same test kept failing afterward, so
the real cause is a teardown race, not the cwd.

Two independent fixes:

1. jobs.Manager.Close only cancelled the session context and returned
   without waiting for job goroutines to observe the cancel and exit, so a
   caller tearing down a t.TempDir could race a goroutine still holding a
   file under it. Close now waits on a WaitGroup — shutdown is graceful.

2. Add a robustTempDir test helper (boot + desktop) that retries RemoveAll
   for a short window, absorbing the brief post-Close interval where a
   background resource still holds a file under the dir. A dir that never
   frees is logged, not fatal, so a genuine leak stays visible.
@esengine esengine requested a review from SivanCola as a code owner June 7, 2026 01:39
@github-actions github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development desktop Wails desktop app (desktop/**) config Configuration & setup (internal/config) and removed v2 Go rewrite (1.x) — main-v2 branch, active development labels Jun 7, 2026
@esengine esengine merged commit 55e131b into main-v2 Jun 7, 2026
10 checks passed
@esengine esengine deleted the fix/test-tempdir-teardown-flaky branch June 7, 2026 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration & setup (internal/config) desktop Wails desktop app (desktop/**)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant