Symptom
Four it.live tests in packages/opencode/test/session/prompt-effect.test.ts fail with duration hitting ~3015-3016ms on Windows CI, matching the explicit { timeout: 3_000 } passed to each helper call.
| Test |
File:Line |
Explicit timeout |
| prompt submitted during an active run is included in the next LLM input |
prompt-effect.test.ts:938 |
3_000 |
| shell rejects with BusyError when loop running |
prompt-effect.test.ts:1054 |
3_000 |
| loop waits while shell runs and starts after shell exits |
prompt-effect.test.ts:1214 |
3_000 |
| shell completion resumes queued loop callers |
prompt-effect.test.ts:1252 |
3_000 |
Evidence (last 5 Windows CI runs on dev, 2026-04-16 → 2026-04-17)
| Test |
Ubuntu pass |
Windows pass p50 / max |
Windows fail count |
| prompt submitted during active run |
324ms |
2281 / 2391ms |
1 × 3015ms |
| shell rejects with BusyError |
238ms |
1531 / 1531ms |
1 × 3016ms |
| loop waits while shell runs |
574ms |
2774 / 2828ms |
3 × 3016ms |
| shell completion resumes |
591ms |
2367 / 2375ms |
2 × 3016ms |
Fail duration (~3016ms) lands precisely on the explicit 3s cap. Windows pass p95 is 2828ms — 94% of the 3s ceiling. Any runner jitter pushes over.
Windows is 4-7× slower than Ubuntu on these tests. it.live uses a real clock (TestClock replaced with the live layer), so real runner slowness shows up here while it doesn't for tests using TestClock.
Proposed fix
Add an OS-aware applyScale in packages/opencode/test/lib/effect.ts so it.live and testEffect helpers scale user-specified timeouts on Windows:
const scaleForOS = (base: number) =>
process.platform === "win32" ? base * 3 : base
const applyScale = (opts?: number | TestOptions) => {
if (opts === undefined) return undefined
if (typeof opts === "number") return scaleForOS(opts)
if (opts.timeout !== undefined) return { ...opts, timeout: scaleForOS(opts.timeout) }
return opts
}
Apply inside effect, effect.only, effect.skip, live, live.only, live.skip — pass applyScale(opts) through to test(name, fn, ...).
- Coefficient
× 3: Windows pass max observed is 2828ms. 3s × 3 = 9s gives ~3× headroom over observed p95. If flakes persist, raise to × 5 rather than adding more escape hatches.
- Does NOT affect tests that pass
undefined (they keep the Bun global --timeout 30000 from package.json).
- Does NOT affect Ubuntu (coefficient is
1× for non-win32).
Alternative considered
Bump the four explicit 3_000 literals to 9_000 inline. Rejected — the same Windows-runner slowness will hit future it.live tests; centralizing the scale is reusable and keeps the Ubuntu-authoring experience unchanged.
Verification
Run CI 5× after merge. Expect these 4 tests to pass on Windows in all runs. If any still flake, the coefficient is wrong — go to × 5 and collect another 5 runs before closing.
Out of scope
Symptom
Four
it.livetests inpackages/opencode/test/session/prompt-effect.test.tsfail with duration hitting ~3015-3016ms on Windows CI, matching the explicit{ timeout: 3_000 }passed to each helper call.3_0003_0003_0003_000Evidence (last 5 Windows CI runs on
dev, 2026-04-16 → 2026-04-17)Fail duration (~3016ms) lands precisely on the explicit 3s cap. Windows pass p95 is 2828ms — 94% of the 3s ceiling. Any runner jitter pushes over.
Windows is 4-7× slower than Ubuntu on these tests.
it.liveuses a real clock (TestClock replaced with the live layer), so real runner slowness shows up here while it doesn't for tests usingTestClock.Proposed fix
Add an OS-aware
applyScaleinpackages/opencode/test/lib/effect.tssoit.liveandtestEffecthelpers scale user-specified timeouts on Windows:Apply inside
effect,effect.only,effect.skip,live,live.only,live.skip— passapplyScale(opts)through totest(name, fn, ...).× 3: Windows pass max observed is 2828ms.3s × 3 = 9sgives ~3× headroom over observed p95. If flakes persist, raise to× 5rather than adding more escape hatches.undefined(they keep the Bun global--timeout 30000frompackage.json).1×for non-win32).Alternative considered
Bump the four explicit
3_000literals to9_000inline. Rejected — the same Windows-runner slowness will hit futureit.livetests; centralizing the scale is reusable and keeps the Ubuntu-authoring experience unchanged.Verification
Run CI 5× after merge. Expect these 4 tests to pass on Windows in all runs. If any still flake, the coefficient is wrong — go to
× 5and collect another 5 runs before closing.Out of scope
timeout-headroomtests (baretest(...), hit 30s Bun global, mockedNpm.installbut still slow on Windows): tracked in flaky(windows): mocked config-deps tests stall for 18-29s #16