CI flake: CLI UI tests intermittently fail across macos/windows/ubuntu (AppContainer footer-remeasure, InputPrompt suggestion submit, AskUserQuestionDialog key handling)

## Summary

Several CLI UI tests intermittently fail on CI across all three runner platforms (macOS, Windows, Ubuntu). Failures reproduce on `main` independently of any PR's content. Every flake observed so far falls into the same class: a vitest assertion on a render-spy / key-sequence expecting an exact call count, where ink's async rerender timing or fake-timer interaction causes the spy to fire one extra time.

## Recurring failing tests

| Test | First flake observed | Failure shape |
|---|---|---|
| `src/ui/AppContainer.test.tsx > AppContainer State Management > Terminal Height Calculation > does not remeasure footer height for sticky todo status-only updates` | well before PR #4386 | `expected "spy" to be called 1 times, but got 2 times` |
| `src/ui/components/InputPrompt.test.tsx > InputPrompt > prompt suggestions > accepts and submits the prompt suggestion on Enter when the buffer is empty` | well before PR #4386 | `expected "spy" to be called with arguments: [ 'commit this' ]` |
| `src/ui/components/messages/AskUserQuestionDialog.test.tsx > <AskUserQuestionDialog /> > single-select interaction > keeps bare k/j in custom input while Ctrl+P/N still navigates options` | well before PR #4386 | (same async-spy shape) |

## Evidence — failures on `main` (recent, sampled)

| Run | Created (UTC) | Platform | Failing test |
|---|---|---|---|
| https://github.com/QwenLM/qwen-code/actions/runs/26213996190 | 2026-05-21 08:13 | ubuntu | `AppContainer > does not remeasure footer height ...` |
| https://github.com/QwenLM/qwen-code/actions/runs/26213457435 | 2026-05-21 08:01 | windows | `InputPrompt > accepts and submits the prompt suggestion ...` |
| https://github.com/QwenLM/qwen-code/actions/runs/26208239117 | 2026-05-21 05:54 | windows | `AppContainer > does not remeasure footer height ...` |
| https://github.com/QwenLM/qwen-code/actions/runs/26207015376 | 2026-05-21 05:18 | macos | `AppContainer > does not remeasure footer height ...` |
| https://github.com/QwenLM/qwen-code/actions/runs/26204481218 | 2026-05-21 03:55 | macos | `AppContainer > does not remeasure footer height ...` |

Five of the eight most recent CI runs on `main` failed; all five failures fall in this class. PR-level CI runs hit them at roughly the same rate; PR #4386 hit them in three of its first four runs (different test each time, all in this class).

## What we know

- All three tests pass reliably on local dev machines (re-ran each locally; immediate pass).
- All three tests interact with ink rendering + an async `useEffect` or `useState` rerender; the spy assertion measures a call count or an arg shape that depends on timer/microtask ordering.
- Failures are not deterministic per-platform — the same test that fails on Windows in one run passes on Windows in the next.

## What would help

This isn't a hard fix request, more a tracking issue so PR authors stop re-triaging the same flake across rounds. Reasonable next steps if someone takes it:

1. Quarantine the three tests (e.g. `test.skipIf(process.env.CI)` or `vitest.config.ts` `testTimeout` + `retry: 2`) until root-caused, so CI signal improves immediately.
2. Audit those three tests for the underlying race — most likely candidates: missing `act()` wrappers around state updates, real timers leaking from prior tests in the file, or `useEffect` cleanup running on a different tick than the spy expected.
3. Optionally: add `@vitest/runners` retry only for this file rather than globally, so retries don't mask other tests' bugs.

## Why not just fix it here

This is well out of scope for any feature/bugfix PR — the flake exists on `main` and predates any single PR. Triaging the actual root cause is a focused exercise on infrastructure / test-harness, not on a code change. Filing this issue so the work can be picked up independently.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI flake: CLI UI tests intermittently fail across macos/windows/ubuntu (AppContainer footer-remeasure, InputPrompt suggestion submit, AskUserQuestionDialog key handling) #4429

Summary

Recurring failing tests

Evidence — failures on `main` (recent, sampled)

What we know

What would help

Why not just fix it here

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	First flake observed	Failure shape
`src/ui/AppContainer.test.tsx > AppContainer State Management > Terminal Height Calculation > does not remeasure footer height for sticky todo status-only updates`	well before PR #4386	`expected "spy" to be called 1 times, but got 2 times`
`src/ui/components/InputPrompt.test.tsx > InputPrompt > prompt suggestions > accepts and submits the prompt suggestion on Enter when the buffer is empty`	well before PR #4386	`expected "spy" to be called with arguments: [ 'commit this' ]`
`src/ui/components/messages/AskUserQuestionDialog.test.tsx > <AskUserQuestionDialog /> > single-select interaction > keeps bare k/j in custom input while Ctrl+P/N still navigates options`	well before PR #4386	(same async-spy shape)

Run	Created (UTC)	Platform	Failing test
https://github.com/QwenLM/qwen-code/actions/runs/26213996190	2026-05-21 08:13	ubuntu	`AppContainer > does not remeasure footer height ...`
https://github.com/QwenLM/qwen-code/actions/runs/26213457435	2026-05-21 08:01	windows	`InputPrompt > accepts and submits the prompt suggestion ...`
https://github.com/QwenLM/qwen-code/actions/runs/26208239117	2026-05-21 05:54	windows	`AppContainer > does not remeasure footer height ...`
https://github.com/QwenLM/qwen-code/actions/runs/26207015376	2026-05-21 05:18	macos	`AppContainer > does not remeasure footer height ...`
https://github.com/QwenLM/qwen-code/actions/runs/26204481218	2026-05-21 03:55	macos	`AppContainer > does not remeasure footer height ...`

CI flake: CLI UI tests intermittently fail across macos/windows/ubuntu (AppContainer footer-remeasure, InputPrompt suggestion submit, AskUserQuestionDialog key handling) #4429

Description

Summary

Recurring failing tests

Evidence — failures on main (recent, sampled)

What we know

What would help

Why not just fix it here

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence — failures on `main` (recent, sampled)