Fix/issue 115 kiro codex status detection by haofeif · Pull Request #1 · call-me-ram/cli-agent-orchestrator

haofeif · 2026-06-10T14:30:19Z

Summary

Fixes the e2e suite on top of PR awslabs#273 by latching ready statuses against an 8 KB-buffer eviction flap. Targets continue/issue-115-event-driven so it lands as part of the event-driven pipeline work.

Without these changes the e2e suites fail across all four primary providers:

codex: 60 s init timeouts, completion timeouts (9/12 failing)
claude_code: completion timeouts under load
gemini_cli: 240 s init timeouts, "TUI footer leaked into output" extraction failures
kiro_cli: spurious idle/processing flaps under TUI redraw

Why it was broken

PR awslabs#273 moves status detection to an event-driven pipeline that feeds a rolling 8 KB FIFO buffer to each provider's get_status(buffer). After an agent settles, the TUI keeps emitting bytes for several seconds (status-bar refreshes, cursor positioning, footer repaints). Those bytes evict the idle / response markers from the 8 KB window, so get_status() can't see them anymore and falls back to PROCESSING or UNKNOWN.

Result: status flaps rapidly between IDLE/COMPLETED and PROCESSING/UNKNOWN, and both server-side wait_until_status() and the e2e tests' 1 Hz HTTP polling miss the brief ready windows.

There's also a codex-specific corollary: when the user-message marker (›) evicts before the assistant bullet, codex's get_status() returns IDLE because last_user is None, silently overwriting a previously-published COMPLETED.

And a gemini-specific extraction problem: gemini already declared extraction_retries = 3 but the retry-with-delay logic only fires when a provider also sets extraction_tail_lines. Gemini didn't, so its escalating-fetch path (200 → 500 → 1000 → 5000 lines, no waits) ran each step once and fell through to [PARTIAL RESPONSE] while the Ink-TUI was still rendering.

What changed

`services/status_monitor.py` — sticky ready-status latching

Once any of {IDLE, COMPLETED, WAITING_USER_ANSWER, ERROR} latches, refuse two downgrades:

ready → PROCESSING / UNKNOWN — the typical buffer-eviction flap.
COMPLETED → IDLE — codex-style off-by-one where the user marker evicts before the assistant marker.

Block is released by notify_input_sent(terminal_id), a one-shot revert gate that callers fire whenever they're about to send external input that legitimately starts a new processing cycle.

`services/terminal_service.py` — arm gate at runtime input boundaries

send_input and send_special_key now call notify_input_sent() before delivering the keystrokes to tmux. Without this, a latched IDLE/COMPLETED would block the genuine PROCESSING signal that arrives once the agent picks up the new message.

`providers/{codex,claude_code,gemini_cli,kiro_cli}.py` — arm gate during init

Each initialize() arms the gate before:

the warm-up echo (codex, gemini)
the CLI launch keystroke
bypass-permissions / workspace-trust prompt acknowledgements (codex, claude_code)
kiro's --legacy-ui fallback (/exit + relaunch)

Untested providers (copilot, hermes, kimi, opencode, q) are intentionally not modified in this PR — adding stickiness arming there without exercising it would be untested behavior change.

`providers/codex.py` — assistant-marker fallback

In get_status(), when last_user is None (the › marker has been evicted by a long response), look above the TUI footer for an assistant bullet and return COMPLETED instead of IDLE. Without this, the COMPLETED→IDLE block above would lock the terminal at IDLE while the response is still visible.

`providers/gemini_cli.py` — extraction_tail_lines = 5000

Routes gemini extraction through the fixed-tail path that honors extraction_retries (3 × 10 s waits between captures). 5 000 lines covers any realistic single-turn response without truncating the query box.

Test plan

E2E results on continue/issue-115-event-driven + this commit (against live providers, real tmux/Ink TUIs):

Provider	Result	Notes
codex	10 passed, 1 failed, 1 xfailed	Remaining failure is a pre-existing supervisor-orchestration extraction issue, unrelated to status detection
claude_code	12 passed	Required `mwinit` refresh on the test host (Bedrock-backed claude prompts for Midway re-auth before TUI launches)
gemini_cli	12 passed	All 3 prior extraction failures fixed by `extraction_tail_lines`
kiro_cli	11 passed	One flake on first run (`test_send_message_to_inbox`), passed on retry

Run with:

uv run cao-server &
uv run pytest test/e2e/ -m e2e -k "Codex or ClaudeCode or GeminiCLI or KiroCli" --no-cov

What is intentionally NOT in this PR

No flag-gating for the stickiness behavior. The pre-fix behavior is the buggy regression you're trying to fix in Event-driven architecture: rebase onto main + green the suite (continues #115) awslabs/cli-agent-orchestrator#273; an opt-in flag would never get turned on and would bit-rot. See discussion if you want this gated regardless.
No changes to copilot/hermes/kimi/opencode/q providers. Their e2e suites weren't run, so the same notify_input_sent() wiring is deferred to a follow-up where those tests can verify it.
No fix for the codex supervisor-orchestration extraction failure (raw ANSI escape sequences in _get_full_output). Pre-existing issue, separate problem domain, deserves its own PR.

The event-driven status pipeline strips terminal escapes from the raw 8KB FIFO buffer to feed provider.get_status(). _LINE_START_CSI in strip_terminal_escapes already turned CHA (\x1b[1G) and CNL (\x1b[E) into \n so per-line patterns work, but missed CUP — Cursor Position to column 1 (\x1b[<row>;1H). Codex's TUI lays out its bottom prompt + status bar via CUP rather than CHA: \x1b[46;1H› places the idle ``›`` at column 1 of row 46. Without normalising this, the ``›`` glyph stayed glued mid-stream, e.g. ›Improve documentation in @filenameopenai.gpt-5.5 medium · /tmp/... The per-line check at codex.py:get_status:370 only inspects the bottom five lines for ``\s*(?:❯|›|codex>)``, so the prompt was never detected and codex sessions reported PROCESSING forever — every codex e2e test hit "Codex initialization timed out after 60 seconds" (0/12 passing). Extend _LINE_START_CSI to also match \x1b[\d+;1H. After the fix the ›-prefixed idle prompt sits on its own line, the existing IDLE_PROMPT_PATTERN check matches, and codex idle detection works: in replay against captured FIFO logs, 14/15 codex sessions reach idle correctly. Remaining e2e failures are MCP startup >60s and OpenAI ``stream disconnected before completion`` API errors — outside the scope of status detection. Add a regression test covering both \x1b[46;1H› (codex's bottom prompt) and the bare \x1b[1;1Hb form.

…on event-driven pipeline Under tmux capture-pane, a TUI that redraws over its boot screen hides "Initializing..." and the MCP-init line — Kiro's TUI -> --legacy-ui fallback path also calls StatusMonitor.reset_buffer when retrying so the rolling buffer stays clean. Yolo mode does NOT take that fallback path: it forces --legacy-ui directly at launch, so its rolling FIFO buffer keeps the boot bytes forever even after kiro has redrawn over them and the actual interactive prompt is showing below. Until now get_status returned PROCESSING unconditionally whenever TUI_INITIALIZING_PATTERN matched anywhere in the buffer. Yolo sessions therefore reported PROCESSING for the entire session and wait_until_status({IDLE, COMPLETED}) timed out — kiro yolo e2e went from passing to 0/11 under the new event-driven pipeline (#273). Make Check 0 position-aware: only return PROCESSING from a matching TUI_INITIALIZING_PATTERN when no real ``[agent] >`` idle prompt appears AFTER the last init match. The new-TUI placeholder ("Ask a question or describe a task") is intentionally NOT counted as a real idle prompt because the new TUI renders it during boot; that pre-existing test (test_tui_initializing_yields_processing_despite_idle_placeholder, issue #211) still passes. Update test_mcp_server_init_yields_processing — its old fixture asserted boot-line + "[developer] !>" should yield PROCESSING, but real captured Kiro logs show the [developer] prompt only renders AFTER init completes, so the assumption was incorrect. Replace the fixture with the actual placeholder text Kiro shows during boot, and add test_mcp_server_init_with_post_init_prompt_yields_idle to lock in the post-init redrawn-stale case (the yolo failure mode). Verified end-to-end: kiro e2e (yolo, --legacy-ui) 0/11 -> 11/11.

… flap PR #273's event-driven pipeline derives status from a rolling 8KB FIFO buffer fed to per-provider get_status(). TUI redraws (status-bar refreshes, cursor positioning, footer repaints) keep emitting bytes for seconds AFTER the agent settles, eventually evicting the idle/response markers from the 8KB window. Without latching, status flaps rapidly between IDLE/COMPLETED and PROCESSING/UNKNOWN, and both wait_until_status (server-side) and the e2e tests' HTTP polling miss the brief ready windows — causing codex 60s init timeouts, gemini 240s init timeouts, and completion-timeout failures. StatusMonitor now refuses two downgrades once a ready status latches: - ready -> PROCESSING/UNKNOWN (typical buffer-eviction flap) - COMPLETED -> IDLE (codex's user-marker evicts before assistant's, so last_user is None and provider falls back to IDLE, silently overwriting COMPLETED) notify_input_sent() arms a one-shot revert gate so the latch releases when external input legitimately starts a new processing cycle: - terminal_service.send_input / send_special_key (runtime input) - each tested provider's initialize() before its launch keystrokes and bypass/trust prompt acknowledgements codex get_status now also handles the "long response evicted the user marker" case directly: when last_user is None, scan above the TUI footer for an assistant bullet and return COMPLETED instead of IDLE (was returning IDLE forever, which the COMPLETED->IDLE block above would otherwise cement as an off-by-one defence). gemini_cli now declares extraction_tail_lines = 5000 so its existing extraction_retries (3 x 10s) actually fires. The escalating-fetch path (200/500/1000/5000) ran each step once with no inter-step waits, so Ink-TUI redraws still rendering at extraction time fell through to [PARTIAL RESPONSE], leaving the TUI footer in the output and tripping the "? for shortcuts not in output" assertion. E2E results on PR #273 with these changes: codex 10/12 passing (1 pre-existing extraction issue, 1 xfail) claude 12/12 gemini 12/12 kiro 11/11

haofeif · 2026-06-10T14:31:13Z

@call-me-ram can you help to review this PR if you could ?

…e thread-safe Two hardening fixes to the sticky ready-status latch (PR #1): 1. Arm semantics: notify_input_sent()'s one-shot arm was consumed by ANY new ready latch, including a ready→ready downgrade flap (COMPLETED→IDLE when a large paste evicts the response markers, WAITING_USER_ANSWER→IDLE after a permission keystroke). Consuming the arm there blocks the genuine PROCESSING that follows, so the terminal reads ready while the agent is busy — and InboxService delivers on IDLE/COMPLETED, so a queued message could be pasted mid-response. The arm is now consumed only by a PROCESSING transition or a genuine non-ready→ready (init-style) latch. 2. Thread safety: the latch decision is a read-modify-write sequence (read armed → decide transition → consume arm) executed on the asyncio consumer, while notify_input_sent()/get_status()/clear_terminal() run on FastAPI threadpool, inbox delivery workers, and the cleanup thread. Guard StatusMonitor state with a lock (provider regex analysis and bus.publish stay outside it). Also covers the pre-existing Copilot review comments about unsynchronized _buffers/_last_status access. Adds a latching state-machine test suite (12 cases) pinning blocked downgrades, arm consumption, flap survival, and event publication.

haofeif added 3 commits June 10, 2026 10:19

call-me-ram merged commit 2f7bd32 into call-me-ram:continue/issue-115-event-driven Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/issue 115 kiro codex status detection#1

Fix/issue 115 kiro codex status detection#1
call-me-ram merged 3 commits into
call-me-ram:continue/issue-115-event-drivenfrom
awslabs:fix/issue-115-kiro-codex-status-detection

haofeif commented Jun 10, 2026

Uh oh!

haofeif commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haofeif commented Jun 10, 2026

Summary

Why it was broken

What changed

services/status_monitor.py — sticky ready-status latching

services/terminal_service.py — arm gate at runtime input boundaries

providers/{codex,claude_code,gemini_cli,kiro_cli}.py — arm gate during init

providers/codex.py — assistant-marker fallback

providers/gemini_cli.py — extraction_tail_lines = 5000

Test plan

What is intentionally NOT in this PR

Uh oh!

haofeif commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`services/status_monitor.py` — sticky ready-status latching

`services/terminal_service.py` — arm gate at runtime input boundaries

`providers/{codex,claude_code,gemini_cli,kiro_cli}.py` — arm gate during init

`providers/codex.py` — assistant-marker fallback

`providers/gemini_cli.py` — extraction_tail_lines = 5000