Event driven architecture#115
Conversation
|
@patricka3125 feel free to have a look at this PR too. |
|
@tuanknguyen is this document still correct ? the change wont impact the user experience right ? https://github.com/awslabs/cli-agent-orchestrator/tree/main/examples/assign ? also would you mind fixing the unit testing errors ? |
|
@haofeif that's correct, this does not affect the assign or handoff or send message pattern at all. All unit tests also passed. Separately, we need to consolidate the provider implementation. there's quite a bit of code duplication across the different providers |
| terminal_id = terminal_id_from_topic(event["topic"]) | ||
| log_path = TERMINAL_LOG_DIR / f"{terminal_id}.log" | ||
| with open(log_path, "a") as f: | ||
| f.write(event["data"]["data"]) |
There was a problem hiding this comment.
could consider using asyncio.to_thread here to minimize blocking in event loop
There was a problem hiding this comment.
This is a sync file write inside an async loop, every output will block the event loop briefly
There was a problem hiding this comment.
good catch! I'll update it.
|
|
||
| Returns: | ||
| bool: True if a message was sent, False otherwise | ||
| def deliver_pending(self, terminal_id: str) -> None: |
There was a problem hiding this comment.
would we ever consider delivering all messages at once here? Or perhaps add some way for user to specify message size as a flag?
There was a problem hiding this comment.
sure thing. I'll add number of messages as an optional param default to 1
|
@tuanknguyen thanks for it. also it looks like the current change breaks the handoff function. In my test using
Let me retry the handoff to the Code Reviewer Agent.
Running tool handoff with the param (from mcp server: cao-mcp-server)
⋮ {
⋮ "agent_profile": "reviewer",
⋮ "message": "You are the Code Reviewer Agent. Please read the review task description at the following absolute path and perform a code review:\n\n/Users/haofeif/Amazon-WorkDocs/Code/AIMLAoD/AIAgents/OpenshiftToEKS/cli-agent-orchestrator/cli-agent-orchestrator/tasks/hello-world-review.md\n\nRead the code file referenced in the task and provide your review. End with a clear verdict: APPROVED or NEEDS CHANGES (with specific feedback)."
⋮ }
When testing the Assign examples, similar issues seem to be hanging the handoff session. |
|
@tuanknguyen also i did the e2e test results for 3 providers, i guess let's test the kiro and claude code in your end and see how it goes E2E Test Results —
|
| Provider | main (baseline) |
feat/event-driven-messaging |
Regression |
|---|---|---|---|
| Kiro CLI | 8/8 ✅ | 0/8 ❌ | Yes — COMPLETED never detected |
| Claude Code | 8/8 ✅ | 0/8 ❌ | Yes — IDLE never detected during init |
| Kimi CLI | 8/8 ✅ | 6-7/8 |
Partial — 1 consistent fail, 1 flaky |
Test Location
All E2E tests are in test/e2e/:
| Test File | Test Classes | What It Tests |
|---|---|---|
test_assign.py |
TestKiroCliAssign, TestClaudeCodeAssign, TestKimiCliAssign |
Create worker terminal, send task, verify COMPLETED, extract output |
test_handoff.py |
TestKiroCliHandoff, TestClaudeCodeHandoff, TestKimiCliHandoff |
Create terminal, send task, poll for COMPLETED, extract response |
test_send_message.py |
TestKiroCliSendMessage, TestClaudeCodeSendMessage, TestKimiCliSendMessage |
Create 2 terminals, send inbox message, verify delivery |
test_supervisor_orchestration.py |
TestKiroCliSupervisorOrchestration, TestClaudeCodeSupervisorOrchestration, TestKimiCliSupervisorOrchestration |
Supervisor delegates via |
| handoff/assign MCP tools |
Failure Details
Kiro CLI — 0/8 (all failed)
All 8 tests fail at wait_for_status(terminal_id, "completed"). Terminal creation and init succeed (IDLE detected), message is sent successfully, but COMPLETED is never detected after the agent finishes.
| Test | Error |
|---|---|
| test_assign.py::TestKiroCliAssign::test_assign_data_analyst | Worker did not reach COMPLETED within 180s (provider=kiro_cli) |
| test_assign.py::TestKiroCliAssign::test_assign_report_generator | Worker did not reach COMPLETED within 180s (provider=kiro_cli) |
| test_assign.py::TestKiroCliAssign::test_assign_with_callback | Worker did not reach COMPLETED within 180s (provider=kiro_cli) |
| test_handoff.py::TestKiroCliHandoff::test_handoff_simple_function | Terminal did not reach COMPLETED within 180s (provider=kiro_cli) |
| test_handoff.py::TestKiroCliHandoff::test_handoff_second_task` | `Terminal did not reach COMPLETED within 180s (provider=kiro_cli) |
| test_send_message.py::TestKiroCliSendMessage::test_send_message_to_inbox | Receiver should have transitioned from IDLE after inbox delivery within 60s, got: idle |
| test_supervisor_orchestration.py::TestKiroCliSupervisorOrchestration::test_supervisor_handoff | Supervisor did not reach COMPLETED within 300s. Last status: idle |
| test_supervisor_orchestration.py::TestKiroCliSupervisorOrchestration::test_supervisor_assign_and_handoff | Supervisor did not reach COMPLETED within 300s. Last status: idle |
Possible Root cause: StatusMonitor returns idle instead of completed. Kiro CLI's get_status() needs both a green arrow pattern (response) AND an idle prompt after it to return COMPLETED. The 8KB rolling buffer loses the green arrow as the response grows, leaving only the idle prompt → returns IDLE.
Claude Code — 0/8 (all failed)
All 8 tests fail at create_terminal() — the terminal never initializes.
| Test | Error |
|---|---|
test_assign.py::TestClaudeCodeAssign::test_assign_data_analyst |
Claude Code initialization timed out after 30 seconds |
test_assign.py::TestClaudeCodeAssign::test_assign_report_generator |
Claude Code initialization timed out after 30 seconds |
test_assign.py::TestClaudeCodeAssign::test_assign_with_callback |
Claude Code initialization timed out after 30 seconds |
test_handoff.py::TestClaudeCodeHandoff::test_handoff_simple_function |
Claude Code initialization timed out after 30 seconds |
test_handoff.py::TestClaudeCodeHandoff::test_handoff_second_task |
Claude Code initialization timed out after 30 seconds |
test_send_message.py::TestClaudeCodeSendMessage::test_send_message_to_inbox |
Claude Code initialization timed out after 30 seconds |
test_supervisor_orchestration.py::TestClaudeCodeSupervisorOrchestration::test_supervisor_handoff |
Claude Code initialization timed out after 30 seconds |
test_supervisor_orchestration.py::TestClaudeCodeSupervisorOrchestration::test_supervisor_assign_and_handoff |
Claude Code initialization timed out after 30 seconds |
Root cause: StatusMonitor never detects IDLE during provider.initialize(). The FIFO → EventBus → StatusMonitor pipeline doesn't deliver output fast enough (or at all) for Claude Code's get_status() to match the IDLE prompt pattern within the 30s timeout.
Architectural Possible Root Cause
On main, get_terminal() calls provider.get_status() which reads fresh tmux scrollback (tmux capture-pane) on every poll — full history available.
On feat/event-driven-messaging, get_terminal() calls status_monitor.get_status() which returns a cached status derived from an 8KB rolling buffer fed by the FIFO pipeline.
This causes two problems:
- Buffer truncation: Long agent responses push early patterns (e.g., Kiro CLI's green arrow) out of the 8KB window, breaking COMPLETED detection
- Pipeline timing: The FIFO → EventBus → StatusMonitor pipeline may not deliver output fast enough during provider initialization, breaking IDLE detection for Claude Code
|
@haofeif thanks for flagging the issue. I noticed that the issue was with the regex of the COMPLETE status detection. We now read from the buffer directly which means that there are raw control characters or ANSI escape. It's not due to sequencing or event bus not delivering fast enough. I'll update and push the changes. |
|
Any updates on the progress of this PR? I have also been ecountering issues mentioned in #131 on a main branch build and this seems a promising fix 👀 |
yes this will be continued worked on |
Bring the event-driven architecture branch up to date with main (98 commits) and reconcile the rewrite with features that landed after it forked: eager inbox delivery (awslabs#251), the OpenCode poller, env-var forwarding (awslabs#259), memory curation (awslabs#254/awslabs#262), CORS auto-derive (awslabs#261), DNS host validation (awslabs#124), and the self-send guard (awslabs#24). Highlights: - Providers adopt the async initialize() + get_status(buffer) contract; copilot_cli/opencode_cli converted; kiro keeps colour-only ANSI stripping so carriage-return-redraw permission prompts aren't misread as idle. - Event-driven InboxService.deliver_pending with the awslabs#251 eager gate and message-sender attribution; OpenCode poller retained as a status-driven method; the watchdog (PollingObserver/LogFileHandler) is removed. - terminal_service.create_terminal is async (FIFO + StatusMonitor wiring); session_service.create_session, flow_service.execute_flow, the API endpoints, and `cao flow run` updated to await. - memory_service curated path and the flow CLI fixed to the new contract. Full unit suite green (1908 passed); black + isort clean.
…266) Inbox messages could stay PENDING forever when the receiving terminal was already idle: the single immediate (on POST) delivery attempt can observe a stale status and skip, and the log-watching observer never fires again because an idle agent produces no new log output. With both fast paths missed there was no fallback (issue #131). Add a provider-agnostic reconciliation daemon that runs on a slower interval than the watchdog and re-attempts delivery for any message left PENDING past a grace window. The grace window keeps the sweep from competing with the immediate and watchdog paths for freshly queued messages — it only adopts ones those paths have already missed. - list_pending_receiver_ids_older_than: provider-agnostic query for stuck receivers, joined against terminals so deleted receivers are skipped; the cutoff uses local-naive datetime.now() to match InboxModel.created_at. - reconcile_orphaned_messages: sweep body, reusing check_and_send_pending_messages. - inbox_reconciliation_daemon: background loop wired into the server lifespan. - INBOX_RECONCILE_INTERVAL / INBOX_RECONCILE_GRACE_SECONDS constants. The sweep reuses the existing delivery helper and so shares its known duplicate-wakeup race; making delivery atomic is left to the unified delivery engine tracked in GH #115. Documented in docs/inbox-delivery.md. Fixes #131 Co-authored-by: haofeif <56006724+haofeif@users.noreply.github.com>
…ues #115) (#273) * refactored but wait_for_* not working * working * merge from main * fix merge conflicts * update docs and tests * update docs * rebase and update docs and tests * update tests * clean up routine * formatting and update kimi_cli to event-driven * fix kiro_cli answer marker pattern * fix(events): address review feedback on the event-driven pipeline Resolve four review comments on PR #273: - InboxService.run() ran deliver_pending() synchronously inside the asyncio consumer loop, blocking it on DB + tmux I/O and starving the StatusMonitor/LogWriter consumers. Offload delivery via asyncio.to_thread, matching the threading discipline documented for the event bus and the existing OpenCode poller. - Event-driven deliveries were started without a PluginRegistry, so they skipped PostSendMessageEvent hooks. Thread the registry through run() -> deliver_pending so status-driven deliveries get the same plugin attribution as the immediate and OpenCode-poller paths. - FifoManager.stop_reader() early-returned when no in-memory reader was tracked, never unlinking the FIFO. Retention cleanup after a restart (empty _readers) would leak *.fifo files unbounded. Always best-effort unlink the FIFO; only log when a reader was actually stopped. - Docs claimed StatusMonitor falls back to a generic shell-prompt pattern before init; _detect_status returns UNKNOWN until a provider registers. Correct the docs to match. Adds regression tests for the registry threading, the to_thread offload, and the stale-FIFO unlink. * fix(events): address Copilot review on PR #273 - flow_service: stop FIFO readers and clear StatusMonitor buffers when recycling an existing flow session, preventing leaked reader threads and stale *.fifo files across repeated flow runs. - flow CLI run: bootstrap the in-process event pipeline (bus loop + StatusMonitor/LogWriter consumers) so execute_flow's async create_terminal/initialize path doesn't hang waiting for status. - inbox_service: deliver pending messages in contiguous same-sender runs so PostSendMessageEvent attribution stays correct when a batch spans multiple senders (num_messages=0). - copilot_cli tests: replace no-op initialize test with real async success and trust-prompt coverage; drop dead placeholder. - docs: correct CODEBASE.md inbox flow (deliver_pending, not the removed check_and_send_pending_messages); document that OutputMode.FULL returns the bounded StatusMonitor buffer, not unbounded scrollback. * fix(status): raw-stream-safe get_status + newest Claude Code TUI support Addresses haofeif's PR #273 review (raw FIFO buffer fed to get_status) plus newest-Claude-Code TUI changes found while reproducing it locally. - claude_code/codex/kimi get_status: strip terminal escapes (utils.text.strip_terminal_escapes) before structural checks, so the cursor-positioning / in-place-redraw sequences in the raw pipe-pane stream no longer break detection. Document the raw-buffer input contract on BaseProvider.get_status. - claude_code: gated support for the newest Claude Code TUI (the box pattern gates it, so older builds keep the legacy logic): live spinner renders ABOVE a box-drawn input prompt; completion shows '* <Verb>ed for Ns' instead of the old marker. Detect both PROCESSING and COMPLETED there. - kiro_cli: clear the StatusMonitor buffer (new StatusMonitor.reset_buffer) on the TUI -> --legacy-ui fallback so the retry is not derived from stale bytes. - input submission: newest Claude Code swallows an Enter sent too soon after a bracketed paste; make the post-paste delay provider-tunable (BaseProvider.paste_submit_delay; Claude overrides to 2.0s). - create_terminal: nudge the shell after pipe-pane attaches so the prompt is captured (fast shells drew it before capture started, timing out wait_for_shell). Full existing unit suite stays green (2012 passed); changes are additive. * fix(claude_code): detect newest Claude Code TUI completion + asterisk spinner in raw stream * fix(claude_code): re-insert spaces for column-positioned text so completion summary is detected The newest Claude Code TUI sometimes redraws the completion summary and spinner with CHA/CUF cursor-positioning escapes instead of literal spaces (e.g. '✻\x1b[3GWorked\x1b[10Gfor\x1b[14G3s'). strip_terminal_escapes dropped those moves with no space, gluing words ('Workedfor3s') so COMPLETION_SUMMARY_PATTERN and the spinner check never matched — get_status stayed IDLE through a finished turn and handoff/poll_until_done timed out. Replace forward horizontal cursor moves with a single space. Verified against three real handoff captures + full suite green. * fix(claude_code): recognize newest-TUI '●' response glyph in message extraction The newest Claude Code TUI renders the response marker as '●' (U+25CF) instead of '⏺' (U+23FA), so extract_last_message_from_script raised 'No Claude Code response found' on a finished handoff even after COMPLETED was detected. Add a line-start- anchored EXTRACTION_RESPONSE_PATTERN matching both glyphs (anchored so the footer effort indicator '… esc to interrupt ● high · /effort' is not mistaken for a marker), and trim the '✻ Worked for Ns' completion stat. RESPONSE_PATTERN used by get_status is left ⏺-only so the legacy COMPLETED check cannot fire mid-stream. * fix(claude_code): boxless completion summary after the last separator wins over a stale spinner When the newest TUI repaints a finished turn boxless ('✻ <Verb>ed for Ns' + ❯) below the previous frame's box separator, a stale spinner remained above that separator. The spinner-before-separator PRIMARY walk fired PROCESSING off the stale spinner and never reached the COMPLETED branch, so a real handoff stayed PROCESSING until timeout. Skip that walk when a completion summary appears after the last separator (the boxless-redraw signature; genuine processing only has the footer there). Verified against four real handoff captures. * fix(claude_code): COMPLETED via ● response marker, robust to clipped/missing completion summary The newest TUI's completion summary is fragile in the raw stream: the duration can be clipped ('✻ Crunched for ' with no Ns) or rendered on a ·/* glyph frame the summary pattern excludes, leaving a finished turn stuck at IDLE. After the PROCESSING checks (which already rule out a LIVE spinner), treat a completion summary OR a start-of-line response marker (⏺/●) plus a visible prompt as COMPLETED — dropping the position-based spinner-freshness guard that mis-counted a stale spinner. Verified against four real handoff captures (greet + multiply). * fix(claude_code): tolerate clipped completion summary ('✻ Crunched for ' no duration) The newest TUI sometimes writes the completion summary's duration via a separate cursor-positioned write that the raw stream splits off, leaving '✻ Crunched for ' with no 'Ns'. The strict COMPLETION_SUMMARY_PATTERN (requires 'for Ns') then failed to recognize completion, so the spinner-before-separator walk reported PROCESSING off a stale spinner and a finished handoff intermittently stuck at PROCESSING/IDLE until timeout. Add GET_STATUS_COMPLETION_PATTERN (glyph + 'for', no ellipsis, no digits required) for get_status detection + the boxless-tail guard; extraction keeps the strict pattern. Verified: six real handoff captures all settle COMPLETED; live-processing and compaction still PROCESSING. * fix(kimi_cli): support the redesigned "Kimi Code" TUI in get_status The newest Kimi CLI (the "Kimi Code" rebuild) replaced the ✨/💫 emoji prompt with a boxed input area ("── input ──"), an "agent (<model> ●)" status bar, a "context: N%" usage footer, and a braille-spinner working indicator ("⠧ Thinking… Ns · N tokens"). The legacy detector keyed on the emoji prompt, so it never observed IDLE and every Kimi terminal timed out at init — this was the failure flagged on #273 (now confirmed to be a Kimi-side TUI redesign, not a regression from the event-driven refactor). get_status now detects the new TUI (gated on the new status/footer markers so legacy emoji builds are unchanged): - READY = status bar / context footer present with no live spinner. - PROCESSING vs READY is decided by POSITION (last braille spinner line vs last ready-chrome line), so stale spinner frames lingering in the 8KB rolling buffer don't pin a finished turn at PROCESSING. - COMPLETED vs IDLE latches on a "•" bullet (a turn produced output); the welcome banner / update nag have none, so a fresh terminal reads IDLE and avoids a premature-completion race when the first task is sent. Adds regression tests driven by real captured raw pipe-pane buffers (idle / processing / completed) of the new TUI. Also enables Claude <-> Kimi cross-provider e2e: - examples/cross-provider/data_analyst_kimi_cli.md (provider: kimi_cli). - TestCrossProviderClaudeToKimi / TestCrossProviderKimiToClaude. - Fix the cross-provider test helper to resolve the worker's provider from its profile (the add-terminal endpoint treats an explicit provider param as authoritative, which silently suppressed the profile override). Verified on real CLIs (kimi 1.47.0 + claude 2.1.x) against a cao-server built from this branch: - TestKimiCliHandoff (2) + TestKimiCliAssign (3): pass. - Claude->Kimi and Kimi->Claude cross-provider assign: pass. Full unit suite green (2202 passed); mypy / black / isort clean. * fix(herdr): make status detection backend-aware so agent init works on herdr The event-driven rewrite made the FIFO -> EventBus -> StatusMonitor pipeline the sole source of shell-readiness and terminal status. The herdr backend deliberately skips that pipeline (its pipe_pane is a no-op; no FIFO reader is started for it, since it delivers via socket events), so for a herdr terminal the StatusMonitor buffer stayed "" and status stayed UNKNOWN forever. As a result provider.initialize() timed out in wait_for_shell()/wait_until_status(), and every status read (API GET /terminals/{id}, busy checks, provider init loops) returned UNKNOWN -- terminal creation was broken for essentially every provider on herdr. Fix at the single chokepoint instead of per-call-site: - status_monitor.get_status() is now backend-aware: for event-inbox backends it derives status on demand from the provider's get_status() (which consults backend.get_native_status()). This fixes all read sites at once -- API status, wait_until_status(), flow busy checks, curator liveness, and the copilot/gemini init loops -- without each having to special-case the backend. - wait_for_shell() reads backend.get_history() directly for event-inbox backends rather than the never-populated StatusMonitor buffer. - wait_until_status() reverts to simply polling the now backend-aware get_status(), removing the duplicated backend logic. The tmux path is unchanged (supports_event_inbox() is False -> same pushed-status read as before). Verified end-to-end against a real herdr 0.6.8 server: a live Claude Code agent launched in a herdr pane, reached idle, and completed a task. Adds test/services/test_status_monitor.py and herdr cases to test/utils/test_terminal.py. * fix(text): normalise CUP-to-column-1 escape so codex idle prompt detects The event-driven status pipeline strips terminal escapes from the raw 8KB FIFO buffer to feed provider.get_status(). _LINE_START_CSI in strip_terminal_escapes already turned CHA (\x1b[1G) and CNL (\x1b[E) into \n so per-line patterns work, but missed CUP — Cursor Position to column 1 (\x1b[<row>;1H). Codex's TUI lays out its bottom prompt + status bar via CUP rather than CHA: \x1b[46;1H› places the idle ``›`` at column 1 of row 46. Without normalising this, the ``›`` glyph stayed glued mid-stream, e.g. ›Improve documentation in @filenameopenai.gpt-5.5 medium · /tmp/... The per-line check at codex.py:get_status:370 only inspects the bottom five lines for ``\s*(?:❯|›|codex>)``, so the prompt was never detected and codex sessions reported PROCESSING forever — every codex e2e test hit "Codex initialization timed out after 60 seconds" (0/12 passing). Extend _LINE_START_CSI to also match \x1b[\d+;1H. After the fix the ›-prefixed idle prompt sits on its own line, the existing IDLE_PROMPT_PATTERN check matches, and codex idle detection works: in replay against captured FIFO logs, 14/15 codex sessions reach idle correctly. Remaining e2e failures are MCP startup >60s and OpenAI ``stream disconnected before completion`` API errors — outside the scope of status detection. Add a regression test covering both \x1b[46;1H› (codex's bottom prompt) and the bare \x1b[1;1Hb form. * fix(kiro_cli): position-aware Initializing/MCP-init check fixes yolo on event-driven pipeline Under tmux capture-pane, a TUI that redraws over its boot screen hides "Initializing..." and the MCP-init line — Kiro's TUI -> --legacy-ui fallback path also calls StatusMonitor.reset_buffer when retrying so the rolling buffer stays clean. Yolo mode does NOT take that fallback path: it forces --legacy-ui directly at launch, so its rolling FIFO buffer keeps the boot bytes forever even after kiro has redrawn over them and the actual interactive prompt is showing below. Until now get_status returned PROCESSING unconditionally whenever TUI_INITIALIZING_PATTERN matched anywhere in the buffer. Yolo sessions therefore reported PROCESSING for the entire session and wait_until_status({IDLE, COMPLETED}) timed out — kiro yolo e2e went from passing to 0/11 under the new event-driven pipeline (#273). Make Check 0 position-aware: only return PROCESSING from a matching TUI_INITIALIZING_PATTERN when no real ``[agent] >`` idle prompt appears AFTER the last init match. The new-TUI placeholder ("Ask a question or describe a task") is intentionally NOT counted as a real idle prompt because the new TUI renders it during boot; that pre-existing test (test_tui_initializing_yields_processing_despite_idle_placeholder, issue #211) still passes. Update test_mcp_server_init_yields_processing — its old fixture asserted boot-line + "[developer] !>" should yield PROCESSING, but real captured Kiro logs show the [developer] prompt only renders AFTER init completes, so the assumption was incorrect. Replace the fixture with the actual placeholder text Kiro shows during boot, and add test_mcp_server_init_with_post_init_prompt_yields_idle to lock in the post-init redrawn-stale case (the yolo failure mode). Verified end-to-end: kiro e2e (yolo, --legacy-ui) 0/11 -> 11/11. * fix(status): sticky ready-status latching to stop 8KB-buffer eviction flap PR #273's event-driven pipeline derives status from a rolling 8KB FIFO buffer fed to per-provider get_status(). TUI redraws (status-bar refreshes, cursor positioning, footer repaints) keep emitting bytes for seconds AFTER the agent settles, eventually evicting the idle/response markers from the 8KB window. Without latching, status flaps rapidly between IDLE/COMPLETED and PROCESSING/UNKNOWN, and both wait_until_status (server-side) and the e2e tests' HTTP polling miss the brief ready windows — causing codex 60s init timeouts, gemini 240s init timeouts, and completion-timeout failures. StatusMonitor now refuses two downgrades once a ready status latches: - ready -> PROCESSING/UNKNOWN (typical buffer-eviction flap) - COMPLETED -> IDLE (codex's user-marker evicts before assistant's, so last_user is None and provider falls back to IDLE, silently overwriting COMPLETED) notify_input_sent() arms a one-shot revert gate so the latch releases when external input legitimately starts a new processing cycle: - terminal_service.send_input / send_special_key (runtime input) - each tested provider's initialize() before its launch keystrokes and bypass/trust prompt acknowledgements codex get_status now also handles the "long response evicted the user marker" case directly: when last_user is None, scan above the TUI footer for an assistant bullet and return COMPLETED instead of IDLE (was returning IDLE forever, which the COMPLETED->IDLE block above would otherwise cement as an off-by-one defence). gemini_cli now declares extraction_tail_lines = 5000 so its existing extraction_retries (3 x 10s) actually fires. The escalating-fetch path (200/500/1000/5000) ran each step once with no inter-step waits, so Ink-TUI redraws still rendering at extraction time fell through to [PARTIAL RESPONSE], leaving the TUI footer in the output and tripping the "? for shortcuts not in output" assertion. E2E results on PR #273 with these changes: codex 10/12 passing (1 pre-existing extraction issue, 1 xfail) claude 12/12 gemini 12/12 kiro 11/11 * fix(status): keep input-arm across ready→ready flaps; make latch state thread-safe Two hardening fixes to the sticky ready-status latch (PR #1): 1. Arm semantics: notify_input_sent()'s one-shot arm was consumed by ANY new ready latch, including a ready→ready downgrade flap (COMPLETED→IDLE when a large paste evicts the response markers, WAITING_USER_ANSWER→IDLE after a permission keystroke). Consuming the arm there blocks the genuine PROCESSING that follows, so the terminal reads ready while the agent is busy — and InboxService delivers on IDLE/COMPLETED, so a queued message could be pasted mid-response. The arm is now consumed only by a PROCESSING transition or a genuine non-ready→ready (init-style) latch. 2. Thread safety: the latch decision is a read-modify-write sequence (read armed → decide transition → consume arm) executed on the asyncio consumer, while notify_input_sent()/get_status()/clear_terminal() run on FastAPI threadpool, inbox delivery workers, and the cleanup thread. Guard StatusMonitor state with a lock (provider regex analysis and bus.publish stay outside it). Also covers the pre-existing Copilot review comments about unsynchronized _buffers/_last_status access. Adds a latching state-machine test suite (12 cases) pinning blocked downgrades, arm consumption, flap survival, and event publication. * fix(claude_code): live MCP tool call no longer misreads as COMPLETED Real failure from the supervisor-handoff e2e: mid-turn, Claude shows an interim completion summary from an earlier thinking phase, then keeps working — '● Calling cao-mcp-server… (ctrl+o to expand)' with a live '✢ Misting… (33s · ↑ 332 tokens)' spinner and a '⎿ Tip: …' hint line between the spinner and the input box. Two gaps combined into a false COMPLETED, which the StatusMonitor ready-latch then pinned until the next input, so the e2e test extracted mid-flight output: - _boxless_completion_tail treated ANY completion summary after the last separator as a finished turn. It now requires that no live spinner (glyph + gerund + ellipsis) renders after the summary — an interim summary followed by a fresh spinner keeps the turn PROCESSING. - The new-TUI box-spinner check looked only at the single line directly above the input box; '⎿ Tip:' hint lines and blanks render between the live spinner and the box, hiding it. The check now walks up to 4 lines, skipping ⎿ continuation lines and blanks. Regression tests reproduce both shapes (fail pre-fix) plus a guard that a summary with no following spinner still reads COMPLETED. * fix(kimi_cli): extract responses from the newest Kimi Code TUI The newest 'Kimi Code' TUI renders user messages as ✨-prefixed prompt lines (no ╭─ input box), responses as • bullets, and a '── input ──' rule plus status bar as footer. extract_last_message_from_script() preferred the last ╰─ box-end as the response anchor — which in the new TUI matches decorative boot banners (Kimi welcome box, FastMCP server banner) ABOVE the conversation — and only stopped at a bare idle prompt, which the new TUI never renders. Result: the 'response' was sliced from the boot screen and ran to end-of-capture (raw spinner frames, status bar), failing the supervisor-assign e2e. - Anchor on the LATEST marker (box-end vs prompt-with-input) instead of box-first: the response always follows the last user input, whichever marker style rendered it. - Stop the response at the new-TUI footer: the '── input ──' rule or the status-bar/context-footer line, in addition to the bare idle prompt. Fixtures are ground truth from a live Kimi Code session. * fix(events): address remaining Copilot review nits on PR #273 - LogWriter: open log files with explicit encoding='utf-8', errors='replace' — a non-UTF-8 platform locale (POSIX/C) would raise UnicodeEncodeError on the first unencodable chunk and stop log persistence for that terminal. - EventBus.set_loop: accept Optional[loop] (test fixtures already call set_loop(None)); publish() now guards against the loop being closed during shutdown instead of raising RuntimeError from a FIFO reader thread draining its last chunks. - BaseProvider.get_status docstring: soften the strip_terminal_escapes 'should' to MAY with an explicit note that kiro_cli intentionally preserves raw \r for permission-prompt detection and must not be 'fixed' to comply. The 'except Exception swallows CancelledError' comments are intentionally NOT addressed: since Python 3.8, asyncio.CancelledError derives from BaseException, so those loops do not swallow cancellation. * test(e2e): read kimi skill catalog from the agent-file system.md Kimi (v1.20+) launches via 'cd <temp_dir> && kimi --agent-file <temp_dir>/agent.yaml', writing the system prompt — including the injected skill catalog — to <temp_dir>/system.md. The catalog is no longer in the CLI command string, and the full-screen Kimi Code TUI clears the visible screen on startup, so asserting against raw tmux scrollback always failed. Parse the temp dir from the launch command (capture-pane -J so soft-wrapped lines can't split the path mid-token) and assert against system.md on disk — same pattern as the existing gemini GEMINI.md branch. * fix(kimi_cli): honest turn-in-flight detection for the newest Kimi Code TUI The newest TUI renders the live spinner BETWEEN the '── input ──' rule and the status bar, and repaints the status bar with every frame — so the spinner-vs-ready-chrome position compare read 'ready' mid-turn whenever a full footer repaint was the freshest chunk (measured: one supervisor turn flapped completed↔processing 29 times in a 57KB stream). The StatusMonitor ready-latch then pinned the first false COMPLETED ~130ms after dispatch, so supervisor flows extracted mid-flight output (supervisor-assign e2e). Detection now uses signals validated by replaying captured live streams: - Live-spinner lines: braille/moon glyphs (incl. tool-call lines like '⠹ Using handoff({…})'), EXCLUDING boot chrome ('⠧ MCP Servers: 0/1', '⠋ Loading configuration...') which legitimately shows braille while idle at the welcome screen. - In flight when a live spinner is within the freshest 15 lines OR renders after the last '•' bullet (covers chunk boundaries where streamed thinking text temporarily pushed the spinner out of the tail). - Dispatch grace: 5s after send_input() (mark_input_received now stamps the dispatch), bridging the paste→first-spinner-frame gap. - Rendered-pane confirmation: a ready-looking chunk boundary mid-turn is byte-identical to one at real completion (stale spinner ~21 lines back, bullets 2-3 from the end in BOTH), so when the stream says ready post-dispatch, confirm against the rendered pane — tmux's compositor has resolved every in-place redraw, so a visible spinner is live, not stale. Also raises the e2e SUPERVISOR_COMPLETION_TIMEOUT 300→600s: a kimi worker alone takes ~3m20s on the report-template handoff; 300s only looked sufficient while the false early COMPLETED was masking the real duration. E2E (live agents): kimi supervisor handoff + assign_and_handoff now pass; kimi handoff x2, send_message, skills unaffected and green. * fix(tools): close claude_code tool-restriction escapes via Task/Monitor/NotebookEdit The allowed-tools e2e caught restricted agents escaping their restrictions live: a reviewer (no Bash/Write) created the forbidden file anyway — first 'via a delegated subagent that ran the write through a shell command' (Task), then on retry via 'the Monitor tool' (background shell scripts). CAO computes --disallowedTools as (tool universe - allowed natives), and claude_code's universe only contained Bash/Read/Edit/Write/Glob/Grep — so the execution-capable tools outside it were never disallowed. Gate them by privilege equivalence: - execute_bash now maps Bash, BashOutput, KillShell, Task (subagents run with their own full toolset), and Monitor (runs arbitrary shell scripts). Anything these can do, Bash can too, so profiles allowed execute_bash lose nothing. - fs_write now maps NotebookEdit alongside Edit/Write (.ipynb writes). Known limitation: this is enumeration against an evolving toolset — each new execution-capable Claude Code tool needs adding here. A deny-by-default mechanism upstream would be the durable fix. E2E (live, herdr backend): all 4 TestClaudeCodeAllowedTools pass, including the two that previously demonstrated the escape. --------- Co-authored-by: Tuan Nguyen <tuankn@amazon.com> Co-authored-by: haofeif <56006724+haofeif@users.noreply.github.com> Co-authored-by: Tuan Nguyen <32879640+tuanknguyen@users.noreply.github.com> Co-authored-by: Feng, Haofei <haofeif@amazon.com>
Notes to Reviewers
This PR replaces the watchdog-based polling architecture with an event-driven pub/sub system for terminal output processing, status detection, and inbox message delivery. Terminal output now streams through named FIFOs into an in-process event bus, eliminating filesystem polling and expensive tmux subprocess calls.
event_bus.py,fifo_reader.py,status_monitor.py,log_writer.pybase.py, all 5 providers,terminal.py(utils)time.sleep→await asyncio.sleepthroughout init chaininbox_service.py,terminal_service.py,main.py(lifespan)session_service.py,cleanup_service.py,terminal_service.delete_terminalevent-driven-architecture.md,CODEBASE.md,constants.py,pyproject.tomltest_codex_provider_unit.py,test_gemini_cli_unit.py,test_inbox_service.pyStart here: Read
docs/event-driven-architecture.mdfor the full architecture overview, ASCII data-flow diagram, and component roles before diving into code.Key changes:
PollingObserverwith FIFO readers (named pipes +os.read()) for real-time terminal output streaminginitialize(),wait_for_shell(),wait_until_status())watchdogandaiofilesdependenciesBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.