Skip to content

Event driven architecture#115

Open
tuanknguyen wants to merge 17 commits into
mainfrom
feat/event-driven-messaging
Open

Event driven architecture#115
tuanknguyen wants to merge 17 commits into
mainfrom
feat/event-driven-messaging

Conversation

@tuanknguyen

Copy link
Copy Markdown
Contributor

Notes to Reviewers

This PR replaces the watchdog-based polling architecture with an event-driven pub/sub system for terminal output processing, status detection, and inbox message delivery. Terminal output now streams through named FIFOs into an in-process event bus, eliminating filesystem polling and expensive tmux subprocess calls.

Category Files Review Priority
Event bus core event_bus.py, fifo_reader.py, status_monitor.py, log_writer.py Critical — new infrastructure, threading/async boundary
Async conversion base.py, all 5 providers, terminal.py (utils) Hightime.sleepawait asyncio.sleep throughout init chain
Service rewrites inbox_service.py, terminal_service.py, main.py (lifespan) High — wiring changes, startup/shutdown lifecycle
Cleanup paths session_service.py, cleanup_service.py, terminal_service.delete_terminal Medium — FIFO + state cleanup on delete
Docs & config event-driven-architecture.md, CODEBASE.md, constants.py, pyproject.toml Low
Tests test_codex_provider_unit.py, test_gemini_cli_unit.py, test_inbox_service.py Low — adapted to new signatures

Start here: Read docs/event-driven-architecture.md for the full architecture overview, ASCII data-flow diagram, and component roles before diving into code.

Key changes:

  • Introduced an in-process EventBus with wildcard topic matching and thread-safe publishing
  • Replaced watchdog PollingObserver with FIFO readers (named pipes + os.read()) for real-time terminal output streaming
  • Added StatusMonitor (rolling 8KB buffer + provider pattern matching) as the single source of truth for terminal status
  • Added LogWriter consumer for debug log persistence
  • Refactored InboxService from file-watching handler to event-driven consumer
  • Converted the entire initialization chain to async (initialize(), wait_for_shell(), wait_until_status())
  • Removed watchdog and aiofiles dependencies

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@tuanknguyen tuanknguyen requested review from a team and haofeif March 13, 2026 02:10
@haofeif

haofeif commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

@patricka3125 feel free to have a look at this PR too.

@haofeif haofeif added the enhancement New feature or request label Mar 13, 2026
@haofeif

haofeif commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

@tuanknguyen is this document still correct ? the change wont impact the user experience right ? https://github.com/awslabs/cli-agent-orchestrator/tree/main/examples/assign ?

also would you mind fixing the unit testing errors ?

@tuanknguyen

Copy link
Copy Markdown
Contributor Author

@haofeif that's correct, this does not affect the assign or handoff or send message pattern at all. All unit tests also passed.

Separately, we need to consolidate the provider implementation. there's quite a bit of code duplication across the different providers

terminal_id = terminal_id_from_topic(event["topic"])
log_path = TERMINAL_LOG_DIR / f"{terminal_id}.log"
with open(log_path, "a") as f:
f.write(event["data"]["data"])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could consider using asyncio.to_thread here to minimize blocking in event loop

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sync file write inside an async loop, every output will block the event loop briefly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I'll update it.


Returns:
bool: True if a message was sent, False otherwise
def deliver_pending(self, terminal_id: str) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we ever consider delivering all messages at once here? Or perhaps add some way for user to specify message size as a flag?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing. I'll add number of messages as an optional param default to 1

@haofeif

haofeif commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

@tuanknguyen thanks for it. also it looks like the current change breaks the handoff function. In my test using code_supervisor, developer and reviewer agent. i found some issues (in kiro-cli)

  • code_supervisor is able to handoff to developer, but when it completes the job, the code_supervisor agent is not aware of its completion (i have to ctrl+C the handoff from supervisor agent and then tells it that it is completed)
  • code_supervisor then will handoff to reviewer, but no extra tmux agent is spinned up. It was hung as below messages and no further tmux session is created.
Let me retry the handoff to the Code Reviewer Agent.
Running tool handoff with the param (from mcp server: cao-mcp-server)
 ⋮  {
 ⋮    "agent_profile": "reviewer",
 ⋮    "message": "You are the Code Reviewer Agent. Please read the review task description at the following absolute path and perform a code review:\n\n/Users/haofeif/Amazon-WorkDocs/Code/AIMLAoD/AIAgents/OpenshiftToEKS/cli-agent-orchestrator/cli-agent-orchestrator/tasks/hello-world-review.md\n\nRead the code file referenced in the task and provide your review. End with a clear verdict: APPROVED or NEEDS CHANGES (with specific feedback)."
 ⋮  }
 

When testing the Assign examples, similar issues seem to be hanging the handoff session.

@haofeif

haofeif commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

@tuanknguyen also i did the e2e test results for 3 providers, i guess let's test the kiro and claude code in your end and see how it goes

E2E Test Results — feat/event-driven-messaging

Summary

Provider main (baseline) feat/event-driven-messaging Regression
Kiro CLI 8/8 ✅ 0/8 ❌ Yes — COMPLETED never detected
Claude Code 8/8 ✅ 0/8 ❌ Yes — IDLE never detected during init
Kimi CLI 8/8 ✅ 6-7/8 ⚠️ Partial — 1 consistent fail, 1 flaky

Test Location

All E2E tests are in test/e2e/:

Test File Test Classes What It Tests
test_assign.py TestKiroCliAssign, TestClaudeCodeAssign, TestKimiCliAssign Create worker terminal, send task, verify COMPLETED, extract output
test_handoff.py TestKiroCliHandoff, TestClaudeCodeHandoff, TestKimiCliHandoff Create terminal, send task, poll for COMPLETED, extract response
test_send_message.py TestKiroCliSendMessage, TestClaudeCodeSendMessage, TestKimiCliSendMessage Create 2 terminals, send inbox message, verify delivery
test_supervisor_orchestration.py TestKiroCliSupervisorOrchestration, TestClaudeCodeSupervisorOrchestration, TestKimiCliSupervisorOrchestration Supervisor delegates via
handoff/assign MCP tools

Failure Details

Kiro CLI — 0/8 (all failed)

All 8 tests fail at wait_for_status(terminal_id, "completed"). Terminal creation and init succeed (IDLE detected), message is sent successfully, but COMPLETED is never detected after the agent finishes.

Test Error
test_assign.py::TestKiroCliAssign::test_assign_data_analyst Worker did not reach COMPLETED within 180s (provider=kiro_cli)
test_assign.py::TestKiroCliAssign::test_assign_report_generator Worker did not reach COMPLETED within 180s (provider=kiro_cli)
test_assign.py::TestKiroCliAssign::test_assign_with_callback Worker did not reach COMPLETED within 180s (provider=kiro_cli)
test_handoff.py::TestKiroCliHandoff::test_handoff_simple_function Terminal did not reach COMPLETED within 180s (provider=kiro_cli)
test_handoff.py::TestKiroCliHandoff::test_handoff_second_task` `Terminal did not reach COMPLETED within 180s (provider=kiro_cli)
test_send_message.py::TestKiroCliSendMessage::test_send_message_to_inbox Receiver should have transitioned from IDLE after inbox delivery within 60s, got: idle
test_supervisor_orchestration.py::TestKiroCliSupervisorOrchestration::test_supervisor_handoff Supervisor did not reach COMPLETED within 300s. Last status: idle
test_supervisor_orchestration.py::TestKiroCliSupervisorOrchestration::test_supervisor_assign_and_handoff Supervisor did not reach COMPLETED within 300s. Last status: idle

Possible Root cause: StatusMonitor returns idle instead of completed. Kiro CLI's get_status() needs both a green arrow pattern (response) AND an idle prompt after it to return COMPLETED. The 8KB rolling buffer loses the green arrow as the response grows, leaving only the idle prompt → returns IDLE.

Claude Code — 0/8 (all failed)

All 8 tests fail at create_terminal() — the terminal never initializes.

Test Error
test_assign.py::TestClaudeCodeAssign::test_assign_data_analyst Claude Code initialization timed out after 30 seconds
test_assign.py::TestClaudeCodeAssign::test_assign_report_generator Claude Code initialization timed out after 30 seconds
test_assign.py::TestClaudeCodeAssign::test_assign_with_callback Claude Code initialization timed out after 30 seconds
test_handoff.py::TestClaudeCodeHandoff::test_handoff_simple_function Claude Code initialization timed out after 30 seconds
test_handoff.py::TestClaudeCodeHandoff::test_handoff_second_task Claude Code initialization timed out after 30 seconds
test_send_message.py::TestClaudeCodeSendMessage::test_send_message_to_inbox Claude Code initialization timed out after 30 seconds
test_supervisor_orchestration.py::TestClaudeCodeSupervisorOrchestration::test_supervisor_handoff Claude Code initialization timed out after 30 seconds
test_supervisor_orchestration.py::TestClaudeCodeSupervisorOrchestration::test_supervisor_assign_and_handoff Claude Code initialization timed out after 30 seconds

Root cause: StatusMonitor never detects IDLE during provider.initialize(). The FIFO → EventBus → StatusMonitor pipeline doesn't deliver output fast enough (or at all) for Claude Code's get_status() to match the IDLE prompt pattern within the 30s timeout.

Architectural Possible Root Cause

On main, get_terminal() calls provider.get_status() which reads fresh tmux scrollback (tmux capture-pane) on every poll — full history available.

On feat/event-driven-messaging, get_terminal() calls status_monitor.get_status() which returns a cached status derived from an 8KB rolling buffer fed by the FIFO pipeline.

This causes two problems:

  1. Buffer truncation: Long agent responses push early patterns (e.g., Kiro CLI's green arrow) out of the 8KB window, breaking COMPLETED detection
  2. Pipeline timing: The FIFO → EventBus → StatusMonitor pipeline may not deliver output fast enough during provider initialization, breaking IDLE detection for Claude Code

@tuanknguyen

Copy link
Copy Markdown
Contributor Author

@haofeif thanks for flagging the issue. I noticed that the issue was with the regex of the COMPLETE status detection. We now read from the buffer directly which means that there are raw control characters or ANSI escape. It's not due to sequencing or event bus not delivering fast enough. I'll update and push the changes.

@patricka3125

Copy link
Copy Markdown
Collaborator

Any updates on the progress of this PR? I have also been ecountering issues mentioned in #131 on a main branch build and this seems a promising fix 👀

@haofeif

haofeif commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Any updates on the progress of this PR? I have also been ecountering issues mentioned in #131 on a main branch build and this seems a promising fix 👀

yes this will be continued worked on

call-me-ram added a commit to call-me-ram/cli-agent-orchestrator that referenced this pull request Jun 3, 2026
Bring the event-driven architecture branch up to date with main (98
commits) and reconcile the rewrite with features that landed after it
forked: eager inbox delivery (awslabs#251), the OpenCode poller, env-var
forwarding (awslabs#259), memory curation (awslabs#254/awslabs#262), CORS auto-derive (awslabs#261),
DNS host validation (awslabs#124), and the self-send guard (awslabs#24).

Highlights:
- Providers adopt the async initialize() + get_status(buffer) contract;
  copilot_cli/opencode_cli converted; kiro keeps colour-only ANSI
  stripping so carriage-return-redraw permission prompts aren't misread
  as idle.
- Event-driven InboxService.deliver_pending with the awslabs#251 eager gate and
  message-sender attribution; OpenCode poller retained as a status-driven
  method; the watchdog (PollingObserver/LogFileHandler) is removed.
- terminal_service.create_terminal is async (FIFO + StatusMonitor wiring);
  session_service.create_session, flow_service.execute_flow, the API
  endpoints, and `cao flow run` updated to await.
- memory_service curated path and the flow CLI fixed to the new contract.

Full unit suite green (1908 passed); black + isort clean.
haofeif added a commit that referenced this pull request Jun 4, 2026
…266)

Inbox messages could stay PENDING forever when the receiving terminal was
already idle: the single immediate (on POST) delivery attempt can observe a
stale status and skip, and the log-watching observer never fires again
because an idle agent produces no new log output. With both fast paths
missed there was no fallback (issue #131).

Add a provider-agnostic reconciliation daemon that runs on a slower interval
than the watchdog and re-attempts delivery for any message left PENDING past
a grace window. The grace window keeps the sweep from competing with the
immediate and watchdog paths for freshly queued messages — it only adopts
ones those paths have already missed.

- list_pending_receiver_ids_older_than: provider-agnostic query for stuck
  receivers, joined against terminals so deleted receivers are skipped; the
  cutoff uses local-naive datetime.now() to match InboxModel.created_at.
- reconcile_orphaned_messages: sweep body, reusing check_and_send_pending_messages.
- inbox_reconciliation_daemon: background loop wired into the server lifespan.
- INBOX_RECONCILE_INTERVAL / INBOX_RECONCILE_GRACE_SECONDS constants.

The sweep reuses the existing delivery helper and so shares its known
duplicate-wakeup race; making delivery atomic is left to the unified delivery
engine tracked in GH #115. Documented in docs/inbox-delivery.md.

Fixes #131

Co-authored-by: haofeif <56006724+haofeif@users.noreply.github.com>
haofeif added a commit that referenced this pull request Jun 11, 2026
…ues #115) (#273)

* refactored but wait_for_* not working

* working

* merge from main

* fix merge conflicts

* update docs and tests

* update docs

* rebase and update docs and tests

* update tests

* clean up routine

* formatting and update kimi_cli to event-driven

* fix kiro_cli answer marker pattern

* fix(events): address review feedback on the event-driven pipeline

Resolve four review comments on PR #273:

- InboxService.run() ran deliver_pending() synchronously inside the
  asyncio consumer loop, blocking it on DB + tmux I/O and starving the
  StatusMonitor/LogWriter consumers. Offload delivery via
  asyncio.to_thread, matching the threading discipline documented for
  the event bus and the existing OpenCode poller.
- Event-driven deliveries were started without a PluginRegistry, so they
  skipped PostSendMessageEvent hooks. Thread the registry through
  run() -> deliver_pending so status-driven deliveries get the same
  plugin attribution as the immediate and OpenCode-poller paths.
- FifoManager.stop_reader() early-returned when no in-memory reader was
  tracked, never unlinking the FIFO. Retention cleanup after a restart
  (empty _readers) would leak *.fifo files unbounded. Always best-effort
  unlink the FIFO; only log when a reader was actually stopped.
- Docs claimed StatusMonitor falls back to a generic shell-prompt pattern
  before init; _detect_status returns UNKNOWN until a provider registers.
  Correct the docs to match.

Adds regression tests for the registry threading, the to_thread offload,
and the stale-FIFO unlink.

* fix(events): address Copilot review on PR #273

- flow_service: stop FIFO readers and clear StatusMonitor buffers when
  recycling an existing flow session, preventing leaked reader threads
  and stale *.fifo files across repeated flow runs.
- flow CLI run: bootstrap the in-process event pipeline (bus loop +
  StatusMonitor/LogWriter consumers) so execute_flow's async
  create_terminal/initialize path doesn't hang waiting for status.
- inbox_service: deliver pending messages in contiguous same-sender runs
  so PostSendMessageEvent attribution stays correct when a batch spans
  multiple senders (num_messages=0).
- copilot_cli tests: replace no-op initialize test with real async
  success and trust-prompt coverage; drop dead placeholder.
- docs: correct CODEBASE.md inbox flow (deliver_pending, not the removed
  check_and_send_pending_messages); document that OutputMode.FULL returns
  the bounded StatusMonitor buffer, not unbounded scrollback.

* fix(status): raw-stream-safe get_status + newest Claude Code TUI support

Addresses haofeif's PR #273 review (raw FIFO buffer fed to get_status) plus
newest-Claude-Code TUI changes found while reproducing it locally.

- claude_code/codex/kimi get_status: strip terminal escapes
  (utils.text.strip_terminal_escapes) before structural checks, so the
  cursor-positioning / in-place-redraw sequences in the raw pipe-pane stream no
  longer break detection. Document the raw-buffer input contract on
  BaseProvider.get_status.
- claude_code: gated support for the newest Claude Code TUI (the box pattern
  gates it, so older builds keep the legacy logic): live spinner renders ABOVE a
  box-drawn input prompt; completion shows '* <Verb>ed for Ns' instead of the
  old marker. Detect both PROCESSING and COMPLETED there.
- kiro_cli: clear the StatusMonitor buffer (new StatusMonitor.reset_buffer) on
  the TUI -> --legacy-ui fallback so the retry is not derived from stale bytes.
- input submission: newest Claude Code swallows an Enter sent too soon after a
  bracketed paste; make the post-paste delay provider-tunable
  (BaseProvider.paste_submit_delay; Claude overrides to 2.0s).
- create_terminal: nudge the shell after pipe-pane attaches so the prompt is
  captured (fast shells drew it before capture started, timing out wait_for_shell).

Full existing unit suite stays green (2012 passed); changes are additive.

* fix(claude_code): detect newest Claude Code TUI completion + asterisk spinner in raw stream

* fix(claude_code): re-insert spaces for column-positioned text so completion summary is detected

The newest Claude Code TUI sometimes redraws the completion summary and spinner
with CHA/CUF cursor-positioning escapes instead of literal spaces (e.g.
'✻\x1b[3GWorked\x1b[10Gfor\x1b[14G3s'). strip_terminal_escapes dropped those moves
with no space, gluing words ('Workedfor3s') so COMPLETION_SUMMARY_PATTERN and the
spinner check never matched — get_status stayed IDLE through a finished turn and
handoff/poll_until_done timed out. Replace forward horizontal cursor moves with a
single space. Verified against three real handoff captures + full suite green.

* fix(claude_code): recognize newest-TUI '●' response glyph in message extraction

The newest Claude Code TUI renders the response marker as '●' (U+25CF) instead of
'⏺' (U+23FA), so extract_last_message_from_script raised 'No Claude Code response
found' on a finished handoff even after COMPLETED was detected. Add a line-start-
anchored EXTRACTION_RESPONSE_PATTERN matching both glyphs (anchored so the footer
effort indicator '… esc to interrupt ● high · /effort' is not mistaken for a
marker), and trim the '✻ Worked for Ns' completion stat. RESPONSE_PATTERN used by
get_status is left ⏺-only so the legacy COMPLETED check cannot fire mid-stream.

* fix(claude_code): boxless completion summary after the last separator wins over a stale spinner

When the newest TUI repaints a finished turn boxless ('✻ <Verb>ed for Ns' + ❯)
below the previous frame's box separator, a stale spinner remained above that
separator. The spinner-before-separator PRIMARY walk fired PROCESSING off the
stale spinner and never reached the COMPLETED branch, so a real handoff stayed
PROCESSING until timeout. Skip that walk when a completion summary appears after
the last separator (the boxless-redraw signature; genuine processing only has the
footer there). Verified against four real handoff captures.

* fix(claude_code): COMPLETED via ● response marker, robust to clipped/missing completion summary

The newest TUI's completion summary is fragile in the raw stream: the duration
can be clipped ('✻ Crunched for ' with no Ns) or rendered on a ·/* glyph frame
the summary pattern excludes, leaving a finished turn stuck at IDLE. After the
PROCESSING checks (which already rule out a LIVE spinner), treat a completion
summary OR a start-of-line response marker (⏺/●) plus a visible prompt as
COMPLETED — dropping the position-based spinner-freshness guard that mis-counted a
stale spinner. Verified against four real handoff captures (greet + multiply).

* fix(claude_code): tolerate clipped completion summary ('✻ Crunched for ' no duration)

The newest TUI sometimes writes the completion summary's duration via a separate
cursor-positioned write that the raw stream splits off, leaving '✻ Crunched for '
with no 'Ns'. The strict COMPLETION_SUMMARY_PATTERN (requires 'for Ns') then
failed to recognize completion, so the spinner-before-separator walk reported
PROCESSING off a stale spinner and a finished handoff intermittently stuck at
PROCESSING/IDLE until timeout. Add GET_STATUS_COMPLETION_PATTERN (glyph + 'for',
no ellipsis, no digits required) for get_status detection + the boxless-tail
guard; extraction keeps the strict pattern. Verified: six real handoff captures
all settle COMPLETED; live-processing and compaction still PROCESSING.

* fix(kimi_cli): support the redesigned "Kimi Code" TUI in get_status

The newest Kimi CLI (the "Kimi Code" rebuild) replaced the ✨/💫 emoji
prompt with a boxed input area ("── input ──"), an "agent (<model> ●)"
status bar, a "context: N%" usage footer, and a braille-spinner working
indicator ("⠧ Thinking… Ns · N tokens"). The legacy detector keyed on the
emoji prompt, so it never observed IDLE and every Kimi terminal timed out
at init — this was the failure flagged on #273 (now confirmed to be a
Kimi-side TUI redesign, not a regression from the event-driven refactor).

get_status now detects the new TUI (gated on the new status/footer markers
so legacy emoji builds are unchanged):
- READY = status bar / context footer present with no live spinner.
- PROCESSING vs READY is decided by POSITION (last braille spinner line vs
  last ready-chrome line), so stale spinner frames lingering in the 8KB
  rolling buffer don't pin a finished turn at PROCESSING.
- COMPLETED vs IDLE latches on a "•" bullet (a turn produced output); the
  welcome banner / update nag have none, so a fresh terminal reads IDLE and
  avoids a premature-completion race when the first task is sent.

Adds regression tests driven by real captured raw pipe-pane buffers
(idle / processing / completed) of the new TUI.

Also enables Claude <-> Kimi cross-provider e2e:
- examples/cross-provider/data_analyst_kimi_cli.md (provider: kimi_cli).
- TestCrossProviderClaudeToKimi / TestCrossProviderKimiToClaude.
- Fix the cross-provider test helper to resolve the worker's provider from
  its profile (the add-terminal endpoint treats an explicit provider param
  as authoritative, which silently suppressed the profile override).

Verified on real CLIs (kimi 1.47.0 + claude 2.1.x) against a cao-server
built from this branch:
- TestKimiCliHandoff (2) + TestKimiCliAssign (3): pass.
- Claude->Kimi and Kimi->Claude cross-provider assign: pass.
Full unit suite green (2202 passed); mypy / black / isort clean.

* fix(herdr): make status detection backend-aware so agent init works on herdr

The event-driven rewrite made the FIFO -> EventBus -> StatusMonitor pipeline the
sole source of shell-readiness and terminal status. The herdr backend deliberately
skips that pipeline (its pipe_pane is a no-op; no FIFO reader is started for it,
since it delivers via socket events), so for a herdr terminal the StatusMonitor
buffer stayed "" and status stayed UNKNOWN forever. As a result provider.initialize()
timed out in wait_for_shell()/wait_until_status(), and every status read (API
GET /terminals/{id}, busy checks, provider init loops) returned UNKNOWN -- terminal
creation was broken for essentially every provider on herdr.

Fix at the single chokepoint instead of per-call-site:
- status_monitor.get_status() is now backend-aware: for event-inbox backends it
  derives status on demand from the provider's get_status() (which consults
  backend.get_native_status()). This fixes all read sites at once -- API status,
  wait_until_status(), flow busy checks, curator liveness, and the copilot/gemini
  init loops -- without each having to special-case the backend.
- wait_for_shell() reads backend.get_history() directly for event-inbox backends
  rather than the never-populated StatusMonitor buffer.
- wait_until_status() reverts to simply polling the now backend-aware get_status(),
  removing the duplicated backend logic.

The tmux path is unchanged (supports_event_inbox() is False -> same pushed-status
read as before). Verified end-to-end against a real herdr 0.6.8 server: a live
Claude Code agent launched in a herdr pane, reached idle, and completed a task.
Adds test/services/test_status_monitor.py and herdr cases to test/utils/test_terminal.py.

* fix(text): normalise CUP-to-column-1 escape so codex idle prompt detects

The event-driven status pipeline strips terminal escapes from the raw
8KB FIFO buffer to feed provider.get_status(). _LINE_START_CSI in
strip_terminal_escapes already turned CHA (\x1b[1G) and CNL (\x1b[E)
into \n so per-line patterns work, but missed CUP — Cursor Position
to column 1 (\x1b[<row>;1H).

Codex's TUI lays out its bottom prompt + status bar via CUP rather
than CHA: \x1b[46;1H› places the idle ``›`` at column 1 of row 46.
Without normalising this, the ``›`` glyph stayed glued mid-stream,
e.g.

    ›Improve documentation in @filenameopenai.gpt-5.5 medium · /tmp/...

The per-line check at codex.py:get_status:370 only inspects the bottom
five lines for ``\s*(?:❯|›|codex>)``, so the prompt was never detected
and codex sessions reported PROCESSING forever — every codex e2e test
hit "Codex initialization timed out after 60 seconds" (0/12 passing).

Extend _LINE_START_CSI to also match \x1b[\d+;1H. After the fix the
›-prefixed idle prompt sits on its own line, the existing
IDLE_PROMPT_PATTERN check matches, and codex idle detection works:
in replay against captured FIFO logs, 14/15 codex sessions reach idle
correctly. Remaining e2e failures are MCP startup >60s and OpenAI
``stream disconnected before completion`` API errors — outside the
scope of status detection.

Add a regression test covering both \x1b[46;1H› (codex's bottom
prompt) and the bare \x1b[1;1Hb form.

* fix(kiro_cli): position-aware Initializing/MCP-init check fixes yolo on event-driven pipeline

Under tmux capture-pane, a TUI that redraws over its boot screen hides
"Initializing..." and the MCP-init line — Kiro's TUI -> --legacy-ui
fallback path also calls StatusMonitor.reset_buffer when retrying so
the rolling buffer stays clean. Yolo mode does NOT take that fallback
path: it forces --legacy-ui directly at launch, so its rolling FIFO
buffer keeps the boot bytes forever even after kiro has redrawn over
them and the actual interactive prompt is showing below.

Until now get_status returned PROCESSING unconditionally whenever
TUI_INITIALIZING_PATTERN matched anywhere in the buffer. Yolo
sessions therefore reported PROCESSING for the entire session and
wait_until_status({IDLE, COMPLETED}) timed out — kiro yolo e2e
went from passing to 0/11 under the new event-driven pipeline (#273).

Make Check 0 position-aware: only return PROCESSING from a matching
TUI_INITIALIZING_PATTERN when no real ``[agent] >`` idle prompt
appears AFTER the last init match. The new-TUI placeholder
("Ask a question or describe a task") is intentionally NOT counted
as a real idle prompt because the new TUI renders it during boot;
that pre-existing test
(test_tui_initializing_yields_processing_despite_idle_placeholder,
issue #211) still passes.

Update test_mcp_server_init_yields_processing — its old fixture
asserted boot-line + "[developer] !>" should yield PROCESSING, but
real captured Kiro logs show the [developer] prompt only renders
AFTER init completes, so the assumption was incorrect. Replace the
fixture with the actual placeholder text Kiro shows during boot,
and add test_mcp_server_init_with_post_init_prompt_yields_idle to
lock in the post-init redrawn-stale case (the yolo failure mode).

Verified end-to-end: kiro e2e (yolo, --legacy-ui) 0/11 -> 11/11.

* fix(status): sticky ready-status latching to stop 8KB-buffer eviction flap

PR #273's event-driven pipeline derives status from a rolling 8KB FIFO
buffer fed to per-provider get_status(). TUI redraws (status-bar
refreshes, cursor positioning, footer repaints) keep emitting bytes
for seconds AFTER the agent settles, eventually evicting the
idle/response markers from the 8KB window. Without latching, status
flaps rapidly between IDLE/COMPLETED and PROCESSING/UNKNOWN, and both
wait_until_status (server-side) and the e2e tests' HTTP polling miss
the brief ready windows — causing codex 60s init timeouts, gemini 240s
init timeouts, and completion-timeout failures.

StatusMonitor now refuses two downgrades once a ready status latches:
  - ready -> PROCESSING/UNKNOWN (typical buffer-eviction flap)
  - COMPLETED -> IDLE (codex's user-marker evicts before assistant's,
    so last_user is None and provider falls back to IDLE, silently
    overwriting COMPLETED)

notify_input_sent() arms a one-shot revert gate so the latch releases
when external input legitimately starts a new processing cycle:
  - terminal_service.send_input / send_special_key (runtime input)
  - each tested provider's initialize() before its launch keystrokes
    and bypass/trust prompt acknowledgements

codex get_status now also handles the "long response evicted the user
marker" case directly: when last_user is None, scan above the TUI
footer for an assistant bullet and return COMPLETED instead of IDLE
(was returning IDLE forever, which the COMPLETED->IDLE block above
would otherwise cement as an off-by-one defence).

gemini_cli now declares extraction_tail_lines = 5000 so its existing
extraction_retries (3 x 10s) actually fires. The escalating-fetch path
(200/500/1000/5000) ran each step once with no inter-step waits, so
Ink-TUI redraws still rendering at extraction time fell through to
[PARTIAL RESPONSE], leaving the TUI footer in the output and tripping
the "? for shortcuts not in output" assertion.

E2E results on PR #273 with these changes:
  codex      10/12 passing (1 pre-existing extraction issue, 1 xfail)
  claude     12/12
  gemini     12/12
  kiro       11/11

* fix(status): keep input-arm across ready→ready flaps; make latch state thread-safe

Two hardening fixes to the sticky ready-status latch (PR #1):

1. Arm semantics: notify_input_sent()'s one-shot arm was consumed by ANY
   new ready latch, including a ready→ready downgrade flap (COMPLETED→IDLE
   when a large paste evicts the response markers, WAITING_USER_ANSWER→IDLE
   after a permission keystroke). Consuming the arm there blocks the
   genuine PROCESSING that follows, so the terminal reads ready while the
   agent is busy — and InboxService delivers on IDLE/COMPLETED, so a queued
   message could be pasted mid-response. The arm is now consumed only by a
   PROCESSING transition or a genuine non-ready→ready (init-style) latch.

2. Thread safety: the latch decision is a read-modify-write sequence
   (read armed → decide transition → consume arm) executed on the asyncio
   consumer, while notify_input_sent()/get_status()/clear_terminal() run on
   FastAPI threadpool, inbox delivery workers, and the cleanup thread.
   Guard StatusMonitor state with a lock (provider regex analysis and
   bus.publish stay outside it). Also covers the pre-existing Copilot
   review comments about unsynchronized _buffers/_last_status access.

Adds a latching state-machine test suite (12 cases) pinning blocked
downgrades, arm consumption, flap survival, and event publication.

* fix(claude_code): live MCP tool call no longer misreads as COMPLETED

Real failure from the supervisor-handoff e2e: mid-turn, Claude shows an
interim completion summary from an earlier thinking phase, then keeps
working — '● Calling cao-mcp-server… (ctrl+o to expand)' with a live
'✢ Misting… (33s · ↑ 332 tokens)' spinner and a '⎿ Tip: …' hint line
between the spinner and the input box. Two gaps combined into a false
COMPLETED, which the StatusMonitor ready-latch then pinned until the next
input, so the e2e test extracted mid-flight output:

- _boxless_completion_tail treated ANY completion summary after the last
  separator as a finished turn. It now requires that no live spinner
  (glyph + gerund + ellipsis) renders after the summary — an interim
  summary followed by a fresh spinner keeps the turn PROCESSING.
- The new-TUI box-spinner check looked only at the single line directly
  above the input box; '⎿ Tip:' hint lines and blanks render between the
  live spinner and the box, hiding it. The check now walks up to 4 lines,
  skipping ⎿ continuation lines and blanks.

Regression tests reproduce both shapes (fail pre-fix) plus a guard that a
summary with no following spinner still reads COMPLETED.

* fix(kimi_cli): extract responses from the newest Kimi Code TUI

The newest 'Kimi Code' TUI renders user messages as ✨-prefixed prompt
lines (no ╭─ input box), responses as • bullets, and a '── input ──' rule
plus status bar as footer. extract_last_message_from_script() preferred
the last ╰─ box-end as the response anchor — which in the new TUI matches
decorative boot banners (Kimi welcome box, FastMCP server banner) ABOVE
the conversation — and only stopped at a bare idle prompt, which the new
TUI never renders. Result: the 'response' was sliced from the boot screen
and ran to end-of-capture (raw spinner frames, status bar), failing the
supervisor-assign e2e.

- Anchor on the LATEST marker (box-end vs prompt-with-input) instead of
  box-first: the response always follows the last user input, whichever
  marker style rendered it.
- Stop the response at the new-TUI footer: the '── input ──' rule or the
  status-bar/context-footer line, in addition to the bare idle prompt.

Fixtures are ground truth from a live Kimi Code session.

* fix(events): address remaining Copilot review nits on PR #273

- LogWriter: open log files with explicit encoding='utf-8',
  errors='replace' — a non-UTF-8 platform locale (POSIX/C) would raise
  UnicodeEncodeError on the first unencodable chunk and stop log
  persistence for that terminal.
- EventBus.set_loop: accept Optional[loop] (test fixtures already call
  set_loop(None)); publish() now guards against the loop being closed
  during shutdown instead of raising RuntimeError from a FIFO reader
  thread draining its last chunks.
- BaseProvider.get_status docstring: soften the strip_terminal_escapes
  'should' to MAY with an explicit note that kiro_cli intentionally
  preserves raw \r for permission-prompt detection and must not be
  'fixed' to comply.

The 'except Exception swallows CancelledError' comments are intentionally
NOT addressed: since Python 3.8, asyncio.CancelledError derives from
BaseException, so those loops do not swallow cancellation.

* test(e2e): read kimi skill catalog from the agent-file system.md

Kimi (v1.20+) launches via 'cd <temp_dir> && kimi --agent-file
<temp_dir>/agent.yaml', writing the system prompt — including the
injected skill catalog — to <temp_dir>/system.md. The catalog is no
longer in the CLI command string, and the full-screen Kimi Code TUI
clears the visible screen on startup, so asserting against raw tmux
scrollback always failed. Parse the temp dir from the launch command
(capture-pane -J so soft-wrapped lines can't split the path mid-token)
and assert against system.md on disk — same pattern as the existing
gemini GEMINI.md branch.

* fix(kimi_cli): honest turn-in-flight detection for the newest Kimi Code TUI

The newest TUI renders the live spinner BETWEEN the '── input ──' rule and
the status bar, and repaints the status bar with every frame — so the
spinner-vs-ready-chrome position compare read 'ready' mid-turn whenever a
full footer repaint was the freshest chunk (measured: one supervisor turn
flapped completed↔processing 29 times in a 57KB stream). The StatusMonitor
ready-latch then pinned the first false COMPLETED ~130ms after dispatch,
so supervisor flows extracted mid-flight output (supervisor-assign e2e).

Detection now uses signals validated by replaying captured live streams:

- Live-spinner lines: braille/moon glyphs (incl. tool-call lines like
  '⠹ Using handoff({…})'), EXCLUDING boot chrome ('⠧ MCP Servers: 0/1',
  '⠋ Loading configuration...') which legitimately shows braille while
  idle at the welcome screen.
- In flight when a live spinner is within the freshest 15 lines OR renders
  after the last '•' bullet (covers chunk boundaries where streamed
  thinking text temporarily pushed the spinner out of the tail).
- Dispatch grace: 5s after send_input() (mark_input_received now stamps
  the dispatch), bridging the paste→first-spinner-frame gap.
- Rendered-pane confirmation: a ready-looking chunk boundary mid-turn is
  byte-identical to one at real completion (stale spinner ~21 lines back,
  bullets 2-3 from the end in BOTH), so when the stream says ready
  post-dispatch, confirm against the rendered pane — tmux's compositor has
  resolved every in-place redraw, so a visible spinner is live, not stale.

Also raises the e2e SUPERVISOR_COMPLETION_TIMEOUT 300→600s: a kimi worker
alone takes ~3m20s on the report-template handoff; 300s only looked
sufficient while the false early COMPLETED was masking the real duration.

E2E (live agents): kimi supervisor handoff + assign_and_handoff now pass;
kimi handoff x2, send_message, skills unaffected and green.

* fix(tools): close claude_code tool-restriction escapes via Task/Monitor/NotebookEdit

The allowed-tools e2e caught restricted agents escaping their restrictions
live: a reviewer (no Bash/Write) created the forbidden file anyway — first
'via a delegated subagent that ran the write through a shell command'
(Task), then on retry via 'the Monitor tool' (background shell scripts).

CAO computes --disallowedTools as (tool universe - allowed natives), and
claude_code's universe only contained Bash/Read/Edit/Write/Glob/Grep — so
the execution-capable tools outside it were never disallowed. Gate them by
privilege equivalence:

- execute_bash now maps Bash, BashOutput, KillShell, Task (subagents run
  with their own full toolset), and Monitor (runs arbitrary shell scripts).
  Anything these can do, Bash can too, so profiles allowed execute_bash
  lose nothing.
- fs_write now maps NotebookEdit alongside Edit/Write (.ipynb writes).

Known limitation: this is enumeration against an evolving toolset — each
new execution-capable Claude Code tool needs adding here. A deny-by-default
mechanism upstream would be the durable fix.

E2E (live, herdr backend): all 4 TestClaudeCodeAllowedTools pass, including
the two that previously demonstrated the escape.

---------

Co-authored-by: Tuan Nguyen <tuankn@amazon.com>
Co-authored-by: haofeif <56006724+haofeif@users.noreply.github.com>
Co-authored-by: Tuan Nguyen <32879640+tuanknguyen@users.noreply.github.com>
Co-authored-by: Feng, Haofei <haofeif@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants