fix(inbox): mark messages DELIVERED before send_input to stop double delivery#265
Conversation
…delivery check_and_send_pending_messages() updated the message status to DELIVERED only after calling terminal_service.send_input(). send_input() types into the tmux pane, which writes to the terminal log file and re-triggers the watchdog on_modified handler. That re-entrant delivery check still saw the message as PENDING and delivered it a second time, corrupting agent input streams in multi-agent workflows. Move the DELIVERED status update ahead of send_input() so the watchdog re-fire finds no pending message. The send_input() failure path still resets the status to FAILED. Add a regression test asserting DELIVERED is recorded before send_input(), and update the failure-path test for the new ordering. Fixes awslabs#164
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #265 +/- ##
=======================================
Coverage ? 92.19%
=======================================
Files ? 69
Lines ? 6833
Branches ? 0
=======================================
Hits ? 6300
Misses ? 533
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Fixes a re-entrant inbox delivery race where terminal_service.send_input() triggers the watchdog (via terminal log writes) while the message is still PENDING, causing the same inbox message to be delivered twice.
Changes:
- Update inbox delivery flow to mark messages
DELIVEREDbefore callingterminal_service.send_input()to close the watchdog re-entrancy window. - Update the failure-path unit test to expect the new
DELIVERED → FAILEDtransition ordering whensend_input()raises. - Add a regression unit test asserting
DELIVEREDis recorded beforesend_input()is invoked (Issue #164).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/cli_agent_orchestrator/services/inbox_service.py |
Moves the DELIVERED status update ahead of send_input() to prevent double delivery from watchdog re-entry. |
test/services/test_inbox_service.py |
Adds regression coverage for call ordering and updates the send-failure expectation for the new status transition sequence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Reconcile the event-driven inbox rewrite with awslabs#265 and awslabs#266, which just landed on main and edit the same files: - inbox_service.py: keep the event-driven InboxService, but apply awslabs#265's ordering — mark messages DELIVERED before send_input so output echoed back through the FIFO/StatusMonitor pipeline can't re-enter deliver_pending and double-deliver a still-PENDING message (awslabs#164). Add awslabs#266's reconcile_orphaned_messages as a method that routes stale PENDING receivers back through deliver_pending (awslabs#131). - api/main.py: keep the event-bus consumers + OpenCode poller; add awslabs#266's inbox_reconciliation_daemon task (start + cancel). Drop awslabs#266's watchdog PollingObserver wiring — the event bus replaced it here. - constants.py: keep both INBOX_POLLING_INTERVAL and awslabs#266's INBOX_RECONCILE_INTERVAL/GRACE; reword the comments off the watchdog. - tests: adapt awslabs#265's ordering regression and awslabs#266's reconcile + lifespan tests to the event-driven class API (deliver_pending, status_monitor, the singleton's bound method); drop the watchdog LogFileHandler tests. Full unit suite green: 1958 passed, 1 skipped.
…ery (#271) * feat(herdr): terminal backend abstraction, HerdrBackend, and inbox service Implements the herdr-backend-support OpenSpec change: TerminalBackend ABC, TmuxBackend wrapper, HerdrBackend (CLI + socket API), BackendFactory with config-driven selection, native status detection, and HerdrInboxService (socket event-based inbox delivery, not yet wired into the app lifespan). Parks work-in-progress ahead of the fix-herdr-inbox-delivery-wiring change, which wires HerdrInboxService into the running app and fixes the eager delivery modal-injection defect. * fix(herdr): wire inbox lifespan + fix eager gate Two bugs prevented inbox message delivery, plus the supporting wiring to make herdr-backed delivery work end to end. Bug A — eager delivery eligibility gate (inbox_service.py): WAITING_USER_ANSWER was incorrectly included in the eager delivery eligibility check. A terminal waiting on a user answer is not buffering input for a next turn, so delivering eagerly in that state is wrong. Restrict eager eligibility to PROCESSING only. Bug B — HerdrInboxService never wired into lifespan (api/main.py): Under the herdr backend the service was never instantiated or started, so zero messages were delivered. The lifespan now selects the delivery mechanism by backend: HerdrInboxService (socket events) for HerdrBackend, PollingObserver (file polling) otherwise, with matching shutdown for each. Supporting wiring: - Add HerdrInboxRegistry (module-level get/set singleton) with a TYPE_CHECKING-guarded import to break the circular import chain between terminal_service and herdr_inbox_service. - Connect terminal create/delete to register/unregister the terminal with the herdr inbox service; failures are logged, not fatal. - Expose HerdrBackend.herdr_session for the service to bind its socket. - Fix the async seam in HerdrInboxService: capture the running loop in start() and use run_coroutine_threadsafe() for pane subscription, since register_terminal() is called from the synchronous create_terminal path. - Clear the HerdrInboxService singleton on shutdown (set_herdr_inbox_service now accepts None) so the registry does not retain a stale reference after the service task is cancelled, keeping startup and shutdown symmetric. Update providers/base.py and constants.py docs to match the corrected PROCESSING-only eager delivery semantics. Tests: 66 passed across the five affected suites. Refs: none (internal herdr bug fix) 🤖 Assisted by Amazon Q Developer * fix(herdr): resolve panes live by workspace/tab label, not in-memory cache Fixes silent message misdelivery after a CAO server restart. The window resolution path keyed off an in-memory _window_to_terminal map populated only at create time and never rebuilt; after a restart it was empty, and resolution fell through to a fallback that returned the FIRST pane in the workspace, misrouting messages to the wrong terminal (marked delivered). Resolve from herdr's live state every call instead: workspace label (= session) -> tab label (= window) -> pane on that tab. herdr is a separate process that survives CAO restarts, so this is restart-safe with no new persistence. Window names are unique by construction (profile + uuid suffix). - _resolve_pane_id_from_window: always resolve via the label chain; wrap backend errors as TerminalNotFoundError so the inbox path can leave messages pending - remove _window_to_terminal entirely (declaration, populate, cleanup sites) - remove the silent first-pane fallback; raise TerminalNotFoundError instead - inbox_service: treat TerminalNotFoundError on send as transient -> leave the message pending for retry (not FAILED, not misrouted) - _resolve_pane_id (socket inbox path) unchanged Tests: test_herdr_backend (label-chain resolution, fail-loud) and test_inbox_service (pending-on-resolution-failure) pass. Implements OpenSpec change fix-herdr-pane-resolution-across-restart. * fix(herdr): reconcile stale panes on reconnect, subscribe lifecycle events Root cause of 106K/day reconnect storm: _resubscribe_all() was sending subscriptions for pane_ids that no longer exist in herdr. Herdr responds to any invalid pane subscription by ECONNRESET-ing the entire socket. Backoff never escalated because it was reset after _connect() success, not after full setup. Changes: - Add _reconcile(): runs on every socket connect, before _resubscribe_all(). Calls 'herdr pane list', prunes _pane_to_terminal of stale entries, deletes orphaned DB terminal records via delete_terminal(), kills workspaces with zero remaining live terminals. Builds _workspace_to_session map for lifecycle event handling. - Add _subscribe_lifecycle_events(): subscribes to pane.closed and workspace.closed globally after resubscribe. No pane_id needed. - Add _handle_lifecycle_event(): pane.closed deletes single terminal and kills workspace if now empty. workspace.closed deletes all terminals for that session and prunes maps using pane_id prefix matching. - Fix backoff reset: moved to after successful reconcile + resubscribe + lifecycle-subscribe, not just _connect(). - Fix import: get_backend() is in backends.registry, not backends.__init__. Two call sites updated to match the pattern used everywhere else in src/. - Fix double get_terminal_metadata() call in pane.closed handler: refactor to walrus-operator list comprehension, eliminating None dereference. - Add 13 unit tests covering _reconcile prune/no-op/failure paths, _subscribe_lifecycle_events message format, and _handle_lifecycle_event for pane.closed and workspace.closed including unknown-entity no-ops. 🤖 Generated with Amazon Q Developer * fix(herdr): cross-check DB terminals against live herdr tabs on reconcile _reconcile() previously only pruned panes it tracked in _pane_to_terminal. On server restart the map starts empty, so ghost DB records from prior runs (terminals whose herdr tabs no longer exist) were never cleaned up. This caused repeated 'Terminal not found' errors and broke 'cao session list' (CLI silently drops a session when its conductor terminal returns 500). Fix in two parts: 1. herdr_inbox_service.py: after the in-memory map prune, run `herdr tab list` and cross-check DB terminals for every tracked workspace. Any DB terminal whose tmux_window label is absent from live herdr tabs is deleted. This runs on every socket connect, covering both startup and reconnect scenarios. 2. api/main.py: GET /terminals/{id} catches TerminalNotFoundError from the backend and returns 404 instead of 500. Ghost terminals that survive until the next server restart now fail with the correct status code, making CLI error handling straightforward. 10 new tests (33 total in test_herdr_inbox_service.py). * fix(herdr): run ghost terminal cleanup at server startup, not just on reconnect The DB cross-check in _reconcile() only fires when _pane_to_terminal is non-empty (the socket loop is gated by that condition). On a fresh server start with no registered terminals, _reconcile() never runs, so ghost records from prior runs survive indefinitely. Extract the workspace/tab cross-check into _startup_db_cleanup() and call it unconditionally from start() before the socket loop begins. The method fetches herdr workspace list and tab list directly, building its own workspace map without depending on _pane_to_terminal or _workspace_to_session. 3 new tests (36 total). * docs(herdr): add herdr backend setup and configuration guide * style: apply black formatting after rebase onto upstream/main * style: fix isort import ordering after rebase * fix(herdr): repair tests broken by rebase + implement /health components Five test/impl fixes surfaced after rebasing onto upstream/main: - /health endpoint: implement the `components` block (cao/herdr/claude) that test_health_check_returns_ok already asserted. herdr/claude probed via shutil.which -> "ok" or "unavailable". - test_herdr_backend: fix infinite hang in test_logs_warning_when_socket_never_appears. The fake_time stub returned 6.0 but _ensure_session_running's poll deadline is now time()+15.0, so 6.0 < 15.0 looped forever. Bump stub to 16.0. - test_tmux_backend: update create_session/create_window delegation assertions for the new extra_env kwarg. - test_api_endpoints + test_plugin_lifespan: patch get_backend to a TmuxBackend mock so the lifespan PollingObserver assertions hold regardless of the local terminal_backend config. - test_kiro_cli_unit: test_mcp_server_init_yields_processing patched the removed kiro_cli.tmux_client; switch to get_backend mock like the sibling tests. * fix(herdr): reconcile rebase conflicts with upstream inbox + codex changes After rebasing onto upstream/main, three integration points needed repair: - inbox_service: upstream now marks a message DELIVERED *before* send_input (PR #265, closes the watchdog re-entrancy race). My TerminalNotFoundError handler returned without resetting that optimistic status, so a transient pane-resolution failure would strand the message as DELIVERED-but-unsent. Reset to PENDING in the handler so the retry path still fires. - test_inbox_service: update test_resolution_failure_leaves_message_pending to assert the final status is PENDING (DELIVERED then reset), not "never touched". - test_codex_provider_unit: upstream's new TestCodexV0136FooterFormat tests patched codex.tmux_client, which this branch migrated to get_backend. Switch the three get_status tests to get_backend; drop the unused patch on the script-only extraction test. * fix(herdr): update launch tests to patch get_backend, not removed subprocess import launch.py migrated tmux attach from subprocess.run(["tmux", "attach-session"...]) to get_backend().attach_session(), so the module no longer imports subprocess. Six launch tests still patched cli.commands.launch.subprocess.run as a defensive no-op, which now fails with "module launch has no attribute 'subprocess'" during patch setup. Repoint those patches at launch.get_backend, which exists and is the correct seam for the attach path. Full CI-equivalent suite (test/, minus integration/e2e) now passes: 2120 passed. * fix(herdr): patch os.path.exists in factory herdr test for CI test_returns_herdr_when_configured constructed a real HerdrBackend, whose __init__ calls _ensure_session_running(). When the session socket is absent that method runs subprocess.Popen(["herdr", ...]), which raises FileNotFoundError on hosts where herdr is not installed (CI). The test only passed locally because a live ~/.config/herdr/sessions/cao/herdr.sock made _ensure_session_running short-circuit before the Popen call. Patch herdr_backend.os.path.exists to True so __init__ finds the socket and skips server startup, matching the existing fixture in test_herdr_backend.py. The factory's own config_path.exists() is pathlib.Path.exists and is unaffected by this module-scoped patch. Reproduced the CI failure locally by running with herdr off PATH; all 7 factory tests and the full 101-test backends suite pass after the fix. * docs(herdr): link herdr backend guide from README The herdr backend setup guide (docs/herdr.md) was added in this branch but nothing in the README pointed to it, so it was undiscoverable. Add a link in the "Sessions run in tmux" section, next to the existing docs/tmux.md link, since herdr is the alternative terminal backend. * fix(herdr): address Copilot review findings on PR #271 Four findings from the automated reviewer, verified against the code: - terminal_service.get_output: the "partial response" message claimed the marker was not found in 5000 lines, but the final attempt captures the full scrollback (full_history=True). Reworded to "full scrollback" so the prefix matches what was actually searched. - HerdrBackend.get_history ignored strip_escapes, while TmuxBackend honors it. The one strip_escapes=True caller (delete-terminal snapshot) was only getting plain text by accident because herdr defaults to text output. Map strip_escapes=True to herdr's native "--format text"; leave the default path untouched so existing provider output parsing on the herdr backend is unchanged. Added two unit tests for the flag mapping. - docs/herdr.md prerequisites said the herdr server must be running before CAO starts, but HerdrBackend._ensure_session_running auto-starts it when the session socket is absent. Clarified that pre-starting is optional. - test_kiro_cli_integration use_tmux_backend fixture set the module-level backend registry singleton without restoring it, leaking the tmux pin into later tests by execution order. Save and restore the previous backend. Not changed: the Copilot note about TmuxBackend.attach_session using check=True. The sole caller (cli/commands/launch.py) already wraps the call in a broad except that converts any exception to a click.ClickException, so a failed attach surfaces as a normal CLI error, not an uncaught stack trace. Backends + terminal_service suites pass (146 tests) with herdr off PATH. * fix(herdr): stop socket reconnect storm by using one events.subscribe per connection HerdrInboxService flapped the herdr socket ~44x/min indefinitely: connect, subscribe, herdr resets the connection ~225ms later, 1s backoff, repeat. The event-driven inbox path delivered nothing as a result (every pane.agent_status_changed was lost between resets); delivery only worked via the polling fallback, and the churn produced ~50MB log files per day. Root cause: herdr 0.6.8 resets the entire connection when it receives a SECOND events.subscribe on a connection that already has an active subscription. _socket_loop sent one events.subscribe per managed pane (_resubscribe_all) and then a second for the lifecycle events (_subscribe_lifecycle_events), so the 2nd call always tore down the connection. The backoff never escalated past the 1.0s base because it reset after each apparently-successful setup. Reproduced and fix validated against a live herdr 0.6.8 socket: a single events.subscribe carrying every pane's agent-status plus pane.closed and workspace.closed is acknowledged once and keeps the connection open with events streaming; any second subscribe resets at ~150-220ms. Changes: - Replace _resubscribe_all + _subscribe_lifecycle_events + _subscribe_pane with a single _subscribe_all_events that builds one subscriptions array (one pane.agent_status_changed per managed pane -- pane_id is required, herdr rejects the wildcard form -- plus the two lifecycle events). - register_terminal on a live connection now forces a reconnect via _force_reconnect (close the writer so _socket_loop rebuilds the combined subscription) instead of sending a second, connection-killing subscribe. herdr exposes no incremental add-subscription API. - Update tests to pin the single-subscribe-per-connection contract. * fix(herdr): route lifecycle events by herdr's real wire format Socket-driven pane.closed / workspace.closed cleanup never ran. _event_loop read the event name from the "type" key and matched the dotted names "pane.closed" / "workspace.closed", but herdr 0.6.8 puts the name in the "event" key using underscore names ("pane_closed" / "workspace_closed"). The condition never matched, so every lifecycle event fell through and real-time cleanup was dead -- leaving the periodic reconcile/startup sweeps as the only thing removing ghost terminals. Captured the real wire format live from herdr 0.6.8: {"event":"pane_closed","data":{"pane_id":...,"workspace_id":...}} {"event":"workspace_closed","data":{"workspace_id":...}} {"event":"pane.agent_status_changed","data":{"agent_status":...,"pane_id":...}} Fix: read the name from "event" (falling back to "type") and normalize underscores to dots before routing. The agent-status path was already correct (it reads pane_id/agent_status from the "data" wrapper and does not depend on the top-level name match). Validated live end-to-end: the real _event_loop over the real herdr socket now routes an actual pane close to _handle_lifecycle_event. Updated the routing test to use herdr's real event shape (the old test used a fabricated {"type":"pane.closed"} shape that never occurs on the wire) and added end-to-end pane_closed cleanup and agent-status delivery tests. * docs(herdr): use bare cao-server to match README Reviewer noted the README starts the server with `cao-server` (no `uv run` prefix), but herdr.md used `uv run cao-server`. cao-server is a project.scripts entry point, so the bare form works once installed. Aligned both occurrences (Launching + troubleshooting) with the README convention. * fix(herdr): patch backend registry in rebased terminal_service tests The rebase onto upstream/main pulled in four upstream tests in test_terminal_service_full.py that still patched terminal_service.tmux_client. This branch migrated terminal_service off the module-level tmux_client to get_backend(), so that attribute no longer exists and the patch setup raised AttributeError. Repointed the four @patch targets at backends.registry._backend (the seam the already-migrated sibling tests in this file use). The mock's method surface (session_exists, create_session, ...) is identical on the backend, so the test bodies are unchanged. Full CI-equivalent suite: 2180 passed, 1 skipped. * fix(herdr): forward extra_env and prune workspace.closed by session Two findings from maintainer review of PR #271. P1: HerdrBackend dropped cao launch --env. create_session() and create_window() accepted extra_env but called _inject_env_vars() without it, so operator-forwarded vars (AWS_REGION, MNEMOSYNE_DIR, ...) worked under tmux and silently vanished under herdr. Thread extra_env through to the injection path and emit shell `export` statements for it. Reuse TmuxClient's filtering policy (blocked provider prefixes, 2048-byte value cap) so the two backends cannot drift. herdr injects via `pane send-text` (a shell command line, unlike tmux's exec-style -e KEY=VALUE argv), so operator-supplied values are passed through shlex.quote to stay injection-safe. Added tests for both create paths covering forwarding, blocked-prefix/oversize filtering, and value quoting. P2: workspace.closed map cleanup pruned via pane_id.startswith(workspace_id). herdr renumbers compact pane_ids and gives no guarantee they begin with the workspace_id, so the prefix test was unreliable. Prune by matching each tracked terminal's DB session against the closed workspace's session instead, mirroring the session match already used in the pane.closed handler. Updated the test to use compact pane_ids that do not share the workspace_id prefix, proving the prune keys off DB session ownership. Full CI-equivalent suite: 2184 passed, 1 skipped. * feat(herdr): add cao-server --terminal to override backend Adds a --terminal {tmux,herdr} flag to cao-server so the backend can be selected at launch without editing config.json, addressing reviewer feedback on PR #271. - BackendFactory.create() gains a backend_override parameter that takes precedence over terminal_backend in config.json. Other config keys (herdr_session) are still read, so --terminal herdr honors a configured session name. - main() parses --terminal and, when set, resolves the backend via BackendFactory and set_backend() before uvicorn starts. The backend singleton is otherwise lazy (first touched in the lifespan), so setting it up front cleanly controls every get_backend() consumer. Without the flag, behavior is unchanged: the factory reads config.json on first use. Precedence: --terminal flag > config.json terminal_backend > tmux default. Tests: factory override precedence (over tmux/herdr config, no-config, session passthrough, unknown name); main() sets the backend when --terminal is present and leaves it lazy when absent. docs/herdr.md documents the flag. Full CI-equivalent suite: 2191 passed, 1 skipped.
Summary
Fixes #164. Inbox messages were delivered twice to the receiving terminal, corrupting agent input streams in multi-agent
assign+send_messageflows.Root cause
check_and_send_pending_messages()set the message status toDELIVEREDonly after callingterminal_service.send_input().send_input()types into the tmux pane, which writes to the terminal log file and re-triggers the watchdogon_modifiedhandler. That re-entrant delivery check still saw the message asPENDINGand delivered it a second time.Fix
Move the
update_message_status(..., DELIVERED)call ahead ofsend_input()so the watchdog re-fire finds no pending message. The failure path still resets the status toFAILEDifsend_input()raises.Tests
DELIVEREDis recorded beforesend_input().DELIVEREDthenFAILED).pytest test/ --ignore=test/e2e -m "not integration"), along withblack --checkandisort --check-only.