Skip to content

fix(inbox): mark messages DELIVERED before send_input to stop double delivery#265

Merged
haofeif merged 3 commits into
awslabs:mainfrom
call-me-ram:fix/issue-164-inbox-double-delivery
Jun 4, 2026
Merged

fix(inbox): mark messages DELIVERED before send_input to stop double delivery#265
haofeif merged 3 commits into
awslabs:mainfrom
call-me-ram:fix/issue-164-inbox-double-delivery

Conversation

@call-me-ram

Copy link
Copy Markdown
Contributor

Summary

Fixes #164. Inbox messages were delivered twice to the receiving terminal, corrupting agent input streams in multi-agent assign + send_message flows.

Root cause

check_and_send_pending_messages() set the message status to DELIVERED only after calling terminal_service.send_input(). send_input() types into the tmux pane, which writes to the terminal log file and re-triggers the watchdog on_modified handler. That re-entrant delivery check still saw the message as PENDING and delivered it a second time.

Fix

Move the update_message_status(..., DELIVERED) call ahead of send_input() so the watchdog re-fire finds no pending message. The failure path still resets the status to FAILED if send_input() raises.

Tests

  • Added a regression test asserting DELIVERED is recorded before send_input().
  • Updated the failure-path test for the new ordering (DELIVERED then FAILED).
  • Full unit suite passes locally (pytest test/ --ignore=test/e2e -m "not integration"), along with black --check and isort --check-only.

…delivery

check_and_send_pending_messages() updated the message status to DELIVERED
only after calling terminal_service.send_input(). send_input() types into
the tmux pane, which writes to the terminal log file and re-triggers the
watchdog on_modified handler. That re-entrant delivery check still saw the
message as PENDING and delivered it a second time, corrupting agent input
streams in multi-agent workflows.

Move the DELIVERED status update ahead of send_input() so the watchdog
re-fire finds no pending message. The send_input() failure path still
resets the status to FAILED.

Add a regression test asserting DELIVERED is recorded before send_input(),
and update the failure-path test for the new ordering.

Fixes awslabs#164
@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@2e67a52). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #265   +/-   ##
=======================================
  Coverage        ?   92.19%           
=======================================
  Files           ?       69           
  Lines           ?     6833           
  Branches        ?        0           
=======================================
  Hits            ?     6300           
  Misses          ?      533           
  Partials        ?        0           
Flag Coverage Δ
unittests 92.19% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a re-entrant inbox delivery race where terminal_service.send_input() triggers the watchdog (via terminal log writes) while the message is still PENDING, causing the same inbox message to be delivered twice.

Changes:

  • Update inbox delivery flow to mark messages DELIVERED before calling terminal_service.send_input() to close the watchdog re-entrancy window.
  • Update the failure-path unit test to expect the new DELIVERED → FAILED transition ordering when send_input() raises.
  • Add a regression unit test asserting DELIVERED is recorded before send_input() is invoked (Issue #164).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/cli_agent_orchestrator/services/inbox_service.py Moves the DELIVERED status update ahead of send_input() to prevent double delivery from watchdog re-entry.
test/services/test_inbox_service.py Adds regression coverage for call ordering and updates the send-failure expectation for the new status transition sequence.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@haofeif haofeif added the bug Something isn't working label Jun 4, 2026
@haofeif haofeif merged commit 067deea into awslabs:main Jun 4, 2026
7 checks passed
call-me-ram added a commit to call-me-ram/cli-agent-orchestrator that referenced this pull request Jun 4, 2026
Reconcile the event-driven inbox rewrite with awslabs#265 and awslabs#266, which just
landed on main and edit the same files:

- inbox_service.py: keep the event-driven InboxService, but apply awslabs#265's
  ordering — mark messages DELIVERED before send_input so output echoed
  back through the FIFO/StatusMonitor pipeline can't re-enter
  deliver_pending and double-deliver a still-PENDING message (awslabs#164). Add
  awslabs#266's reconcile_orphaned_messages as a method that routes stale
  PENDING receivers back through deliver_pending (awslabs#131).
- api/main.py: keep the event-bus consumers + OpenCode poller; add awslabs#266's
  inbox_reconciliation_daemon task (start + cancel). Drop awslabs#266's watchdog
  PollingObserver wiring — the event bus replaced it here.
- constants.py: keep both INBOX_POLLING_INTERVAL and awslabs#266's
  INBOX_RECONCILE_INTERVAL/GRACE; reword the comments off the watchdog.
- tests: adapt awslabs#265's ordering regression and awslabs#266's reconcile + lifespan
  tests to the event-driven class API (deliver_pending, status_monitor,
  the singleton's bound method); drop the watchdog LogFileHandler tests.

Full unit suite green: 1958 passed, 1 skipped.
haofeif pushed a commit that referenced this pull request Jun 7, 2026
…ery (#271)

* feat(herdr): terminal backend abstraction, HerdrBackend, and inbox service

Implements the herdr-backend-support OpenSpec change: TerminalBackend ABC,
TmuxBackend wrapper, HerdrBackend (CLI + socket API), BackendFactory with
config-driven selection, native status detection, and HerdrInboxService
(socket event-based inbox delivery, not yet wired into the app lifespan).

Parks work-in-progress ahead of the fix-herdr-inbox-delivery-wiring change,
which wires HerdrInboxService into the running app and fixes the eager
delivery modal-injection defect.

* fix(herdr): wire inbox lifespan + fix eager gate

Two bugs prevented inbox message delivery, plus the supporting wiring
to make herdr-backed delivery work end to end.

Bug A — eager delivery eligibility gate (inbox_service.py):
WAITING_USER_ANSWER was incorrectly included in the eager delivery
eligibility check. A terminal waiting on a user answer is not buffering
input for a next turn, so delivering eagerly in that state is wrong.
Restrict eager eligibility to PROCESSING only.

Bug B — HerdrInboxService never wired into lifespan (api/main.py):
Under the herdr backend the service was never instantiated or started,
so zero messages were delivered. The lifespan now selects the delivery
mechanism by backend: HerdrInboxService (socket events) for HerdrBackend,
PollingObserver (file polling) otherwise, with matching shutdown for each.

Supporting wiring:
- Add HerdrInboxRegistry (module-level get/set singleton) with a
  TYPE_CHECKING-guarded import to break the circular import chain
  between terminal_service and herdr_inbox_service.
- Connect terminal create/delete to register/unregister the terminal
  with the herdr inbox service; failures are logged, not fatal.
- Expose HerdrBackend.herdr_session for the service to bind its socket.
- Fix the async seam in HerdrInboxService: capture the running loop in
  start() and use run_coroutine_threadsafe() for pane subscription, since
  register_terminal() is called from the synchronous create_terminal path.
- Clear the HerdrInboxService singleton on shutdown (set_herdr_inbox_service
  now accepts None) so the registry does not retain a stale reference after
  the service task is cancelled, keeping startup and shutdown symmetric.

Update providers/base.py and constants.py docs to match the corrected
PROCESSING-only eager delivery semantics.

Tests: 66 passed across the five affected suites.

Refs: none (internal herdr bug fix)

🤖 Assisted by Amazon Q Developer

* fix(herdr): resolve panes live by workspace/tab label, not in-memory cache

Fixes silent message misdelivery after a CAO server restart. The window
resolution path keyed off an in-memory _window_to_terminal map populated only
at create time and never rebuilt; after a restart it was empty, and resolution
fell through to a fallback that returned the FIRST pane in the workspace,
misrouting messages to the wrong terminal (marked delivered).

Resolve from herdr's live state every call instead: workspace label (= session)
-> tab label (= window) -> pane on that tab. herdr is a separate process that
survives CAO restarts, so this is restart-safe with no new persistence. Window
names are unique by construction (profile + uuid suffix).

- _resolve_pane_id_from_window: always resolve via the label chain; wrap backend
  errors as TerminalNotFoundError so the inbox path can leave messages pending
- remove _window_to_terminal entirely (declaration, populate, cleanup sites)
- remove the silent first-pane fallback; raise TerminalNotFoundError instead
- inbox_service: treat TerminalNotFoundError on send as transient -> leave the
  message pending for retry (not FAILED, not misrouted)
- _resolve_pane_id (socket inbox path) unchanged

Tests: test_herdr_backend (label-chain resolution, fail-loud) and
test_inbox_service (pending-on-resolution-failure) pass.

Implements OpenSpec change fix-herdr-pane-resolution-across-restart.

* fix(herdr): reconcile stale panes on reconnect, subscribe lifecycle events

Root cause of 106K/day reconnect storm: _resubscribe_all() was sending
subscriptions for pane_ids that no longer exist in herdr. Herdr responds
to any invalid pane subscription by ECONNRESET-ing the entire socket.
Backoff never escalated because it was reset after _connect() success,
not after full setup.

Changes:
- Add _reconcile(): runs on every socket connect, before _resubscribe_all().
  Calls 'herdr pane list', prunes _pane_to_terminal of stale entries,
  deletes orphaned DB terminal records via delete_terminal(), kills
  workspaces with zero remaining live terminals. Builds
  _workspace_to_session map for lifecycle event handling.
- Add _subscribe_lifecycle_events(): subscribes to pane.closed and
  workspace.closed globally after resubscribe. No pane_id needed.
- Add _handle_lifecycle_event(): pane.closed deletes single terminal and
  kills workspace if now empty. workspace.closed deletes all terminals
  for that session and prunes maps using pane_id prefix matching.
- Fix backoff reset: moved to after successful reconcile + resubscribe +
  lifecycle-subscribe, not just _connect().
- Fix import: get_backend() is in backends.registry, not backends.__init__.
  Two call sites updated to match the pattern used everywhere else in src/.
- Fix double get_terminal_metadata() call in pane.closed handler: refactor
  to walrus-operator list comprehension, eliminating None dereference.
- Add 13 unit tests covering _reconcile prune/no-op/failure paths,
  _subscribe_lifecycle_events message format, and _handle_lifecycle_event
  for pane.closed and workspace.closed including unknown-entity no-ops.

🤖 Generated with Amazon Q Developer

* fix(herdr): cross-check DB terminals against live herdr tabs on reconcile

_reconcile() previously only pruned panes it tracked in _pane_to_terminal.
On server restart the map starts empty, so ghost DB records from prior runs
(terminals whose herdr tabs no longer exist) were never cleaned up. This
caused repeated 'Terminal not found' errors and broke 'cao session list'
(CLI silently drops a session when its conductor terminal returns 500).

Fix in two parts:

1. herdr_inbox_service.py: after the in-memory map prune, run `herdr tab
   list` and cross-check DB terminals for every tracked workspace. Any DB
   terminal whose tmux_window label is absent from live herdr tabs is
   deleted. This runs on every socket connect, covering both startup and
   reconnect scenarios.

2. api/main.py: GET /terminals/{id} catches TerminalNotFoundError from the
   backend and returns 404 instead of 500. Ghost terminals that survive
   until the next server restart now fail with the correct status code,
   making CLI error handling straightforward.

10 new tests (33 total in test_herdr_inbox_service.py).

* fix(herdr): run ghost terminal cleanup at server startup, not just on reconnect

The DB cross-check in _reconcile() only fires when _pane_to_terminal is
non-empty (the socket loop is gated by that condition). On a fresh server
start with no registered terminals, _reconcile() never runs, so ghost records
from prior runs survive indefinitely.

Extract the workspace/tab cross-check into _startup_db_cleanup() and call it
unconditionally from start() before the socket loop begins. The method fetches
herdr workspace list and tab list directly, building its own workspace map
without depending on _pane_to_terminal or _workspace_to_session.

3 new tests (36 total).

* docs(herdr): add herdr backend setup and configuration guide

* style: apply black formatting after rebase onto upstream/main

* style: fix isort import ordering after rebase

* fix(herdr): repair tests broken by rebase + implement /health components

Five test/impl fixes surfaced after rebasing onto upstream/main:

- /health endpoint: implement the `components` block (cao/herdr/claude) that
  test_health_check_returns_ok already asserted. herdr/claude probed via
  shutil.which -> "ok" or "unavailable".
- test_herdr_backend: fix infinite hang in test_logs_warning_when_socket_never_appears.
  The fake_time stub returned 6.0 but _ensure_session_running's poll deadline is
  now time()+15.0, so 6.0 < 15.0 looped forever. Bump stub to 16.0.
- test_tmux_backend: update create_session/create_window delegation assertions
  for the new extra_env kwarg.
- test_api_endpoints + test_plugin_lifespan: patch get_backend to a TmuxBackend
  mock so the lifespan PollingObserver assertions hold regardless of the local
  terminal_backend config.
- test_kiro_cli_unit: test_mcp_server_init_yields_processing patched the removed
  kiro_cli.tmux_client; switch to get_backend mock like the sibling tests.

* fix(herdr): reconcile rebase conflicts with upstream inbox + codex changes

After rebasing onto upstream/main, three integration points needed repair:

- inbox_service: upstream now marks a message DELIVERED *before* send_input
  (PR #265, closes the watchdog re-entrancy race). My TerminalNotFoundError
  handler returned without resetting that optimistic status, so a transient
  pane-resolution failure would strand the message as DELIVERED-but-unsent.
  Reset to PENDING in the handler so the retry path still fires.
- test_inbox_service: update test_resolution_failure_leaves_message_pending to
  assert the final status is PENDING (DELIVERED then reset), not "never touched".
- test_codex_provider_unit: upstream's new TestCodexV0136FooterFormat tests
  patched codex.tmux_client, which this branch migrated to get_backend. Switch
  the three get_status tests to get_backend; drop the unused patch on the
  script-only extraction test.

* fix(herdr): update launch tests to patch get_backend, not removed subprocess import

launch.py migrated tmux attach from subprocess.run(["tmux", "attach-session"...])
to get_backend().attach_session(), so the module no longer imports subprocess.
Six launch tests still patched cli.commands.launch.subprocess.run as a defensive
no-op, which now fails with "module launch has no attribute 'subprocess'" during
patch setup. Repoint those patches at launch.get_backend, which exists and is the
correct seam for the attach path.

Full CI-equivalent suite (test/, minus integration/e2e) now passes: 2120 passed.

* fix(herdr): patch os.path.exists in factory herdr test for CI

test_returns_herdr_when_configured constructed a real HerdrBackend, whose
__init__ calls _ensure_session_running(). When the session socket is absent
that method runs subprocess.Popen(["herdr", ...]), which raises
FileNotFoundError on hosts where herdr is not installed (CI). The test only
passed locally because a live ~/.config/herdr/sessions/cao/herdr.sock made
_ensure_session_running short-circuit before the Popen call.

Patch herdr_backend.os.path.exists to True so __init__ finds the socket and
skips server startup, matching the existing fixture in test_herdr_backend.py.
The factory's own config_path.exists() is pathlib.Path.exists and is
unaffected by this module-scoped patch.

Reproduced the CI failure locally by running with herdr off PATH; all 7
factory tests and the full 101-test backends suite pass after the fix.

* docs(herdr): link herdr backend guide from README

The herdr backend setup guide (docs/herdr.md) was added in this branch but
nothing in the README pointed to it, so it was undiscoverable. Add a link in
the "Sessions run in tmux" section, next to the existing docs/tmux.md link,
since herdr is the alternative terminal backend.

* fix(herdr): address Copilot review findings on PR #271

Four findings from the automated reviewer, verified against the code:

- terminal_service.get_output: the "partial response" message claimed the
  marker was not found in 5000 lines, but the final attempt captures the full
  scrollback (full_history=True). Reworded to "full scrollback" so the prefix
  matches what was actually searched.

- HerdrBackend.get_history ignored strip_escapes, while TmuxBackend honors it.
  The one strip_escapes=True caller (delete-terminal snapshot) was only getting
  plain text by accident because herdr defaults to text output. Map
  strip_escapes=True to herdr's native "--format text"; leave the default path
  untouched so existing provider output parsing on the herdr backend is
  unchanged. Added two unit tests for the flag mapping.

- docs/herdr.md prerequisites said the herdr server must be running before CAO
  starts, but HerdrBackend._ensure_session_running auto-starts it when the
  session socket is absent. Clarified that pre-starting is optional.

- test_kiro_cli_integration use_tmux_backend fixture set the module-level
  backend registry singleton without restoring it, leaking the tmux pin into
  later tests by execution order. Save and restore the previous backend.

Not changed: the Copilot note about TmuxBackend.attach_session using
check=True. The sole caller (cli/commands/launch.py) already wraps the call in
a broad except that converts any exception to a click.ClickException, so a
failed attach surfaces as a normal CLI error, not an uncaught stack trace.

Backends + terminal_service suites pass (146 tests) with herdr off PATH.

* fix(herdr): stop socket reconnect storm by using one events.subscribe per connection

HerdrInboxService flapped the herdr socket ~44x/min indefinitely: connect,
subscribe, herdr resets the connection ~225ms later, 1s backoff, repeat. The
event-driven inbox path delivered nothing as a result (every
pane.agent_status_changed was lost between resets); delivery only worked via
the polling fallback, and the churn produced ~50MB log files per day.

Root cause: herdr 0.6.8 resets the entire connection when it receives a SECOND
events.subscribe on a connection that already has an active subscription.
_socket_loop sent one events.subscribe per managed pane (_resubscribe_all) and
then a second for the lifecycle events (_subscribe_lifecycle_events), so the
2nd call always tore down the connection. The backoff never escalated past the
1.0s base because it reset after each apparently-successful setup.

Reproduced and fix validated against a live herdr 0.6.8 socket: a single
events.subscribe carrying every pane's agent-status plus pane.closed and
workspace.closed is acknowledged once and keeps the connection open with events
streaming; any second subscribe resets at ~150-220ms.

Changes:
- Replace _resubscribe_all + _subscribe_lifecycle_events + _subscribe_pane with
  a single _subscribe_all_events that builds one subscriptions array (one
  pane.agent_status_changed per managed pane -- pane_id is required, herdr
  rejects the wildcard form -- plus the two lifecycle events).
- register_terminal on a live connection now forces a reconnect via
  _force_reconnect (close the writer so _socket_loop rebuilds the combined
  subscription) instead of sending a second, connection-killing subscribe.
  herdr exposes no incremental add-subscription API.
- Update tests to pin the single-subscribe-per-connection contract.

* fix(herdr): route lifecycle events by herdr's real wire format

Socket-driven pane.closed / workspace.closed cleanup never ran. _event_loop
read the event name from the "type" key and matched the dotted names
"pane.closed" / "workspace.closed", but herdr 0.6.8 puts the name in the
"event" key using underscore names ("pane_closed" / "workspace_closed"). The
condition never matched, so every lifecycle event fell through and real-time
cleanup was dead -- leaving the periodic reconcile/startup sweeps as the only
thing removing ghost terminals.

Captured the real wire format live from herdr 0.6.8:
  {"event":"pane_closed","data":{"pane_id":...,"workspace_id":...}}
  {"event":"workspace_closed","data":{"workspace_id":...}}
  {"event":"pane.agent_status_changed","data":{"agent_status":...,"pane_id":...}}

Fix: read the name from "event" (falling back to "type") and normalize
underscores to dots before routing. The agent-status path was already correct
(it reads pane_id/agent_status from the "data" wrapper and does not depend on
the top-level name match).

Validated live end-to-end: the real _event_loop over the real herdr socket now
routes an actual pane close to _handle_lifecycle_event. Updated the routing
test to use herdr's real event shape (the old test used a fabricated
{"type":"pane.closed"} shape that never occurs on the wire) and added
end-to-end pane_closed cleanup and agent-status delivery tests.

* docs(herdr): use bare cao-server to match README

Reviewer noted the README starts the server with `cao-server` (no `uv run`
prefix), but herdr.md used `uv run cao-server`. cao-server is a project.scripts
entry point, so the bare form works once installed. Aligned both occurrences
(Launching + troubleshooting) with the README convention.

* fix(herdr): patch backend registry in rebased terminal_service tests

The rebase onto upstream/main pulled in four upstream tests in
test_terminal_service_full.py that still patched
terminal_service.tmux_client. This branch migrated terminal_service off the
module-level tmux_client to get_backend(), so that attribute no longer exists
and the patch setup raised AttributeError.

Repointed the four @patch targets at backends.registry._backend (the seam the
already-migrated sibling tests in this file use). The mock's method surface
(session_exists, create_session, ...) is identical on the backend, so the test
bodies are unchanged.

Full CI-equivalent suite: 2180 passed, 1 skipped.

* fix(herdr): forward extra_env and prune workspace.closed by session

Two findings from maintainer review of PR #271.

P1: HerdrBackend dropped cao launch --env. create_session() and
create_window() accepted extra_env but called _inject_env_vars() without it,
so operator-forwarded vars (AWS_REGION, MNEMOSYNE_DIR, ...) worked under tmux
and silently vanished under herdr. Thread extra_env through to the injection
path and emit shell `export` statements for it. Reuse TmuxClient's filtering
policy (blocked provider prefixes, 2048-byte value cap) so the two backends
cannot drift. herdr injects via `pane send-text` (a shell command line, unlike
tmux's exec-style -e KEY=VALUE argv), so operator-supplied values are passed
through shlex.quote to stay injection-safe. Added tests for both create paths
covering forwarding, blocked-prefix/oversize filtering, and value quoting.

P2: workspace.closed map cleanup pruned via pane_id.startswith(workspace_id).
herdr renumbers compact pane_ids and gives no guarantee they begin with the
workspace_id, so the prefix test was unreliable. Prune by matching each tracked
terminal's DB session against the closed workspace's session instead, mirroring
the session match already used in the pane.closed handler. Updated the test to
use compact pane_ids that do not share the workspace_id prefix, proving the
prune keys off DB session ownership.

Full CI-equivalent suite: 2184 passed, 1 skipped.

* feat(herdr): add cao-server --terminal to override backend

Adds a --terminal {tmux,herdr} flag to cao-server so the backend can be
selected at launch without editing config.json, addressing reviewer feedback
on PR #271.

- BackendFactory.create() gains a backend_override parameter that takes
  precedence over terminal_backend in config.json. Other config keys
  (herdr_session) are still read, so --terminal herdr honors a configured
  session name.
- main() parses --terminal and, when set, resolves the backend via
  BackendFactory and set_backend() before uvicorn starts. The backend
  singleton is otherwise lazy (first touched in the lifespan), so setting it
  up front cleanly controls every get_backend() consumer. Without the flag,
  behavior is unchanged: the factory reads config.json on first use.

Precedence: --terminal flag > config.json terminal_backend > tmux default.

Tests: factory override precedence (over tmux/herdr config, no-config, session
passthrough, unknown name); main() sets the backend when --terminal is present
and leaves it lazy when absent. docs/herdr.md documents the flag.

Full CI-equivalent suite: 2191 passed, 1 skipped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Inbox messages delivered twice due to race condition in check_and_send_pending_messages

4 participants