feat(inbox): add reconciliation sweep for orphaned PENDING messages#266
Conversation
Inbox messages could stay PENDING forever when the receiving terminal was already idle: the single immediate (on POST) delivery attempt can observe a stale status and skip, and the log-watching observer never fires again because an idle agent produces no new log output. With both fast paths missed there was no fallback (issue awslabs#131). Add a provider-agnostic reconciliation daemon that runs on a slower interval than the watchdog and re-attempts delivery for any message left PENDING past a grace window. The grace window keeps the sweep from competing with the immediate and watchdog paths for freshly queued messages — it only adopts ones those paths have already missed. - list_pending_receiver_ids_older_than: provider-agnostic query for stuck receivers, joined against terminals so deleted receivers are skipped; the cutoff uses local-naive datetime.now() to match InboxModel.created_at. - reconcile_orphaned_messages: sweep body, reusing check_and_send_pending_messages. - inbox_reconciliation_daemon: background loop wired into the server lifespan. - INBOX_RECONCILE_INTERVAL / INBOX_RECONCILE_GRACE_SECONDS constants. The sweep reuses the existing delivery helper and so shares its known duplicate-wakeup race; making delivery atomic is left to the unified delivery engine tracked in GH awslabs#115. Documented in docs/inbox-delivery.md. Fixes awslabs#131
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #266 +/- ##
=======================================
Coverage ? 92.19%
=======================================
Files ? 69
Lines ? 6833
Branches ? 0
=======================================
Hits ? 6300
Misses ? 533
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@call-me-ram you are welcome to continue on working #115 if you have the time. i think it was a good start but didnt pass the test. |
There was a problem hiding this comment.
Pull request overview
This PR adds a provider-agnostic “reconciliation sweep” to recover inbox messages that can otherwise remain PENDING indefinitely when the receiver is already idle (issue #131). It fits into the existing inbox delivery architecture as a slow safety net that re-routes orphaned messages back through the existing check_and_send_pending_messages() gate, without replacing the fast delivery paths.
Changes:
- Added a new DB query (
list_pending_receiver_ids_older_than) to find receivers with stalePENDINGmessages while skipping deleted terminals via a join. - Added
reconcile_orphaned_messages()plus aninbox_reconciliation_daemon()background task wired into FastAPI lifespan startup/shutdown. - Added unit tests (DB + service + daemon/lifespan) and documentation updates describing the reconciliation design and grace window behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/cli_agent_orchestrator/clients/database.py |
Adds provider-agnostic query to find receivers with stale pending messages (with terminal join). |
src/cli_agent_orchestrator/services/inbox_service.py |
Adds reconciliation sweep function that reuses existing delivery gate for stale receivers. |
src/cli_agent_orchestrator/api/main.py |
Adds reconciliation daemon and wires it into lifespan startup/shutdown cancellation. |
src/cli_agent_orchestrator/constants.py |
Introduces reconciliation interval and grace-window constants. |
test/clients/test_database.py |
Adds in-memory DB test to validate age cutoff, status filter, and terminal join behavior. |
test/services/test_inbox_service.py |
Adds unit tests for sweep behavior and per-receiver failure isolation. |
test/api/test_api_endpoints.py |
Adds daemon + lifespan cancellation tests for reconciliation task. |
docs/inbox-delivery.md |
Documents reconciliation sweep rationale, grace window, and interaction with OpenCode poller. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks @haofeif i'll take it on. The branch had fallen well behind main and was conflicting, so i have merged latest main and am working through the conflicts + getting the tests green. will update the PR and ping you once it passes |
haofeif
left a comment
There was a problem hiding this comment.
LGTM thanks again for your contribution @call-me-ram
Reconcile the event-driven inbox rewrite with awslabs#265 and awslabs#266, which just landed on main and edit the same files: - inbox_service.py: keep the event-driven InboxService, but apply awslabs#265's ordering — mark messages DELIVERED before send_input so output echoed back through the FIFO/StatusMonitor pipeline can't re-enter deliver_pending and double-deliver a still-PENDING message (awslabs#164). Add awslabs#266's reconcile_orphaned_messages as a method that routes stale PENDING receivers back through deliver_pending (awslabs#131). - api/main.py: keep the event-bus consumers + OpenCode poller; add awslabs#266's inbox_reconciliation_daemon task (start + cancel). Drop awslabs#266's watchdog PollingObserver wiring — the event bus replaced it here. - constants.py: keep both INBOX_POLLING_INTERVAL and awslabs#266's INBOX_RECONCILE_INTERVAL/GRACE; reword the comments off the watchdog. - tests: adapt awslabs#265's ordering regression and awslabs#266's reconcile + lifespan tests to the event-driven class API (deliver_pending, status_monitor, the singleton's bound method); drop the watchdog LogFileHandler tests. Full unit suite green: 1958 passed, 1 skipped.
Thanks @haofeif |
Summary
Fixes #131. Inbox messages could stay
PENDINGforever when the receiving terminal was already idle, stalling multi-agent workflows until a human re-sent the message.Root cause
Delivery has two fast paths, and both can miss when the receiver is already idle:
get_status()is momentarily stale.PollingObserver): only fires on terminal log-file changes — which an already-idle agent never produces.With both missed, there was no fallback, so the message was orphaned.
Fix
A provider-agnostic reconciliation sweep (
inbox_reconciliation_daemon) runs on a slower interval than the watchdog and re-attempts delivery for any message leftPENDINGpast a grace window, routing it back through the existingcheck_and_send_pending_messages()gate.The grace window (
INBOX_RECONCILE_GRACE_SECONDS) is the key design choice: the sweep only adopts messages older than the window, so it never competes with the immediate/watchdog paths for freshly queued messages. It coexists with — rather than replaces — the OpenCode poller, which serves a different role (fast primary wakeup vs. slow provider-agnostic safety net).Details:
list_pending_receiver_ids_older_than(seconds)— provider-agnostic query for stuck receivers, joined againstterminalsso deleted receivers are skipped. The cutoff uses local-naivedatetime.now()to matchInboxModel.created_at(same convention ascleanup_service).reconcile_orphaned_messages()— the sweep body, a sibling ofpoll_opencode_pending_messages.INBOX_RECONCILE_INTERVAL/INBOX_RECONCILE_GRACE_SECONDSconstants.docs/inbox-delivery.md.Scope note
The sweep reuses the existing delivery helper and so shares its known duplicate-wakeup race. Making delivery atomic touches code shared by all delivery paths and is better done as part of the unified delivery engine already tracked in GH #115, so it is intentionally out of scope here. The grace window keeps the sweep from materially widening the race in practice — it only acts on messages the fast paths have already abandoned.
Tests
pytest test/ --ignore=test/e2e -m "not integration"), along withblack --checkandisort --check-only.