Skip to content

feat(inbox): add reconciliation sweep for orphaned PENDING messages#266

Merged
haofeif merged 2 commits into
awslabs:mainfrom
call-me-ram:feat/issue-131-inbox-reconciliation
Jun 4, 2026
Merged

feat(inbox): add reconciliation sweep for orphaned PENDING messages#266
haofeif merged 2 commits into
awslabs:mainfrom
call-me-ram:feat/issue-131-inbox-reconciliation

Conversation

@call-me-ram

Copy link
Copy Markdown
Contributor

Summary

Fixes #131. Inbox messages could stay PENDING forever when the receiving terminal was already idle, stalling multi-agent workflows until a human re-sent the message.

Root cause

Delivery has two fast paths, and both can miss when the receiver is already idle:

  • Immediate (on POST): a single attempt that skips if get_status() is momentarily stale.
  • Watchdog (PollingObserver): only fires on terminal log-file changes — which an already-idle agent never produces.

With both missed, there was no fallback, so the message was orphaned.

Fix

A provider-agnostic reconciliation sweep (inbox_reconciliation_daemon) runs on a slower interval than the watchdog and re-attempts delivery for any message left PENDING past a grace window, routing it back through the existing check_and_send_pending_messages() gate.

The grace window (INBOX_RECONCILE_GRACE_SECONDS) is the key design choice: the sweep only adopts messages older than the window, so it never competes with the immediate/watchdog paths for freshly queued messages. It coexists with — rather than replaces — the OpenCode poller, which serves a different role (fast primary wakeup vs. slow provider-agnostic safety net).

Details:

  • list_pending_receiver_ids_older_than(seconds) — provider-agnostic query for stuck receivers, joined against terminals so deleted receivers are skipped. The cutoff uses local-naive datetime.now() to match InboxModel.created_at (same convention as cleanup_service).
  • reconcile_orphaned_messages() — the sweep body, a sibling of poll_opencode_pending_messages.
  • INBOX_RECONCILE_INTERVAL / INBOX_RECONCILE_GRACE_SECONDS constants.
  • Documented in docs/inbox-delivery.md.

Scope note

The sweep reuses the existing delivery helper and so shares its known duplicate-wakeup race. Making delivery atomic touches code shared by all delivery paths and is better done as part of the unified delivery engine already tracked in GH #115, so it is intentionally out of scope here. The grace window keeps the sweep from materially widening the race in practice — it only acts on messages the fast paths have already abandoned.

Tests

  • New DB test exercises the age cutoff, status filter, and terminal join against a real in-memory database.
  • Service-level tests for the sweep (delivers per stuck receiver; survives a per-receiver failure).
  • Daemon tests: dispatches the sweep with the registry, and the task is cancelled on server shutdown.
  • Full unit suite passes locally (pytest test/ --ignore=test/e2e -m "not integration"), along with black --check and isort --check-only.

Inbox messages could stay PENDING forever when the receiving terminal was
already idle: the single immediate (on POST) delivery attempt can observe a
stale status and skip, and the log-watching observer never fires again
because an idle agent produces no new log output. With both fast paths
missed there was no fallback (issue awslabs#131).

Add a provider-agnostic reconciliation daemon that runs on a slower interval
than the watchdog and re-attempts delivery for any message left PENDING past
a grace window. The grace window keeps the sweep from competing with the
immediate and watchdog paths for freshly queued messages — it only adopts
ones those paths have already missed.

- list_pending_receiver_ids_older_than: provider-agnostic query for stuck
  receivers, joined against terminals so deleted receivers are skipped; the
  cutoff uses local-naive datetime.now() to match InboxModel.created_at.
- reconcile_orphaned_messages: sweep body, reusing check_and_send_pending_messages.
- inbox_reconciliation_daemon: background loop wired into the server lifespan.
- INBOX_RECONCILE_INTERVAL / INBOX_RECONCILE_GRACE_SECONDS constants.

The sweep reuses the existing delivery helper and so shares its known
duplicate-wakeup race; making delivery atomic is left to the unified delivery
engine tracked in GH awslabs#115. Documented in docs/inbox-delivery.md.

Fixes awslabs#131
@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.33333% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@ccc8a92). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/cli_agent_orchestrator/api/main.py 85.71% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #266   +/-   ##
=======================================
  Coverage        ?   92.19%           
=======================================
  Files           ?       69           
  Lines           ?     6833           
  Branches        ?        0           
=======================================
  Hits            ?     6300           
  Misses          ?      533           
  Partials        ?        0           
Flag Coverage Δ
unittests 92.19% <93.33%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@haofeif

haofeif commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@call-me-ram you are welcome to continue on working #115 if you have the time. i think it was a good start but didnt pass the test.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a provider-agnostic “reconciliation sweep” to recover inbox messages that can otherwise remain PENDING indefinitely when the receiver is already idle (issue #131). It fits into the existing inbox delivery architecture as a slow safety net that re-routes orphaned messages back through the existing check_and_send_pending_messages() gate, without replacing the fast delivery paths.

Changes:

  • Added a new DB query (list_pending_receiver_ids_older_than) to find receivers with stale PENDING messages while skipping deleted terminals via a join.
  • Added reconcile_orphaned_messages() plus an inbox_reconciliation_daemon() background task wired into FastAPI lifespan startup/shutdown.
  • Added unit tests (DB + service + daemon/lifespan) and documentation updates describing the reconciliation design and grace window behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/cli_agent_orchestrator/clients/database.py Adds provider-agnostic query to find receivers with stale pending messages (with terminal join).
src/cli_agent_orchestrator/services/inbox_service.py Adds reconciliation sweep function that reuses existing delivery gate for stale receivers.
src/cli_agent_orchestrator/api/main.py Adds reconciliation daemon and wires it into lifespan startup/shutdown cancellation.
src/cli_agent_orchestrator/constants.py Introduces reconciliation interval and grace-window constants.
test/clients/test_database.py Adds in-memory DB test to validate age cutoff, status filter, and terminal join behavior.
test/services/test_inbox_service.py Adds unit tests for sweep behavior and per-receiver failure isolation.
test/api/test_api_endpoints.py Adds daemon + lifespan cancellation tests for reconciliation task.
docs/inbox-delivery.md Documents reconciliation sweep rationale, grace window, and interaction with OpenCode poller.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@haofeif haofeif added the enhancement New feature or request label Jun 3, 2026
@call-me-ram

Copy link
Copy Markdown
Contributor Author

Thanks @haofeif i'll take it on. The branch had fallen well behind main and was conflicting, so i have merged latest main and am working through the conflicts + getting the tests green. will update the PR and ping you once it passes

@haofeif haofeif left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks again for your contribution @call-me-ram

@haofeif haofeif merged commit 2e67a52 into awslabs:main Jun 4, 2026
8 checks passed
call-me-ram added a commit to call-me-ram/cli-agent-orchestrator that referenced this pull request Jun 4, 2026
Reconcile the event-driven inbox rewrite with awslabs#265 and awslabs#266, which just
landed on main and edit the same files:

- inbox_service.py: keep the event-driven InboxService, but apply awslabs#265's
  ordering — mark messages DELIVERED before send_input so output echoed
  back through the FIFO/StatusMonitor pipeline can't re-enter
  deliver_pending and double-deliver a still-PENDING message (awslabs#164). Add
  awslabs#266's reconcile_orphaned_messages as a method that routes stale
  PENDING receivers back through deliver_pending (awslabs#131).
- api/main.py: keep the event-bus consumers + OpenCode poller; add awslabs#266's
  inbox_reconciliation_daemon task (start + cancel). Drop awslabs#266's watchdog
  PollingObserver wiring — the event bus replaced it here.
- constants.py: keep both INBOX_POLLING_INTERVAL and awslabs#266's
  INBOX_RECONCILE_INTERVAL/GRACE; reword the comments off the watchdog.
- tests: adapt awslabs#265's ordering regression and awslabs#266's reconcile + lifespan
  tests to the event-driven class API (deliver_pending, status_monitor,
  the singleton's bound method); drop the watchdog LogFileHandler tests.

Full unit suite green: 1958 passed, 1 skipped.
@call-me-ram

Copy link
Copy Markdown
Contributor Author

LGTM thanks again for your contribution @call-me-ram

Thanks @haofeif ♥️🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inbox messages stuck in PENDING when receiving agent is already idle

4 participants