Skip to content

Inbox messages stuck in PENDING when receiving agent is already idle #131

@kdzhang

Description

@kdzhang

Summary

Inbox messages can get stuck in PENDING indefinitely when the receiving agent is already idle at the time the message is posted. This affects all providers — Kiro CLI, Claude Code, etc. — because the issue is in the delivery architecture, not in any specific provider's status detection.

Provider: Kiro CLI (but could be provider-agnostic)

Impact

  • Agent-to-agent messaging silently fails — messages stay PENDING forever
  • Multi-agent workflows stall waiting for callbacks that were sent but never delivered
  • Requires manual intervention (resending the message) to unblock

Reproduction

  1. Start cao-server and a multi-agent session with 3+ agents
  2. Agent A finishes work and goes idle (no more log output)
  3. Agent B calls send_message to Agent A
  4. Message stays PENDING — Agent A never receives it

This happens intermittently in long-running sessions (4-8 hours) with multiple concurrent agents. We observe it several times per session.

Root Cause

The inbox has two delivery paths:

Path 1 — Immediate delivery (on POST): POST /terminals/{id}/inbox/messages calls check_and_send_pending_messages(receiver_id), which calls provider.get_status(). If IDLE or COMPLETED, delivers immediately. This is a single-shot attempt with no retry. If get_status() returns a stale or incorrect status at that moment, delivery is skipped.

Path 2 — PollingObserver: Monitors TERMINAL_LOG_DIR for .log file changes every 5 seconds. On change → check pending → check idle → deliver. But if the agent is already idle and not producing output, the log file doesn't change, so the observer never fires again.

The gap: If Path 1 fails (stale status at the wrong moment) and the agent is already idle (Path 2 never triggers), the message is permanently orphaned. There is no fallback mechanism.

Possible Directions

  • A periodic background check for orphaned PENDING messages (similar to the existing flow_daemon() pattern)
  • Retry logic on the immediate delivery path (e.g., a few attempts with short delays)
  • A fallback poll triggered when a new message is queued but the watcher hasn't fired within N seconds

Related Issues

Both improve get_status() accuracy, but this issue is distinct: even with perfect status detection, the single-shot immediate delivery can miss due to timing, and there is no fallback when it does.

Environment

  • cao-server at commit 331e8d7
  • macOS, Kiro CLI provider
  • Observed across multiple multi-day sessions

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions