Skip to content

Feature: Auto-resume unanswered sessions after gateway restart #51917

@skillz-xx

Description

@skillz-xx

Problem

After a gateway restart (SIGUSR1, config change, update, or manual restart), all active agent sessions are interrupted. If an agent was mid-conversation in a Signal group (or any channel), the session dies and the agent never follows up. The user has to re-send their message or poke the agent with "?" to get a response.

This is especially painful when:

  • A config change triggers an automatic restart
  • The gateway restarts during an active conversation
  • Multiple agents across multiple Signal groups are affected simultaneously

The user experience is terrible — messages appear "read" (Signal read receipts) but never get a response. It looks like the agent is ignoring you.

Current Workarounds

1. historyLimit (partial fix)

Setting channels.signal.historyLimit: 15 means agents see recent group messages when a new session starts. But this only helps if someone sends a new message — the agent still sits idle until poked.

2. BOOT.md + scan script (workaround)

We built a BOOT.md that runs on gateway:startup via the boot-md hook. It scans all agent session transcripts for Signal groups where the last message was from a user (unanswered), then sends sessions_send nudges to those agents.

This works but is fragile:

  • It costs a full agent turn on the main agent every restart
  • The scan script reads raw JSONL transcripts (implementation detail that could change)
  • It cannot detect messages lost during the drain window
  • Maximum 5 nudges per boot to avoid token storms

Proposed Solution

Native session resumption after restart

When the gateway comes back up after a SIGUSR1 restart:

  1. Detect interrupted sessions — sessions that had an active turn aborted by drain, or sessions where the last transcript entry is a user message with no assistant response
  2. Auto-resume those sessions — inject a system event like: "The gateway restarted. Review conversation context and respond to any unanswered messages." or simply re-process the last user message
  3. Scope it to channel sessions only — skip heartbeat, subagent, and boot sessions
  4. Rate limit — cap at N concurrent resumptions to avoid API storms
  5. Configurable — add a config key like session.resumeAfterRestart: true/false (default: true)

Bonus: Drain-aware message queuing

The GatewayDrainingError should queue messages silently (the code already has resetAllLanes() for this, but it does not always work). Messages received during drain should be replayed after restart, not rejected.

Environment

  • OpenClaw 2026.3.13
  • Signal channel with ~27 bound agents across Signal groups
  • Frequent restarts due to config changes, updates, and development

Impact

This affects every multi-agent Signal setup. Any restart = broken conversations across all active groups. The user has to manually re-engage every agent that was mid-conversation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions