Skip to content

Feature: Graceful Gateway Restart with Session Recovery #57425

@lmdeagles

Description

@lmdeagles

Problem

When the gateway restarts — whether from openclaw gateway restart, a config change, SIGUSR1, or a crash — all in-flight work is silently killed. There is no mechanism for sessions to know they were interrupted, no way for parent sessions to learn their subagents died, and no recovery path other than waiting for the next heartbeat or manual intervention.

This is the single biggest reliability gap in multi-agent OpenClaw deployments.

What happens today

  1. Gateway receives restart signal
  2. Drain period begins (90s timeout)
  3. Active sessions are killed after drain
  4. Gateway comes back up — fresh slate
  5. Sessions persist on disk (JSONL) but no one reads them
  6. Subagents die without reporting back to parents
  7. Cron jobs interrupted mid-execution are not retried until next scheduled time
  8. Users in group chats see "read" receipts but never get responses

Real-world impact

Running 7 agents across Discord + iMessage + cron, a single gateway restart can:

  • Kill 3-4 active conversations simultaneously
  • Orphan subagents mid-task (research, code generation, file operations)
  • Drop cron jobs that were minutes into expensive multi-tool workflows
  • Leave group chat messages permanently unanswered
  • Break multi-step workflows where step N completed but step N+1 never fires

The blast radius scales with agent count. This is not a solo-agent problem.

Existing community reports

Prior art: Hermes Agent

Hermes (Nous Research) is building this in their multi-agent architecture (issue #344):

  • Per-tool-call checkpointing — Sub-agent state persisted to ~/.hermes/checkpoints/ after each tool call. On failure, resume from checkpoint.
  • ResponseStore persistence — SQLite-backed state that survives restarts (shipped in v0.4.0)
  • Three-level failure escalation — Retry → Replan → Decompose further
  • One-shot job recovery — Interrupted cron-like jobs are automatically retried

Proposed solution

1. Pre-restart session manifest

Before sending SIGTERM to workers, the gateway should enumerate active sessions and write a manifest:

{
  "timestamp": "2026-03-29T23:25:30Z",
  "reason": "config-change",
  "triggeredBy": "agent:main:guild-agent-birthing",
  "activeSessions": [
    {
      "key": "agent:main:main",
      "status": "processing",
      "lastUserMessage": "Can you check the garden plan?",
      "activeSubagents": ["agent:sage:guild-gardening"],
      "channel": "discord",
      "channelTarget": "user:344256406146383874"
    }
  ],
  "activeCronRuns": [
    {
      "jobId": "5a820e42-...",
      "jobName": "Pulse: Nightly Ecosystem Scan",
      "startedAt": "2026-03-29T23:20:00Z",
      "status": "running"
    }
  ]
}

2. Post-restart session recovery

After startup, read the manifest and for each interrupted session, inject a system event:

[System] Gateway restarted at {time}. Reason: {reason}. You were interrupted mid-task. Review conversation context and respond to any unanswered messages.

For interrupted cron runs, re-queue with a retry flag. For sessions with active subagents, notify the parent that its subagent was killed.

3. Restart readiness gate

When gateway restart is called:

  • Enumerate active sessions
  • If count > 0, return a warning with the list of sessions that will be interrupted
  • Require --force to skip the check
  • For agent-triggered restarts, return the warning as a tool result so the agent can decide

4. Drain-aware message queuing

Messages received during drain should be queued and replayed after restart, not rejected. The current resetAllLanes() mechanism should be made reliable.

5. Configuration

{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}

What this does NOT solve (and shouldn't)

  • Crash recovery — No pre-crash manifest without periodic state snapshots or WAL-style journaling. Separate issue.
  • Subagent state resume — Injecting "you were interrupted" is enough. Actually resuming a half-completed tool call chain is complex and error-prone.
  • Idempotency — Agents should still design idempotent workflows. Recovery is a safety net, not a substitute for good design.

Current workaround

We've built a 3-layer userspace workaround that validates the approach:

  1. Pre-restart manifest script — Shell script calls openclaw sessions --all-agents --active 10 --json, writes restart-manifest.json
  2. BOOT.md hook — Reads the manifest on startup, sends a notification summarizing interrupted work, deletes the manifest
  3. One-shot POL cron — Backup proof-of-life scheduled before restart, fires after startup

This works (tested twice, clean results both times), but it's fragile — the manifest capture is best-effort, BOOT.md runs in a fresh context with no memory of what sessions were doing, and the entire thing bypasses OpenClaw's session management.

Environment

  • OpenClaw 2026.3.28
  • macOS (Darwin 25.3.0, arm64), LaunchAgent
  • 7 agents, 3 Discord bots, BlueBubbles iMessage, 12 cron jobs
  • Restarts happen 2-5x per day during active development

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions