Feature: Graceful Gateway Restart with Session Recovery

## Problem

When the gateway restarts — whether from `openclaw gateway restart`, a config change, SIGUSR1, or a crash — all in-flight work is silently killed. There is no mechanism for sessions to know they were interrupted, no way for parent sessions to learn their subagents died, and no recovery path other than waiting for the next heartbeat or manual intervention.

This is the single biggest reliability gap in multi-agent OpenClaw deployments.

### What happens today

1. Gateway receives restart signal
2. Drain period begins (90s timeout)
3. Active sessions are killed after drain
4. Gateway comes back up — fresh slate
5. Sessions persist on disk (JSONL) but no one reads them
6. Subagents die without reporting back to parents
7. Cron jobs interrupted mid-execution are not retried until next scheduled time
8. Users in group chats see "read" receipts but never get responses

### Real-world impact

Running 7 agents across Discord + iMessage + cron, a single gateway restart can:

- Kill 3-4 active conversations simultaneously
- Orphan subagents mid-task (research, code generation, file operations)
- Drop cron jobs that were minutes into expensive multi-tool workflows
- Leave group chat messages permanently unanswered
- Break multi-step workflows where step N completed but step N+1 never fires

The blast radius scales with agent count. This is not a solo-agent problem.

### Existing community reports

- #51917 — Auto-resume unanswered sessions (27-agent Signal deployment)
- #30043 — Resume interrupted sessions and cron runs (macOS LaunchAgent)
- #4410 — Auto-restart on stuck sessions
- #43311 — Self-decapitation: agent-triggered restart kills its own session
- #43178 — Telegram watchdog triggers restart under 10-agent load

## Prior art: Hermes Agent

Hermes (Nous Research) is building this in their multi-agent architecture (issue #344):

- **Per-tool-call checkpointing** — Sub-agent state persisted to `~/.hermes/checkpoints/` after each tool call. On failure, resume from checkpoint.
- **ResponseStore persistence** — SQLite-backed state that survives restarts (shipped in v0.4.0)
- **Three-level failure escalation** — Retry → Replan → Decompose further
- **One-shot job recovery** — Interrupted cron-like jobs are automatically retried

## Proposed solution

### 1. Pre-restart session manifest

Before sending SIGTERM to workers, the gateway should enumerate active sessions and write a manifest:

```json
{
  "timestamp": "2026-03-29T23:25:30Z",
  "reason": "config-change",
  "triggeredBy": "agent:main:guild-agent-birthing",
  "activeSessions": [
    {
      "key": "agent:main:main",
      "status": "processing",
      "lastUserMessage": "Can you check the garden plan?",
      "activeSubagents": ["agent:sage:guild-gardening"],
      "channel": "discord",
      "channelTarget": "user:344256406146383874"
    }
  ],
  "activeCronRuns": [
    {
      "jobId": "5a820e42-...",
      "jobName": "Pulse: Nightly Ecosystem Scan",
      "startedAt": "2026-03-29T23:20:00Z",
      "status": "running"
    }
  ]
}
```

### 2. Post-restart session recovery

After startup, read the manifest and for each interrupted session, inject a system event:

> [System] Gateway restarted at {time}. Reason: {reason}. You were interrupted mid-task. Review conversation context and respond to any unanswered messages.

For interrupted cron runs, re-queue with a retry flag. For sessions with active subagents, notify the parent that its subagent was killed.

### 3. Restart readiness gate

When `gateway restart` is called:

- Enumerate active sessions
- If count > 0, return a warning with the list of sessions that will be interrupted
- Require `--force` to skip the check
- For agent-triggered restarts, return the warning as a tool result so the agent can decide

### 4. Drain-aware message queuing

Messages received during drain should be queued and replayed after restart, not rejected. The current `resetAllLanes()` mechanism should be made reliable.

### 5. Configuration

```json
{
  "gateway": {
    "restart": {
      "sessionRecovery": true,
      "cronRetryOnInterrupt": true,
      "readinessGate": true,
      "readinessGateThreshold": 0,
      "drainQueueMessages": true,
      "manifestPath": "restart-manifest.json"
    }
  }
}
```

## What this does NOT solve (and shouldn't)

- **Crash recovery** — No pre-crash manifest without periodic state snapshots or WAL-style journaling. Separate issue.
- **Subagent state resume** — Injecting "you were interrupted" is enough. Actually resuming a half-completed tool call chain is complex and error-prone.
- **Idempotency** — Agents should still design idempotent workflows. Recovery is a safety net, not a substitute for good design.

## Current workaround

We've built a 3-layer userspace workaround that validates the approach:

1. **Pre-restart manifest script** — Shell script calls `openclaw sessions --all-agents --active 10 --json`, writes `restart-manifest.json`
2. **BOOT.md hook** — Reads the manifest on startup, sends a notification summarizing interrupted work, deletes the manifest
3. **One-shot POL cron** — Backup proof-of-life scheduled before restart, fires after startup

This works (tested twice, clean results both times), but it's fragile — the manifest capture is best-effort, BOOT.md runs in a fresh context with no memory of what sessions were doing, and the entire thing bypasses OpenClaw's session management.

## Environment

- OpenClaw 2026.3.28
- macOS (Darwin 25.3.0, arm64), LaunchAgent
- 7 agents, 3 Discord bots, BlueBubbles iMessage, 12 cron jobs
- Restarts happen 2-5x per day during active development


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Graceful Gateway Restart with Session Recovery #57425

Problem

What happens today

Real-world impact

Existing community reports

Prior art: Hermes Agent

Proposed solution

1. Pre-restart session manifest

2. Post-restart session recovery

3. Restart readiness gate

4. Drain-aware message queuing

5. Configuration

What this does NOT solve (and shouldn't)

Current workaround

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feature: Graceful Gateway Restart with Session Recovery #57425

Description

Problem

What happens today

Real-world impact

Existing community reports

Prior art: Hermes Agent

Proposed solution

1. Pre-restart session manifest

2. Post-restart session recovery

3. Restart readiness gate

4. Drain-aware message queuing

5. Configuration

What this does NOT solve (and shouldn't)

Current workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions