Problem
When the gateway restarts — whether from openclaw gateway restart, a config change, SIGUSR1, or a crash — all in-flight work is silently killed. There is no mechanism for sessions to know they were interrupted, no way for parent sessions to learn their subagents died, and no recovery path other than waiting for the next heartbeat or manual intervention.
This is the single biggest reliability gap in multi-agent OpenClaw deployments.
What happens today
- Gateway receives restart signal
- Drain period begins (90s timeout)
- Active sessions are killed after drain
- Gateway comes back up — fresh slate
- Sessions persist on disk (JSONL) but no one reads them
- Subagents die without reporting back to parents
- Cron jobs interrupted mid-execution are not retried until next scheduled time
- Users in group chats see "read" receipts but never get responses
Real-world impact
Running 7 agents across Discord + iMessage + cron, a single gateway restart can:
- Kill 3-4 active conversations simultaneously
- Orphan subagents mid-task (research, code generation, file operations)
- Drop cron jobs that were minutes into expensive multi-tool workflows
- Leave group chat messages permanently unanswered
- Break multi-step workflows where step N completed but step N+1 never fires
The blast radius scales with agent count. This is not a solo-agent problem.
Existing community reports
Prior art: Hermes Agent
Hermes (Nous Research) is building this in their multi-agent architecture (issue #344):
- Per-tool-call checkpointing — Sub-agent state persisted to
~/.hermes/checkpoints/ after each tool call. On failure, resume from checkpoint.
- ResponseStore persistence — SQLite-backed state that survives restarts (shipped in v0.4.0)
- Three-level failure escalation — Retry → Replan → Decompose further
- One-shot job recovery — Interrupted cron-like jobs are automatically retried
Proposed solution
1. Pre-restart session manifest
Before sending SIGTERM to workers, the gateway should enumerate active sessions and write a manifest:
{
"timestamp": "2026-03-29T23:25:30Z",
"reason": "config-change",
"triggeredBy": "agent:main:guild-agent-birthing",
"activeSessions": [
{
"key": "agent:main:main",
"status": "processing",
"lastUserMessage": "Can you check the garden plan?",
"activeSubagents": ["agent:sage:guild-gardening"],
"channel": "discord",
"channelTarget": "user:344256406146383874"
}
],
"activeCronRuns": [
{
"jobId": "5a820e42-...",
"jobName": "Pulse: Nightly Ecosystem Scan",
"startedAt": "2026-03-29T23:20:00Z",
"status": "running"
}
]
}
2. Post-restart session recovery
After startup, read the manifest and for each interrupted session, inject a system event:
[System] Gateway restarted at {time}. Reason: {reason}. You were interrupted mid-task. Review conversation context and respond to any unanswered messages.
For interrupted cron runs, re-queue with a retry flag. For sessions with active subagents, notify the parent that its subagent was killed.
3. Restart readiness gate
When gateway restart is called:
- Enumerate active sessions
- If count > 0, return a warning with the list of sessions that will be interrupted
- Require
--force to skip the check
- For agent-triggered restarts, return the warning as a tool result so the agent can decide
4. Drain-aware message queuing
Messages received during drain should be queued and replayed after restart, not rejected. The current resetAllLanes() mechanism should be made reliable.
5. Configuration
{
"gateway": {
"restart": {
"sessionRecovery": true,
"cronRetryOnInterrupt": true,
"readinessGate": true,
"readinessGateThreshold": 0,
"drainQueueMessages": true,
"manifestPath": "restart-manifest.json"
}
}
}
What this does NOT solve (and shouldn't)
- Crash recovery — No pre-crash manifest without periodic state snapshots or WAL-style journaling. Separate issue.
- Subagent state resume — Injecting "you were interrupted" is enough. Actually resuming a half-completed tool call chain is complex and error-prone.
- Idempotency — Agents should still design idempotent workflows. Recovery is a safety net, not a substitute for good design.
Current workaround
We've built a 3-layer userspace workaround that validates the approach:
- Pre-restart manifest script — Shell script calls
openclaw sessions --all-agents --active 10 --json, writes restart-manifest.json
- BOOT.md hook — Reads the manifest on startup, sends a notification summarizing interrupted work, deletes the manifest
- One-shot POL cron — Backup proof-of-life scheduled before restart, fires after startup
This works (tested twice, clean results both times), but it's fragile — the manifest capture is best-effort, BOOT.md runs in a fresh context with no memory of what sessions were doing, and the entire thing bypasses OpenClaw's session management.
Environment
- OpenClaw 2026.3.28
- macOS (Darwin 25.3.0, arm64), LaunchAgent
- 7 agents, 3 Discord bots, BlueBubbles iMessage, 12 cron jobs
- Restarts happen 2-5x per day during active development
Problem
When the gateway restarts — whether from
openclaw gateway restart, a config change, SIGUSR1, or a crash — all in-flight work is silently killed. There is no mechanism for sessions to know they were interrupted, no way for parent sessions to learn their subagents died, and no recovery path other than waiting for the next heartbeat or manual intervention.This is the single biggest reliability gap in multi-agent OpenClaw deployments.
What happens today
Real-world impact
Running 7 agents across Discord + iMessage + cron, a single gateway restart can:
The blast radius scales with agent count. This is not a solo-agent problem.
Existing community reports
gateway restartfrom agent session causes self-decapitation on macOS (launchd) #43311 — Self-decapitation: agent-triggered restart kills its own sessionPrior art: Hermes Agent
Hermes (Nous Research) is building this in their multi-agent architecture (issue #344):
~/.hermes/checkpoints/after each tool call. On failure, resume from checkpoint.Proposed solution
1. Pre-restart session manifest
Before sending SIGTERM to workers, the gateway should enumerate active sessions and write a manifest:
{ "timestamp": "2026-03-29T23:25:30Z", "reason": "config-change", "triggeredBy": "agent:main:guild-agent-birthing", "activeSessions": [ { "key": "agent:main:main", "status": "processing", "lastUserMessage": "Can you check the garden plan?", "activeSubagents": ["agent:sage:guild-gardening"], "channel": "discord", "channelTarget": "user:344256406146383874" } ], "activeCronRuns": [ { "jobId": "5a820e42-...", "jobName": "Pulse: Nightly Ecosystem Scan", "startedAt": "2026-03-29T23:20:00Z", "status": "running" } ] }2. Post-restart session recovery
After startup, read the manifest and for each interrupted session, inject a system event:
For interrupted cron runs, re-queue with a retry flag. For sessions with active subagents, notify the parent that its subagent was killed.
3. Restart readiness gate
When
gateway restartis called:--forceto skip the check4. Drain-aware message queuing
Messages received during drain should be queued and replayed after restart, not rejected. The current
resetAllLanes()mechanism should be made reliable.5. Configuration
{ "gateway": { "restart": { "sessionRecovery": true, "cronRetryOnInterrupt": true, "readinessGate": true, "readinessGateThreshold": 0, "drainQueueMessages": true, "manifestPath": "restart-manifest.json" } } }What this does NOT solve (and shouldn't)
Current workaround
We've built a 3-layer userspace workaround that validates the approach:
openclaw sessions --all-agents --active 10 --json, writesrestart-manifest.jsonThis works (tested twice, clean results both times), but it's fragile — the manifest capture is best-effort, BOOT.md runs in a fresh context with no memory of what sessions were doing, and the entire thing bypasses OpenClaw's session management.
Environment