Skip to content

Resume interrupted agent sessions and cron runs after gateway restart #30043

@alexcf

Description

@alexcf

Feature Request

After a gateway restart (planned or crash recovery), all in-flight work is lost:

  • Active agent sessions lose their conversation context
  • Running cron jobs are silently dropped
  • There's no mechanism to resume or retry interrupted work

Current Behaviour

  1. Gateway crashes or restarts
  2. Watchdog detects and restarts gateway (~5 min)
  3. All in-flight sessions and cron runs are gone
  4. Agent wakes up fresh with no knowledge of what was in progress
  5. Cron jobs that were mid-execution are not retried until their next scheduled time

Desired Behaviour

Session recovery:

  • On restart, the gateway should detect sessions that were active at shutdown
  • Inject a system event into recovered sessions indicating the restart (e.g. [System] Gateway restarted. Previous session context may be incomplete.)
  • Optionally: persist session state to disk so context survives restarts

Cron run recovery:

  • Track in-flight cron runs in a durable store (e.g. SQLite or file)
  • On restart, check for interrupted runs
  • Re-queue interrupted runs with a flag indicating they're retries
  • Respect a configurable retry policy (e.g. max retries, backoff)

Wake mechanism:

  • After successful restart, automatically send a wake event to all agents that had active sessions
  • This ensures agents can check for in-progress work rather than waiting for their next heartbeat

Workarounds Currently in Use

  • External watchdog script (watchdog.sh) handles restart detection
  • Manual cron wake event after restart to kick agents
  • Daily memory files for manual context recovery
  • openclaw-safe-restart scripts for planned restarts

Impact

This is especially important for:

  • Long-running cron jobs (e.g. email cleanup that takes 9+ minutes)
  • Multi-step agent workflows that get interrupted
  • Users with longer heartbeat intervals (1-2h) who won't notice the gap

Environment

  • macOS, LaunchAgent-based gateway
  • Watchdog runs every 5 minutes
  • Multiple agents (kit, cron-bot, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions