Skip to content

RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume) #60864

@Gernspi

Description

@Gernspi

RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)

Summary

OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.

This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:

  • No full in-memory state rehydration
  • No external scripts required
  • Continuation implemented as a new follow-up run seeded from a persisted checkpoint
  • Task continues automatically without requiring a new user prompt

Goals

  • Persist sufficient state before restart to continue the same task

  • Transition run state to paused_for_restart

  • On startup, automatically resume exactly once (idempotent / at-most-once)

  • Resume via a new run with:

  • same user goal

  • same plan progress (last completed + next step)

  • relevant intermediate context/artifacts

  • No manual intervention required

  • Explicit failure states, with no silent “running without progress”

Non-Goals (v1)

  • Full RAM snapshot or exact process rehydration
  • Perfect reproduction of all tool/subagent internal state
  • Mandatory full chat transcript restoration
  • External orchestration as a requirement

Current Limitation / Motivation

  • Runs are interrupted by restart.
  • There is no built-in continuation mechanism.
  • External workarounds are required today.
  • Runs may remain “running” without progress after restart, or require manual re-prompting to reconstruct context.

Proposed Approach

Checkpoint Lifecycle

A run that requires a restart should follow a standardized lifecycle:

running → paused_for_restart → resuming → resumed

If needed, resuming can be optional internally, but the persisted state model should support it.

Trigger Conditions

Checkpointing should occur for:

  • Explicit restart, for example openclaw gateway restart
  • Restart required due to config reload, for example “config change requires restart”

As soon as a restart becomes part of normal task execution, continuation should apply.

Checkpoint Content (minimal)

A checkpoint should be sufficient to resume the same task functionally, not as a full runtime snapshot.

Identity

  • original run/task IDs
  • origin/routing: channel/provider, peer/chat identifiers
  • account ID where applicable
  • timestamps

Task semantics

  • userGoal for the original request

  • plan plus cursor:

  • last completed step

  • next step

  • relevant intermediate results/artifacts:

  • file diffs

  • computed outputs

  • decisions

  • tool-context references, without secrets

Resume metadata

  • resume reason: explicit restart vs restart-required-by-reload
  • resume policy: at-most-once

Status

  • paused_for_restart | resuming | resumed | failed_resume
  • error diagnostics where applicable

Storage should be local, durable, and use atomic writes.

Resume Mechanism

On gateway start:

  • detect checkpoints in paused_for_restart
  • atomically claim one for resume
  • create a new follow-up run seeded with checkpoint context
  • mark checkpoint as resumed, linking it to the new run ID

Idempotency / At-Most-Once

Requirements:

  • no duplicate resume
  • safe under restart loops
  • atomic state transitions

Suggested model:

  • paused_for_restart → resuming(token) → resumed | failed_resume

Failure Handling

  • checkpoint write failure → explicit error state, no silent restart
  • resume failure → failed_resume with error details
  • bounded retries only, no infinite loops
  • multiple checkpoints resumed independently

UI / Messaging Behavior

  • No additional user prompt should be required to resume.
  • Optional minimal info: “Resumed after restart; continuing from step X…”
  • No duplicate notifications and no debug dumps.
  • If resume fails, show explicit status plus error summary.

Backwards Compatibility

  • Feature should activate only when restart is required, either explicitly or internally detected.
  • Existing behavior remains unchanged otherwise.

Acceptance Criteria

  • A run that hits requires_restart is checkpointed before restart.

  • The run transitions to paused_for_restart.

  • After gateway start, resume occurs automatically and exactly once.

  • Resume creates a new run that continues the same task with:

  • same goal

  • same plan progress, including last and next step

  • relevant context and artifacts

  • No manual input is required.

  • No double resume occurs.

  • Clear failure states exist, with no silent hangs.

Open Questions

  • checkpoint storage location and format
  • definition of safe plan-step boundaries
  • handling non-idempotent tool calls with side effects
  • cross-backend consistency for subagents, ACP, and similar runtimes

PR Implementation Plan

Overview / Strategy

Implement this in two small, reviewable phases:

  1. Checkpoint persistence + status model
  2. Auto-resume on gateway startup

v1 should stay intentionally minimal: resume by creating a new follow-up run seeded from checkpoint data, not by trying to rehydrate in-memory runtime state.

Phase 1: Checkpoint + Status Model

Areas likely affected

  • Gateway restart orchestration
  • Restart-required-by-reload logic
  • Run/task runtime state model
  • Persistence/storage layer

Core changes

  • Add explicit statuses:

  • paused_for_restart

  • resuming

  • resumed

  • failed_resume

  • Define checkpoint schema, for example JSON, containing:

  • IDs and routing info

  • goal

  • plan cursor with last and next step

  • artifact summary such as paths and diff pointers

  • tool-context references without secrets

  • resume reason and policy

  • Implement durable checkpoint store, for example under:

  • ~/.openclaw/state/checkpoints/*.json

  • Use atomic write strategy:

  • write temp file

  • rename into place

  • Add checkpoint creation hook:

  • when restart is required, serialize checkpoint for each impacted run

  • transition run state to paused_for_restart

  • fail clearly if checkpoint write fails

Tests

  • unit tests for schema validation
  • unit tests for atomic write and atomic claim
  • unit tests for run status transitions
  • integration-style test for restart-required flow creating checkpoint and paused state

Risks

  • defining a stable plan cursor if planner state is currently too implicit
  • accidentally persisting secrets from tool or environment context

Phase 2: Auto-Resume Hook

Areas likely affected

  • Gateway startup/bootstrap sequence
  • Run scheduler/dispatcher
  • Status reporting

Core changes

  • On gateway start:

  • scan checkpoint store for paused_for_restart

  • claim checkpoint atomically

  • create a new run with:

  • same routing

  • same goal

  • seeded context summary

  • mark checkpoint resumed with resumedRunId

  • Enforce at-most-once:

  • claim token plus persisted transition prevents double resume

  • handle concurrent startup paths safely

Tests

  • unit tests for claim semantics and no double resume

  • integration test:

  • create paused checkpoint

  • run startup resume hook

  • assert exactly one new run is created

  • assert checkpoint is marked resumed

  • failure-mode tests:

  • corrupted checkpoint file → failed_resume with diagnostic

  • resume creation failure → failed_resume without infinite retry

Failure Modes / Handling Checklist

  • Checkpoint write fails → run marked failed and restart blocked or clearly reported
  • Resume fails → checkpoint marked failed_resume; no follow-up run created
  • Restart loop → checkpoint remains paused/resuming with bounded retry or manual intervention path
  • Multiple paused runs → each checkpoint resumed independently with bounded concurrency

Deliberate Constraints for v1

  • Not full memory rehydration
  • Not restoring complete subagent graphs
  • Resume is always a new run seeded from checkpoint
  • No external scripts required

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:data-lossCan lose, corrupt, or silently drop user/session/config data.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🌊 off-meta tidepoolIssue quality rating does not apply to this item.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions