RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)

## RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)

### Summary

OpenClaw currently lacks a standardized, first-class mechanism to continue **in-progress tasks** across **gateway restarts**. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.

This RFC proposes a pragmatic **v1 checkpoint + auto-resume mechanism**:

* **No full in-memory state rehydration**
* **No external scripts required**
* Continuation implemented as a **new follow-up run seeded from a persisted checkpoint**
* Task continues automatically **without requiring a new user prompt**

### Goals

* Persist sufficient state before restart to continue the **same task**
* Transition run state to **`paused_for_restart`**
* On startup, automatically resume **exactly once** (idempotent / at-most-once)
* Resume via a new run with:

 * same **user goal**
 * same **plan progress** (last completed + next step)
 * relevant intermediate context/artifacts
* No manual intervention required
* Explicit failure states, with no silent “running without progress”

### Non-Goals (v1)

* Full RAM snapshot or exact process rehydration
* Perfect reproduction of all tool/subagent internal state
* Mandatory full chat transcript restoration
* External orchestration as a requirement

### Current Limitation / Motivation

* Runs are interrupted by restart.
* There is no built-in continuation mechanism.
* External workarounds are required today.
* Runs may remain “running” without progress after restart, or require manual re-prompting to reconstruct context.

### Proposed Approach

#### Checkpoint Lifecycle

A run that requires a restart should follow a standardized lifecycle:

`running → paused_for_restart → resuming → resumed`

If needed, `resuming` can be optional internally, but the persisted state model should support it.

#### Trigger Conditions

Checkpointing should occur for:

* Explicit restart, for example `openclaw gateway restart`
* Restart required due to config reload, for example “config change requires restart”

As soon as a restart becomes part of normal task execution, continuation should apply.

#### Checkpoint Content (minimal)

A checkpoint should be sufficient to resume the same task functionally, not as a full runtime snapshot.

**Identity**

* original run/task IDs
* origin/routing: channel/provider, peer/chat identifiers
* account ID where applicable
* timestamps

**Task semantics**

* `userGoal` for the original request
* plan plus cursor:

 * last completed step
 * next step
* relevant intermediate results/artifacts:

 * file diffs
 * computed outputs
 * decisions
* tool-context references, without secrets

**Resume metadata**

* resume reason: explicit restart vs restart-required-by-reload
* resume policy: at-most-once

**Status**

* `paused_for_restart | resuming | resumed | failed_resume`
* error diagnostics where applicable

Storage should be local, durable, and use atomic writes.

#### Resume Mechanism

On gateway start:

* detect checkpoints in `paused_for_restart`
* atomically claim one for resume
* create a new follow-up run seeded with checkpoint context
* mark checkpoint as `resumed`, linking it to the new run ID

#### Idempotency / At-Most-Once

Requirements:

* no duplicate resume
* safe under restart loops
* atomic state transitions

Suggested model:

* `paused_for_restart → resuming(token) → resumed | failed_resume`

#### Failure Handling

* checkpoint write failure → explicit error state, no silent restart
* resume failure → `failed_resume` with error details
* bounded retries only, no infinite loops
* multiple checkpoints resumed independently

### UI / Messaging Behavior

* No additional user prompt should be required to resume.
* Optional minimal info: “Resumed after restart; continuing from step X…”
* No duplicate notifications and no debug dumps.
* If resume fails, show explicit status plus error summary.

### Backwards Compatibility

* Feature should activate only when restart is required, either explicitly or internally detected.
* Existing behavior remains unchanged otherwise.

### Acceptance Criteria

* A run that hits `requires_restart` is checkpointed before restart.
* The run transitions to `paused_for_restart`.
* After gateway start, resume occurs automatically and **exactly once**.
* Resume creates a **new run** that continues the same task with:

 * same goal
 * same plan progress, including last and next step
 * relevant context and artifacts
* No manual input is required.
* No double resume occurs.
* Clear failure states exist, with no silent hangs.

### Open Questions

* checkpoint storage location and format
* definition of safe plan-step boundaries
* handling non-idempotent tool calls with side effects
* cross-backend consistency for subagents, ACP, and similar runtimes

## PR Implementation Plan

### Overview / Strategy

Implement this in two small, reviewable phases:

1. **Checkpoint persistence + status model**
2. **Auto-resume on gateway startup**

v1 should stay intentionally minimal: resume by creating a **new follow-up run** seeded from checkpoint data, not by trying to rehydrate in-memory runtime state.

### Phase 1: Checkpoint + Status Model

**Areas likely affected**

* Gateway restart orchestration
* Restart-required-by-reload logic
* Run/task runtime state model
* Persistence/storage layer

**Core changes**

* Add explicit statuses:

 * `paused_for_restart`
 * `resuming`
 * `resumed`
 * `failed_resume`
* Define checkpoint schema, for example JSON, containing:

 * IDs and routing info
 * goal
 * plan cursor with last and next step
 * artifact summary such as paths and diff pointers
 * tool-context references without secrets
 * resume reason and policy
* Implement durable checkpoint store, for example under:

 * `~/.openclaw/state/checkpoints/*.json`
* Use atomic write strategy:

 * write temp file
 * rename into place
* Add checkpoint creation hook:

 * when restart is required, serialize checkpoint for each impacted run
 * transition run state to `paused_for_restart`
 * fail clearly if checkpoint write fails

**Tests**

* unit tests for schema validation
* unit tests for atomic write and atomic claim
* unit tests for run status transitions
* integration-style test for restart-required flow creating checkpoint and paused state

**Risks**

* defining a stable plan cursor if planner state is currently too implicit
* accidentally persisting secrets from tool or environment context

### Phase 2: Auto-Resume Hook

**Areas likely affected**

* Gateway startup/bootstrap sequence
* Run scheduler/dispatcher
* Status reporting

**Core changes**

* On gateway start:

 * scan checkpoint store for `paused_for_restart`
 * claim checkpoint atomically
 * create a new run with:

 * same routing
 * same goal
 * seeded context summary
 * mark checkpoint `resumed` with `resumedRunId`
* Enforce at-most-once:

 * claim token plus persisted transition prevents double resume
 * handle concurrent startup paths safely

**Tests**

* unit tests for claim semantics and no double resume
* integration test:

 * create paused checkpoint
 * run startup resume hook
 * assert exactly one new run is created
 * assert checkpoint is marked resumed
* failure-mode tests:

 * corrupted checkpoint file → `failed_resume` with diagnostic
 * resume creation failure → `failed_resume` without infinite retry

### Failure Modes / Handling Checklist

* Checkpoint write fails → run marked failed and restart blocked or clearly reported
* Resume fails → checkpoint marked `failed_resume`; no follow-up run created
* Restart loop → checkpoint remains paused/resuming with bounded retry or manual intervention path
* Multiple paused runs → each checkpoint resumed independently with bounded concurrency

### Deliberate Constraints for v1

* Not full memory rehydration
* Not restoring complete subagent graphs
* Resume is always a new run seeded from checkpoint
* No external scripts required


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume) #60864