RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)
Summary
OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.
This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:
- No full in-memory state rehydration
- No external scripts required
- Continuation implemented as a new follow-up run seeded from a persisted checkpoint
- Task continues automatically without requiring a new user prompt
Goals
-
Persist sufficient state before restart to continue the same task
-
Transition run state to paused_for_restart
-
On startup, automatically resume exactly once (idempotent / at-most-once)
-
Resume via a new run with:
-
same user goal
-
same plan progress (last completed + next step)
-
relevant intermediate context/artifacts
-
No manual intervention required
-
Explicit failure states, with no silent “running without progress”
Non-Goals (v1)
- Full RAM snapshot or exact process rehydration
- Perfect reproduction of all tool/subagent internal state
- Mandatory full chat transcript restoration
- External orchestration as a requirement
Current Limitation / Motivation
- Runs are interrupted by restart.
- There is no built-in continuation mechanism.
- External workarounds are required today.
- Runs may remain “running” without progress after restart, or require manual re-prompting to reconstruct context.
Proposed Approach
Checkpoint Lifecycle
A run that requires a restart should follow a standardized lifecycle:
running → paused_for_restart → resuming → resumed
If needed, resuming can be optional internally, but the persisted state model should support it.
Trigger Conditions
Checkpointing should occur for:
- Explicit restart, for example
openclaw gateway restart
- Restart required due to config reload, for example “config change requires restart”
As soon as a restart becomes part of normal task execution, continuation should apply.
Checkpoint Content (minimal)
A checkpoint should be sufficient to resume the same task functionally, not as a full runtime snapshot.
Identity
- original run/task IDs
- origin/routing: channel/provider, peer/chat identifiers
- account ID where applicable
- timestamps
Task semantics
-
userGoal for the original request
-
plan plus cursor:
-
last completed step
-
next step
-
relevant intermediate results/artifacts:
-
file diffs
-
computed outputs
-
decisions
-
tool-context references, without secrets
Resume metadata
- resume reason: explicit restart vs restart-required-by-reload
- resume policy: at-most-once
Status
paused_for_restart | resuming | resumed | failed_resume
- error diagnostics where applicable
Storage should be local, durable, and use atomic writes.
Resume Mechanism
On gateway start:
- detect checkpoints in
paused_for_restart
- atomically claim one for resume
- create a new follow-up run seeded with checkpoint context
- mark checkpoint as
resumed, linking it to the new run ID
Idempotency / At-Most-Once
Requirements:
- no duplicate resume
- safe under restart loops
- atomic state transitions
Suggested model:
paused_for_restart → resuming(token) → resumed | failed_resume
Failure Handling
- checkpoint write failure → explicit error state, no silent restart
- resume failure →
failed_resume with error details
- bounded retries only, no infinite loops
- multiple checkpoints resumed independently
UI / Messaging Behavior
- No additional user prompt should be required to resume.
- Optional minimal info: “Resumed after restart; continuing from step X…”
- No duplicate notifications and no debug dumps.
- If resume fails, show explicit status plus error summary.
Backwards Compatibility
- Feature should activate only when restart is required, either explicitly or internally detected.
- Existing behavior remains unchanged otherwise.
Acceptance Criteria
-
A run that hits requires_restart is checkpointed before restart.
-
The run transitions to paused_for_restart.
-
After gateway start, resume occurs automatically and exactly once.
-
Resume creates a new run that continues the same task with:
-
same goal
-
same plan progress, including last and next step
-
relevant context and artifacts
-
No manual input is required.
-
No double resume occurs.
-
Clear failure states exist, with no silent hangs.
Open Questions
- checkpoint storage location and format
- definition of safe plan-step boundaries
- handling non-idempotent tool calls with side effects
- cross-backend consistency for subagents, ACP, and similar runtimes
PR Implementation Plan
Overview / Strategy
Implement this in two small, reviewable phases:
- Checkpoint persistence + status model
- Auto-resume on gateway startup
v1 should stay intentionally minimal: resume by creating a new follow-up run seeded from checkpoint data, not by trying to rehydrate in-memory runtime state.
Phase 1: Checkpoint + Status Model
Areas likely affected
- Gateway restart orchestration
- Restart-required-by-reload logic
- Run/task runtime state model
- Persistence/storage layer
Core changes
-
Add explicit statuses:
-
paused_for_restart
-
resuming
-
resumed
-
failed_resume
-
Define checkpoint schema, for example JSON, containing:
-
IDs and routing info
-
goal
-
plan cursor with last and next step
-
artifact summary such as paths and diff pointers
-
tool-context references without secrets
-
resume reason and policy
-
Implement durable checkpoint store, for example under:
-
~/.openclaw/state/checkpoints/*.json
-
Use atomic write strategy:
-
write temp file
-
rename into place
-
Add checkpoint creation hook:
-
when restart is required, serialize checkpoint for each impacted run
-
transition run state to paused_for_restart
-
fail clearly if checkpoint write fails
Tests
- unit tests for schema validation
- unit tests for atomic write and atomic claim
- unit tests for run status transitions
- integration-style test for restart-required flow creating checkpoint and paused state
Risks
- defining a stable plan cursor if planner state is currently too implicit
- accidentally persisting secrets from tool or environment context
Phase 2: Auto-Resume Hook
Areas likely affected
- Gateway startup/bootstrap sequence
- Run scheduler/dispatcher
- Status reporting
Core changes
-
On gateway start:
-
scan checkpoint store for paused_for_restart
-
claim checkpoint atomically
-
create a new run with:
-
same routing
-
same goal
-
seeded context summary
-
mark checkpoint resumed with resumedRunId
-
Enforce at-most-once:
-
claim token plus persisted transition prevents double resume
-
handle concurrent startup paths safely
Tests
-
unit tests for claim semantics and no double resume
-
integration test:
-
create paused checkpoint
-
run startup resume hook
-
assert exactly one new run is created
-
assert checkpoint is marked resumed
-
failure-mode tests:
-
corrupted checkpoint file → failed_resume with diagnostic
-
resume creation failure → failed_resume without infinite retry
Failure Modes / Handling Checklist
- Checkpoint write fails → run marked failed and restart blocked or clearly reported
- Resume fails → checkpoint marked
failed_resume; no follow-up run created
- Restart loop → checkpoint remains paused/resuming with bounded retry or manual intervention path
- Multiple paused runs → each checkpoint resumed independently with bounded concurrency
Deliberate Constraints for v1
- Not full memory rehydration
- Not restoring complete subagent graphs
- Resume is always a new run seeded from checkpoint
- No external scripts required
RFC: Task Continuation Across Gateway Restarts (Checkpoint + Auto-Resume)
Summary
OpenClaw currently lacks a standardized, first-class mechanism to continue in-progress tasks across gateway restarts. When a restart occurs, whether explicit or required due to config changes, runs are interrupted and may require manual recovery or external orchestration.
This RFC proposes a pragmatic v1 checkpoint + auto-resume mechanism:
Goals
Persist sufficient state before restart to continue the same task
Transition run state to
paused_for_restartOn startup, automatically resume exactly once (idempotent / at-most-once)
Resume via a new run with:
same user goal
same plan progress (last completed + next step)
relevant intermediate context/artifacts
No manual intervention required
Explicit failure states, with no silent “running without progress”
Non-Goals (v1)
Current Limitation / Motivation
Proposed Approach
Checkpoint Lifecycle
A run that requires a restart should follow a standardized lifecycle:
running → paused_for_restart → resuming → resumedIf needed,
resumingcan be optional internally, but the persisted state model should support it.Trigger Conditions
Checkpointing should occur for:
openclaw gateway restartAs soon as a restart becomes part of normal task execution, continuation should apply.
Checkpoint Content (minimal)
A checkpoint should be sufficient to resume the same task functionally, not as a full runtime snapshot.
Identity
Task semantics
userGoalfor the original requestplan plus cursor:
last completed step
next step
relevant intermediate results/artifacts:
file diffs
computed outputs
decisions
tool-context references, without secrets
Resume metadata
Status
paused_for_restart | resuming | resumed | failed_resumeStorage should be local, durable, and use atomic writes.
Resume Mechanism
On gateway start:
paused_for_restartresumed, linking it to the new run IDIdempotency / At-Most-Once
Requirements:
Suggested model:
paused_for_restart → resuming(token) → resumed | failed_resumeFailure Handling
failed_resumewith error detailsUI / Messaging Behavior
Backwards Compatibility
Acceptance Criteria
A run that hits
requires_restartis checkpointed before restart.The run transitions to
paused_for_restart.After gateway start, resume occurs automatically and exactly once.
Resume creates a new run that continues the same task with:
same goal
same plan progress, including last and next step
relevant context and artifacts
No manual input is required.
No double resume occurs.
Clear failure states exist, with no silent hangs.
Open Questions
PR Implementation Plan
Overview / Strategy
Implement this in two small, reviewable phases:
v1 should stay intentionally minimal: resume by creating a new follow-up run seeded from checkpoint data, not by trying to rehydrate in-memory runtime state.
Phase 1: Checkpoint + Status Model
Areas likely affected
Core changes
Add explicit statuses:
paused_for_restartresumingresumedfailed_resumeDefine checkpoint schema, for example JSON, containing:
IDs and routing info
goal
plan cursor with last and next step
artifact summary such as paths and diff pointers
tool-context references without secrets
resume reason and policy
Implement durable checkpoint store, for example under:
~/.openclaw/state/checkpoints/*.jsonUse atomic write strategy:
write temp file
rename into place
Add checkpoint creation hook:
when restart is required, serialize checkpoint for each impacted run
transition run state to
paused_for_restartfail clearly if checkpoint write fails
Tests
Risks
Phase 2: Auto-Resume Hook
Areas likely affected
Core changes
On gateway start:
scan checkpoint store for
paused_for_restartclaim checkpoint atomically
create a new run with:
same routing
same goal
seeded context summary
mark checkpoint
resumedwithresumedRunIdEnforce at-most-once:
claim token plus persisted transition prevents double resume
handle concurrent startup paths safely
Tests
unit tests for claim semantics and no double resume
integration test:
create paused checkpoint
run startup resume hook
assert exactly one new run is created
assert checkpoint is marked resumed
failure-mode tests:
corrupted checkpoint file →
failed_resumewith diagnosticresume creation failure →
failed_resumewithout infinite retryFailure Modes / Handling Checklist
failed_resume; no follow-up run createdDeliberate Constraints for v1