-
-
Notifications
You must be signed in to change notification settings - Fork 52.6k
Closed
Description
Problem
When a cron job fails due to a transient error (e.g., model provider rate limit cooldown, temporary network outage), the job is immediately set to enabled: false with no retry. This is especially problematic for one-shot jobs (schedule.kind: "at") since there is no next scheduled run — the job is effectively lost.
Observed behavior
- Cron job fires on schedule ✅
- Model provider returns rate limit (429) → OpenClaw enters cooldown
- Job state:
lastStatus: "error",enabled: false - One-shot job with
deleteAfterRun: true→ permanently disabled, never retries
Expected behavior
For transient/retryable errors (rate limit, network timeout, provider cooldown), the scheduler should:
- Automatically retry with exponential backoff (e.g., 1m → 2m → 5m → 10m)
- Only disable/fail permanently after max retries exhausted
- Distinguish between transient vs permanent errors (auth failure = permanent, rate limit = transient)
Context
This is standard in most modern schedulers:
- AWS EventBridge: retry policy with up to 185 retries
- Kubernetes CronJob:
backoffLimitfor retry count - Celery/Bull: exponential backoff by default
Suggestion
A possible configuration could look like:
{
"retry": {
"maxAttempts": 3,
"backoffMs": [60000, 120000, 300000],
"retryOn": ["rate_limit", "network"]
}
}Or a simpler global default: retry transient errors up to 3 times with exponential backoff.
Environment
- OpenClaw 2026.2.15
- macOS (Darwin arm64)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels