-
-
Notifications
You must be signed in to change notification settings - Fork 54.5k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
OpenClaw cron jobs already apply an exponential backoff based on consecutive errors, but the backoff is reason-agnostic. In practice, different failure reasons should trigger different safety behavior:
- billing/quota exhausted (402 / insufficient_quota): continuing to retry is wasteful; jobs should stop (or switch model/provider if configured) and notify once.
- auth (401/403): retries are wasteful until credentials are fixed; jobs should stop and notify once.
- rate limit (429): retries should use Retry-After and/or exponential backoff; do not disable jobs by default.
- timeout/network: temporary; backoff but keep enabled.
- format (400): likely prompt/tooling bug; disable and surface actionable error.
This matters most for background/automated workloads (cron/heartbeat) where silent retry storms create unnecessary cost and log noise.
Current Behavior
src/cron/service/timer.tsapplies a fixed backoff schedule based onconsecutiveErrors(30s→1m→5m→15m→60m), regardless of error type.- Agents already have error classification via
FailoverReason(billing/rate_limit/auth/timeout/format) insrc/agents/failover-error.ts, but cron does not use it to select mitigation.
Proposed Behavior (Default Policy)
Introduce a reason-aware cron guard layer that classifies terminal errors and applies policy:
Classification
- Reuse/align with
FailoverReasonclassification (billing,rate_limit,auth,timeout,format,unknown). - Ensure tool-level failures (e.g. Brave 429) don’t get misclassified as billing (see Bug: Tool failures (e.g., Brave Search 429) incorrectly surfaced as 'billing error' in Telegram #14245).
Mitigations
- billing/auth/format: circuit-break (disable job) after 1 failure (or small N), persist
lastError, and optionally deliver a single user-facing alert (once per window). - rate_limit: respect
Retry-After/provider headers when available; apply exponential backoff but keep job enabled. - timeout/network: apply backoff; keep enabled.
- unknown: conservative backoff; optionally circuit-break after higher threshold.
Observability
- Record
lastErrorReasonin job state to aid debugging and UI visibility. - Optional: add
cooldownUntilMs/disabledUntilMsfields at the job level (separate from auth-profile cooldowns).
Acceptance Criteria
- Cron avoids retry storms when quota is exhausted or auth is invalid.
- Rate limits back off appropriately without disabling jobs.
- Job state exposes the reason for the last error.
- Behavior is configurable but has safe defaults.
Related
- feat: Add automatic retry with exponential backoff for rate limits (429) #8894 (429 retry/backoff)
- Add retry logic for failed cron jobs #13609 (retry logic for failed cron jobs)
- Feature Request: Per-model usage logging for cost tracking #13219 (usage logging for cost tracking)
- [Feature]: Provider rate-limit / quota query tool #13923 (provider quota/rate limit query tool)
- [Bug]: Gateway sends empty messages on persistent API errors (429/402) instead of user-facing explanation #2202 (429/402 handling)
- Bug: Tool failures (e.g., Brave Search 429) incorrectly surfaced as 'billing error' in Telegram #14245 (misclassification of tool failures as billing)
Implementation Sketch
- Extend cron execution pipeline to capture/normalize error objects and classify via shared helper (or reuse
coerceToFailoverError/resolveFailoverReasonFromError). - Adjust
applyJobResultinsrc/cron/service/timer.tsto compute nextRun / disable decision based on reason + counters. - Persist reason into job state and expose via
cron.list/UI.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request