Skip to content

Feature: Reason-aware cron guardrails (quota/auth/rate-limit aware backoff + circuit breaker) #14376

@futuremind2026

Description

@futuremind2026

Problem

OpenClaw cron jobs already apply an exponential backoff based on consecutive errors, but the backoff is reason-agnostic. In practice, different failure reasons should trigger different safety behavior:

  • billing/quota exhausted (402 / insufficient_quota): continuing to retry is wasteful; jobs should stop (or switch model/provider if configured) and notify once.
  • auth (401/403): retries are wasteful until credentials are fixed; jobs should stop and notify once.
  • rate limit (429): retries should use Retry-After and/or exponential backoff; do not disable jobs by default.
  • timeout/network: temporary; backoff but keep enabled.
  • format (400): likely prompt/tooling bug; disable and surface actionable error.

This matters most for background/automated workloads (cron/heartbeat) where silent retry storms create unnecessary cost and log noise.

Current Behavior

  • src/cron/service/timer.ts applies a fixed backoff schedule based on consecutiveErrors (30s→1m→5m→15m→60m), regardless of error type.
  • Agents already have error classification via FailoverReason (billing/rate_limit/auth/timeout/format) in src/agents/failover-error.ts, but cron does not use it to select mitigation.

Proposed Behavior (Default Policy)

Introduce a reason-aware cron guard layer that classifies terminal errors and applies policy:

Classification

Mitigations

  • billing/auth/format: circuit-break (disable job) after 1 failure (or small N), persist lastError, and optionally deliver a single user-facing alert (once per window).
  • rate_limit: respect Retry-After/provider headers when available; apply exponential backoff but keep job enabled.
  • timeout/network: apply backoff; keep enabled.
  • unknown: conservative backoff; optionally circuit-break after higher threshold.

Observability

  • Record lastErrorReason in job state to aid debugging and UI visibility.
  • Optional: add cooldownUntilMs/disabledUntilMs fields at the job level (separate from auth-profile cooldowns).

Acceptance Criteria

  • Cron avoids retry storms when quota is exhausted or auth is invalid.
  • Rate limits back off appropriately without disabling jobs.
  • Job state exposes the reason for the last error.
  • Behavior is configurable but has safe defaults.

Related

Implementation Sketch

  • Extend cron execution pipeline to capture/normalize error objects and classify via shared helper (or reuse coerceToFailoverError/resolveFailoverReasonFromError).
  • Adjust applyJobResult in src/cron/service/timer.ts to compute nextRun / disable decision based on reason + counters.
  • Persist reason into job state and expose via cron.list/UI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions