fix(cron): retry recurring jobs on transient errors with backoff#46423
fix(cron): retry recurring jobs on transient errors with backoff#46423onesmash wants to merge 31 commits intoopenclaw:mainfrom
Conversation
b2dbc92 to
daed538
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2dbc92e3a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // backoff regardless of schedule interval. Without this, long-period jobs | ||
| // (hourly, daily) would wait until the next natural schedule time even | ||
| // when the error is self-healing (e.g. a brief network blip). | ||
| job.state.nextRunAtMs = backoffNext; |
There was a problem hiding this comment.
Respect preserveSchedule on forced transient failures
When applyJobResult is invoked from a manual force run (src/cron/service/ops.ts:479-490 passes preserveSchedule: true), transient errors now always write nextRunAtMs = backoffNext and ignore the preserved natural schedule. That means a forced run of a long-interval recurring every job can now trigger an unexpected automatic retry within ~30s after a network blip, whereas the prior behavior kept the original cadence/anchor for force runs.
Useful? React with 👍 / 👎.
daed538 to
d3951f5
Compare
Greptile SummaryThis PR correctly extends the transient-error retry logic (previously only for Key issues found:
Confidence Score: 3/5
|
| @@ -414,7 +414,8 @@ export function applyJobResult( | |||
| } | |||
| } | |||
| } else if (result.status === "error" && job.enabled) { | |||
| // Apply exponential backoff for errored jobs to prevent retry storms. | |||
| const retryConfig = resolveRetryConfig(state.deps.cronConfig); | |||
| const transient = isTransientCronError(result.error, retryConfig.retryOn); | |||
| const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); | |||
There was a problem hiding this comment.
retryConfig.backoffMs fetched but silently ignored
resolveRetryConfig is called to get retryConfig, and retryConfig.retryOn is correctly forwarded to isTransientCronError. However, retryConfig.backoffMs is never passed to errorBackoffMs, so any per-deployment backoff schedule configured via cronConfig.retry.backoffMs is silently ignored for recurring jobs. The one-shot path at line 385 correctly forwards retryConfig.backoffMs:
const backoff = errorBackoffMs(consecutive, retryConfig.backoffMs);The recurring path should do the same:
| const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); | |
| const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1, retryConfig.backoffMs); |
Without this, cronConfig.retry.backoffMs only takes effect for at jobs, not cron/every jobs, which is inconsistent and likely unintended.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 419
Comment:
**`retryConfig.backoffMs` fetched but silently ignored**
`resolveRetryConfig` is called to get `retryConfig`, and `retryConfig.retryOn` is correctly forwarded to `isTransientCronError`. However, `retryConfig.backoffMs` is never passed to `errorBackoffMs`, so any per-deployment backoff schedule configured via `cronConfig.retry.backoffMs` is silently ignored for recurring jobs. The one-shot path at line 385 correctly forwards `retryConfig.backoffMs`:
```typescript
const backoff = errorBackoffMs(consecutive, retryConfig.backoffMs);
```
The recurring path should do the same:
```suggestion
const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1, retryConfig.backoffMs);
```
Without this, `cronConfig.retry.backoffMs` only takes effect for `at` jobs, not `cron`/`every` jobs, which is inconsistent and likely unintended.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Addressed in `7a98b18f5`: recurring error path now takes backoff schedule from `resolveRetryConfig(state.deps.cronConfig).backoffMs` when `cron.retry.backoffMs` is set (same reference as one-shot). Full default ladder still used when unset. Also merged latest `openclaw/main` into this branch to refresh CI.
| it("uses increasing backoff on consecutive transient errors for a daily job", () => { | ||
| const state = makeState(); | ||
|
|
||
| const runError = (consecutiveErrors: number) => { | ||
| const job = makeCronJob("0 10 * * *"); | ||
| job.state.consecutiveErrors = consecutiveErrors; | ||
| applyJobResult(state, job, { | ||
| status: "error", | ||
| error: "econnreset", | ||
| startedAt: NOW_MS, | ||
| endedAt: ENDED_AT, | ||
| }); | ||
| return job.state.nextRunAtMs!; | ||
| }; | ||
|
|
||
| const first = runError(0); // 1st error → 30s backoff | ||
| const second = runError(1); // 2nd error → 60s backoff | ||
| const third = runError(2); // 3rd error → 5min backoff | ||
|
|
||
| expect(second).toBeGreaterThan(first); | ||
| expect(third).toBeGreaterThan(second); | ||
| // All well within 5 minutes, not waiting 24 hours | ||
| expect(third).toBeLessThanOrEqual(ENDED_AT + 5 * 60_000); | ||
| }); |
There was a problem hiding this comment.
Backoff escalation test creates fresh job instances instead of reusing one
Each runError(n) call creates a fresh CronJob and manually pre-sets consecutiveErrors before calling applyJobResult. This means the test does not exercise the actual counter accumulation path across repeated failures on the same job — it only verifies that errorBackoffMs returns a larger value for larger inputs.
A more realistic regression test would apply errors to the same job instance in sequence:
it("uses increasing backoff on consecutive transient errors for a daily job", () => {
const state = makeState();
const job = makeCronJob("0 10 * * *");
const recordNext = () => {
applyJobResult(state, job, {
status: "error",
error: "econnreset",
startedAt: NOW_MS,
endedAt: ENDED_AT,
});
return job.state.nextRunAtMs!;
};
const first = recordNext(); // consecutiveErrors → 1, 30s backoff
const second = recordNext(); // consecutiveErrors → 2, 60s backoff
const third = recordNext(); // consecutiveErrors → 3, 5min backoff
expect(second).toBeGreaterThan(first);
expect(third).toBeGreaterThan(second);
expect(third).toBeLessThanOrEqual(ENDED_AT + 5 * 60_000);
});This also implicitly validates that consecutiveErrors increments correctly rather than relying on manually supplied values.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.issue-46341-recurring-transient-retry.test.ts
Line: 145-168
Comment:
**Backoff escalation test creates fresh job instances instead of reusing one**
Each `runError(n)` call creates a fresh `CronJob` and manually pre-sets `consecutiveErrors` before calling `applyJobResult`. This means the test does not exercise the actual counter accumulation path across repeated failures on the same job — it only verifies that `errorBackoffMs` returns a larger value for larger inputs.
A more realistic regression test would apply errors to the same job instance in sequence:
```typescript
it("uses increasing backoff on consecutive transient errors for a daily job", () => {
const state = makeState();
const job = makeCronJob("0 10 * * *");
const recordNext = () => {
applyJobResult(state, job, {
status: "error",
error: "econnreset",
startedAt: NOW_MS,
endedAt: ENDED_AT,
});
return job.state.nextRunAtMs!;
};
const first = recordNext(); // consecutiveErrors → 1, 30s backoff
const second = recordNext(); // consecutiveErrors → 2, 60s backoff
const third = recordNext(); // consecutiveErrors → 3, 5min backoff
expect(second).toBeGreaterThan(first);
expect(third).toBeGreaterThan(second);
expect(third).toBeLessThanOrEqual(ENDED_AT + 5 * 60_000);
});
```
This also implicitly validates that `consecutiveErrors` increments correctly rather than relying on manually supplied values.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1305d8fd3a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); | ||
| const retryConfig = resolveRetryConfig(state.deps.cronConfig); | ||
| const transient = isTransientCronError(result.error, retryConfig.retryOn); | ||
| const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1, retryConfig.backoffMs); |
There was a problem hiding this comment.
Keep recurring retries on full backoff ladder
Passing retryConfig.backoffMs into the recurring error path changes the default ladder for recurring jobs from [30s, 60s, 5m, 15m, 60m] to only the first three entries, because resolveRetryConfig falls back to DEFAULT_BACKOFF_SCHEDULE_MS.slice(0, 3). After the 3rd consecutive failure, recurring jobs now retry every 5 minutes instead of backing off to 15/60 minutes, which increases retry pressure for short-interval jobs and weakens the existing retry-storm protection.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3fdfd5a35e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const backoffSchedule = | ||
| Array.isArray(state.deps.cronConfig?.retry?.backoffMs) && | ||
| state.deps.cronConfig.retry.backoffMs.length > 0 | ||
| ? state.deps.cronConfig.retry.backoffMs | ||
| : DEFAULT_BACKOFF_SCHEDULE_MS; |
There was a problem hiding this comment.
Honor custom retry backoff during startup catch-up
Using cronConfig.retry.backoffMs here changes recurring jobs’ persisted nextRunAtMs, but the restart gate in isErrorBackoffPending still computes backoff with the default ladder (errorBackoffMs(consecutiveErrors)), so the two paths diverge. In a configured environment with a longer custom backoff (for example 2h), a gateway restart after 30s+ can treat backoff as expired and planStartupCatchup (with allowCronMissedRunByLastRun) may run the cron job before the configured cooldown and before its stored nextRunAtMs, which breaks operator-configured retry pacing after restarts.
Useful? React with 👍 / 👎.
|
Merged latest |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d72ffc41e8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const retryConfig = resolveRetryConfig(state.deps.cronConfig); | ||
| const transient = isTransientCronError(result.error, retryConfig.retryOn); |
There was a problem hiding this comment.
Decouple recurring transient checks from one-shot retryOn config
The recurring error path now calls isTransientCronError(..., retryConfig.retryOn), so any custom cron.retry.retryOn setting will also gate recurring retries. This is a regression for deployments that tuned cron.retry for one-shot jobs only (as documented in src/config/schema.help.ts:1199), because a config like retryOn: ["rate_limit"] will make recurring network/timeout failures skip the new backoff retry behavior and wait until the next natural schedule again. Please keep recurring transient classification independent of one-shot retry config (or update the config contract everywhere).
Useful? React with 👍 / 👎.
Push
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a98b18f54
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const recurringBackoffMs = | ||
| Array.isArray(state.deps.cronConfig?.retry?.backoffMs) && | ||
| state.deps.cronConfig.retry.backoffMs.length > 0 | ||
| ? retryConfig.backoffMs | ||
| : DEFAULT_BACKOFF_SCHEDULE_MS; |
There was a problem hiding this comment.
Align restart backoff gating with recurring backoff config
Using cron.retry.backoffMs for recurring errors here changes persisted nextRunAtMs, but startup replay gating still uses isErrorBackoffPending with the default ladder only. In a config like cron.retry.backoffMs: [7200000], a cron job that failed transiently can be replayed on restart (via missed-slot logic) well before its configured 2-hour cooldown and before its stored nextRunAtMs, because the gate considers backoff elapsed after the default 30s. This breaks operator-configured retry pacing specifically after restarts.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 52a71a93a: startup missed-slot gating now reuses the same recurring backoff schedule as applyJobResult, so custom cron.retry.backoffMs prevents premature replay after restart. Added a restart-catchup regression test for the custom-backoff case, and targeted cron tests passed.
定时维护检查(2026-03-19)代码审查项: Greptile 在
CI: 最新 run 中失败主要为 建议在 upstream 侧 Re-run failed jobs 或待 main 稳定后再看是否仍为红。PR 逻辑侧当前无待改项。 |
定时维护(2026-03-19)HEAD: CI(run Codex P2( 合并: PR 为 |
定时维护(2026-03-27)HEAD:
CI(run 待办:PR 当前无代码待改项, |
52a71a9 to
30aa07f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30aa07f73c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1); | ||
| // Recurring jobs: use default transient patterns only. cron.retry.retryOn is documented | ||
| // for one-shot retries; narrowing it must not disable recurring network/timeout retries. | ||
| const transient = isTransientCronError(result.error); |
There was a problem hiding this comment.
Classify built-in timeout errors as transient
The new recurring retry branch now depends on isTransientCronError(result.error), but the scheduler’s own timeout failures are emitted as "cron: job execution timed out" (from timeoutErrorMessage()), which does not match the current timeout transient pattern (timeout|etimedout). In practice, timed-out recurring jobs (for example a daily job that hits the execution timeout) are treated as permanent errors and wait for the next natural schedule instead of taking the intended backoff retry path, so the timeout case in this fix is not actually covered.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 18f0891690: updated TRANSIENT_PATTERNS.timeout to /(timeout|timed\\s+out|etimedout)/i so "cron: job execution timed out" from timeoutErrorMessage() is correctly classified as transient. Added regression test for the exact internal message.
Made-with: Cursor
When applyJobResult is invoked with preserveSchedule (e.g. manual force run), transient errors now keep the preserved natural schedule with backoff floor instead of always scheduling an immediate backoff retry. Made-with: Cursor
…backoffMs Respect cronConfig.retry.backoffMs when set; otherwise use DEFAULT_BACKOFF_SCHEDULE_MS (full ladder) for recurring error path to preserve retry-storm protection (15m/60m) instead of resolving via resolveRetryConfig which defaults to slice(0,3). Made-with: Cursor
Align recurring path with one-shot: pass retryConfig.backoffMs to errorBackoffMs so cronConfig.retry.backoffMs is respected for cron/every jobs (addresses review feedback). Made-with: Cursor
|
Thanks for the PR. This branch appears to include a large replay of unrelated main/release commits, so it is not reviewable as a focused Telegram change. Please reopen as a narrow PR with only the intended fix. |
Summary
cron/every) jobs now detect transient errors (network, timeout, rate-limit, 5xx) and retry with exponential backoff, instead of waiting until the next natural schedule timeRoot Cause
One-shot (
at) jobs already hadisTransientCronErrordetection (added in #24355), but the recurring job path atsrc/cron/service/timer.ts:416was missing it. For a daily job with a network error,Math.max(normalNext=+24h, backoffNext=+30s)always picked the 24-hour schedule, making backoff completely ineffective.Behavior Change
Test Plan
src/cron/service.issue-46341-recurring-transient-retry.test.ts— 5 new tests covering transient retry, permanent error schedule, backoff escalationpnpm testpassesCloses #46341