Skip to content

fix(cron): retry recurring jobs on transient errors with backoff#46423

Closed
onesmash wants to merge 31 commits intoopenclaw:mainfrom
onesmash:fix/recurring-cron-transient-retry
Closed

fix(cron): retry recurring jobs on transient errors with backoff#46423
onesmash wants to merge 31 commits intoopenclaw:mainfrom
onesmash:fix/recurring-cron-transient-retry

Conversation

@onesmash
Copy link
Copy Markdown

Summary

Root Cause

One-shot (at) jobs already had isTransientCronError detection (added in #24355), but the recurring job path at src/cron/service/timer.ts:416 was missing it. For a daily job with a network error, Math.max(normalNext=+24h, backoffNext=+30s) always picked the 24-hour schedule, making backoff completely ineffective.

Behavior Change

Error type Before After
Transient (network/timeout/5xx/rate-limit) Wait for next natural schedule Retry with backoff (30s → 1m → 5m → 15m → 60m)
Permanent Wait for next natural schedule Wait for next natural schedule (unchanged)

Test Plan

  • src/cron/service.issue-46341-recurring-transient-retry.test.ts — 5 new tests covering transient retry, permanent error schedule, backoff escalation
  • pnpm test passes

Closes #46341

@onesmash onesmash force-pushed the fix/recurring-cron-transient-retry branch from b2dbc92 to daed538 Compare March 14, 2026 17:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2dbc92e3a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
// backoff regardless of schedule interval. Without this, long-period jobs
// (hourly, daily) would wait until the next natural schedule time even
// when the error is self-healing (e.g. a brief network blip).
job.state.nextRunAtMs = backoffNext;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect preserveSchedule on forced transient failures

When applyJobResult is invoked from a manual force run (src/cron/service/ops.ts:479-490 passes preserveSchedule: true), transient errors now always write nextRunAtMs = backoffNext and ignore the preserved natural schedule. That means a forced run of a long-interval recurring every job can now trigger an unexpected automatic retry within ~30s after a network blip, whereas the prior behavior kept the original cadence/anchor for force runs.

Useful? React with 👍 / 👎.

@onesmash onesmash force-pushed the fix/recurring-cron-transient-retry branch from daed538 to d3951f5 Compare March 14, 2026 17:22
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 14, 2026

Greptile Summary

This PR correctly extends the transient-error retry logic (previously only for at one-shot jobs) to recurring cron/every jobs. The root cause — Math.max(normalNext, backoffNext) always favouring the natural schedule for long-period jobs — is accurately diagnosed and fixed.

Key issues found:

  • retryConfig.backoffMs silently ignored for recurring jobs (timer.ts:419): resolveRetryConfig is called and its retryOn list is forwarded to isTransientCronError, but retryConfig.backoffMs is never passed to errorBackoffMs. The one-shot path correctly passes retryConfig.backoffMs; the recurring path always falls back to DEFAULT_BACKOFF_SCHEDULE_MS, making cronConfig.retry.backoffMs a no-op for cron/every jobs.

  • New transient variable not included in log statement (timer.ts:444–453): The transient flag is the key branching signal but is absent from the log.info call, reducing observability. A pre-existing duplicate consecutiveErrors key in the same log object was also left in place and should be replaced with transient.

  • Backoff escalation test uses fresh job instances (test line 145–168): Each call to runError pre-seeds consecutiveErrors on a new job rather than letting errors accumulate on the same instance, so the counter accumulation code path is not exercised by this test.

Confidence Score: 3/5

  • The fix is directionally correct but has a configuration bug (retryConfig.backoffMs ignored) that could affect deployments relying on custom backoff schedules.
  • The core logic is sound and the failure mode (transient errors on long-period recurring jobs not retrying with backoff) is clearly fixed. However, retryConfig.backoffMs being fetched but not forwarded means any operator-configured backoff schedule silently has no effect on recurring jobs — a real functional gap. The missing transient log field is a lower-severity observability gap. These issues are confined to the error-handling path and are unlikely to cause data loss, but the configuration inconsistency warrants a fix before merging.
  • Pay close attention to src/cron/service/timer.ts around the errorBackoffMs call in the recurring error branch (line 419).

Comments Outside Diff (1)

  1. src/cron/service/timer.ts, line 444-453 (link)

    transient computed but not logged; duplicate key in log object

    The PR introduces the transient variable, which is now the key branching signal, but it is not included in the log.info call. This makes it hard to distinguish transient-path retries from permanent-path retries in production logs. Additionally, the pre-existing log object already has a duplicate consecutiveErrors key (lines 447–448) — one of those was almost certainly intended to be transient.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/cron/service/timer.ts
    Line: 444-453
    
    Comment:
    **`transient` computed but not logged; duplicate key in log object**
    
    The PR introduces the `transient` variable, which is now the key branching signal, but it is not included in the `log.info` call. This makes it hard to distinguish transient-path retries from permanent-path retries in production logs. Additionally, the pre-existing log object already has a duplicate `consecutiveErrors` key (lines 447–448) — one of those was almost certainly intended to be `transient`.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 419

Comment:
**`retryConfig.backoffMs` fetched but silently ignored**

`resolveRetryConfig` is called to get `retryConfig`, and `retryConfig.retryOn` is correctly forwarded to `isTransientCronError`. However, `retryConfig.backoffMs` is never passed to `errorBackoffMs`, so any per-deployment backoff schedule configured via `cronConfig.retry.backoffMs` is silently ignored for recurring jobs. The one-shot path at line 385 correctly forwards `retryConfig.backoffMs`:

```typescript
const backoff = errorBackoffMs(consecutive, retryConfig.backoffMs);
```

The recurring path should do the same:

```suggestion
      const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1, retryConfig.backoffMs);
```

Without this, `cronConfig.retry.backoffMs` only takes effect for `at` jobs, not `cron`/`every` jobs, which is inconsistent and likely unintended.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 444-453

Comment:
**`transient` computed but not logged; duplicate key in log object**

The PR introduces the `transient` variable, which is now the key branching signal, but it is not included in the `log.info` call. This makes it hard to distinguish transient-path retries from permanent-path retries in production logs. Additionally, the pre-existing log object already has a duplicate `consecutiveErrors` key (lines 447–448) — one of those was almost certainly intended to be `transient`.

```suggestion
      state.deps.log.info(
        {
          jobId: job.id,
          jobName: job.name,
          transient,
          consecutiveErrors: job.state.consecutiveErrors,
          backoffMs: backoff,
          nextRunAtMs: job.state.nextRunAtMs,
        },
        "cron: applying error backoff",
      );
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cron/service.issue-46341-recurring-transient-retry.test.ts
Line: 145-168

Comment:
**Backoff escalation test creates fresh job instances instead of reusing one**

Each `runError(n)` call creates a fresh `CronJob` and manually pre-sets `consecutiveErrors` before calling `applyJobResult`. This means the test does not exercise the actual counter accumulation path across repeated failures on the same job — it only verifies that `errorBackoffMs` returns a larger value for larger inputs.

A more realistic regression test would apply errors to the same job instance in sequence:

```typescript
it("uses increasing backoff on consecutive transient errors for a daily job", () => {
  const state = makeState();
  const job = makeCronJob("0 10 * * *");

  const recordNext = () => {
    applyJobResult(state, job, {
      status: "error",
      error: "econnreset",
      startedAt: NOW_MS,
      endedAt: ENDED_AT,
    });
    return job.state.nextRunAtMs!;
  };

  const first = recordNext();   // consecutiveErrors → 1, 30s backoff
  const second = recordNext();  // consecutiveErrors → 2, 60s backoff
  const third = recordNext();   // consecutiveErrors → 3, 5min backoff

  expect(second).toBeGreaterThan(first);
  expect(third).toBeGreaterThan(second);
  expect(third).toBeLessThanOrEqual(ENDED_AT + 5 * 60_000);
});
```

This also implicitly validates that `consecutiveErrors` increments correctly rather than relying on manually supplied values.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: daed538

Comment thread src/cron/service/timer.ts Outdated
@@ -414,7 +414,8 @@ export function applyJobResult(
}
}
} else if (result.status === "error" && job.enabled) {
// Apply exponential backoff for errored jobs to prevent retry storms.
const retryConfig = resolveRetryConfig(state.deps.cronConfig);
const transient = isTransientCronError(result.error, retryConfig.retryOn);
const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryConfig.backoffMs fetched but silently ignored

resolveRetryConfig is called to get retryConfig, and retryConfig.retryOn is correctly forwarded to isTransientCronError. However, retryConfig.backoffMs is never passed to errorBackoffMs, so any per-deployment backoff schedule configured via cronConfig.retry.backoffMs is silently ignored for recurring jobs. The one-shot path at line 385 correctly forwards retryConfig.backoffMs:

const backoff = errorBackoffMs(consecutive, retryConfig.backoffMs);

The recurring path should do the same:

Suggested change
const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1);
const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1, retryConfig.backoffMs);

Without this, cronConfig.retry.backoffMs only takes effect for at jobs, not cron/every jobs, which is inconsistent and likely unintended.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service/timer.ts
Line: 419

Comment:
**`retryConfig.backoffMs` fetched but silently ignored**

`resolveRetryConfig` is called to get `retryConfig`, and `retryConfig.retryOn` is correctly forwarded to `isTransientCronError`. However, `retryConfig.backoffMs` is never passed to `errorBackoffMs`, so any per-deployment backoff schedule configured via `cronConfig.retry.backoffMs` is silently ignored for recurring jobs. The one-shot path at line 385 correctly forwards `retryConfig.backoffMs`:

```typescript
const backoff = errorBackoffMs(consecutive, retryConfig.backoffMs);
```

The recurring path should do the same:

```suggestion
      const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1, retryConfig.backoffMs);
```

Without this, `cronConfig.retry.backoffMs` only takes effect for `at` jobs, not `cron`/`every` jobs, which is inconsistent and likely unintended.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in `7a98b18f5`: recurring error path now takes backoff schedule from `resolveRetryConfig(state.deps.cronConfig).backoffMs` when `cron.retry.backoffMs` is set (same reference as one-shot). Full default ladder still used when unset. Also merged latest `openclaw/main` into this branch to refresh CI.

Comment on lines +145 to +168
it("uses increasing backoff on consecutive transient errors for a daily job", () => {
const state = makeState();

const runError = (consecutiveErrors: number) => {
const job = makeCronJob("0 10 * * *");
job.state.consecutiveErrors = consecutiveErrors;
applyJobResult(state, job, {
status: "error",
error: "econnreset",
startedAt: NOW_MS,
endedAt: ENDED_AT,
});
return job.state.nextRunAtMs!;
};

const first = runError(0); // 1st error → 30s backoff
const second = runError(1); // 2nd error → 60s backoff
const third = runError(2); // 3rd error → 5min backoff

expect(second).toBeGreaterThan(first);
expect(third).toBeGreaterThan(second);
// All well within 5 minutes, not waiting 24 hours
expect(third).toBeLessThanOrEqual(ENDED_AT + 5 * 60_000);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backoff escalation test creates fresh job instances instead of reusing one

Each runError(n) call creates a fresh CronJob and manually pre-sets consecutiveErrors before calling applyJobResult. This means the test does not exercise the actual counter accumulation path across repeated failures on the same job — it only verifies that errorBackoffMs returns a larger value for larger inputs.

A more realistic regression test would apply errors to the same job instance in sequence:

it("uses increasing backoff on consecutive transient errors for a daily job", () => {
  const state = makeState();
  const job = makeCronJob("0 10 * * *");

  const recordNext = () => {
    applyJobResult(state, job, {
      status: "error",
      error: "econnreset",
      startedAt: NOW_MS,
      endedAt: ENDED_AT,
    });
    return job.state.nextRunAtMs!;
  };

  const first = recordNext();   // consecutiveErrors → 1, 30s backoff
  const second = recordNext();  // consecutiveErrors → 2, 60s backoff
  const third = recordNext();   // consecutiveErrors → 3, 5min backoff

  expect(second).toBeGreaterThan(first);
  expect(third).toBeGreaterThan(second);
  expect(third).toBeLessThanOrEqual(ENDED_AT + 5 * 60_000);
});

This also implicitly validates that consecutiveErrors increments correctly rather than relying on manually supplied values.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.issue-46341-recurring-transient-retry.test.ts
Line: 145-168

Comment:
**Backoff escalation test creates fresh job instances instead of reusing one**

Each `runError(n)` call creates a fresh `CronJob` and manually pre-sets `consecutiveErrors` before calling `applyJobResult`. This means the test does not exercise the actual counter accumulation path across repeated failures on the same job — it only verifies that `errorBackoffMs` returns a larger value for larger inputs.

A more realistic regression test would apply errors to the same job instance in sequence:

```typescript
it("uses increasing backoff on consecutive transient errors for a daily job", () => {
  const state = makeState();
  const job = makeCronJob("0 10 * * *");

  const recordNext = () => {
    applyJobResult(state, job, {
      status: "error",
      error: "econnreset",
      startedAt: NOW_MS,
      endedAt: ENDED_AT,
    });
    return job.state.nextRunAtMs!;
  };

  const first = recordNext();   // consecutiveErrors → 1, 30s backoff
  const second = recordNext();  // consecutiveErrors → 2, 60s backoff
  const third = recordNext();   // consecutiveErrors → 3, 5min backoff

  expect(second).toBeGreaterThan(first);
  expect(third).toBeGreaterThan(second);
  expect(third).toBeLessThanOrEqual(ENDED_AT + 5 * 60_000);
});
```

This also implicitly validates that `consecutiveErrors` increments correctly rather than relying on manually supplied values.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1305d8fd3a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1);
const retryConfig = resolveRetryConfig(state.deps.cronConfig);
const transient = isTransientCronError(result.error, retryConfig.retryOn);
const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1, retryConfig.backoffMs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep recurring retries on full backoff ladder

Passing retryConfig.backoffMs into the recurring error path changes the default ladder for recurring jobs from [30s, 60s, 5m, 15m, 60m] to only the first three entries, because resolveRetryConfig falls back to DEFAULT_BACKOFF_SCHEDULE_MS.slice(0, 3). After the 3rd consecutive failure, recurring jobs now retry every 5 minutes instead of backing off to 15/60 minutes, which increases retry pressure for short-interval jobs and weakens the existing retry-storm protection.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3fdfd5a35e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
Comment on lines +421 to +425
const backoffSchedule =
Array.isArray(state.deps.cronConfig?.retry?.backoffMs) &&
state.deps.cronConfig.retry.backoffMs.length > 0
? state.deps.cronConfig.retry.backoffMs
: DEFAULT_BACKOFF_SCHEDULE_MS;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor custom retry backoff during startup catch-up

Using cronConfig.retry.backoffMs here changes recurring jobs’ persisted nextRunAtMs, but the restart gate in isErrorBackoffPending still computes backoff with the default ladder (errorBackoffMs(consecutiveErrors)), so the two paths diverge. In a configured environment with a longer custom backoff (for example 2h), a gateway restart after 30s+ can treat backoff as expired and planStartupCatchup (with allowCronMissedRunByLastRun) may run the cron job before the configured cooldown and before its stored nextRunAtMs, which breaks operator-configured retry pacing after restarts.

Useful? React with 👍 / 👎.

@onesmash
Copy link
Copy Markdown
Author

Merged latest main (commit d72ffc41) to refresh CI — prior run had stale-base diff failure, unrelated TS errors, and OOM in test shards. Maintainer: please re-check CI on this push.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d72ffc41e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
Comment on lines +417 to +418
const retryConfig = resolveRetryConfig(state.deps.cronConfig);
const transient = isTransientCronError(result.error, retryConfig.retryOn);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decouple recurring transient checks from one-shot retryOn config

The recurring error path now calls isTransientCronError(..., retryConfig.retryOn), so any custom cron.retry.retryOn setting will also gate recurring retries. This is a regression for deployments that tuned cron.retry for one-shot jobs only (as documented in src/config/schema.help.ts:1199), because a config like retryOn: ["rate_limit"] will make recurring network/timeout failures skip the new backoff retry behavior and wait until the next natural schedule again. Please keep recurring transient classification independent of one-shot retry config (or update the config contract everywhere).

Useful? React with 👍 / 👎.

@onesmash
Copy link
Copy Markdown
Author

Push 27d0a309d — Codex P2 (retryOn)

Change: Recurring cron/every error path now calls isTransientCronError(error) without cron.retry.retryOn, so a narrowed retryOn (documented for one-shot jobs) no longer disables network/timeout/5xx early retry on recurring jobs. Added regression test; jobName added to error-backoff log.

CI note (prior run 23242260608): Red jobs look unrelated to this PR:

  • extensions / channels / contracts: extensions/tlon submodule npm install fails with TS1360 in channelContentConfig.ts / groupTemplates.ts.
  • extensions: llm-task-tool.test.ts 1 failed (~142s); provider-ollama-setup.js missing in embedded-agent path.
  • windows: shard failures (likely flake/load).

Please re-run CI on this push; cron unit tests and core shards were green.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a98b18f54

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
Comment on lines +423 to +427
const recurringBackoffMs =
Array.isArray(state.deps.cronConfig?.retry?.backoffMs) &&
state.deps.cronConfig.retry.backoffMs.length > 0
? retryConfig.backoffMs
: DEFAULT_BACKOFF_SCHEDULE_MS;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align restart backoff gating with recurring backoff config

Using cron.retry.backoffMs for recurring errors here changes persisted nextRunAtMs, but startup replay gating still uses isErrorBackoffPending with the default ladder only. In a config like cron.retry.backoffMs: [7200000], a cron job that failed transiently can be replayed on restart (via missed-slot logic) well before its configured 2-hour cooldown and before its stored nextRunAtMs, because the gate considers backoff elapsed after the default 30s. This breaks operator-configured retry pacing specifically after restarts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 52a71a93a: startup missed-slot gating now reuses the same recurring backoff schedule as applyJobResult, so custom cron.retry.backoffMs prevents premature replay after restart. Added a restart-catchup regression test for the custom-backoff case, and targeted cron tests passed.

@onesmash
Copy link
Copy Markdown
Author

定时维护检查(2026-03-19)

代码审查项: Greptile 在 daed538 上提的三点在当前 HEAD 7a98b18 已覆盖:

  • cron.retry.backoffMs:recurring 路径在配置了自定义 backoffMs 时通过 retryConfig.backoffMs 传入 errorBackoffMs(与 one-shot 一致);未配置时仍用 DEFAULT_BACKOFF_SCHEDULE_MS
  • 日志:cron: applying error backoff 已包含 transient,且无重复 consecutiveErrors key。
  • 测试:uses increasing backoff on consecutive transient errors for a daily job 已在同一 job 实例上连续 applyJobResult

CI: 最新 run 中失败主要为 pnpm test:channels / pnpm test:extensions / Windows 分片;日志为大量 extensions/* 用例失败,与本次 src/cron 改动无交集。本地 vitest run src/cron/service.issue-46341-recurring-transient-retry.test.ts 6/6 通过。

建议在 upstream 侧 Re-run failed jobs 或待 main 稳定后再看是否仍为红。PR 逻辑侧当前无待改项。

@onesmash
Copy link
Copy Markdown
Author

定时维护(2026-03-19)

HEAD: 7a98b18 — Greptile 在 daed538 上提的三点(retry.backoffMs 传入 recurring、transient 日志、同一 job 连续退避测试)在当前分支已覆盖。

CI(run 23271919932):pnpm test:channelstest:extensions、Linux test shard 1、多路 Windows 分片等为红;日志显示失败集中在 extensions/telegramdiscordslackwhatsappsignal 等用例,与本次 src/cron 改动无交集。无仓库 admin 权限无法代为 Re-run failed jobs,请维护者在 Actions 上重跑或待上游稳定后再看。

Codex P2preserveSchedule + transient):当前 timer.tsopts.preserveSchedule && normalNext !== undefined 时使用 Math.max(normalNext, backoffNext),强制跑场景下仍保留自然节奏而非仅用短退避;若认可可在该条 inline review 上 resolve。

合并: PR 为 MERGEABLE,但 reviewDecision 仍为空,需维护者 review/approve。

@onesmash
Copy link
Copy Markdown
Author

定时维护(2026-03-27)

HEAD: 52a71a93a — 所有代码审查项已覆盖:

  • Greptile: retryConfig.backoffMstransient 日志、同一 job 连续退避测试 ✅
  • Codex P2: preserveSchedule(transient path 使用 Math.max(normalNext, backoffNext))、recurringBackoffMs、startup catch-up backoff 一致性、retryOn 解耦 ✅

CI(run 23307940289,2026-03-19):macos 1 失败,与本次 src/cron 改动无交集(macOS job 完成时间 18:58 UTC,共享 run 号,属 flaky/pre-existing);30 项 extension-fast 检查全绿。

待办:PR 当前无代码待改项,reviewDecision 仍为空,等待维护者 review/approve。若方便,请 Re-run failed jobs 或直接 approve 合并。

@onesmash onesmash force-pushed the fix/recurring-cron-transient-retry branch from 52a71a9 to 30aa07f Compare March 28, 2026 20:27
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30aa07f73c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts
const backoff = errorBackoffMs(job.state.consecutiveErrors ?? 1);
// Recurring jobs: use default transient patterns only. cron.retry.retryOn is documented
// for one-shot retries; narrowing it must not disable recurring network/timeout retries.
const transient = isTransientCronError(result.error);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Classify built-in timeout errors as transient

The new recurring retry branch now depends on isTransientCronError(result.error), but the scheduler’s own timeout failures are emitted as "cron: job execution timed out" (from timeoutErrorMessage()), which does not match the current timeout transient pattern (timeout|etimedout). In practice, timed-out recurring jobs (for example a daily job that hits the execution timeout) are treated as permanent errors and wait for the next natural schedule instead of taking the intended backoff retry path, so the timeout case in this fix is not actually covered.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 18f0891690: updated TRANSIENT_PATTERNS.timeout to /(timeout|timed\\s+out|etimedout)/i so "cron: job execution timed out" from timeoutErrorMessage() is correctly classified as transient. Added regression test for the exact internal message.

When applyJobResult is invoked with preserveSchedule (e.g. manual force run),
transient errors now keep the preserved natural schedule with backoff floor
instead of always scheduling an immediate backoff retry.

Made-with: Cursor
…backoffMs

Respect cronConfig.retry.backoffMs when set; otherwise use
DEFAULT_BACKOFF_SCHEDULE_MS (full ladder) for recurring error path
to preserve retry-storm protection (15m/60m) instead of resolving
via resolveRetryConfig which defaults to slice(0,3).

Made-with: Cursor
Align recurring path with one-shot: pass retryConfig.backoffMs to
errorBackoffMs so cronConfig.retry.backoffMs is respected for
cron/every jobs (addresses review feedback).

Made-with: Cursor
@obviyus
Copy link
Copy Markdown
Contributor

obviyus commented May 9, 2026

Thanks for the PR. This branch appears to include a large replay of unrelated main/release commits, so it is not reviewable as a focused Telegram change. Please reopen as a narrow PR with only the intended fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: android App: android app: ios App: ios app: macos App: macos channel: bluebubbles Channel integration: bluebubbles channel: discord Channel integration: discord channel: feishu Channel integration: feishu channel: googlechat Channel integration: googlechat channel: imessage Channel integration: imessage channel: line Channel integration: line channel: matrix Channel integration: matrix channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: nextcloud-talk Channel integration: nextcloud-talk channel: nostr Channel integration: nostr channel: qa-channel Channel integration: qa-channel channel: qqbot channel: signal Channel integration: signal channel: slack Channel integration: slack channel: synology-chat channel: telegram Channel integration: telegram channel: tlon Channel integration: tlon channel: twitch Channel integration: twitch channel: voice-call Channel integration: voice-call channel: whatsapp-web Channel integration: whatsapp-web channel: zalo Channel integration: zalo channel: zalouser Channel integration: zalouser cli CLI command changes commands Command implementations docker Docker and sandbox tooling docs Improvements or additions to documentation extensions: acpx extensions: anthropic extensions: byteplus extensions: cerebras extensions: cloudflare-ai-gateway extensions: codex extensions: deepinfra extensions: deepseek extensions: diagnostics-prometheus extensions: duckduckgo extensions: fal extensions: gradium extensions: inworld Extension: inworld extensions: kimi-coding extensions: litellm extensions: llm-task Extension: llm-task extensions: lmstudio extensions: lobster Extension: lobster extensions: memory-core Extension: memory-core extensions: memory-lancedb Extension: memory-lancedb extensions: memory-wiki extensions: minimax extensions: moonshot extensions: nvidia extensions: openai extensions: qa-lab extensions: qianfan extensions: senseaudio extensions: tavily extensions: tencent extensions: tokenjuice Changes to the bundled tokenjuice extension extensions: tts-local-cli extensions: venice extensions: vercel-ai-gateway extensions: volcengine extensions: webhooks extensions: xiaomi gateway Gateway runtime plugin: azure-speech Azure Speech plugin plugin: bonjour Plugin integration: bonjour plugin: google-meet plugin: migrate-claude plugin: migrate-hermes scripts Repository scripts security Security documentation size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Recurring cron jobs silently skip transient-error retry — only one-shot jobs retry with backoff

3 participants