Skip to content

fix(cron): start isolated job timeout after queue wait#41796

Closed
ayanesakura wants to merge 4 commits intoopenclaw:mainfrom
ayanesakura:fix/cron-timeout-after-lane-start
Closed

fix(cron): start isolated job timeout after queue wait#41796
ayanesakura wants to merge 4 commits intoopenclaw:mainfrom
ayanesakura:fix/cron-timeout-after-lane-start

Conversation

@ayanesakura
Copy link
Copy Markdown
Contributor

Summary

Fix isolated cron job timeout semantics so queue wait on the shared cron lane does not consume the entire execution timeout budget.

Closes #41783.

Root cause

Before this change, executeJobCoreWithTimeout() started the cron job timeout timer before the isolated agent job actually got CPU on the downstream cron lane. Under contention, a job could sit queued for minutes and then immediately fail with cron: job execution timed out once execution finally began.

What changed

  • removed the outer timeout wrapper for due-job execution in onTimer
  • moved isolated-agent timeout enforcement into the isolated branch of executeJobCore()
  • start the timeout right before runIsolatedAgentJob(...)
  • preserve upstream abort behavior by combining signals with AbortSignal.any(...)
  • normalize timeout-triggered aborts back to cron: job execution timed out

Validation

Ran targeted tests locally:

  • pnpm vitest run src/cron/service.issue-regressions.test.ts -t "suppresses isolated follow-up side effects after timeout"
  • pnpm vitest run src/cron/service.issue-regressions.test.ts src/cron/service/timeout-policy.test.ts

Both passed.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4b80a7769

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 10, 2026

Greptile Summary

This PR fixes a legitimate bug where isolated cron job timeout timers started too early (during queue wait on the shared cron lane), causing jobs to appear to time out immediately upon receiving CPU. The core fix — moving the setTimeout to just before runIsolatedAgentJob and combining signals via AbortSignal.any — is well-structured for isolated jobs.

Key concerns:

  • Regression for non-isolated jobs: Replacing executeJobCoreWithTimeout with executeJobCore in runDueJob removes the outer timeout wrapper for all job types, not just isolated ones. Main-session jobs with wakeMode === "now" previously respected the user-configured cronJobTimeoutMs via the abort signal threaded through the heartbeat retry loop; they now always wait up to maxWaitMs (2 min) regardless of configuration.
  • Overly broad post-return abort check: After runIsolatedAgentJob resolves, checking combinedAbortSignal.aborted unconditionally returns a timeout error — including for non-timeout upstream aborts — which could silently discard a successful result and misreport the error type for future callers that pass unrelated abort signals.
  • The jobTimeoutMs variable declared at line 632 in runDueJob is now only used for logging, which may be confusing since it previously drove enforcement.

Confidence Score: 2/5

  • Merge with caution — the fix is correct for isolated jobs but inadvertently removes timeout enforcement for main-session jobs and has inconsistent abort-reason handling.
  • The isolated-job timeout fix is sound and solves the reported issue. However, switching runDueJob to call executeJobCore directly (instead of executeJobCoreWithTimeout) is a broader change that silently removes timeout behavior for non-isolated job types in the onTimer path. This is a real behavioral regression for users with main-session jobs that have a configured cronJobTimeoutMs shorter than 2 minutes. Additionally, the abort-reason check at lines 1153–1155 doesn't distinguish between timeout and non-timeout aborts, creating a subtle correctness issue if upstream abort signals are passed with non-timeout reasons. These issues lower confidence enough to warrant fixes before merging.
  • src/cron/service/timer.ts — specifically the runDueJob function around line 635 and the post-return abort check at lines 1153–1155

Last reviewed commit: d4b80a7

Comment thread src/cron/service/timer.ts Outdated
Comment thread src/cron/service/timer.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e8c0e36fb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M and removed size: S labels Mar 10, 2026
@ayanesakura
Copy link
Copy Markdown
Contributor Author

Addressed the timeout review follow-up in 50dfd02. Main-session due jobs keep the outer timeout, isolated execution timeout now starts when execution actually begins, timer-driven isolated runs keep a separate hard guard, and cron timer delays are clamped to the Node timer max.

@jlcbk
Copy link
Copy Markdown

jlcbk commented Mar 10, 2026

Tested locally on the PR head:

  • pnpm format:check ✅
  • pnpm tsgo ✅
  • pnpm build ✅
  • pnpm test src/cron/service.issue-regressions.test.ts ✅

This looks like the right fix for the cron-lane queue-wait timeout semantics (isolated timeout starts after execution begins), and the added regression coverage is solid.

@ayanesakura
Copy link
Copy Markdown
Contributor Author

Closing in favor of #42482, which addresses both the deferred timeout issue (#41783) and the per-attempt AbortController issue (#37505) in a single comprehensive fix. #42482's broader scope makes it a better solution for the cron timeout problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(cron): job timeout includes cron-lane queue wait time

2 participants