fix(cron): per-attempt AbortControllers and deferred execution timeout by frankbuild · Pull Request #42482 · openclaw/openclaw

frankbuild · 2026-03-10T20:07:26Z

Summary

Fix two root causes of isolated cron agentTurn jobs hanging until timeout (#42464):

Per-attempt AbortControllers (Cron job timeout aborts entire model fallback chain via shared AbortController #37505): The cron timeout fires a shared AbortController signal that propagates to all subsequent model fallback attempts, killing them instantly (~100ms) without making a network request. Each fallback attempt in runCronIsolatedAgentTurn now gets its own AbortController linked to the parent signal, so when one attempt is aborted the next attempt starts with a fresh (non-aborted) controller.
Deferred execution timeout (bug(cron): job timeout includes cron-lane queue wait time #41783): The timeout timer in executeJobCoreWithTimeout started immediately via Promise.race, but the job may wait in the lane queue (inside runEmbeddedPiAgent) before doing real work. executeJobCore now accepts an onExecutionStart callback and calls it right before runIsolatedAgentJob, deferring the timeout clock until actual execution begins. A 2× safety backstop prevents indefinite hangs if the callback is never called.

Files changed

src/cron/isolated-agent/run.ts — per-attempt AbortController in the runWithModelFallback run callback
src/cron/service/timer.ts — deferred timeout arming via onExecutionStart callback
src/cron/service.cron-timeout-abort.test.ts — 5 new tests covering both fixes

Test plan

All 502 existing cron tests pass (64 test files)
All 60 model-fallback tests pass
5 new tests covering:
- Deferred timeout does not fire during queue wait
- executeJobCore calls onExecutionStart before running isolated jobs
- executeJobCore calls onExecutionStart for main session jobs
- Abort signal fires correctly on timeout
- executeJobCoreWithTimeout still times out correctly with deferred start

Closes #37505
Closes #41783
Fixes #42632
Refs #42464 #40237

Supersedes PR fix(cron): start isolated job timeout after queue wait #41796 (closed by author in favour of this PR — same deferred timeout fix but narrower scope)
Independent repro in cron sessionTarget="isolated" + agentTurn can time out on a minimal prompt #42632 (6 confirmations from affected users)
WS self-contention ([Bug]: Gateway WS self-contention still unresolved — cron tool timeouts from active sessions (#5703/#6508 circular-duped) #40237) addressed separately in PR fix: route embedded tool calls through in-process dispatch (#40237) #42497

greptile-apps · 2026-03-10T20:15:39Z

Greptile Summary

This PR fixes two independent root causes that caused isolated cron agentTurn jobs to hang until the full timeout rather than failing fast or surviving model fallbacks.

Fix 1 — Per-attempt AbortController (src/cron/isolated-agent/run.ts): The shared parent AbortSignal was previously forwarded directly to runEmbeddedPiAgent. When the timeout fired mid-attempt, subsequent fallback attempts received the already-aborted parent signal and bailed out instantly. Each attempt now gets its own AbortController linked to the parent via a { once: true } listener; the finally block removes the listener cleanly. Logic and cleanup are correct.

Fix 2 — Deferred execution timeout (src/cron/service/timer.ts): The job-level timeout timer previously started immediately in Promise.race, allowing lane-queue wait time to consume the execution budget before real work began. The timer is now armed via an onExecutionStart callback called inside executeJobCore just before runIsolatedAgentJob. A 2× safety backstop ensures the race never hangs if onExecutionStart is never reached (e.g., early validation failure). Idempotency (armed flag) and cleanup (finally) are handled correctly.

Both fixes are logically sound, properly scoped, and well-tested with 5 new regression tests covering deferred timeout behavior, onExecutionStart ordering, and abort signal propagation. All 502 existing cron tests pass.

Confidence Score: 4/5

Safe to merge — both fixes are logically correct and thoroughly tested with 5 new tests plus all 502 existing tests passing.
The PR implements two well-scoped, independent fixes for isolated cron timeout hangs. The deferred execution timeout correctly arms the timer only after onExecutionStart is called, preventing queue-wait time from consuming the budget. The per-attempt AbortController logic is clean: each fallback attempt gets a fresh controller linked to the parent signal with proper cleanup. The 5 new tests validate core behavior: timeout deferral, onExecutionStart ordering, and abort signal propagation. Score is held at 4 rather than 5 primarily due to the first test not simulating an actual queue-wait scenario—while the test still validates happy-path behavior, the specific queue-wait regression case (job waiting in lane queue for 50ms before execution) lacks explicit automated coverage.
No files require special attention. Core logic in both timer.ts and run.ts is correct.

_{Last reviewed commit: f6818f3}

spectra-the-bot · 2026-03-11T23:32:09Z

Heavily impacted by both issues described here — wanted to add a field report.

We have several isolated agentTurn cron jobs running on a busy gateway with multiple agents (main, forge, governance). The combination of the shared AbortController and the non-deferred timeout created a compounding failure mode today: cron lane saturation caused lane waits long enough to burn through a significant portion of the job's timeout budget before execution even started. When the first attempt was aborted (by the now-expired timer), the poisoned signal cascaded to all fallback attempts, producing FailoverError: LLM request timed out across the entire fallback chain simultaneously.

The result looked like a provider outage — multiple agents hitting FailoverError at the same moment — even though interactive sessions on the same gateway were unaffected. Took several hours and an external forensic pass to correctly attribute the cause.

Workaround we landed on: pin the cron model to the agent's primary model to avoid triggering the fallback chain at all. Not ideal but stable.

Looking forward to this merging. Happy to provide test data or a reproduction config if useful.

frankbuild · 2026-03-13T11:23:20Z

Hi @tyler6204 @vincentkoc — this PR has been CI-green and Greptile 4/5 ("safe to merge") for a few days now. We also got a field report from another user confirming the real-world impact (see spectra-the-bot's comment above), and a duplicate issue #42632 was independently filed. A competing PR #41796 was closed in favour of this approach. Would appreciate a review when you get a chance 🙏

…successfully If the execution timeout fires during cleanup after the actual work is done, abortSignal.aborted will be true when executeJobCore returns, even though runIsolatedAgentJob resolved with status 'ok'. The previous check discarded the successful res entirely and returned a timeout error. Fix: only override with timeout error if the session itself did not complete successfully (res.status !== 'ok'). Fixes: openclaw#42482-followup

Fix two root causes of cron agentTurn jobs hanging until timeout: 1. Shared AbortController kills fallback chain (openclaw#37505): When the cron timeout fires, it aborts a shared signal that propagates to all subsequent model fallback attempts, killing them instantly (~100ms). Now each fallback attempt in runCronIsolatedAgentTurn gets its own AbortController linked to the parent signal, so new attempts start with a fresh (non-aborted) controller. 2. Queue wait consumes execution timeout (openclaw#41783): The timeout timer started immediately in executeJobCoreWithTimeout, but the job may wait in the lane queue (inside runEmbeddedPiAgent) before doing real work. Now executeJobCore accepts an onExecutionStart callback and calls it right before runIsolatedAgentJob, deferring the timeout clock until actual execution begins. A 2x safety backstop prevents indefinite hangs if the callback is never called. Closes openclaw#37505 Closes openclaw#41783 Refs openclaw#42464 openclaw#40237

…successfully If the execution timeout fires during cleanup after the actual work is done, abortSignal.aborted will be true when executeJobCore returns, even though runIsolatedAgentJob resolved with status 'ok'. The previous check discarded the successful res entirely and returned a timeout error. Fix: only override with timeout error if the session itself did not complete successfully (res.status !== 'ok'). Fixes: openclaw#42482-followup

howel52 · 2026-04-15T08:02:33Z

I’ve been hitting this issue quite frequently as well, so really appreciate the work that’s already gone into this fix.

I noticed the PR has been open for a while and seems technically mature already. Just wondering if there are any blockers at the moment that are holding up the merge?

Happy to help validate this against our production workload if needed.

@vincentkoc @tyler6204

openclaw-barnacle · 2026-04-15T08:02:43Z

Please don’t spam-ping multiple maintainers at once. Be patient, or join our community Discord for help: https://discord.gg/clawd

openclaw-barnacle · 2026-04-27T04:35:29Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

clawsweeper · 2026-04-27T05:15:26Z

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded: current main has already shipped the deferred-timeout fix, while this PR's timer hook still arms too early and the remaining abort/fallback semantics are now tracked by narrower follow-up work.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Close this branch as superseded, keep the shipped timeout implementation, and resolve terminal cron-budget abort behavior in the active shared failover-policy follow-ups with focused contract tests.

Do we have a high-confidence way to reproduce the issue?

Partial yes: current-main tests reproduce the queue-wait timeout boundary, and source inspection shows the PR diff arms its callback before the runner lane wait it meant to exclude. The remaining abort-policy behavior is source-visible but should be verified in the narrower canonical follow-up.

Is this the best way to solve the issue?

No: this PR is no longer the best fix because its deferred-timeout callback fires before the actual embedded lane start and current main already shipped the correct timeout boundary. The remaining abort semantics belong in the shared failover policy tracked by the newer follow-ups.

Security review:

Security review cleared: The PR changes cron runtime code and tests only; I found no workflow, dependency, script, credential, or supply-chain surface in the diff.

What I checked:

current-main-deferred-timeout: Current main defers timeout arming for non-main agentTurn jobs through onExecutionStarted instead of starting the timer immediately. (src/cron/service/timer.ts:120, 5ecd01ff94d6)
runner-lane-callback: The embedded runner calls onExecutionStarted after entering the session/global lane path, which matches the queue-wait boundary the PR intended to fix. (src/agents/pi-embedded-runner/run.ts:379, 5ecd01ff94d6)
regression-coverage: Current regression coverage verifies isolated execution timeout is not spent while waiting for runner lane acquisition and only aborts after the execution-start callback fires. (src/cron/service/timer.regression.test.ts:618, 5ecd01ff94d6)
release-provenance: CHANGELOG.md lists the shipped 2026.4.29 fix for starting isolated agent-turn timeouts after the runner enters its effective execution lane, fixing bug(cron): job timeout includes cron-lane queue wait time #41783. (CHANGELOG.md:1429, 5ecd01ff94d6)
pr-diff-timer-blocker: The submitted diff invokes the new execution-start callback immediately before runIsolatedAgentJob, but the queue wait described by the PR occurs inside that runner path. (src/cron/service/timer.ts:1167, fd818d32018d)
remaining-abort-policy-track: Current failover policy treats external aborts as surfaced errors rather than model-fallback candidates, and the provided timeline links narrower active follow-ups Don't trigger model fallback when abort reason is the run's own timeout budget #60388, fix(cron): stop fallback attempts when cron budget is exhausted #52365, and fix(agents): distinguish terminal aborts from retryable failures (#60388) #62682 for terminal-abort semantics. (src/agents/pi-embedded-runner/run/failover-policy.ts:119, 5ecd01ff94d6)

Likely related people:

steipete: Commit metadata shows Peter authored the shipped deferred-timeout fix plus adjacent timed-out cron cleanup and cron lane concurrency work. (role: recent maintainer; confidence: high; commits: 729147dcb523, 61d53f98d314, 7d74c29dcc07; files: src/cron/service/timer.ts, src/agents/pi-embedded-runner/run.ts, src/cron/isolated-agent/run-executor.ts)
ayanesakura: Filed the queue-wait timeout report and authored the earlier superseded timeout PR; the shipped changelog entry credits this context. (role: related contributor; confidence: medium; commits: 729147dcb523; files: src/cron/service/timer.ts, src/cron/service/timer.regression.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5ecd01ff94d6.

openclaw-barnacle Bot added the size: M label Mar 10, 2026

goofy814 mentioned this pull request Mar 11, 2026

cron sessionTarget="isolated" + agentTurn can time out on a minimal prompt #42632

Closed

github-actions Bot mentioned this pull request Mar 12, 2026

🦞 Bản tin hàng ngày hệ sinh thái OpenClaw 2026-03-12 compasify/agents-radar#31

Open

ayanesakura mentioned this pull request Mar 12, 2026

fix(cron): start isolated job timeout after queue wait #41796

Closed

frankbuild mentioned this pull request Mar 12, 2026

[Bug]: Isolated cron agentTurn jobs hang until timeout — providers reachable, interactive sessions unaffected #42464

Closed

frankbuild force-pushed the fix/cron-abort-timeout-42464 branch from 88a2abe to 2154f40 Compare March 13, 2026 03:14

frankbuild force-pushed the fix/cron-abort-timeout-42464 branch from 2154f40 to fd818d3 Compare March 14, 2026 09:42

simonusa mentioned this pull request Apr 7, 2026

fix(agents): distinguish terminal aborts from retryable failures (#60388) #62682

Open

3 tasks

openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 27, 2026

clawsweeper Bot mentioned this pull request Apr 27, 2026

fix(cron): start job timeout after execution begins, not at enqueue time #42680

Closed

BingqingLyu mentioned this pull request Apr 27, 2026

fix(cron): per-attempt AbortControllers and deferred execution timeout BingqingLyu/openclaw#392

Open

3 tasks

openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Apr 28, 2026

BingqingLyu mentioned this pull request Apr 28, 2026

fix(agents): distinguish terminal aborts from retryable failures (#60388) BingqingLyu/openclaw#2114

Open

3 tasks

This was referenced Apr 28, 2026

bug(cron): job timeout includes cron-lane queue wait time #41783

Closed

Don't trigger model fallback when abort reason is the run's own timeout budget #60388

Open

fix(cron): stop fallback attempts when cron budget is exhausted #52365

Open

clawsweeper Bot closed this May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cron): per-attempt AbortControllers and deferred execution timeout#42482

fix(cron): per-attempt AbortControllers and deferred execution timeout#42482
frankbuild wants to merge 1 commit intoopenclaw:mainfrom
frankbuild:fix/cron-abort-timeout-42464

frankbuild commented Mar 10, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 10, 2026

Uh oh!

spectra-the-bot commented Mar 11, 2026

Uh oh!

frankbuild commented Mar 13, 2026

Uh oh!

howel52 commented Apr 15, 2026 •

edited

Loading

Uh oh!

openclaw-barnacle Bot commented Apr 15, 2026

Uh oh!

openclaw-barnacle Bot commented Apr 27, 2026

Uh oh!

clawsweeper Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

frankbuild commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

Test plan

Related

Uh oh!

greptile-apps Bot commented Mar 10, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

spectra-the-bot commented Mar 11, 2026

Uh oh!

frankbuild commented Mar 13, 2026

Uh oh!

howel52 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openclaw-barnacle Bot commented Apr 15, 2026

Uh oh!

openclaw-barnacle Bot commented Apr 27, 2026

Uh oh!

clawsweeper Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

frankbuild commented Mar 10, 2026 •

edited

Loading

howel52 commented Apr 15, 2026 •

edited

Loading

clawsweeper Bot commented Apr 27, 2026 •

edited

Loading