fix(cron): per-attempt AbortControllers and deferred execution timeout#42482
fix(cron): per-attempt AbortControllers and deferred execution timeout#42482frankbuild wants to merge 1 commit intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes two independent root causes that caused isolated cron Fix 1 — Per-attempt Fix 2 — Deferred execution timeout ( Both fixes are logically sound, properly scoped, and well-tested with 5 new regression tests covering deferred timeout behavior, Confidence Score: 4/5
Last reviewed commit: f6818f3 |
|
Heavily impacted by both issues described here — wanted to add a field report. We have several isolated The result looked like a provider outage — multiple agents hitting FailoverError at the same moment — even though interactive sessions on the same gateway were unaffected. Took several hours and an external forensic pass to correctly attribute the cause. Workaround we landed on: pin the cron model to the agent's primary model to avoid triggering the fallback chain at all. Not ideal but stable. Looking forward to this merging. Happy to provide test data or a reproduction config if useful. |
88a2abe to
2154f40
Compare
|
Hi @tyler6204 @vincentkoc — this PR has been CI-green and Greptile 4/5 ("safe to merge") for a few days now. We also got a field report from another user confirming the real-world impact (see spectra-the-bot's comment above), and a duplicate issue #42632 was independently filed. A competing PR #41796 was closed in favour of this approach. Would appreciate a review when you get a chance 🙏 |
…successfully If the execution timeout fires during cleanup after the actual work is done, abortSignal.aborted will be true when executeJobCore returns, even though runIsolatedAgentJob resolved with status 'ok'. The previous check discarded the successful res entirely and returned a timeout error. Fix: only override with timeout error if the session itself did not complete successfully (res.status !== 'ok'). Fixes: openclaw#42482-followup
Fix two root causes of cron agentTurn jobs hanging until timeout: 1. Shared AbortController kills fallback chain (openclaw#37505): When the cron timeout fires, it aborts a shared signal that propagates to all subsequent model fallback attempts, killing them instantly (~100ms). Now each fallback attempt in runCronIsolatedAgentTurn gets its own AbortController linked to the parent signal, so new attempts start with a fresh (non-aborted) controller. 2. Queue wait consumes execution timeout (openclaw#41783): The timeout timer started immediately in executeJobCoreWithTimeout, but the job may wait in the lane queue (inside runEmbeddedPiAgent) before doing real work. Now executeJobCore accepts an onExecutionStart callback and calls it right before runIsolatedAgentJob, deferring the timeout clock until actual execution begins. A 2x safety backstop prevents indefinite hangs if the callback is never called. Closes openclaw#37505 Closes openclaw#41783 Refs openclaw#42464 openclaw#40237
2154f40 to
fd818d3
Compare
…successfully If the execution timeout fires during cleanup after the actual work is done, abortSignal.aborted will be true when executeJobCore returns, even though runIsolatedAgentJob resolved with status 'ok'. The previous check discarded the successful res entirely and returned a timeout error. Fix: only override with timeout error if the session itself did not complete successfully (res.status !== 'ok'). Fixes: openclaw#42482-followup
|
I’ve been hitting this issue quite frequently as well, so really appreciate the work that’s already gone into this fix. I noticed the PR has been open for a while and seems technically mature already. Just wondering if there are any blockers at the moment that are holding up the merge? Happy to help validate this against our production workload if needed. |
|
Please don’t spam-ping multiple maintainers at once. Be patient, or join our community Discord for help: https://discord.gg/clawd |
|
This pull request has been automatically marked as stale due to inactivity. |
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Close as superseded: current main has already shipped the deferred-timeout fix, while this PR's timer hook still arms too early and the remaining abort/fallback semantics are now tracked by narrower follow-up work. So I’m closing this here and keeping the remaining discussion on the canonical linked item. Review detailsBest possible solution: Close this branch as superseded, keep the shipped timeout implementation, and resolve terminal cron-budget abort behavior in the active shared failover-policy follow-ups with focused contract tests. Do we have a high-confidence way to reproduce the issue? Partial yes: current-main tests reproduce the queue-wait timeout boundary, and source inspection shows the PR diff arms its callback before the runner lane wait it meant to exclude. The remaining abort-policy behavior is source-visible but should be verified in the narrower canonical follow-up. Is this the best way to solve the issue? No: this PR is no longer the best fix because its deferred-timeout callback fires before the actual embedded lane start and current main already shipped the correct timeout boundary. The remaining abort semantics belong in the shared failover policy tracked by the newer follow-ups. Security review: Security review cleared: The PR changes cron runtime code and tests only; I found no workflow, dependency, script, credential, or supply-chain surface in the diff. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 5ecd01ff94d6. |
Summary
Fix two root causes of isolated cron
agentTurnjobs hanging until timeout (#42464):Per-attempt AbortControllers (Cron job timeout aborts entire model fallback chain via shared AbortController #37505): The cron timeout fires a shared
AbortControllersignal that propagates to all subsequent model fallback attempts, killing them instantly (~100ms) without making a network request. Each fallback attempt inrunCronIsolatedAgentTurnnow gets its ownAbortControllerlinked to the parent signal, so when one attempt is aborted the next attempt starts with a fresh (non-aborted) controller.Deferred execution timeout (bug(cron): job timeout includes cron-lane queue wait time #41783): The timeout timer in
executeJobCoreWithTimeoutstarted immediately viaPromise.race, but the job may wait in the lane queue (insiderunEmbeddedPiAgent) before doing real work.executeJobCorenow accepts anonExecutionStartcallback and calls it right beforerunIsolatedAgentJob, deferring the timeout clock until actual execution begins. A 2× safety backstop prevents indefinite hangs if the callback is never called.Files changed
src/cron/isolated-agent/run.ts— per-attemptAbortControllerin therunWithModelFallbackrun callbacksrc/cron/service/timer.ts— deferred timeout arming viaonExecutionStartcallbacksrc/cron/service.cron-timeout-abort.test.ts— 5 new tests covering both fixesTest plan
executeJobCorecallsonExecutionStartbefore running isolated jobsexecuteJobCorecallsonExecutionStartfor main session jobsexecuteJobCoreWithTimeoutstill times out correctly with deferred startCloses #37505
Closes #41783
Fixes #42632
Refs #42464 #40237
Related
sessionTarget="isolated"+agentTurncan time out on a minimal prompt #42632 (6 confirmations from affected users)