Skip to content

fix(cron): per-attempt AbortControllers and deferred execution timeout#42482

Closed
frankbuild wants to merge 1 commit intoopenclaw:mainfrom
frankbuild:fix/cron-abort-timeout-42464
Closed

fix(cron): per-attempt AbortControllers and deferred execution timeout#42482
frankbuild wants to merge 1 commit intoopenclaw:mainfrom
frankbuild:fix/cron-abort-timeout-42464

Conversation

@frankbuild
Copy link
Copy Markdown
Contributor

@frankbuild frankbuild commented Mar 10, 2026

Summary

Fix two root causes of isolated cron agentTurn jobs hanging until timeout (#42464):

  • Per-attempt AbortControllers (Cron job timeout aborts entire model fallback chain via shared AbortController #37505): The cron timeout fires a shared AbortController signal that propagates to all subsequent model fallback attempts, killing them instantly (~100ms) without making a network request. Each fallback attempt in runCronIsolatedAgentTurn now gets its own AbortController linked to the parent signal, so when one attempt is aborted the next attempt starts with a fresh (non-aborted) controller.

  • Deferred execution timeout (bug(cron): job timeout includes cron-lane queue wait time #41783): The timeout timer in executeJobCoreWithTimeout started immediately via Promise.race, but the job may wait in the lane queue (inside runEmbeddedPiAgent) before doing real work. executeJobCore now accepts an onExecutionStart callback and calls it right before runIsolatedAgentJob, deferring the timeout clock until actual execution begins. A 2× safety backstop prevents indefinite hangs if the callback is never called.

Files changed

  • src/cron/isolated-agent/run.ts — per-attempt AbortController in the runWithModelFallback run callback
  • src/cron/service/timer.ts — deferred timeout arming via onExecutionStart callback
  • src/cron/service.cron-timeout-abort.test.ts — 5 new tests covering both fixes

Test plan

  • All 502 existing cron tests pass (64 test files)
  • All 60 model-fallback tests pass
  • 5 new tests covering:
    • Deferred timeout does not fire during queue wait
    • executeJobCore calls onExecutionStart before running isolated jobs
    • executeJobCore calls onExecutionStart for main session jobs
    • Abort signal fires correctly on timeout
    • executeJobCoreWithTimeout still times out correctly with deferred start

Closes #37505
Closes #41783
Fixes #42632
Refs #42464 #40237

Related

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 10, 2026

Greptile Summary

This PR fixes two independent root causes that caused isolated cron agentTurn jobs to hang until the full timeout rather than failing fast or surviving model fallbacks.

Fix 1 — Per-attempt AbortController (src/cron/isolated-agent/run.ts): The shared parent AbortSignal was previously forwarded directly to runEmbeddedPiAgent. When the timeout fired mid-attempt, subsequent fallback attempts received the already-aborted parent signal and bailed out instantly. Each attempt now gets its own AbortController linked to the parent via a { once: true } listener; the finally block removes the listener cleanly. Logic and cleanup are correct.

Fix 2 — Deferred execution timeout (src/cron/service/timer.ts): The job-level timeout timer previously started immediately in Promise.race, allowing lane-queue wait time to consume the execution budget before real work began. The timer is now armed via an onExecutionStart callback called inside executeJobCore just before runIsolatedAgentJob. A 2× safety backstop ensures the race never hangs if onExecutionStart is never reached (e.g., early validation failure). Idempotency (armed flag) and cleanup (finally) are handled correctly.

Both fixes are logically sound, properly scoped, and well-tested with 5 new regression tests covering deferred timeout behavior, onExecutionStart ordering, and abort signal propagation. All 502 existing cron tests pass.

Confidence Score: 4/5

  • Safe to merge — both fixes are logically correct and thoroughly tested with 5 new tests plus all 502 existing tests passing.
  • The PR implements two well-scoped, independent fixes for isolated cron timeout hangs. The deferred execution timeout correctly arms the timer only after onExecutionStart is called, preventing queue-wait time from consuming the budget. The per-attempt AbortController logic is clean: each fallback attempt gets a fresh controller linked to the parent signal with proper cleanup. The 5 new tests validate core behavior: timeout deferral, onExecutionStart ordering, and abort signal propagation. Score is held at 4 rather than 5 primarily due to the first test not simulating an actual queue-wait scenario—while the test still validates happy-path behavior, the specific queue-wait regression case (job waiting in lane queue for 50ms before execution) lacks explicit automated coverage.
  • No files require special attention. Core logic in both timer.ts and run.ts is correct.

Last reviewed commit: f6818f3

@spectra-the-bot
Copy link
Copy Markdown

Heavily impacted by both issues described here — wanted to add a field report.

We have several isolated agentTurn cron jobs running on a busy gateway with multiple agents (main, forge, governance). The combination of the shared AbortController and the non-deferred timeout created a compounding failure mode today: cron lane saturation caused lane waits long enough to burn through a significant portion of the job's timeout budget before execution even started. When the first attempt was aborted (by the now-expired timer), the poisoned signal cascaded to all fallback attempts, producing FailoverError: LLM request timed out across the entire fallback chain simultaneously.

The result looked like a provider outage — multiple agents hitting FailoverError at the same moment — even though interactive sessions on the same gateway were unaffected. Took several hours and an external forensic pass to correctly attribute the cause.

Workaround we landed on: pin the cron model to the agent's primary model to avoid triggering the fallback chain at all. Not ideal but stable.

Looking forward to this merging. Happy to provide test data or a reproduction config if useful.

@frankbuild
Copy link
Copy Markdown
Contributor Author

Hi @tyler6204 @vincentkoc — this PR has been CI-green and Greptile 4/5 ("safe to merge") for a few days now. We also got a field report from another user confirming the real-world impact (see spectra-the-bot's comment above), and a duplicate issue #42632 was independently filed. A competing PR #41796 was closed in favour of this approach. Would appreciate a review when you get a chance 🙏

spectra-the-bot added a commit to spectra-the-bot/openclaw that referenced this pull request Mar 13, 2026
…successfully

If the execution timeout fires during cleanup after the actual work is done,
abortSignal.aborted will be true when executeJobCore returns, even though
runIsolatedAgentJob resolved with status 'ok'. The previous check discarded
the successful res entirely and returned a timeout error.

Fix: only override with timeout error if the session itself did not complete
successfully (res.status !== 'ok').

Fixes: openclaw#42482-followup
Fix two root causes of cron agentTurn jobs hanging until timeout:

1. Shared AbortController kills fallback chain (openclaw#37505): When the cron
   timeout fires, it aborts a shared signal that propagates to all
   subsequent model fallback attempts, killing them instantly (~100ms).
   Now each fallback attempt in runCronIsolatedAgentTurn gets its own
   AbortController linked to the parent signal, so new attempts start
   with a fresh (non-aborted) controller.

2. Queue wait consumes execution timeout (openclaw#41783): The timeout timer
   started immediately in executeJobCoreWithTimeout, but the job may
   wait in the lane queue (inside runEmbeddedPiAgent) before doing real
   work. Now executeJobCore accepts an onExecutionStart callback and
   calls it right before runIsolatedAgentJob, deferring the timeout
   clock until actual execution begins. A 2x safety backstop prevents
   indefinite hangs if the callback is never called.

Closes openclaw#37505
Closes openclaw#41783
Refs openclaw#42464 openclaw#40237
@frankbuild frankbuild force-pushed the fix/cron-abort-timeout-42464 branch from 2154f40 to fd818d3 Compare March 14, 2026 09:42
spectra-the-bot added a commit to spectra-the-bot/openclaw that referenced this pull request Mar 14, 2026
…successfully

If the execution timeout fires during cleanup after the actual work is done,
abortSignal.aborted will be true when executeJobCore returns, even though
runIsolatedAgentJob resolved with status 'ok'. The previous check discarded
the successful res entirely and returned a timeout error.

Fix: only override with timeout error if the session itself did not complete
successfully (res.status !== 'ok').

Fixes: openclaw#42482-followup
@howel52
Copy link
Copy Markdown

howel52 commented Apr 15, 2026

I’ve been hitting this issue quite frequently as well, so really appreciate the work that’s already gone into this fix.

I noticed the PR has been open for a while and seems technically mature already. Just wondering if there are any blockers at the moment that are holding up the merge?

Happy to help validate this against our production workload if needed.

@vincentkoc @tyler6204

@openclaw-barnacle
Copy link
Copy Markdown

Please don’t spam-ping multiple maintainers at once. Be patient, or join our community Discord for help: https://discord.gg/clawd

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 27, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 27, 2026

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded: current main has already shipped the deferred-timeout fix, while this PR's timer hook still arms too early and the remaining abort/fallback semantics are now tracked by narrower follow-up work.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Close this branch as superseded, keep the shipped timeout implementation, and resolve terminal cron-budget abort behavior in the active shared failover-policy follow-ups with focused contract tests.

Do we have a high-confidence way to reproduce the issue?

Partial yes: current-main tests reproduce the queue-wait timeout boundary, and source inspection shows the PR diff arms its callback before the runner lane wait it meant to exclude. The remaining abort-policy behavior is source-visible but should be verified in the narrower canonical follow-up.

Is this the best way to solve the issue?

No: this PR is no longer the best fix because its deferred-timeout callback fires before the actual embedded lane start and current main already shipped the correct timeout boundary. The remaining abort semantics belong in the shared failover policy tracked by the newer follow-ups.

Security review:

Security review cleared: The PR changes cron runtime code and tests only; I found no workflow, dependency, script, credential, or supply-chain surface in the diff.

What I checked:

Likely related people:

  • steipete: Commit metadata shows Peter authored the shipped deferred-timeout fix plus adjacent timed-out cron cleanup and cron lane concurrency work. (role: recent maintainer; confidence: high; commits: 729147dcb523, 61d53f98d314, 7d74c29dcc07; files: src/cron/service/timer.ts, src/agents/pi-embedded-runner/run.ts, src/cron/isolated-agent/run-executor.ts)
  • ayanesakura: Filed the queue-wait timeout report and authored the earlier superseded timeout PR; the shipped changelog entry credits this context. (role: related contributor; confidence: medium; commits: 729147dcb523; files: src/cron/service/timer.ts, src/cron/service/timer.regression.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 5ecd01ff94d6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

3 participants