Skip to content

fix(cron): use per-attempt abort controller so model fallback chain survives timeout#38149

Closed
Sid-Qin wants to merge 1 commit intoopenclaw:mainfrom
Sid-Qin:fix/37505-cron-timeout-per-attempt-abort
Closed

fix(cron): use per-attempt abort controller so model fallback chain survives timeout#38149
Sid-Qin wants to merge 1 commit intoopenclaw:mainfrom
Sid-Qin:fix/37505-cron-timeout-per-attempt-abort

Conversation

@Sid-Qin
Copy link
Copy Markdown
Contributor

@Sid-Qin Sid-Qin commented Mar 6, 2026

Summary

  • Problem: Cron job timeout aborts a shared AbortController. When the first model attempt times out, subsequent fallback attempts immediately fail because they receive an already-aborted signal.
  • Why it matters: Model fallback chains become useless under cron timeout — the first timeout kills all alternatives, preventing recovery with a different (possibly faster) model.
  • What changed: In the run callback passed to runWithModelFallback, create a fresh AbortController for each attempt. Forward the outer signal's abort event to the per-attempt controller, so new attempts start with a non-aborted signal while remaining cancellable. Added 2 unit tests.
  • What did NOT change: timer.ts timeout logic, runWithModelFallback loop, CLI agent path (uses its own timeout), outer abort semantics.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Cron jobs with model fallback chains now correctly try alternative models after the primary times out, instead of aborting the entire chain.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS Darwin 25.3.0 (arm64)
  • Runtime: Node v25.6.1

Steps

  1. Configure a cron job with a slow primary model and a fallback chain
  2. Set a timeout that triggers before the primary completes

Expected

  • After primary times out, fallback models are attempted

Actual

  • Before fix: All fallback attempts fail immediately with "cron: job execution timed out" because they see the aborted signal
  • After fix: Each fallback attempt gets a fresh abort controller

Evidence

 ✓ src/cron/isolated-agent/run.abort-signal.test.ts (2 tests) 5ms
   ✓ passes a per-attempt abort signal to runEmbeddedPiAgent, not the outer signal
   ✓ forwards outer abort to the per-attempt signal

Human Verification (required)

  • Verified scenarios: fresh signal per attempt, outer abort forwarded, listener cleanup on success
  • Edge cases checked: outer already aborted at attempt start, multiple sequential attempts
  • What I did not verify: Live cron job with real model timeout + fallback

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert: Revert this commit
  • Files/config to restore: src/cron/isolated-agent/run.ts
  • Known bad symptoms: If per-attempt controller leaks (unlikely — addEventListener with once: true), abort forwarding could be delayed

Risks and Mitigations

Low — the change is scoped to the run callback closure. The outer abort is still forwarded, so cancellation semantics are preserved. The only behavior change is that new fallback attempts get a fresh signal.

…urvives timeout

The cron timeout handler aborts a shared AbortController. When the first
model attempt times out, subsequent fallback attempts receive an already-
aborted signal and fail immediately. Create a fresh AbortController for
each fallback attempt and forward the outer abort to it, so new attempts
start with a non-aborted signal while still being cancellable.

Closes openclaw#37505
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Mar 6, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🔵 Low AbortSignal listener leak and ignored pre-aborted signal in runCronIsolatedAgentTurn attempt runner

1. 🔵 AbortSignal listener leak and ignored pre-aborted signal in runCronIsolatedAgentTurn attempt runner

Property Value
Severity Low
CWE CWE-772
Location src/cron/isolated-agent/run.ts:471-549

Description

In runCronIsolatedAgentTurn, the per-attempt abort forwarding registers an abort event listener on the outer abortSignal, but the listener cleanup is not guaranteed on all exit paths.

Impacts:

  • Listener leak / retained closures: If the attempt exits via an exception/rejection (e.g., runEmbeddedPiAgent(...) throws) or via the runCliAgent(...) early-return path, the abort listener is never removed. Over multiple model fallback attempts and/or multiple prompts within the same cron run, this can accumulate listeners and retain AbortController/closure objects.
  • DoS / resource overuse risk: runWithModelFallback can invoke run(...) multiple times (per candidate model). Repeated failing attempts can therefore register multiple listeners without cleanup, growing memory use and adding abort propagation overhead.
  • Cancellation bypass (functional/security impact): The previous code threw immediately when abortSignal.aborted was already true. The new code skips listener registration when already aborted, but also does not abort the per-attempt controller; the attempt proceeds with a non-aborted attemptSignal, potentially continuing expensive work after the job timed out/canceled.

Vulnerable code (listener added, but removed only on one successful path):

const attemptController = new AbortController();
const attemptSignal = attemptController.signal;
const forwardAbort = () => attemptController.abort(abortSignal?.reason);
if (abortSignal && !abortSignal.aborted) {
  abortSignal.addEventListener("abort", forwardAbort, { once: true });
}
...
if (isCliProvider(providerOverride, cfgWithAgentDefaults)) {
  const result = await runCliAgent({ ... });
  ...
  return result; // <- listener is never removed
}
const result = await runEmbeddedPiAgent({ ..., abortSignal: attemptSignal, ... });
...
abortSignal?.removeEventListener("abort", forwardAbort);
return result;

Why this can repeat:

  • runWithModelFallback loops through fallback candidates and calls params.run(...) for each attempt until one succeeds or all fail, so the above run callback may execute multiple times per cron turn.

Recommendation

Ensure listener cleanup and correct abort propagation on all exit paths by using try/finally, and immediately abort the per-attempt controller when the outer signal is already aborted.

Suggested pattern:

const attemptController = new AbortController();
const attemptSignal = attemptController.signal;
const forwardAbort = () => attemptController.abort(abortSignal?.reason);

if (abortSignal?.aborted) {
  attemptController.abort(abortSignal.reason);
} else if (abortSignal) {
  abortSignal.addEventListener("abort", forwardAbort, { once: true });
}

try {
  if (isCliProvider(providerOverride, cfgWithAgentDefaults)) {
    return await runCliAgent({ /* ... */ });
  }

  const result = await runEmbeddedPiAgent({/* ... */
    abortSignal: attemptSignal,
  });
  return result;
} finally {
  abortSignal?.removeEventListener("abort", forwardAbort);
}

This guarantees:

  • no leaked listeners on exceptions or early returns
  • per-attempt signal is aborted immediately if the outer job was already canceled/timed out

Analyzed PR: #38149 at commit b02874f

Last updated on: 2026-03-06T17:02:08Z

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 6, 2026

Greptile Summary

This PR correctly fixes a bug where the shared AbortController used by the cron timeout would abort the first model attempt and immediately fail all subsequent fallback attempts.

The fix: Creates a fresh AbortController per runWithModelFallback attempt with addEventListener("abort", ..., { once: true }) to forward the outer signal's abort event. This allows fallback models to start with a clean, non-aborted signal while remaining cancellable by the outer timeout.

Why it works: Each attempt gets its own controller. When the outer signal aborts, the { once: true } listener fires once and self-removes, forwarding the abort to the current attempt's controller. The loop in runWithModelFallback then proceeds to the next fallback with a new controller.

Tests: Well-structured coverage of signal isolation (confirms per-attempt controller is not the outer signal) and abort forwarding behavior.

Confidence Score: 5/5

  • The fix correctly implements per-attempt abort controllers that preserve fallback chain functionality under timeout.
  • The core fix is sound: fresh AbortController per attempt with proper signal forwarding ensures fallback attempts start with non-aborted signals while remaining cancellable by the outer timeout. Listener cleanup via { once: true } is sufficient for the job-scoped lifetime. Tests validate signal isolation and forwarding behavior. No functional issues identified.
  • No files require special attention

Last reviewed commit: b02874f

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b02874ff92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +475 to +476
if (abortSignal && !abortSignal.aborted) {
abortSignal.addEventListener("abort", forwardAbort, { once: true });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort per-attempt signal when outer timeout is already set

This logic only forwards cancellation when abortSignal is not yet aborted, so if the cron timeout fires between fallback attempts (or the signal is already aborted on entry), the new attemptSignal stays live and the next model attempt still runs. In executeJobCoreWithTimeout (src/cron/service/timer.ts), that outer signal is the job deadline mechanism, so this change can let timed-out jobs keep executing model/tool work (and potentially send side effects) after the scheduler has already treated the run as timed out.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron job timeout aborts entire model fallback chain via shared AbortController

2 participants