fix(cron): stop fallback attempts when cron budget is exhausted by imwyvern · Pull Request #52365 · openclaw/openclaw

imwyvern · 2026-03-22T16:34:00Z

Problem

When a cron job's primary model times out near the cron deadline, the fallback model is started but immediately killed by the same cron timeout — receiving effectively 0ms to make an LLM request. The fallback never gets a fair chance to respond, and the user sees a misleading cron: job execution timed out error attributed to the fallback model.

Observed in production logs (2026-03-22):

[08:10:30.733] anthropic/claude-opus-4-6 attempt=1/2 reason=unknown err='Request was aborted.' duration=28667ms
[08:10:30.735] openai-codex/gpt-5.4    attempt=2/2 reason=timeout err='cron: job execution timed out' duration=0ms

The cron timeout wraps the entire fallback chain. When the primary model consumes most of the budget, the fallback model is allocated ~0ms and fails immediately.

Solution

Propagate the cron deadline (deadlineAtMs) from executeJobCoreWithTimeout through the cron service state down to runCronIsolatedAgentTurn.
Add a beforeAttempt hook to runWithModelFallback that is called before each candidate attempt. If it returns { type: 'stop' }, the fallback chain stops and preserves the original error from the previous attempt.
Check remaining budget in the cron layer's beforeAttempt callback. If less than min(30s, timeout/4) remains before the cron deadline, the fallback is skipped with a clear message:

Skipping fallback: only 50ms remain before cron timeout (need at least 30000ms).

This ensures:

The primary model's real error is preserved (not masked by a fake cron timeout on the fallback)
No wasted LLM API calls that would be aborted instantly
The beforeAttempt hook is generic and reusable for other pre-attempt checks

Changes

src/agents/model-fallback.ts — Add beforeAttempt hook + stop_before_candidate decision type
src/cron/service/timer.ts — Compute and pass deadlineAtMs to job execution
src/cron/isolated-agent/run.ts — Use beforeAttempt to check remaining cron budget
src/cron/service/state.ts + src/gateway/server-cron.ts — Thread deadlineAtMs through interfaces

Tests

model-fallback.test.ts — beforeAttempt stop behavior
run.fallback-time-budget.test.ts — Cron deadline propagation + fallback budget exhaustion
service.issue-regressions.test.ts — Deadline passed to isolated runs

greptile-apps · 2026-03-22T16:37:15Z

Greptile Summary

This PR fixes a real production bug where a cron job's fallback model was allocated ~0ms of budget when the primary model consumed most of the cron timeout, resulting in an immediate spurious failure attributed to the fallback. The fix propagates the outer cron deadline (deadlineAtMs) through the service layer into runCronIsolatedAgentTurn and adds a generic beforeAttempt hook to runWithModelFallback that lets callers veto a fallback candidate before the LLM call is made.

model-fallback.ts — beforeAttempt hook + stop_before_candidate log decision; clean placement after the cooldown-skip block so the hook is only invoked for candidates that would actually run. A break on stop falls through to throwFallbackFailureSummary, correctly re-throwing the primary model's real error.
timer.ts — deadlineAtMs = Date.now() + jobTimeoutMs computed at the moment the race starts; passed through executeJobCore → runIsolatedAgentJob.
run.ts — beforeAttempt callback checks remaining budget against max(1 s, min(30 s, timeout/4)), short-circuits the primary attempt guard (attempt <= 1), and handles missing/invalid deadline gracefully.
Tests — Coverage for the hook stop behavior, cron deadline propagation, and budget exhaustion. One assertion in service.issue-regressions.test.ts uses .toBe() to compare two separate Date.now() calls, which is slightly timing-sensitive and could flake under CI load (see inline comment).

Confidence Score: 5/5

Safe to merge — fix is well-scoped, backward-compatible, and tested end-to-end.
The production bug is clearly described and the fix is minimal and correct. The beforeAttempt hook is optional and defaults to a no-op, so existing callers are unaffected. Deadline propagation is purely additive through optional parameters. The only rough edge is a timing-sensitive .toBe() assertion in one test, which is unlikely to flake in practice but worth tidying.
No files require special attention beyond the minor test assertion noted in src/cron/service.issue-regressions.test.ts.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/cron/service.issue-regressions.test.ts
Line: 1115-1119

Comment:
**Flaky test: exact `Date.now()` equality**

`wallNow` is captured before the async function call, while `deadlineAtMs` inside `executeJobCoreWithTimeout` is set to a second `Date.now()` invocation. If even 1 ms elapses between the two calls (very possible under CI load), the `.toBe()` assertion fails. A range check or fake timers would make this deterministic:

```suggestion
    const wallNow = Date.now();
    const result = await executeJobCoreWithTimeout(state, cronJob);

    expect(result).toMatchObject({ status: "ok", summary: "done" });
    // Allow a small window for the two Date.now() calls to differ
    expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000);
    expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500);
```

Alternatively, mock `Date.now` with `vi.spyOn(Date, "now").mockReturnValue(wallNow)` to make both calls return the same value and keep the exact `.toBe` check.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(cron): stop fallback attempts when c..." | Re-trigger Greptile}

greptile-apps · 2026-03-22T16:37:19Z

+    const wallNow = Date.now();
+    const result = await executeJobCoreWithTimeout(state, cronJob);
+
+    expect(result).toMatchObject({ status: "ok", summary: "done" });
+    expect(observedDeadlineAtMs).toBe(wallNow + timeoutSeconds * 1000);


Flaky test: exact Date.now() equality

wallNow is captured before the async function call, while deadlineAtMs inside executeJobCoreWithTimeout is set to a second Date.now() invocation. If even 1 ms elapses between the two calls (very possible under CI load), the .toBe() assertion fails. A range check or fake timers would make this deterministic:

Suggested change

const wallNow = Date.now();

const result = await executeJobCoreWithTimeout(state, cronJob);

expect(result).toMatchObject({ status: "ok", summary: "done" });

expect(observedDeadlineAtMs).toBe(wallNow + timeoutSeconds * 1000);

const wallNow = Date.now();

const result = await executeJobCoreWithTimeout(state, cronJob);

expect(result).toMatchObject({ status: "ok", summary: "done" });

// Allow a small window for the two Date.now() calls to differ

expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000);

expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500);

Alternatively, mock Date.now with vi.spyOn(Date, "now").mockReturnValue(wallNow) to make both calls return the same value and keep the exact .toBe check.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cron/service.issue-regressions.test.ts Line: 1115-1119 Comment: **Flaky test: exact `Date.now()` equality** `wallNow` is captured before the async function call, while `deadlineAtMs` inside `executeJobCoreWithTimeout` is set to a second `Date.now()` invocation. If even 1 ms elapses between the two calls (very possible under CI load), the `.toBe()` assertion fails. A range check or fake timers would make this deterministic: ```suggestion const wallNow = Date.now(); const result = await executeJobCoreWithTimeout(state, cronJob); expect(result).toMatchObject({ status: "ok", summary: "done" }); // Allow a small window for the two Date.now() calls to differ expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000); expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500); ``` Alternatively, mock `Date.now` with `vi.spyOn(Date, "now").mockReturnValue(wallNow)` to make both calls return the same value and keep the exact `.toBe` check. How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93927ee1ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-22T16:39:58Z

+    if (beforeAttemptResult?.type === "stop") {
+      logModelFallbackDecision({
+        decision: "stop_before_candidate",
+        runId: params.runId,
+        requestedProvider: params.provider,
+        requestedModel: params.model,
+        candidate,
+        attempt: i + 1,
+        total: candidates.length,
+        reason: beforeAttemptResult.reason ?? "timeout",
+        error:
+          beforeAttemptResult.error ??
+          `Stopped fallback before ${candidate.provider}/${candidate.model}.`,
+        nextCandidate: candidates[i + 1],
+        isPrimary,
+        requestedModelMatched: requestedModel,
+        fallbackConfigured: hasFallbackCandidates,
+        previousAttempts: attempts,
+      });
+      break;


Preserve the last real error when a budget stop follows skips

When beforeAttempt returns stop here, the loop just breaks and falls through to throwFallbackFailureSummary(). That only rethrows the prior error when attempts.length <= 1, so any skipped candidate recorded earlier in this same run (for example the existing cooldown/auth skip branches) turns the result into a synthesized All models failed ... summary instead of the original primary failure. In a chain like “primary fails, next fallback is skipped, later fallback is stopped for low cron budget”, this change still masks the real error the commit is trying to preserve.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34334173d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T01:55:17Z

+      if (lastError !== undefined) {
+        throw lastError;
+      }
+      break;


Propagate stop errors when all prior candidates were skipped

When beforeAttempt stops here after only skip-only entries have accumulated (for example, the requested model is skipped by the cooldown/auth branch earlier in runWithModelFallback, and the cron budget gate fires on the next runnable fallback), lastError is still undefined. This break then falls through to throwFallbackFailureSummary(), so callers see a synthesized All models failed ... error instead of the explicit stop message from beforeAttempt. In that cron path the new budget-stop behavior is still masked by a misleading summary.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d628daad75

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T02:12:29Z

+    const beforeAttemptResult = await params.beforeAttempt?.({
+      candidate,
+      attempt: i + 1,
+      total: candidates.length,
+      previousAttempts: attempts,
+      isPrimary,
+      requestedModelMatched: requestedModel,
+      fallbackConfigured: hasFallbackCandidates,
+    });


Run the stop hook before marking cooldown probes

When beforeAttempt stops a near-deadline cron run on a cooldowned provider, the code above has already called markProbeAttempt() for that candidate before any request is sent. Because lastProbeAttempt is global and throttles probes for 30 seconds, the next fallback run can incorrectly skip a real cooldown-recovery probe even though this run never actually probed the provider. Moving the stop check ahead of the cooldown probe bookkeeping would avoid burning probe slots on aborted attempts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bbc38af88

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T02:20:30Z

+          const minRemainingMs = resolveCronFallbackMinRemainingMs(cronTimeoutMs);
+          if (remainingMs >= minRemainingMs) {


Base the fallback guard on the actual agent timeout

For cron jobs that rely on agents.defaults.timeoutSeconds instead of payload.timeoutSeconds, timeoutMs in this function can be much smaller than the cron safety timeout. The new guard uses cronTimeoutMs here, so those jobs still require 30s remaining before any fallback, even when each model attempt is capped to 5–10s. A primary failure in the last 20–25s will now skip fallbacks that still had enough budget to complete under the configured agent timeout.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-23T02:20:30Z

+git branch -D fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted local" || echo "Already gone"
+git push origin --delete fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted remote" || echo "Already gone from remote"


Remove the unconditional remote branch deletion

This helper always deletes fix/subagent-timeout-partial-results at the end, even though that branch is not in BRANCHES and the script never verifies that it was merged. Running the script to rebase the listed branches will also remove that remote branch from origin, which is destructive if the PR is still open or another maintainer is still using it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bbc38af88

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T02:20:34Z

+          const minRemainingMs = resolveCronFallbackMinRemainingMs(cronTimeoutMs);
+          if (remainingMs >= minRemainingMs) {


Base the fallback guard on the actual agent timeout

For cron jobs that rely on agents.defaults.timeoutSeconds instead of payload.timeoutSeconds, timeoutMs in this function can be much smaller than the cron safety timeout. The new guard uses cronTimeoutMs here, so those jobs still require 30s remaining before any fallback, even when each model attempt is capped to 5–10s. A primary failure in the last 20–25s will now skip fallbacks that still had enough budget to complete under the configured agent timeout.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-23T02:20:34Z

+git branch -D fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted local" || echo "Already gone"
+git push origin --delete fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted remote" || echo "Already gone from remote"


Remove the unconditional remote branch deletion

This helper always deletes fix/subagent-timeout-partial-results at the end, even though that branch is not in BRANCHES and the script never verifies that it was merged. Running the script to rebase the listed branches will also remove that remote branch from origin, which is destructive if the PR is still open or another maintainer is still using it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4e231e921

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

imwyvern · 2026-04-11T17:36:16Z

Follow-up complete on this branch:

rebuilt from upstream/main and cherry-picked 80026983b91fa9fdcb1376640ed24ecef9d03786
rechecked the Codex P1 on model-fallback.ts; the current stop path already rethrows lastError, so no extra fallback logic change was needed
added one small follow-up fix for the regression test fixture import so the branch type-checks cleanly
verification: NODE_OPTIONS='--max-old-space-size=8192' npx tsc --noEmit

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8734c9627

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-11T17:43:26Z

+  const cronTimeoutMs = resolveCronJobTimeoutMs(params.job);
+  const fallbackMinRemainingMs =
+    typeof cronTimeoutMs === "number" && cronTimeoutMs > 0
+      ? resolveCronFallbackMinRemainingMs(cronTimeoutMs)


Base fallback guard on actual agent timeout

This computes fallbackMinRemainingMs from resolveCronJobTimeoutMs(params.job), which defaults agent-turn jobs to a 60-minute safety window when payload.timeoutSeconds is unset. In the same function, execution still uses prepared.context.timeoutMs for each model call, so jobs configured via agents.defaults.timeoutSeconds (for example 5–10s) can have fallbacks blocked whenever under 30s remain even though another attempt could still complete within the real per-attempt timeout. That makes the new guard overly strict and can incorrectly suppress useful fallback attempts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-11T17:43:26Z

+      if (lastError !== undefined) {
+        throw lastError;
+      }
+      break;


Propagate stop reason when no prior candidate failed

When beforeAttempt requests stop but lastError is still undefined (for example, earlier candidates were only skipped for cooldown/auth), this branch falls through via break and later throws a synthesized fallback summary. That drops the explicit stop message from beforeAttempt and returns a misleading generic failure instead of the cron-budget stop reason, which is precisely the condition this hook is intended to communicate.

Useful? React with 👍 / 👎.

clawsweeper · 2026-04-28T22:13:52Z

Codex review: needs real behavior proof before merge.

Summary
The PR adds a model-fallback pre-attempt stop hook, threads cron deadline data into isolated cron execution, skips low-budget fallback candidates, adds regression tests, and updates the changelog.

Reproducibility: yes. The PR body includes production logs, and current-main source inspection shows no pre-attempt fallback guard before cron-selected fallback candidates run.

Real behavior proof
Needs real behavior proof before merge: The PR includes production before logs and test/typecheck/format verification, but no after-fix real cron run evidence showing the fallback skip behavior in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
External PR needs contributor-owned real behavior proof plus fixes for the execution-start deadline and stop-error findings before merge.

Security
Cleared: Security review cleared: the diff is limited to TypeScript cron/model-fallback runtime code, tests, and a changelog entry, with no workflow, dependency, lockfile, package, secret, or artifact-download changes.

Review findings

[P2] Derive fallback deadlines from the execution-start timeout — src/cron/service/timer.ts:117
[P2] Preserve stop errors when no candidate failed — src/agents/model-fallback.ts:1114-1115

Review details

Best possible solution:

Land a rebased patch that keeps deferred isolated-timeout semantics, uses an execution-start-aligned fallback deadline, preserves explicit stop errors, and includes real behavior proof plus focused tests.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR body includes production logs, and current-main source inspection shows no pre-attempt fallback guard before cron-selected fallback candidates run.

Is this the best way to solve the issue?

No. The approach is maintainable, but this PR is not the best mergeable implementation until the propagated deadline uses the same clock origin as the deferred cron timeout and stop-only failures preserve the hook's explicit error.

Full review comments:

[P2] Derive fallback deadlines from the execution-start timeout — src/cron/service/timer.ts:117
For detached agent-turn jobs, this captures deadlineAtMs before setup even though current main intentionally starts the actual abort timer only when onExecutionStarted fires. Slow setup or queue acquisition can therefore make the new fallback guard skip a candidate while the real cron timeout still has a full execution budget remaining.
Confidence: 0.91
[P2] Preserve stop errors when no candidate failed — src/agents/model-fallback.ts:1114-1115
If beforeAttempt stops after only cooldown/auth skips, lastError is still undefined. Breaking here falls through to the generic fallback summary, dropping the explicit Skipping fallback... error returned by the hook and reintroducing a misleading failure message.
Confidence: 0.87

Overall correctness: patch is incorrect
Overall confidence: 0.9

Acceptance criteria:

pnpm test src/agents/model-fallback.test.ts src/cron/isolated-agent/run.fallback-time-budget.test.ts src/cron/service/timer.test.ts src/cron/service/timer.regression.test.ts
pnpm check:changed in Testbox after rebase/adaptation

What I checked:

Current main lacks the pre-attempt guard: Current main's runWithModelFallback parameters end at classifyResult and the loop proceeds from cooldown handling directly into runFallbackAttempt, so cron cannot veto a selected fallback candidate before the model call. (src/agents/model-fallback.ts:827, 1819e41d2683)
Current main defers detached cron timeout start: executeJobCoreWithTimeout only starts the detached isolated agent-turn timeout from onExecutionStarted and has cleanupTimedOutCronAgentRun on timeout, so any propagated fallback deadline must share that same clock origin. (src/cron/service/timer.ts:125, 1819e41d2683)
PR head computes the propagated deadline too early: Latest PR head computes deadlineAtMs before the deferred timeout has started, then passes that fixed timestamp through to detached execution. (src/cron/service/timer.ts:117, d5591c3868c9)
PR head can still drop an explicit stop error: When beforeAttempt returns stop, the loop only sets rethrowLastError when lastError is defined and then breaks; skip-only prior candidates still fall through to a synthesized fallback summary. (src/agents/model-fallback.ts:1114, d5591c3868c9)
Related work is complementary, not a superseding merge: The open terminal-abort PR describes itself as complementary to this proactive cron-budget hook; the closed deferred-timeout PR is adjacent context, not a merged replacement for this exact guard.
Latest PR proof is not real behavior proof: The latest contributor comment reports targeted tests, formatting, and typecheck, but does not include after-fix logs, terminal output, screenshot, recording, or artifact from a real cron run showing the low-budget fallback being skipped. (d5591c3868c9)

Likely related people:

steipete: Recent GitHub path history shows steipete authored the deferred isolated-timeout start and timed-out-run cleanup that this PR must preserve, plus several adjacent model-fallback changes. (role: recent maintainer; confidence: high; commits: 729147dcb523, 61d53f98d314, aec5efed8d43; files: src/cron/service/timer.ts, src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-executor.ts)
vincentkoc: Recent path history links vincentkoc to model-fallback short-circuit work and cron isolated-runtime seam work near the affected fallback and cron execution boundaries. (role: adjacent fallback and cron history owner; confidence: medium; commits: 43a003b8a062, 21d850dd6656, 35176f3cb730; files: src/agents/model-fallback.ts, src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-executor.ts)
mbelinky: Recent cron service history links mbelinky to detached task lifecycle and next-run/backoff behavior around cron runtime state and timer maintenance. (role: adjacent cron lifecycle maintainer; confidence: medium; commits: 0787266637fd, 190a4b48697b; files: src/cron/service/timer.ts, src/cron/service/state.ts)

Remaining risk / open question:

Exact-head CI still has failing check-test-types, check-lint, checks-node-core-runtime-shared, and check-additional-extension-bundled jobs at d5591c3.
The contributor has not yet supplied after-fix real cron behavior proof; tests and copied verification commands do not prove the production runtime path.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1819e41d2683.

clawsweeper · 2026-04-29T16:55:17Z

Codex review: found issues before merge.

What this changes:

The PR adds a pre-attempt stop hook to model fallback, threads cron deadline information into isolated cron execution, and adds tests for skipping fallback attempts when little cron budget remains.

Maintainer follow-up before merge:

This is an open implementation PR with useful code, but the remaining action is maintainer review, rebase, and timeout-policy coordination rather than an autonomous replacement branch.

Review findings:

[P2] Base the fallback budget on the attempt timeout — src/cron/isolated-agent/run.ts:986

Review details

Best possible solution:

Land a rebased, narrow fix that keeps current deferred isolated-timeout cleanup semantics, adds only the needed generic pre-attempt stop seam, computes the skip threshold from the correct execution budget, and coordinates the terminal-abort contract with #42482 and #62682.

Full review comments:

[P2] Base the fallback budget on the attempt timeout — src/cron/isolated-agent/run.ts:986
resolveCronJobTimeoutMs(params.job) falls back to the 60-minute agent-turn safety timeout when payload.timeoutSeconds is unset, while the actual model attempt uses prepared.context.timeoutMs from agents.defaults.timeoutSeconds. With a 5-10s agent default, this guard still requires 30s of remaining outer budget and skips fallbacks that could have completed within the real attempt timeout.
Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.86

Acceptance criteria:

pnpm test src/agents/model-fallback.test.ts src/cron/isolated-agent/run.fallback-time-budget.test.ts src/cron/service/timer.test.ts src/cron/service/timer.regression.test.ts
pnpm check:changed in Testbox after rebase if shared cron/agent runtime files remain touched

What I checked:

Current main has no pre-attempt fallback hook: runWithModelFallback accepts config, provider/model, fallback overrides, run handlers, and result classification, but no beforeAttempt parameter; after cooldown handling it proceeds to runFallbackAttempt. (src/agents/model-fallback.ts:751, e46dccb35374)
Current cron executor cannot stop before candidate selection: createCronPromptExecutor calls runWithModelFallback without a pre-candidate guard; the cron abort check lives inside the run callback, after the candidate has already been selected. (src/cron/isolated-agent/run-executor.ts:121, e46dccb35374)
Current timeout plumbing has start notification but no deadline: executeJobCoreWithTimeout now defers isolated agent-turn timeouts via onExecutionStarted, and the state/gateway/run APIs expose onExecutionStarted rather than deadlineAtMs. (src/cron/service/timer.ts:102, e46dccb35374)
Existing tests cover adjacent timeout behavior only: Main has deferred-timeout and cleanup regression tests, but targeted search found no beforeAttempt, stop_before_candidate, Skipping fallback, or fallback budget test on the active branch. (src/cron/service/timer.regression.test.ts:644, e46dccb35374)
PR branch still carries the intended implementation surface: Provided PR context for head 48a621558196 shows additions for beforeAttempt, stop_before_candidate, deadlineAtMs threading, cron fallback minimum-budget checks, and src/cron/isolated-agent/run.fallback-time-budget.test.ts. (src/cron/isolated-agent/run.ts:985, 48a621558196)
Security review pass: The provided file set is TypeScript source and tests under src/agents, src/cron, and src/gateway; it does not add workflows, dependency manifests, lockfiles, lifecycle scripts, downloaded artifacts, or package publishing metadata. (48a621558196)

Likely related people:

steipete: Local blame/log on current main show Peter Steinberger authored the recent cron timeout cleanup and the current snapshot touching the central fallback, cron timer, isolated executor, and gateway cron wiring. (role: recent maintainer; confidence: high; commits: 61d53f98d314, 34d11d57579d; files: src/agents/model-fallback.ts, src/cron/service/timer.ts, src/cron/isolated-agent/run.ts)
vincentkoc: The existing ClawSweeper path-history context for this PR links Vincent Koc to the snapshot commit covering model fallback, cron timer, isolated executor, and gateway cron wiring. (role: adjacent fallback and cron history owner; confidence: medium; commits: ad2516b1c876; files: src/agents/model-fallback.ts, src/cron/service/timer.ts, src/cron/isolated-agent/run-executor.ts)
mbelinky: Prior path-history context links mbelinky to cron service and task lifecycle work near the timer/state deadline and execution lifecycle area affected by this PR. (role: adjacent cron service maintainer; confidence: medium; commits: 190a4b48697b, c60282421524; files: src/cron/service/timer.ts, src/cron/service/state.ts)

Remaining risk / open question:

The PR’s fallback guard appears too strict for jobs that inherit agents.defaults.timeoutSeconds, because it bases the threshold on the outer cron safety timeout rather than the actual per-attempt timeout.
The change overlaps open abort/timeout policy work in fix(cron): per-attempt AbortControllers and deferred execution timeout #42482 and fix(agents): distinguish terminal aborts from retryable failures (#60388) #62682, so maintainers need to keep retryable provider failures distinct from terminal cron-budget aborts.
The branch is size XL and touches shared cron/agent runtime files, so it needs a rebase onto current main and targeted validation before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e46dccb35374.

imwyvern · 2026-05-08T04:31:46Z

Rebased onto upstream/main (HEAD d5591c3). Addressed Codex review P2:

Fallback budget now uses the per-attempt timeout (prepared.context.timeoutMs) instead of the outer cron safety timeout. Jobs that rely on agents.defaults.timeoutSeconds no longer skip fallbacks spuriously.
Added regression test for outer-large / attempt-small budget scenario.

Verification:

pnpm test src/agents/model-fallback.test.ts src/cron/isolated-agent/run.fallback-time-budget.test.ts src/cron/service/timer.test.ts src/cron/service/timer.regression.test.ts — PASS
pnpm exec oxfmt --check --threads=1 src/agents/model-fallback.ts src/cron/isolated-agent/run.ts src/cron/isolated-agent/run-executor.ts src/cron/service/timer.ts CHANGELOG.md — PASS
pnpm exec tsc -p tsconfig.core.json --noEmit — PASS

CHANGELOG updated. Ready for another look.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5591c3868

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-08T04:33:59Z

  }

  const runAbortController = new AbortController();
+  const deadlineAtMs = Date.now() + jobTimeoutMs;


Anchor fallback budget to the deferred timeout start

For detached agentTurn jobs, this deadline is computed before setup even though executeJobCoreWithTimeout intentionally defers starting the actual abort timer until onExecutionStarted fires. If delivery/session/skills/model setup takes significant time, the new beforeAttempt check sees that setup time as already spent and can skip a fallback with <budget remaining, while the real cron timeout still has a full jobTimeoutMs from execution start. Please derive deadlineAtMs when startTimeout() runs or otherwise use the same clock origin as the deferred timer.

Useful? React with 👍 / 👎.

openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M labels Mar 22, 2026

greptile-apps Bot reviewed Mar 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 22, 2026

View reviewed changes

imwyvern force-pushed the fix/cron-fallback-timeout branch 2 times, most recently from 6d3f4d7 to 3433417 Compare March 23, 2026 01:49

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

imwyvern force-pushed the fix/cron-fallback-timeout branch from 3433417 to d628daa Compare March 23, 2026 02:05

openclaw-barnacle Bot added channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: nextcloud-talk Channel integration: nextcloud-talk channel: irc labels Mar 23, 2026

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

imwyvern force-pushed the fix/cron-fallback-timeout branch from d628daa to 1bbc38a Compare March 23, 2026 02:14

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

imwyvern force-pushed the fix/cron-fallback-timeout branch from 1bbc38a to d31660a Compare March 28, 2026 09:37

openclaw-barnacle Bot removed channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: irc labels Mar 28, 2026

imwyvern force-pushed the fix/cron-fallback-timeout branch from d31660a to 9cab596 Compare March 28, 2026 15:35

openclaw-barnacle Bot removed the channel: nextcloud-talk Channel integration: nextcloud-talk label Mar 28, 2026

imwyvern force-pushed the fix/cron-fallback-timeout branch from 9cab596 to 97875ec Compare March 28, 2026 16:05

openclaw-barnacle Bot added the docs Improvements or additions to documentation label Mar 28, 2026

imwyvern force-pushed the fix/cron-fallback-timeout branch from 97875ec to a4e231e Compare March 30, 2026 08:34

openclaw-barnacle Bot removed the docs Improvements or additions to documentation label Mar 30, 2026

chatgpt-codex-connector Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread src/cron/service/timer.ts Outdated

imwyvern force-pushed the fix/cron-fallback-timeout branch from a4e231e to 3f552d9 Compare April 1, 2026 06:49

imwyvern force-pushed the fix/cron-fallback-timeout branch from 8002698 to d8734c9 Compare April 11, 2026 17:36

chatgpt-codex-connector Bot reviewed Apr 11, 2026

View reviewed changes

imwyvern force-pushed the fix/cron-fallback-timeout branch 4 times, most recently from 15a4f17 to a5cf851 Compare April 19, 2026 18:00

iPythoning mentioned this pull request Apr 22, 2026

Feature request: session.maxDurationMinutes and session.maxTokensPerSession config keys #64463

Open

imwyvern force-pushed the fix/cron-fallback-timeout branch 7 times, most recently from df70038 to 61fa650 Compare April 26, 2026 17:29

BingqingLyu mentioned this pull request Apr 27, 2026

fix(cron): stop fallback attempts when cron budget is exhausted BingqingLyu/openclaw#1239

Open

clawsweeper Bot mentioned this pull request Apr 28, 2026

fix(cron): per-attempt AbortControllers and deferred execution timeout #42482

Closed

3 tasks

BingqingLyu mentioned this pull request Apr 28, 2026

fix(agents): distinguish terminal aborts from retryable failures (#60388) BingqingLyu/openclaw#2114

Open

3 tasks

imwyvern force-pushed the fix/cron-fallback-timeout branch from 48a6215 to f0a3253 Compare April 30, 2026 15:13

imwyvern and others added 5 commits May 7, 2026 21:25

fix(cron): stop fallback attempts when cron budget is exhausted

a89f2cb

test(cron): fix issue regression fixture import

287951b

fix(cron): normalize fallback gate returns

22bcf92

fix(cron): address P1 review feedback for fallback budget

11cfc43

test(cron): cover default timeout fallback budget

d5591c3

imwyvern force-pushed the fix/cron-fallback-timeout branch from f0a3253 to d5591c3 Compare May 8, 2026 04:31

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 8, 2026

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

		const minRemainingMs = resolveCronFallbackMinRemainingMs(cronTimeoutMs);
		if (remainingMs >= minRemainingMs) {

		git branch -D fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted local" \|\| echo "Already gone"
		git push origin --delete fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted remote" \|\| echo "Already gone from remote"

Uh oh!

Conversation

imwyvern commented Mar 22, 2026

Problem

Solution

Changes

Tests

Uh oh!

greptile-apps Bot commented Mar 22, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

imwyvern commented Apr 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented Apr 29, 2026

Uh oh!

imwyvern commented May 8, 2026

Uh oh!

clawsweeper Bot commented Apr 28, 2026 •

edited

Loading