Skip to content

fix(cron): stop fallback attempts when cron budget is exhausted#52365

Open
imwyvern wants to merge 5 commits intoopenclaw:mainfrom
imwyvern:fix/cron-fallback-timeout
Open

fix(cron): stop fallback attempts when cron budget is exhausted#52365
imwyvern wants to merge 5 commits intoopenclaw:mainfrom
imwyvern:fix/cron-fallback-timeout

Conversation

@imwyvern
Copy link
Copy Markdown
Contributor

Problem

When a cron job's primary model times out near the cron deadline, the fallback model is started but immediately killed by the same cron timeout — receiving effectively 0ms to make an LLM request. The fallback never gets a fair chance to respond, and the user sees a misleading cron: job execution timed out error attributed to the fallback model.

Observed in production logs (2026-03-22):

[08:10:30.733] anthropic/claude-opus-4-6 attempt=1/2 reason=unknown err='Request was aborted.' duration=28667ms
[08:10:30.735] openai-codex/gpt-5.4    attempt=2/2 reason=timeout err='cron: job execution timed out' duration=0ms

The cron timeout wraps the entire fallback chain. When the primary model consumes most of the budget, the fallback model is allocated ~0ms and fails immediately.

Solution

  1. Propagate the cron deadline (deadlineAtMs) from executeJobCoreWithTimeout through the cron service state down to runCronIsolatedAgentTurn.

  2. Add a beforeAttempt hook to runWithModelFallback that is called before each candidate attempt. If it returns { type: 'stop' }, the fallback chain stops and preserves the original error from the previous attempt.

  3. Check remaining budget in the cron layer's beforeAttempt callback. If less than min(30s, timeout/4) remains before the cron deadline, the fallback is skipped with a clear message:

    Skipping fallback: only 50ms remain before cron timeout (need at least 30000ms).

This ensures:

  • The primary model's real error is preserved (not masked by a fake cron timeout on the fallback)
  • No wasted LLM API calls that would be aborted instantly
  • The beforeAttempt hook is generic and reusable for other pre-attempt checks

Changes

  • src/agents/model-fallback.ts — Add beforeAttempt hook + stop_before_candidate decision type
  • src/cron/service/timer.ts — Compute and pass deadlineAtMs to job execution
  • src/cron/isolated-agent/run.ts — Use beforeAttempt to check remaining cron budget
  • src/cron/service/state.ts + src/gateway/server-cron.ts — Thread deadlineAtMs through interfaces

Tests

  • model-fallback.test.ts — beforeAttempt stop behavior
  • run.fallback-time-budget.test.ts — Cron deadline propagation + fallback budget exhaustion
  • service.issue-regressions.test.ts — Deadline passed to isolated runs

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M labels Mar 22, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 22, 2026

Greptile Summary

This PR fixes a real production bug where a cron job's fallback model was allocated ~0ms of budget when the primary model consumed most of the cron timeout, resulting in an immediate spurious failure attributed to the fallback. The fix propagates the outer cron deadline (deadlineAtMs) through the service layer into runCronIsolatedAgentTurn and adds a generic beforeAttempt hook to runWithModelFallback that lets callers veto a fallback candidate before the LLM call is made.

  • model-fallback.tsbeforeAttempt hook + stop_before_candidate log decision; clean placement after the cooldown-skip block so the hook is only invoked for candidates that would actually run. A break on stop falls through to throwFallbackFailureSummary, correctly re-throwing the primary model's real error.
  • timer.tsdeadlineAtMs = Date.now() + jobTimeoutMs computed at the moment the race starts; passed through executeJobCorerunIsolatedAgentJob.
  • run.tsbeforeAttempt callback checks remaining budget against max(1 s, min(30 s, timeout/4)), short-circuits the primary attempt guard (attempt <= 1), and handles missing/invalid deadline gracefully.
  • Tests — Coverage for the hook stop behavior, cron deadline propagation, and budget exhaustion. One assertion in service.issue-regressions.test.ts uses .toBe() to compare two separate Date.now() calls, which is slightly timing-sensitive and could flake under CI load (see inline comment).

Confidence Score: 5/5

  • Safe to merge — fix is well-scoped, backward-compatible, and tested end-to-end.
  • The production bug is clearly described and the fix is minimal and correct. The beforeAttempt hook is optional and defaults to a no-op, so existing callers are unaffected. Deadline propagation is purely additive through optional parameters. The only rough edge is a timing-sensitive .toBe() assertion in one test, which is unlikely to flake in practice but worth tidying.
  • No files require special attention beyond the minor test assertion noted in src/cron/service.issue-regressions.test.ts.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cron/service.issue-regressions.test.ts
Line: 1115-1119

Comment:
**Flaky test: exact `Date.now()` equality**

`wallNow` is captured before the async function call, while `deadlineAtMs` inside `executeJobCoreWithTimeout` is set to a second `Date.now()` invocation. If even 1 ms elapses between the two calls (very possible under CI load), the `.toBe()` assertion fails. A range check or fake timers would make this deterministic:

```suggestion
    const wallNow = Date.now();
    const result = await executeJobCoreWithTimeout(state, cronJob);

    expect(result).toMatchObject({ status: "ok", summary: "done" });
    // Allow a small window for the two Date.now() calls to differ
    expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000);
    expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500);
```

Alternatively, mock `Date.now` with `vi.spyOn(Date, "now").mockReturnValue(wallNow)` to make both calls return the same value and keep the exact `.toBe` check.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(cron): stop fallback attempts when c..." | Re-trigger Greptile

Comment on lines +1115 to +1119
const wallNow = Date.now();
const result = await executeJobCoreWithTimeout(state, cronJob);

expect(result).toMatchObject({ status: "ok", summary: "done" });
expect(observedDeadlineAtMs).toBe(wallNow + timeoutSeconds * 1000);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Flaky test: exact Date.now() equality

wallNow is captured before the async function call, while deadlineAtMs inside executeJobCoreWithTimeout is set to a second Date.now() invocation. If even 1 ms elapses between the two calls (very possible under CI load), the .toBe() assertion fails. A range check or fake timers would make this deterministic:

Suggested change
const wallNow = Date.now();
const result = await executeJobCoreWithTimeout(state, cronJob);
expect(result).toMatchObject({ status: "ok", summary: "done" });
expect(observedDeadlineAtMs).toBe(wallNow + timeoutSeconds * 1000);
const wallNow = Date.now();
const result = await executeJobCoreWithTimeout(state, cronJob);
expect(result).toMatchObject({ status: "ok", summary: "done" });
// Allow a small window for the two Date.now() calls to differ
expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000);
expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500);

Alternatively, mock Date.now with vi.spyOn(Date, "now").mockReturnValue(wallNow) to make both calls return the same value and keep the exact .toBe check.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.issue-regressions.test.ts
Line: 1115-1119

Comment:
**Flaky test: exact `Date.now()` equality**

`wallNow` is captured before the async function call, while `deadlineAtMs` inside `executeJobCoreWithTimeout` is set to a second `Date.now()` invocation. If even 1 ms elapses between the two calls (very possible under CI load), the `.toBe()` assertion fails. A range check or fake timers would make this deterministic:

```suggestion
    const wallNow = Date.now();
    const result = await executeJobCoreWithTimeout(state, cronJob);

    expect(result).toMatchObject({ status: "ok", summary: "done" });
    // Allow a small window for the two Date.now() calls to differ
    expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000);
    expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500);
```

Alternatively, mock `Date.now` with `vi.spyOn(Date, "now").mockReturnValue(wallNow)` to make both calls return the same value and keep the exact `.toBe` check.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93927ee1ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +687 to +706
if (beforeAttemptResult?.type === "stop") {
logModelFallbackDecision({
decision: "stop_before_candidate",
runId: params.runId,
requestedProvider: params.provider,
requestedModel: params.model,
candidate,
attempt: i + 1,
total: candidates.length,
reason: beforeAttemptResult.reason ?? "timeout",
error:
beforeAttemptResult.error ??
`Stopped fallback before ${candidate.provider}/${candidate.model}.`,
nextCandidate: candidates[i + 1],
isPrimary,
requestedModelMatched: requestedModel,
fallbackConfigured: hasFallbackCandidates,
previousAttempts: attempts,
});
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the last real error when a budget stop follows skips

When beforeAttempt returns stop here, the loop just breaks and falls through to throwFallbackFailureSummary(). That only rethrows the prior error when attempts.length <= 1, so any skipped candidate recorded earlier in this same run (for example the existing cooldown/auth skip branches) turns the result into a synthesized All models failed ... summary instead of the original primary failure. In a chain like “primary fails, next fallback is skipped, later fallback is stopped for low cron budget”, this change still masks the real error the commit is trying to preserve.

Useful? React with 👍 / 👎.

@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch 2 times, most recently from 6d3f4d7 to 3433417 Compare March 23, 2026 01:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34334173d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/model-fallback.ts Outdated
Comment on lines +706 to +709
if (lastError !== undefined) {
throw lastError;
}
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate stop errors when all prior candidates were skipped

When beforeAttempt stops here after only skip-only entries have accumulated (for example, the requested model is skipped by the cooldown/auth branch earlier in runWithModelFallback, and the cron budget gate fires on the next runnable fallback), lastError is still undefined. This break then falls through to throwFallbackFailureSummary(), so callers see a synthesized All models failed ... error instead of the explicit stop message from beforeAttempt. In that cron path the new budget-stop behavior is still masked by a misleading summary.

Useful? React with 👍 / 👎.

@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from 3433417 to d628daa Compare March 23, 2026 02:05
@openclaw-barnacle openclaw-barnacle Bot added channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: nextcloud-talk Channel integration: nextcloud-talk channel: irc labels Mar 23, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d628daad75

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +678 to +686
const beforeAttemptResult = await params.beforeAttempt?.({
candidate,
attempt: i + 1,
total: candidates.length,
previousAttempts: attempts,
isPrimary,
requestedModelMatched: requestedModel,
fallbackConfigured: hasFallbackCandidates,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run the stop hook before marking cooldown probes

When beforeAttempt stops a near-deadline cron run on a cooldowned provider, the code above has already called markProbeAttempt() for that candidate before any request is sent. Because lastProbeAttempt is global and throttles probes for 30 seconds, the next fallback run can incorrectly skip a real cooldown-recovery probe even though this run never actually probed the provider. Moving the stop check ahead of the cooldown probe bookkeeping would avoid burning probe slots on aborted attempts.

Useful? React with 👍 / 👎.

@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from d628daa to 1bbc38a Compare March 23, 2026 02:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bbc38af88

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/isolated-agent/run.ts Outdated
Comment on lines +484 to +485
const minRemainingMs = resolveCronFallbackMinRemainingMs(cronTimeoutMs);
if (remainingMs >= minRemainingMs) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Base the fallback guard on the actual agent timeout

For cron jobs that rely on agents.defaults.timeoutSeconds instead of payload.timeoutSeconds, timeoutMs in this function can be much smaller than the cron safety timeout. The new guard uses cronTimeoutMs here, so those jobs still require 30s remaining before any fallback, even when each model attempt is capped to 5–10s. A primary failure in the last 20–25s will now skip fallbacks that still had enough budget to complete under the configured agent timeout.

Useful? React with 👍 / 👎.

Comment thread rebase-all.sh Outdated
Comment on lines +43 to +44
git branch -D fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted local" || echo "Already gone"
git push origin --delete fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted remote" || echo "Already gone from remote"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove the unconditional remote branch deletion

This helper always deletes fix/subagent-timeout-partial-results at the end, even though that branch is not in BRANCHES and the script never verifies that it was merged. Running the script to rebase the listed branches will also remove that remote branch from origin, which is destructive if the PR is still open or another maintainer is still using it.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bbc38af88

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/isolated-agent/run.ts Outdated
Comment on lines +484 to +485
const minRemainingMs = resolveCronFallbackMinRemainingMs(cronTimeoutMs);
if (remainingMs >= minRemainingMs) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Base the fallback guard on the actual agent timeout

For cron jobs that rely on agents.defaults.timeoutSeconds instead of payload.timeoutSeconds, timeoutMs in this function can be much smaller than the cron safety timeout. The new guard uses cronTimeoutMs here, so those jobs still require 30s remaining before any fallback, even when each model attempt is capped to 5–10s. A primary failure in the last 20–25s will now skip fallbacks that still had enough budget to complete under the configured agent timeout.

Useful? React with 👍 / 👎.

Comment thread rebase-all.sh Outdated
Comment on lines +43 to +44
git branch -D fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted local" || echo "Already gone"
git push origin --delete fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted remote" || echo "Already gone from remote"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove the unconditional remote branch deletion

This helper always deletes fix/subagent-timeout-partial-results at the end, even though that branch is not in BRANCHES and the script never verifies that it was merged. Running the script to rebase the listed branches will also remove that remote branch from origin, which is destructive if the PR is still open or another maintainer is still using it.

Useful? React with 👍 / 👎.

@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from 1bbc38a to d31660a Compare March 28, 2026 09:37
@openclaw-barnacle openclaw-barnacle Bot removed channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: irc labels Mar 28, 2026
@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from d31660a to 9cab596 Compare March 28, 2026 15:35
@openclaw-barnacle openclaw-barnacle Bot removed the channel: nextcloud-talk Channel integration: nextcloud-talk label Mar 28, 2026
@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from 9cab596 to 97875ec Compare March 28, 2026 16:05
@openclaw-barnacle openclaw-barnacle Bot added the docs Improvements or additions to documentation label Mar 28, 2026
@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from 97875ec to a4e231e Compare March 30, 2026 08:34
@openclaw-barnacle openclaw-barnacle Bot removed the docs Improvements or additions to documentation label Mar 30, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4e231e921

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/service/timer.ts Outdated
@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from a4e231e to 3f552d9 Compare April 1, 2026 06:49
@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from 8002698 to d8734c9 Compare April 11, 2026 17:36
@imwyvern
Copy link
Copy Markdown
Contributor Author

Follow-up complete on this branch:

  • rebuilt from upstream/main and cherry-picked 80026983b91fa9fdcb1376640ed24ecef9d03786
  • rechecked the Codex P1 on model-fallback.ts; the current stop path already rethrows lastError, so no extra fallback logic change was needed
  • added one small follow-up fix for the regression test fixture import so the branch type-checks cleanly
  • verification: NODE_OPTIONS='--max-old-space-size=8192' npx tsc --noEmit

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8734c9627

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/isolated-agent/run.ts Outdated
Comment on lines +682 to +685
const cronTimeoutMs = resolveCronJobTimeoutMs(params.job);
const fallbackMinRemainingMs =
typeof cronTimeoutMs === "number" && cronTimeoutMs > 0
? resolveCronFallbackMinRemainingMs(cronTimeoutMs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Base fallback guard on actual agent timeout

This computes fallbackMinRemainingMs from resolveCronJobTimeoutMs(params.job), which defaults agent-turn jobs to a 60-minute safety window when payload.timeoutSeconds is unset. In the same function, execution still uses prepared.context.timeoutMs for each model call, so jobs configured via agents.defaults.timeoutSeconds (for example 5–10s) can have fallbacks blocked whenever under 30s remain even though another attempt could still complete within the real per-attempt timeout. That makes the new guard overly strict and can incorrectly suppress useful fallback attempts.

Useful? React with 👍 / 👎.

Comment thread src/agents/model-fallback.ts Outdated
Comment on lines +815 to +818
if (lastError !== undefined) {
throw lastError;
}
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate stop reason when no prior candidate failed

When beforeAttempt requests stop but lastError is still undefined (for example, earlier candidates were only skipped for cooldown/auth), this branch falls through via break and later throws a synthesized fallback summary. That drops the explicit stop message from beforeAttempt and returns a misleading generic failure instead of the cron-budget stop reason, which is precisely the condition this hook is intended to communicate.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Codex review: needs real behavior proof before merge.

Summary
The PR adds a model-fallback pre-attempt stop hook, threads cron deadline data into isolated cron execution, skips low-budget fallback candidates, adds regression tests, and updates the changelog.

Reproducibility: yes. The PR body includes production logs, and current-main source inspection shows no pre-attempt fallback guard before cron-selected fallback candidates run.

Real behavior proof
Needs real behavior proof before merge: The PR includes production before logs and test/typecheck/format verification, but no after-fix real cron run evidence showing the fallback skip behavior in a real setup. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
External PR needs contributor-owned real behavior proof plus fixes for the execution-start deadline and stop-error findings before merge.

Security
Cleared: Security review cleared: the diff is limited to TypeScript cron/model-fallback runtime code, tests, and a changelog entry, with no workflow, dependency, lockfile, package, secret, or artifact-download changes.

Review findings

  • [P2] Derive fallback deadlines from the execution-start timeout — src/cron/service/timer.ts:117
  • [P2] Preserve stop errors when no candidate failed — src/agents/model-fallback.ts:1114-1115
Review details

Best possible solution:

Land a rebased patch that keeps deferred isolated-timeout semantics, uses an execution-start-aligned fallback deadline, preserves explicit stop errors, and includes real behavior proof plus focused tests.

Do we have a high-confidence way to reproduce the issue?

Yes. The PR body includes production logs, and current-main source inspection shows no pre-attempt fallback guard before cron-selected fallback candidates run.

Is this the best way to solve the issue?

No. The approach is maintainable, but this PR is not the best mergeable implementation until the propagated deadline uses the same clock origin as the deferred cron timeout and stop-only failures preserve the hook's explicit error.

Full review comments:

  • [P2] Derive fallback deadlines from the execution-start timeout — src/cron/service/timer.ts:117
    For detached agent-turn jobs, this captures deadlineAtMs before setup even though current main intentionally starts the actual abort timer only when onExecutionStarted fires. Slow setup or queue acquisition can therefore make the new fallback guard skip a candidate while the real cron timeout still has a full execution budget remaining.
    Confidence: 0.91
  • [P2] Preserve stop errors when no candidate failed — src/agents/model-fallback.ts:1114-1115
    If beforeAttempt stops after only cooldown/auth skips, lastError is still undefined. Breaking here falls through to the generic fallback summary, dropping the explicit Skipping fallback... error returned by the hook and reintroducing a misleading failure message.
    Confidence: 0.87

Overall correctness: patch is incorrect
Overall confidence: 0.9

Acceptance criteria:

  • pnpm test src/agents/model-fallback.test.ts src/cron/isolated-agent/run.fallback-time-budget.test.ts src/cron/service/timer.test.ts src/cron/service/timer.regression.test.ts
  • pnpm check:changed in Testbox after rebase/adaptation

What I checked:

  • Current main lacks the pre-attempt guard: Current main's runWithModelFallback parameters end at classifyResult and the loop proceeds from cooldown handling directly into runFallbackAttempt, so cron cannot veto a selected fallback candidate before the model call. (src/agents/model-fallback.ts:827, 1819e41d2683)
  • Current main defers detached cron timeout start: executeJobCoreWithTimeout only starts the detached isolated agent-turn timeout from onExecutionStarted and has cleanupTimedOutCronAgentRun on timeout, so any propagated fallback deadline must share that same clock origin. (src/cron/service/timer.ts:125, 1819e41d2683)
  • PR head computes the propagated deadline too early: Latest PR head computes deadlineAtMs before the deferred timeout has started, then passes that fixed timestamp through to detached execution. (src/cron/service/timer.ts:117, d5591c3868c9)
  • PR head can still drop an explicit stop error: When beforeAttempt returns stop, the loop only sets rethrowLastError when lastError is defined and then breaks; skip-only prior candidates still fall through to a synthesized fallback summary. (src/agents/model-fallback.ts:1114, d5591c3868c9)
  • Related work is complementary, not a superseding merge: The open terminal-abort PR describes itself as complementary to this proactive cron-budget hook; the closed deferred-timeout PR is adjacent context, not a merged replacement for this exact guard.
  • Latest PR proof is not real behavior proof: The latest contributor comment reports targeted tests, formatting, and typecheck, but does not include after-fix logs, terminal output, screenshot, recording, or artifact from a real cron run showing the low-budget fallback being skipped. (d5591c3868c9)

Likely related people:

  • steipete: Recent GitHub path history shows steipete authored the deferred isolated-timeout start and timed-out-run cleanup that this PR must preserve, plus several adjacent model-fallback changes. (role: recent maintainer; confidence: high; commits: 729147dcb523, 61d53f98d314, aec5efed8d43; files: src/cron/service/timer.ts, src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-executor.ts)
  • vincentkoc: Recent path history links vincentkoc to model-fallback short-circuit work and cron isolated-runtime seam work near the affected fallback and cron execution boundaries. (role: adjacent fallback and cron history owner; confidence: medium; commits: 43a003b8a062, 21d850dd6656, 35176f3cb730; files: src/agents/model-fallback.ts, src/cron/isolated-agent/run.ts, src/cron/isolated-agent/run-executor.ts)
  • mbelinky: Recent cron service history links mbelinky to detached task lifecycle and next-run/backoff behavior around cron runtime state and timer maintenance. (role: adjacent cron lifecycle maintainer; confidence: medium; commits: 0787266637fd, 190a4b48697b; files: src/cron/service/timer.ts, src/cron/service/state.ts)

Remaining risk / open question:

  • Exact-head CI still has failing check-test-types, check-lint, checks-node-core-runtime-shared, and check-additional-extension-bundled jobs at d5591c3.
  • The contributor has not yet supplied after-fix real cron behavior proof; tests and copied verification commands do not prove the production runtime path.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1819e41d2683.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: found issues before merge.

What this changes:

The PR adds a pre-attempt stop hook to model fallback, threads cron deadline information into isolated cron execution, and adds tests for skipping fallback attempts when little cron budget remains.

Maintainer follow-up before merge:

This is an open implementation PR with useful code, but the remaining action is maintainer review, rebase, and timeout-policy coordination rather than an autonomous replacement branch.

Review findings:

  • [P2] Base the fallback budget on the attempt timeout — src/cron/isolated-agent/run.ts:986
Review details

Best possible solution:

Land a rebased, narrow fix that keeps current deferred isolated-timeout cleanup semantics, adds only the needed generic pre-attempt stop seam, computes the skip threshold from the correct execution budget, and coordinates the terminal-abort contract with #42482 and #62682.

Full review comments:

  • [P2] Base the fallback budget on the attempt timeout — src/cron/isolated-agent/run.ts:986
    resolveCronJobTimeoutMs(params.job) falls back to the 60-minute agent-turn safety timeout when payload.timeoutSeconds is unset, while the actual model attempt uses prepared.context.timeoutMs from agents.defaults.timeoutSeconds. With a 5-10s agent default, this guard still requires 30s of remaining outer budget and skips fallbacks that could have completed within the real attempt timeout.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.86

Acceptance criteria:

  • pnpm test src/agents/model-fallback.test.ts src/cron/isolated-agent/run.fallback-time-budget.test.ts src/cron/service/timer.test.ts src/cron/service/timer.regression.test.ts
  • pnpm check:changed in Testbox after rebase if shared cron/agent runtime files remain touched

What I checked:

  • Current main has no pre-attempt fallback hook: runWithModelFallback accepts config, provider/model, fallback overrides, run handlers, and result classification, but no beforeAttempt parameter; after cooldown handling it proceeds to runFallbackAttempt. (src/agents/model-fallback.ts:751, e46dccb35374)
  • Current cron executor cannot stop before candidate selection: createCronPromptExecutor calls runWithModelFallback without a pre-candidate guard; the cron abort check lives inside the run callback, after the candidate has already been selected. (src/cron/isolated-agent/run-executor.ts:121, e46dccb35374)
  • Current timeout plumbing has start notification but no deadline: executeJobCoreWithTimeout now defers isolated agent-turn timeouts via onExecutionStarted, and the state/gateway/run APIs expose onExecutionStarted rather than deadlineAtMs. (src/cron/service/timer.ts:102, e46dccb35374)
  • Existing tests cover adjacent timeout behavior only: Main has deferred-timeout and cleanup regression tests, but targeted search found no beforeAttempt, stop_before_candidate, Skipping fallback, or fallback budget test on the active branch. (src/cron/service/timer.regression.test.ts:644, e46dccb35374)
  • PR branch still carries the intended implementation surface: Provided PR context for head 48a621558196 shows additions for beforeAttempt, stop_before_candidate, deadlineAtMs threading, cron fallback minimum-budget checks, and src/cron/isolated-agent/run.fallback-time-budget.test.ts. (src/cron/isolated-agent/run.ts:985, 48a621558196)
  • Security review pass: The provided file set is TypeScript source and tests under src/agents, src/cron, and src/gateway; it does not add workflows, dependency manifests, lockfiles, lifecycle scripts, downloaded artifacts, or package publishing metadata. (48a621558196)

Likely related people:

  • steipete: Local blame/log on current main show Peter Steinberger authored the recent cron timeout cleanup and the current snapshot touching the central fallback, cron timer, isolated executor, and gateway cron wiring. (role: recent maintainer; confidence: high; commits: 61d53f98d314, 34d11d57579d; files: src/agents/model-fallback.ts, src/cron/service/timer.ts, src/cron/isolated-agent/run.ts)
  • vincentkoc: The existing ClawSweeper path-history context for this PR links Vincent Koc to the snapshot commit covering model fallback, cron timer, isolated executor, and gateway cron wiring. (role: adjacent fallback and cron history owner; confidence: medium; commits: ad2516b1c876; files: src/agents/model-fallback.ts, src/cron/service/timer.ts, src/cron/isolated-agent/run-executor.ts)
  • mbelinky: Prior path-history context links mbelinky to cron service and task lifecycle work near the timer/state deadline and execution lifecycle area affected by this PR. (role: adjacent cron service maintainer; confidence: medium; commits: 190a4b48697b, c60282421524; files: src/cron/service/timer.ts, src/cron/service/state.ts)

Remaining risk / open question:

Codex review notes: model gpt-5.5, reasoning high; reviewed against e46dccb35374.

@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from 48a6215 to f0a3253 Compare April 30, 2026 15:13
@imwyvern imwyvern force-pushed the fix/cron-fallback-timeout branch from f0a3253 to d5591c3 Compare May 8, 2026 04:31
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 8, 2026
@imwyvern
Copy link
Copy Markdown
Contributor Author

imwyvern commented May 8, 2026

Rebased onto upstream/main (HEAD d5591c3). Addressed Codex review P2:

  • Fallback budget now uses the per-attempt timeout (prepared.context.timeoutMs) instead of the outer cron safety timeout. Jobs that rely on agents.defaults.timeoutSeconds no longer skip fallbacks spuriously.
  • Added regression test for outer-large / attempt-small budget scenario.

Verification:

  • pnpm test src/agents/model-fallback.test.ts src/cron/isolated-agent/run.fallback-time-budget.test.ts src/cron/service/timer.test.ts src/cron/service/timer.regression.test.ts — PASS
  • pnpm exec oxfmt --check --threads=1 src/agents/model-fallback.ts src/cron/isolated-agent/run.ts src/cron/isolated-agent/run-executor.ts src/cron/service/timer.ts CHANGELOG.md — PASS
  • pnpm exec tsc -p tsconfig.core.json --noEmit — PASS

CHANGELOG updated. Ready for another look.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5591c3868

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cron/service/timer.ts
}

const runAbortController = new AbortController();
const deadlineAtMs = Date.now() + jobTimeoutMs;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Anchor fallback budget to the deferred timeout start

For detached agentTurn jobs, this deadline is computed before setup even though executeJobCoreWithTimeout intentionally defers starting the actual abort timer until onExecutionStarted fires. If delivery/session/skills/model setup takes significant time, the new beforeAttempt check sees that setup time as already spent and can skip a fallback with <budget remaining, while the real cron timeout still has a full jobTimeoutMs from execution start. Please derive deadlineAtMs when startTimeout() runs or otherwise use the same clock origin as the deferred timer.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: XL triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant