fix(cron): stop fallback attempts when cron budget is exhausted#52365
fix(cron): stop fallback attempts when cron budget is exhausted#52365imwyvern wants to merge 5 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes a real production bug where a cron job's fallback model was allocated ~0ms of budget when the primary model consumed most of the cron timeout, resulting in an immediate spurious failure attributed to the fallback. The fix propagates the outer cron deadline (
Confidence Score: 5/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/cron/service.issue-regressions.test.ts
Line: 1115-1119
Comment:
**Flaky test: exact `Date.now()` equality**
`wallNow` is captured before the async function call, while `deadlineAtMs` inside `executeJobCoreWithTimeout` is set to a second `Date.now()` invocation. If even 1 ms elapses between the two calls (very possible under CI load), the `.toBe()` assertion fails. A range check or fake timers would make this deterministic:
```suggestion
const wallNow = Date.now();
const result = await executeJobCoreWithTimeout(state, cronJob);
expect(result).toMatchObject({ status: "ok", summary: "done" });
// Allow a small window for the two Date.now() calls to differ
expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000);
expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500);
```
Alternatively, mock `Date.now` with `vi.spyOn(Date, "now").mockReturnValue(wallNow)` to make both calls return the same value and keep the exact `.toBe` check.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(cron): stop fallback attempts when c..." | Re-trigger Greptile |
| const wallNow = Date.now(); | ||
| const result = await executeJobCoreWithTimeout(state, cronJob); | ||
|
|
||
| expect(result).toMatchObject({ status: "ok", summary: "done" }); | ||
| expect(observedDeadlineAtMs).toBe(wallNow + timeoutSeconds * 1000); |
There was a problem hiding this comment.
Flaky test: exact
Date.now() equality
wallNow is captured before the async function call, while deadlineAtMs inside executeJobCoreWithTimeout is set to a second Date.now() invocation. If even 1 ms elapses between the two calls (very possible under CI load), the .toBe() assertion fails. A range check or fake timers would make this deterministic:
| const wallNow = Date.now(); | |
| const result = await executeJobCoreWithTimeout(state, cronJob); | |
| expect(result).toMatchObject({ status: "ok", summary: "done" }); | |
| expect(observedDeadlineAtMs).toBe(wallNow + timeoutSeconds * 1000); | |
| const wallNow = Date.now(); | |
| const result = await executeJobCoreWithTimeout(state, cronJob); | |
| expect(result).toMatchObject({ status: "ok", summary: "done" }); | |
| // Allow a small window for the two Date.now() calls to differ | |
| expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000); | |
| expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500); |
Alternatively, mock Date.now with vi.spyOn(Date, "now").mockReturnValue(wallNow) to make both calls return the same value and keep the exact .toBe check.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cron/service.issue-regressions.test.ts
Line: 1115-1119
Comment:
**Flaky test: exact `Date.now()` equality**
`wallNow` is captured before the async function call, while `deadlineAtMs` inside `executeJobCoreWithTimeout` is set to a second `Date.now()` invocation. If even 1 ms elapses between the two calls (very possible under CI load), the `.toBe()` assertion fails. A range check or fake timers would make this deterministic:
```suggestion
const wallNow = Date.now();
const result = await executeJobCoreWithTimeout(state, cronJob);
expect(result).toMatchObject({ status: "ok", summary: "done" });
// Allow a small window for the two Date.now() calls to differ
expect(observedDeadlineAtMs).toBeGreaterThanOrEqual(wallNow + timeoutSeconds * 1000);
expect(observedDeadlineAtMs).toBeLessThan(wallNow + timeoutSeconds * 1000 + 500);
```
Alternatively, mock `Date.now` with `vi.spyOn(Date, "now").mockReturnValue(wallNow)` to make both calls return the same value and keep the exact `.toBe` check.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93927ee1ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (beforeAttemptResult?.type === "stop") { | ||
| logModelFallbackDecision({ | ||
| decision: "stop_before_candidate", | ||
| runId: params.runId, | ||
| requestedProvider: params.provider, | ||
| requestedModel: params.model, | ||
| candidate, | ||
| attempt: i + 1, | ||
| total: candidates.length, | ||
| reason: beforeAttemptResult.reason ?? "timeout", | ||
| error: | ||
| beforeAttemptResult.error ?? | ||
| `Stopped fallback before ${candidate.provider}/${candidate.model}.`, | ||
| nextCandidate: candidates[i + 1], | ||
| isPrimary, | ||
| requestedModelMatched: requestedModel, | ||
| fallbackConfigured: hasFallbackCandidates, | ||
| previousAttempts: attempts, | ||
| }); | ||
| break; |
There was a problem hiding this comment.
Preserve the last real error when a budget stop follows skips
When beforeAttempt returns stop here, the loop just breaks and falls through to throwFallbackFailureSummary(). That only rethrows the prior error when attempts.length <= 1, so any skipped candidate recorded earlier in this same run (for example the existing cooldown/auth skip branches) turns the result into a synthesized All models failed ... summary instead of the original primary failure. In a chain like “primary fails, next fallback is skipped, later fallback is stopped for low cron budget”, this change still masks the real error the commit is trying to preserve.
Useful? React with 👍 / 👎.
6d3f4d7 to
3433417
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34334173d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (lastError !== undefined) { | ||
| throw lastError; | ||
| } | ||
| break; |
There was a problem hiding this comment.
Propagate stop errors when all prior candidates were skipped
When beforeAttempt stops here after only skip-only entries have accumulated (for example, the requested model is skipped by the cooldown/auth branch earlier in runWithModelFallback, and the cron budget gate fires on the next runnable fallback), lastError is still undefined. This break then falls through to throwFallbackFailureSummary(), so callers see a synthesized All models failed ... error instead of the explicit stop message from beforeAttempt. In that cron path the new budget-stop behavior is still masked by a misleading summary.
Useful? React with 👍 / 👎.
3433417 to
d628daa
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d628daad75
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const beforeAttemptResult = await params.beforeAttempt?.({ | ||
| candidate, | ||
| attempt: i + 1, | ||
| total: candidates.length, | ||
| previousAttempts: attempts, | ||
| isPrimary, | ||
| requestedModelMatched: requestedModel, | ||
| fallbackConfigured: hasFallbackCandidates, | ||
| }); |
There was a problem hiding this comment.
Run the stop hook before marking cooldown probes
When beforeAttempt stops a near-deadline cron run on a cooldowned provider, the code above has already called markProbeAttempt() for that candidate before any request is sent. Because lastProbeAttempt is global and throttles probes for 30 seconds, the next fallback run can incorrectly skip a real cooldown-recovery probe even though this run never actually probed the provider. Moving the stop check ahead of the cooldown probe bookkeeping would avoid burning probe slots on aborted attempts.
Useful? React with 👍 / 👎.
d628daa to
1bbc38a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1bbc38af88
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const minRemainingMs = resolveCronFallbackMinRemainingMs(cronTimeoutMs); | ||
| if (remainingMs >= minRemainingMs) { |
There was a problem hiding this comment.
Base the fallback guard on the actual agent timeout
For cron jobs that rely on agents.defaults.timeoutSeconds instead of payload.timeoutSeconds, timeoutMs in this function can be much smaller than the cron safety timeout. The new guard uses cronTimeoutMs here, so those jobs still require 30s remaining before any fallback, even when each model attempt is capped to 5–10s. A primary failure in the last 20–25s will now skip fallbacks that still had enough budget to complete under the configured agent timeout.
Useful? React with 👍 / 👎.
| git branch -D fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted local" || echo "Already gone" | ||
| git push origin --delete fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted remote" || echo "Already gone from remote" |
There was a problem hiding this comment.
Remove the unconditional remote branch deletion
This helper always deletes fix/subagent-timeout-partial-results at the end, even though that branch is not in BRANCHES and the script never verifies that it was merged. Running the script to rebase the listed branches will also remove that remote branch from origin, which is destructive if the PR is still open or another maintainer is still using it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1bbc38af88
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const minRemainingMs = resolveCronFallbackMinRemainingMs(cronTimeoutMs); | ||
| if (remainingMs >= minRemainingMs) { |
There was a problem hiding this comment.
Base the fallback guard on the actual agent timeout
For cron jobs that rely on agents.defaults.timeoutSeconds instead of payload.timeoutSeconds, timeoutMs in this function can be much smaller than the cron safety timeout. The new guard uses cronTimeoutMs here, so those jobs still require 30s remaining before any fallback, even when each model attempt is capped to 5–10s. A primary failure in the last 20–25s will now skip fallbacks that still had enough budget to complete under the configured agent timeout.
Useful? React with 👍 / 👎.
| git branch -D fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted local" || echo "Already gone" | ||
| git push origin --delete fix/subagent-timeout-partial-results 2>/dev/null && echo "Deleted remote" || echo "Already gone from remote" |
There was a problem hiding this comment.
Remove the unconditional remote branch deletion
This helper always deletes fix/subagent-timeout-partial-results at the end, even though that branch is not in BRANCHES and the script never verifies that it was merged. Running the script to rebase the listed branches will also remove that remote branch from origin, which is destructive if the PR is still open or another maintainer is still using it.
Useful? React with 👍 / 👎.
1bbc38a to
d31660a
Compare
d31660a to
9cab596
Compare
9cab596 to
97875ec
Compare
97875ec to
a4e231e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4e231e921
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
a4e231e to
3f552d9
Compare
8002698 to
d8734c9
Compare
|
Follow-up complete on this branch:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8734c9627
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const cronTimeoutMs = resolveCronJobTimeoutMs(params.job); | ||
| const fallbackMinRemainingMs = | ||
| typeof cronTimeoutMs === "number" && cronTimeoutMs > 0 | ||
| ? resolveCronFallbackMinRemainingMs(cronTimeoutMs) |
There was a problem hiding this comment.
Base fallback guard on actual agent timeout
This computes fallbackMinRemainingMs from resolveCronJobTimeoutMs(params.job), which defaults agent-turn jobs to a 60-minute safety window when payload.timeoutSeconds is unset. In the same function, execution still uses prepared.context.timeoutMs for each model call, so jobs configured via agents.defaults.timeoutSeconds (for example 5–10s) can have fallbacks blocked whenever under 30s remain even though another attempt could still complete within the real per-attempt timeout. That makes the new guard overly strict and can incorrectly suppress useful fallback attempts.
Useful? React with 👍 / 👎.
| if (lastError !== undefined) { | ||
| throw lastError; | ||
| } | ||
| break; |
There was a problem hiding this comment.
Propagate stop reason when no prior candidate failed
When beforeAttempt requests stop but lastError is still undefined (for example, earlier candidates were only skipped for cooldown/auth), this branch falls through via break and later throws a synthesized fallback summary. That drops the explicit stop message from beforeAttempt and returns a misleading generic failure instead of the cron-budget stop reason, which is precisely the condition this hook is intended to communicate.
Useful? React with 👍 / 👎.
15a4f17 to
a5cf851
Compare
df70038 to
61fa650
Compare
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: yes. The PR body includes production logs, and current-main source inspection shows no pre-attempt fallback guard before cron-selected fallback candidates run. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Land a rebased patch that keeps deferred isolated-timeout semantics, uses an execution-start-aligned fallback deadline, preserves explicit stop errors, and includes real behavior proof plus focused tests. Do we have a high-confidence way to reproduce the issue? Yes. The PR body includes production logs, and current-main source inspection shows no pre-attempt fallback guard before cron-selected fallback candidates run. Is this the best way to solve the issue? No. The approach is maintainable, but this PR is not the best mergeable implementation until the propagated deadline uses the same clock origin as the deferred cron timeout and stop-only failures preserve the hook's explicit error. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 1819e41d2683. |
|
Codex review: found issues before merge. What this changes: The PR adds a pre-attempt stop hook to model fallback, threads cron deadline information into isolated cron execution, and adds tests for skipping fallback attempts when little cron budget remains. Maintainer follow-up before merge: This is an open implementation PR with useful code, but the remaining action is maintainer review, rebase, and timeout-policy coordination rather than an autonomous replacement branch. Review findings:
Review detailsBest possible solution: Land a rebased, narrow fix that keeps current deferred isolated-timeout cleanup semantics, adds only the needed generic pre-attempt stop seam, computes the skip threshold from the correct execution budget, and coordinates the terminal-abort contract with #42482 and #62682. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e46dccb35374. |
48a6215 to
f0a3253
Compare
f0a3253 to
d5591c3
Compare
|
Rebased onto upstream/main (HEAD d5591c3). Addressed Codex review P2:
Verification:
CHANGELOG updated. Ready for another look. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5591c3868
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| const runAbortController = new AbortController(); | ||
| const deadlineAtMs = Date.now() + jobTimeoutMs; |
There was a problem hiding this comment.
Anchor fallback budget to the deferred timeout start
For detached agentTurn jobs, this deadline is computed before setup even though executeJobCoreWithTimeout intentionally defers starting the actual abort timer until onExecutionStarted fires. If delivery/session/skills/model setup takes significant time, the new beforeAttempt check sees that setup time as already spent and can skip a fallback with <budget remaining, while the real cron timeout still has a full jobTimeoutMs from execution start. Please derive deadlineAtMs when startTimeout() runs or otherwise use the same clock origin as the deferred timer.
Useful? React with 👍 / 👎.
Problem
When a cron job's primary model times out near the cron deadline, the fallback model is started but immediately killed by the same cron timeout — receiving effectively 0ms to make an LLM request. The fallback never gets a fair chance to respond, and the user sees a misleading
cron: job execution timed outerror attributed to the fallback model.Observed in production logs (2026-03-22):
The cron timeout wraps the entire fallback chain. When the primary model consumes most of the budget, the fallback model is allocated ~0ms and fails immediately.
Solution
Propagate the cron deadline (
deadlineAtMs) fromexecuteJobCoreWithTimeoutthrough the cron service state down torunCronIsolatedAgentTurn.Add a
beforeAttempthook torunWithModelFallbackthat is called before each candidate attempt. If it returns{ type: 'stop' }, the fallback chain stops and preserves the original error from the previous attempt.Check remaining budget in the cron layer's
beforeAttemptcallback. If less thanmin(30s, timeout/4)remains before the cron deadline, the fallback is skipped with a clear message:This ensures:
beforeAttempthook is generic and reusable for other pre-attempt checksChanges
src/agents/model-fallback.ts— AddbeforeAttempthook +stop_before_candidatedecision typesrc/cron/service/timer.ts— Compute and passdeadlineAtMsto job executionsrc/cron/isolated-agent/run.ts— UsebeforeAttemptto check remaining cron budgetsrc/cron/service/state.ts+src/gateway/server-cron.ts— ThreaddeadlineAtMsthrough interfacesTests
model-fallback.test.ts— beforeAttempt stop behaviorrun.fallback-time-budget.test.ts— Cron deadline propagation + fallback budget exhaustionservice.issue-regressions.test.ts— Deadline passed to isolated runs