Skip to content

bug(cron): job timeout includes cron-lane queue wait time #41783

@ayanesakura

Description

@ayanesakura

Summary

Manual/isolated cron jobs can time out before doing useful work because the cron job timeout starts before the job gets CPU on the cron lane.

Observed behavior

A cron job with sessionTarget="isolated" and payload.kind="agentTurn" can sit queued on the cron lane for several minutes and then immediately fail with:

  • cron: job execution timed out
  • embedded run cleanup shows aborted=true timedOut=true
  • prompt phase may end in only a few milliseconds, meaning the job never really got to execute

Relevant log pattern seen in production:

  • lane wait exceeded: lane=cron waitedMs=599950 queueAhead=1
  • then later the run starts
  • then run cleanup ... aborted=true timedOut=true
  • and the overall cron lane task duration roughly matches the configured timeout budget

Root cause

In src/cron/service/timer.ts the timeout is enforced in executeJobCoreWithTimeout() by racing executeJobCore(...) against a timer immediately:

return await Promise.race([
  executeJobCore(state, job, runAbortController.signal),
  new Promise<never>((_, reject) => {
    timeoutId = setTimeout(() => {
      runAbortController.abort(timeoutErrorMessage());
      reject(new Error(timeoutErrorMessage()));
    }, jobTimeoutMs);
  }),
]);

But executeJobCore() for isolated agent jobs may still need to wait for the shared cron lane / downstream lane acquisition before useful work begins.

So queue wait time is effectively charged against the job execution timeout.

Why this is a bug

The configured cron timeout reads like an execution timeout, but in practice it becomes a queue wait + execution timeout.
This causes false failures under contention and makes manual cron run --force behavior confusing.

Expected behavior

One of these should happen:

  1. timeout should start after the job actually starts executing on its effective lane, or
  2. queue wait and execution should have separate budgets / error messages.

At minimum, queued time should not silently consume the whole execution timeout budget.

Suggested fixes

  • Start the timeout clock only after lane acquisition / actual execution start.
  • Or split timeout into:
    • queue wait timeout
    • execution timeout
  • Or preserve the current behavior but surface a distinct error such as cron: job timed out while waiting in queue.

Notes

This was reproduced while testing an isolated daily digest cron with a 600s timeout. Increasing to 1800s works around the symptom, but does not fix the semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions