Skip to content

[Bug] Isolated cron agentTurn jobs with explicit timeoutSeconds still hard-killed at ~300s despite fix in v2026.4.9 #63805

@Crabsticksalad

Description

@Crabsticksalad

Bug Description

Isolated cron agentTurn jobs with an explicit payload.timeoutSeconds: 300 are still being hard-killed at approximately 300,000ms — even after the v2026.4.9 fix for the LLM idle timeout regression.

The job-level setTimeout in executeJobCoreWithTimeout fires at exactly the configured timeoutSeconds × 1000, but the embedded runner's internal work is not completing in time. This is different from the fixed LLM idle timeout bug — it is a separate, still-active issue affecting jobs that require more than ~265 seconds of wall-clock time.

Expected Behavior

A cron job with sessionTarget: "isolated", payload.kind: "agentTurn", and payload.timeoutSeconds: 300 should:

  1. Respect the 300,000ms job-level timeout and kill the job at 300s if it genuinely exceeds that budget
  2. Complete successfully if the embedded runner can finish within 300s

Actual Behavior

The job is killed at approximately 300,000ms (job-level timeout) even when the embedded runner is still actively processing — executing commands, running LLM calls that are succeeding, and making visible progress in the session transcript.

Environment

  • OpenClaw: v2026.4.9 (fresh install today)
  • Configured model: MiniMax-M2.7 via minimax-portal
  • Cron job: sessionTarget: "isolated", payload.kind: "agentTurn", timeoutSeconds: 300
  • Delivery: Discord announce channel
  • Session transcript (81KB, 49 entries) shows the session bootstraps, the LLM responds, exec commands run, but the job is hard-killed at ~300s with "cron: job execution timed out"

Timeline of Events

  1. v2026.3.31 era: Cron works fine (scheduled, isolated agentTurn)
  2. v2026.4.1 upgrade: Fails at ~60s (LLM idle timeout bug — now fixed)
  3. v2026.4.9 upgrade: LLM idle timeout is fixed (no more 65s kills), but the job-level 300s hard timeout is now the terminal failure. The fix in d9dc75774b addressed the embedded runner's LLM idle watchdog, not the cron runner's executeJobCoreWithTimeout setTimeout.

Session Transcript Evidence

Session start: 08:50:10
First LLM response: 08:50:16 (6s)
Multiple exec commands run, session actively working
Final entries: 08:53:21-08:54:35 (continued working)
Error fired: 08:54:35 — "aborted | cron: job execution timed out"
Total wall time: ~265s (not 300s — timing suggests overhead consumption)

The session was doing useful work until the moment it was killed. The exec commands were running, tool results were being returned. The job was not hung — it was actively processing.

Technical Findings

Two Independent Timeout Systems

The codebase has two independent timeout mechanisms that can kill a cron job:

  1. Job-level timeout (executeJobCoreWithTimeout, gateway-cli-*.js)

    • Set via payload.timeoutSeconds (converted to ms)
    • Enforced by a setTimeout(..., jobTimeoutMs) that races against executeJobCore
    • When it fires: runAbortController.abort() + rejection of the Promise race
    • Error: "cron: job execution timed out"
  2. Embedded runner's idle timeout (runEmbeddedPiAgent, pi-embedded-*.js)

    • Set via cfg.agents.defaults.timeoutSeconds (default 600s)
    • Independent setTimeout inside the embedded runner
    • Now fixed in v2026.4.9: inherits agents.defaults.timeoutSeconds and is disabled for unconfigured cron runs

The job-level timeout (300s) fires before the embedded runner's 600s idle watchdog. Both mechanisms are functioning as designed — but the job-level timeout is killing jobs that should be allowed to complete within 300s.

The Fix That Shipped Only Addressed One Layer

The v2026.4.9 release notes state:

"Agents/timeouts: make the LLM idle timeout inherit agents.defaults.timeoutSeconds when configured, disable the unconfigured idle watchdog for cron runs..."

This fixed the embedded runner's idle watchdog (the ~65s killer). It did not address executeJobCoreWithTimeout's job-level setTimeout, which has always been reading payload.timeoutSeconds correctly but fires regardless of whether the embedded runner is making progress.

Additional Observations

  • When the upstream LLM API returns 529 (overloaded), OpenClaw's retry/failover logic consumes time that counts against the job-level 300s budget. This is expected but compounds the timing pressure.
  • The same model works fine in manual (non-cron) sessions via the Control UI — direct API calls succeed in <1s.
  • The payload.model field IS being respected (session uses MiniMax-M2.7), but the model parameter for isolated cron sessions has a separate bug ([Bug] Cron isolated session model parameter ignored, falls back to default agent model #61294) — it may be ignored depending on configuration format.

Relevant Open Issues

Questions / Things to Investigate

  1. Is executeJobCoreWithTimeout's setTimeout supposed to fire as a hard ceiling, or should it be extended by time spent in the embedded runner's internal retry loops?
  2. Should payload.timeoutSeconds for isolated agentTurn jobs be increased significantly (600s+) to give the embedded runner enough time, given the job-level timeout is a hard ceiling?
  3. Is there overhead in the isolated session bootstrap that's consuming part of the 300s budget before the LLM call even starts?

Reproduction Steps

Any isolated agentTurn cron job with:

  • sessionTarget: "isolated"
  • payload.timeoutSeconds: 300
  • A moderately complex task (LLM + multiple exec commands)

Will fail at 300s if the task takes longer than ~265s due to isolated session bootstrap overhead, retry time from API errors, or natural processing time.

Tags

bug regression cron isolated agentTurn timeout

Metadata

Metadata

Assignees

No one assigned

    Labels

    close:duplicateClosed as duplicatededupe:childDuplicate issue/PR child in dedupe clusterdedupe:parentPrimary canonical item in dedupe cluster

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions