[Bug] Isolated cron agentTurn jobs with explicit timeoutSeconds still hard-killed at ~300s despite fix in v2026.4.9

## Bug Description

Isolated cron `agentTurn` jobs with an explicit `payload.timeoutSeconds: 300` are still being hard-killed at approximately 300,000ms — even after the v2026.4.9 fix for the LLM idle timeout regression.

The job-level `setTimeout` in `executeJobCoreWithTimeout` fires at exactly the configured `timeoutSeconds × 1000`, but the embedded runner's internal work is not completing in time. This is **different from the fixed LLM idle timeout bug** — it is a separate, still-active issue affecting jobs that require more than ~265 seconds of wall-clock time.

## Expected Behavior

A cron job with `sessionTarget: "isolated"`, `payload.kind: "agentTurn"`, and `payload.timeoutSeconds: 300` should:
1. Respect the 300,000ms job-level timeout and kill the job at 300s if it genuinely exceeds that budget
2. Complete successfully if the embedded runner can finish within 300s

## Actual Behavior

The job is killed at approximately 300,000ms (job-level timeout) even when the embedded runner is still actively processing — executing commands, running LLM calls that are succeeding, and making visible progress in the session transcript.

## Environment

- OpenClaw: v2026.4.9 (fresh install today)
- Configured model: MiniMax-M2.7 via minimax-portal
- Cron job: `sessionTarget: "isolated"`, `payload.kind: "agentTurn"`, `timeoutSeconds: 300`
- Delivery: Discord announce channel
- Session transcript (81KB, 49 entries) shows the session bootstraps, the LLM responds, exec commands run, but the job is hard-killed at ~300s with `"cron: job execution timed out"`

## Timeline of Events

1. **v2026.3.31 era**: Cron works fine (scheduled, isolated agentTurn)
2. **v2026.4.1 upgrade**: Fails at ~60s (LLM idle timeout bug — now fixed)
3. **v2026.4.9 upgrade**: LLM idle timeout is fixed (no more 65s kills), but the **job-level 300s hard timeout** is now the terminal failure. The fix in `d9dc75774b` addressed the embedded runner's LLM idle watchdog, not the cron runner's `executeJobCoreWithTimeout` setTimeout.

## Session Transcript Evidence

```
Session start: 08:50:10
First LLM response: 08:50:16 (6s)
Multiple exec commands run, session actively working
Final entries: 08:53:21-08:54:35 (continued working)
Error fired: 08:54:35 — "aborted | cron: job execution timed out"
Total wall time: ~265s (not 300s — timing suggests overhead consumption)
```

The session was **doing useful work until the moment it was killed**. The exec commands were running, tool results were being returned. The job was not hung — it was actively processing.

## Technical Findings

### Two Independent Timeout Systems

The codebase has two independent timeout mechanisms that can kill a cron job:

1. **Job-level timeout** (`executeJobCoreWithTimeout`, `gateway-cli-*.js`)
   - Set via `payload.timeoutSeconds` (converted to ms)
   - Enforced by a `setTimeout(..., jobTimeoutMs)` that races against `executeJobCore`
   - When it fires: `runAbortController.abort()` + rejection of the Promise race
   - Error: `"cron: job execution timed out"`

2. **Embedded runner's idle timeout** (`runEmbeddedPiAgent`, `pi-embedded-*.js`)
   - Set via `cfg.agents.defaults.timeoutSeconds` (default 600s)
   - Independent `setTimeout` inside the embedded runner
   - **Now fixed in v2026.4.9**: inherits `agents.defaults.timeoutSeconds` and is disabled for unconfigured cron runs

The job-level timeout (300s) fires **before** the embedded runner's 600s idle watchdog. Both mechanisms are functioning as designed — but the job-level timeout is killing jobs that should be allowed to complete within 300s.

### The Fix That Shipped Only Addressed One Layer

The v2026.4.9 release notes state:
> "Agents/timeouts: make the LLM idle timeout inherit `agents.defaults.timeoutSeconds` when configured, disable the unconfigured idle watchdog for cron runs..."

This fixed the **embedded runner's idle watchdog** (the ~65s killer). It did **not** address `executeJobCoreWithTimeout`'s job-level `setTimeout`, which has always been reading `payload.timeoutSeconds` correctly but fires regardless of whether the embedded runner is making progress.

### Additional Observations

- When the upstream LLM API returns 529 (overloaded), OpenClaw's retry/failover logic consumes time that counts against the job-level 300s budget. This is expected but compounds the timing pressure.
- The same model works fine in manual (non-cron) sessions via the Control UI — direct API calls succeed in <1s.
- The `payload.model` field IS being respected (session uses MiniMax-M2.7), but the model parameter for isolated cron sessions has a separate bug (#61294) — it may be ignored depending on configuration format.

## Relevant Open Issues

- **#58138** (open): "Cron agentTurn jobs timeout even when script executes successfully" — similar pattern, same `Profile *:default timed out. Trying next account...` behavior, direct API works
- **#59579** (open): "Scheduled cron runs timeout at ~60s in 2026.4.1" — the original 60s regression, now partially fixed but job-level 300s remains
- **#61294** (open): "Cron isolated session model parameter ignored" — model override not respected for isolated sessions
- **#54320** (open): "cron run: default 30s timeout kills long-running jobs" — CLI vs gateway execution path difference

## Questions / Things to Investigate

1. Is `executeJobCoreWithTimeout`'s `setTimeout` supposed to fire as a hard ceiling, or should it be extended by time spent in the embedded runner's internal retry loops?
2. Should `payload.timeoutSeconds` for isolated `agentTurn` jobs be increased significantly (600s+) to give the embedded runner enough time, given the job-level timeout is a hard ceiling?
3. Is there overhead in the isolated session bootstrap that's consuming part of the 300s budget before the LLM call even starts?

## Reproduction Steps

Any isolated `agentTurn` cron job with:
- `sessionTarget: "isolated"`
- `payload.timeoutSeconds: 300`
- A moderately complex task (LLM + multiple exec commands)

Will fail at 300s if the task takes longer than ~265s due to isolated session bootstrap overhead, retry time from API errors, or natural processing time.

## Tags

bug regression cron isolated agentTurn timeout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Isolated cron agentTurn jobs with explicit timeoutSeconds still hard-killed at ~300s despite fix in v2026.4.9 #63805

Bug Description

Expected Behavior

Actual Behavior

Environment

Timeline of Events

Session Transcript Evidence

Technical Findings

Two Independent Timeout Systems

The Fix That Shipped Only Addressed One Layer

Additional Observations

Relevant Open Issues

Questions / Things to Investigate

Reproduction Steps

Tags

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Isolated cron agentTurn jobs with explicit timeoutSeconds still hard-killed at ~300s despite fix in v2026.4.9 #63805

Description

Bug Description

Expected Behavior

Actual Behavior

Environment

Timeline of Events

Session Transcript Evidence

Technical Findings

Two Independent Timeout Systems

The Fix That Shipped Only Addressed One Layer

Additional Observations

Relevant Open Issues

Questions / Things to Investigate

Reproduction Steps

Tags

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions