bug(cron): isolated-job status is non-deterministic when an agent recovers from a tool error

## Summary

The `status` field returned for `openclaw cron runs --id <jobId>` for an **isolated** (`payload.kind: agentTurn`) job is non-deterministically `ok` or `error` across runs with **identical input** when the agent encounters a recoverable tool error and continues working. Two consecutive runs of the same cron job, with the same prompt and the same underlying failure condition (a Discord message-send returning "Channel is unavailable"), yielded:

| Run | Status | Same prompt | Same outcome (Discord post failed, digest written to fallback log, JSON summary emitted) |
|-----|--------|-------------|---|
| 1   | `ok`   | yes         | yes |
| 2   | `error`| yes         | yes |

This makes the cron status field unreliable as a monitoring signal — operators can't trust `status: ok` to mean "the run did what it was supposed to," nor `status: error` to mean "intervention required." For long-running orchestrator-style cron jobs (where the agent is **expected** to handle tool failures and continue gracefully — e.g. fall back to disk logging when an external channel is unreachable), this is a real production-readiness issue.

## Reproduction

Set up an isolated cron job whose agent prompt:
1. Calls a tool that may fail (in our case, `openclaw message send --channel discord` against an unreachable channel).
2. **Catches the failure** and falls back to a different action (in our case, `write` to a fallback log file).
3. Emits a final JSON summary line (a sentinel marking the agent's own assessment of success/failure).

Trigger the job twice in a row via `openclaw cron run <jobId>`. Inspect with `openclaw cron runs --id <jobId>`.

You will see two entries with the same `summary` content (modulo minor variation in the agent's text), but different `status` values.

### Concrete evidence from our session

```jsonc
// Run 1 — status: "ok"
{
  "runId": "manual:93753d9f-...:1",
  "status": "ok",
  "durationMs": 642885,
  "summary": "```json\n{\"reactiveTriggers\": 0, \"rescuedSweeps\": 2, ..., \"discordPosted\": false, \"errors\": [\"Discord channel unavailable\"]}\n```"
}

// Run 2 — status: "error" (same prompt, same Discord failure, same fallback)
{
  "runId": "manual:93753d9f-...:2",
  "status": "error",
  "durationMs": 337476,
  "summary": "```json\n{\"reactiveTriggers\": 0, \"rescuedSweeps\": 0, ..., \"discordPosted\": false, \"errors\": [\"Discord posting failed: channel unavailable...\"]}\n```"
}
```

Both runs:
- Completed normally (no abort, no timeout)
- Produced the same final JSON sentinel with `discordPosted: false`
- Wrote the digest to the fallback log file per the prompt's failure-handling rule
- Returned cleanly from the agent loop

## Root cause

In `dist/helpers-BalIC4F-.js` (the cron payload classifier — equivalent source likely at `src/cron/...` / `extensions/cron/...`):

```js
const hasErrorPayload = params.payloads.some(p => p?.isError === true)
const lastErrorPayloadIndex = params.payloads.findLastIndex(p => p?.isError === true)
const hasSuccessfulPayloadAfterLastError = !params.runLevelError && lastErrorPayloadIndex >= 0
  && params.payloads.slice(lastErrorPayloadIndex + 1).some(p => p?.isError !== true && Boolean(p?.text?.trim()))

const hasFatalStructuredErrorPayload = hasErrorPayload && !hasSuccessfulPayloadAfterLastError && !hasPendingPresentationWarning
```

And in `dist/isolated-agent-SKs97XgD.js` (the cron isolated-agent runner — `finalizeCronRun`):

```js
const agentDiagnostics = createCronRunDiagnosticsFromAgentResult(finalRunResult, { finalStatus: hasFatalErrorPayload ? "error" : "ok" })
return prepared.withRunSession({
  status: hasFatalErrorPayload ? "error" : "ok",
  // ...
})
```

The classifier is **sensitive to tool-call ordering within a single agent turn**: it walks the payloads in order, finds the last `isError: true`, and asks "did any successful payload come after it?" If yes → `ok`. If no → `error`.

LLMs do not produce deterministic tool-call orderings. Two runs of the same prompt can:
- Emit `write(file)` AFTER `exec(openclaw message send)` → ok
- Emit `write(file)` BEFORE `exec(openclaw message send)`, with the final emit being the failed send → error

Both are functionally identical from the agent's perspective (the agent caught the error and fulfilled its objective via the fallback path), but the cron classifier sees them as opposite outcomes.

## Why this matters

For monitoring / alerting pipelines that route on `status: error`:
- False positives during normal-operation runs that involve any recoverable tool failure
- Missed signals when the agent happened to order tools differently this run
- Operators eventually learn to ignore the status field, defeating its purpose

For our use case (a daily canonry orchestrator that posts a digest to Discord and writes a fallback log when Discord is unreachable), the run is **successful** from a business perspective even if Discord posting fails — the digest is preserved, the operator can read it from disk, and the next day's run will retry. The cron status flipping to `error` on tool-ordering chance is purely a misclassification.

## Suggested fixes

Pick one (or a combination):

### A. Trust the agent's final assistant message as the outcome signal

If the agent's final assistant-visible output is non-empty and non-error-shaped, the run is `ok`. Tool errors during the turn that were followed by *any* further agent action (text or tool call) are evidence the agent saw the error and continued. This is roughly what the current logic intends, but it should be order-agnostic: **was there any successful agent activity at all after the error?** rather than "was the final payload non-error?".

Concretely, replace the trailing-payload check with `hasAnySuccessfulPayload = params.payloads.some(p => p?.isError !== true && Boolean(p?.text?.trim()))`. If a successful payload exists anywhere in the turn alongside an error payload, the agent likely handled the error.

### B. Add an explicit agent-signaled status sentinel

Let jobs declare a regex / JSON sentinel that the cron runner uses to authoritatively determine status from the final output. E.g.:

```yaml
payload:
  kind: agentTurn
  message: "..."
  statusSentinel:
    type: jsonField
    path: "$.discordPosted"   # or "$.ok", "$.success", etc.
    okWhen: { equals: true }
```

When set, the agent's own output is the source of truth. When unset, fall back to current heuristic.

### C. Treat tool errors as `warning` rather than fatal when the agent continues

Introduce a third status — `ok-with-warnings` — for runs where a tool errored but the agent kept going. Preserves the diagnostic value of the existing classifier without misrouting alerts.

My lean is **(A)** because it's the smallest change and matches the spirit of the existing heuristic without the order sensitivity. **(B)** is the most powerful but requires schema changes. **(C)** is the most truthful but introduces a new state operators have to learn.

## Workarounds (for users hitting this today)

- Inspect `summary` instead of `status` if the agent's prompt is known to produce a JSON sentinel.
- Suppress alerts on transient tool errors at the consumer level (downstream of cron).
- Re-order the prompt's failure-handling so the **last** tool call in the recovery path is non-error (e.g., write the fallback file unconditionally at the end of every turn, regardless of whether Discord succeeded). This is brittle but works.

## Related files

- `dist/helpers-BalIC4F-.js` → `resolveCronPayloadOutcome` (lines ~140-205)
- `dist/isolated-agent-SKs97XgD.js` → `finalizeCronRun` (lines ~643-740)
- `dist/server-cron-CM4aws4s.js` → `executeDetachedCronJob` (lines ~1242-1290)

Equivalent sources likely under `src/cron/` or `extensions/cron/` in the repo.

## Environment

- OpenClaw 2026.5.7
- Linux gateway via systemd user service
- Job type: isolated (`sessionTarget: isolated`, `payload.kind: agentTurn`)
- Agent: `main`, model `deepseek/deepseek-v4-pro`

🤖 Surfaced during canonry orchestrator setup; both runs were the same job at https://github.com/AINYC/canonry — diagnosed by tracing the dist code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug(cron): isolated-job status is non-deterministic when an agent recovers from a tool error #81514

Summary

Reproduction

Concrete evidence from our session

Root cause

Why this matters

Suggested fixes

A. Trust the agent's final assistant message as the outcome signal

B. Add an explicit agent-signaled status sentinel

C. Treat tool errors as `warning` rather than fatal when the agent continues

Workarounds (for users hitting this today)

Related files

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

bug(cron): isolated-job status is non-deterministic when an agent recovers from a tool error #81514

Description

Summary

Reproduction

Concrete evidence from our session

Root cause

Why this matters

Suggested fixes

A. Trust the agent's final assistant message as the outcome signal

B. Add an explicit agent-signaled status sentinel

C. Treat tool errors as warning rather than fatal when the agent continues

Workarounds (for users hitting this today)

Related files

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

C. Treat tool errors as `warning` rather than fatal when the agent continues