Summary
The status field returned for openclaw cron runs --id <jobId> for an isolated (payload.kind: agentTurn) job is non-deterministically ok or error across runs with identical input when the agent encounters a recoverable tool error and continues working. Two consecutive runs of the same cron job, with the same prompt and the same underlying failure condition (a Discord message-send returning "Channel is unavailable"), yielded:
| Run |
Status |
Same prompt |
Same outcome (Discord post failed, digest written to fallback log, JSON summary emitted) |
| 1 |
ok |
yes |
yes |
| 2 |
error |
yes |
yes |
This makes the cron status field unreliable as a monitoring signal — operators can't trust status: ok to mean "the run did what it was supposed to," nor status: error to mean "intervention required." For long-running orchestrator-style cron jobs (where the agent is expected to handle tool failures and continue gracefully — e.g. fall back to disk logging when an external channel is unreachable), this is a real production-readiness issue.
Reproduction
Set up an isolated cron job whose agent prompt:
- Calls a tool that may fail (in our case,
openclaw message send --channel discord against an unreachable channel).
- Catches the failure and falls back to a different action (in our case,
write to a fallback log file).
- Emits a final JSON summary line (a sentinel marking the agent's own assessment of success/failure).
Trigger the job twice in a row via openclaw cron run <jobId>. Inspect with openclaw cron runs --id <jobId>.
You will see two entries with the same summary content (modulo minor variation in the agent's text), but different status values.
Concrete evidence from our session
Both runs:
- Completed normally (no abort, no timeout)
- Produced the same final JSON sentinel with
discordPosted: false
- Wrote the digest to the fallback log file per the prompt's failure-handling rule
- Returned cleanly from the agent loop
Root cause
In dist/helpers-BalIC4F-.js (the cron payload classifier — equivalent source likely at src/cron/... / extensions/cron/...):
const hasErrorPayload = params.payloads.some(p => p?.isError === true)
const lastErrorPayloadIndex = params.payloads.findLastIndex(p => p?.isError === true)
const hasSuccessfulPayloadAfterLastError = !params.runLevelError && lastErrorPayloadIndex >= 0
&& params.payloads.slice(lastErrorPayloadIndex + 1).some(p => p?.isError !== true && Boolean(p?.text?.trim()))
const hasFatalStructuredErrorPayload = hasErrorPayload && !hasSuccessfulPayloadAfterLastError && !hasPendingPresentationWarning
And in dist/isolated-agent-SKs97XgD.js (the cron isolated-agent runner — finalizeCronRun):
const agentDiagnostics = createCronRunDiagnosticsFromAgentResult(finalRunResult, { finalStatus: hasFatalErrorPayload ? "error" : "ok" })
return prepared.withRunSession({
status: hasFatalErrorPayload ? "error" : "ok",
// ...
})
The classifier is sensitive to tool-call ordering within a single agent turn: it walks the payloads in order, finds the last isError: true, and asks "did any successful payload come after it?" If yes → ok. If no → error.
LLMs do not produce deterministic tool-call orderings. Two runs of the same prompt can:
- Emit
write(file) AFTER exec(openclaw message send) → ok
- Emit
write(file) BEFORE exec(openclaw message send), with the final emit being the failed send → error
Both are functionally identical from the agent's perspective (the agent caught the error and fulfilled its objective via the fallback path), but the cron classifier sees them as opposite outcomes.
Why this matters
For monitoring / alerting pipelines that route on status: error:
- False positives during normal-operation runs that involve any recoverable tool failure
- Missed signals when the agent happened to order tools differently this run
- Operators eventually learn to ignore the status field, defeating its purpose
For our use case (a daily canonry orchestrator that posts a digest to Discord and writes a fallback log when Discord is unreachable), the run is successful from a business perspective even if Discord posting fails — the digest is preserved, the operator can read it from disk, and the next day's run will retry. The cron status flipping to error on tool-ordering chance is purely a misclassification.
Suggested fixes
Pick one (or a combination):
A. Trust the agent's final assistant message as the outcome signal
If the agent's final assistant-visible output is non-empty and non-error-shaped, the run is ok. Tool errors during the turn that were followed by any further agent action (text or tool call) are evidence the agent saw the error and continued. This is roughly what the current logic intends, but it should be order-agnostic: was there any successful agent activity at all after the error? rather than "was the final payload non-error?".
Concretely, replace the trailing-payload check with hasAnySuccessfulPayload = params.payloads.some(p => p?.isError !== true && Boolean(p?.text?.trim())). If a successful payload exists anywhere in the turn alongside an error payload, the agent likely handled the error.
B. Add an explicit agent-signaled status sentinel
Let jobs declare a regex / JSON sentinel that the cron runner uses to authoritatively determine status from the final output. E.g.:
payload:
kind: agentTurn
message: "..."
statusSentinel:
type: jsonField
path: "$.discordPosted" # or "$.ok", "$.success", etc.
okWhen: { equals: true }
When set, the agent's own output is the source of truth. When unset, fall back to current heuristic.
C. Treat tool errors as warning rather than fatal when the agent continues
Introduce a third status — ok-with-warnings — for runs where a tool errored but the agent kept going. Preserves the diagnostic value of the existing classifier without misrouting alerts.
My lean is (A) because it's the smallest change and matches the spirit of the existing heuristic without the order sensitivity. (B) is the most powerful but requires schema changes. (C) is the most truthful but introduces a new state operators have to learn.
Workarounds (for users hitting this today)
- Inspect
summary instead of status if the agent's prompt is known to produce a JSON sentinel.
- Suppress alerts on transient tool errors at the consumer level (downstream of cron).
- Re-order the prompt's failure-handling so the last tool call in the recovery path is non-error (e.g., write the fallback file unconditionally at the end of every turn, regardless of whether Discord succeeded). This is brittle but works.
Related files
dist/helpers-BalIC4F-.js → resolveCronPayloadOutcome (lines ~140-205)
dist/isolated-agent-SKs97XgD.js → finalizeCronRun (lines ~643-740)
dist/server-cron-CM4aws4s.js → executeDetachedCronJob (lines ~1242-1290)
Equivalent sources likely under src/cron/ or extensions/cron/ in the repo.
Environment
- OpenClaw 2026.5.7
- Linux gateway via systemd user service
- Job type: isolated (
sessionTarget: isolated, payload.kind: agentTurn)
- Agent:
main, model deepseek/deepseek-v4-pro
🤖 Surfaced during canonry orchestrator setup; both runs were the same job at https://github.com/AINYC/canonry — diagnosed by tracing the dist code.
Summary
The
statusfield returned foropenclaw cron runs --id <jobId>for an isolated (payload.kind: agentTurn) job is non-deterministicallyokorerroracross runs with identical input when the agent encounters a recoverable tool error and continues working. Two consecutive runs of the same cron job, with the same prompt and the same underlying failure condition (a Discord message-send returning "Channel is unavailable"), yielded:okerrorThis makes the cron status field unreliable as a monitoring signal — operators can't trust
status: okto mean "the run did what it was supposed to," norstatus: errorto mean "intervention required." For long-running orchestrator-style cron jobs (where the agent is expected to handle tool failures and continue gracefully — e.g. fall back to disk logging when an external channel is unreachable), this is a real production-readiness issue.Reproduction
Set up an isolated cron job whose agent prompt:
openclaw message send --channel discordagainst an unreachable channel).writeto a fallback log file).Trigger the job twice in a row via
openclaw cron run <jobId>. Inspect withopenclaw cron runs --id <jobId>.You will see two entries with the same
summarycontent (modulo minor variation in the agent's text), but differentstatusvalues.Concrete evidence from our session
Both runs:
discordPosted: falseRoot cause
In
dist/helpers-BalIC4F-.js(the cron payload classifier — equivalent source likely atsrc/cron/.../extensions/cron/...):And in
dist/isolated-agent-SKs97XgD.js(the cron isolated-agent runner —finalizeCronRun):The classifier is sensitive to tool-call ordering within a single agent turn: it walks the payloads in order, finds the last
isError: true, and asks "did any successful payload come after it?" If yes →ok. If no →error.LLMs do not produce deterministic tool-call orderings. Two runs of the same prompt can:
write(file)AFTERexec(openclaw message send)→ okwrite(file)BEFOREexec(openclaw message send), with the final emit being the failed send → errorBoth are functionally identical from the agent's perspective (the agent caught the error and fulfilled its objective via the fallback path), but the cron classifier sees them as opposite outcomes.
Why this matters
For monitoring / alerting pipelines that route on
status: error:For our use case (a daily canonry orchestrator that posts a digest to Discord and writes a fallback log when Discord is unreachable), the run is successful from a business perspective even if Discord posting fails — the digest is preserved, the operator can read it from disk, and the next day's run will retry. The cron status flipping to
erroron tool-ordering chance is purely a misclassification.Suggested fixes
Pick one (or a combination):
A. Trust the agent's final assistant message as the outcome signal
If the agent's final assistant-visible output is non-empty and non-error-shaped, the run is
ok. Tool errors during the turn that were followed by any further agent action (text or tool call) are evidence the agent saw the error and continued. This is roughly what the current logic intends, but it should be order-agnostic: was there any successful agent activity at all after the error? rather than "was the final payload non-error?".Concretely, replace the trailing-payload check with
hasAnySuccessfulPayload = params.payloads.some(p => p?.isError !== true && Boolean(p?.text?.trim())). If a successful payload exists anywhere in the turn alongside an error payload, the agent likely handled the error.B. Add an explicit agent-signaled status sentinel
Let jobs declare a regex / JSON sentinel that the cron runner uses to authoritatively determine status from the final output. E.g.:
When set, the agent's own output is the source of truth. When unset, fall back to current heuristic.
C. Treat tool errors as
warningrather than fatal when the agent continuesIntroduce a third status —
ok-with-warnings— for runs where a tool errored but the agent kept going. Preserves the diagnostic value of the existing classifier without misrouting alerts.My lean is (A) because it's the smallest change and matches the spirit of the existing heuristic without the order sensitivity. (B) is the most powerful but requires schema changes. (C) is the most truthful but introduces a new state operators have to learn.
Workarounds (for users hitting this today)
summaryinstead ofstatusif the agent's prompt is known to produce a JSON sentinel.Related files
dist/helpers-BalIC4F-.js→resolveCronPayloadOutcome(lines ~140-205)dist/isolated-agent-SKs97XgD.js→finalizeCronRun(lines ~643-740)dist/server-cron-CM4aws4s.js→executeDetachedCronJob(lines ~1242-1290)Equivalent sources likely under
src/cron/orextensions/cron/in the repo.Environment
sessionTarget: isolated,payload.kind: agentTurn)main, modeldeepseek/deepseek-v4-pro🤖 Surfaced during canonry orchestrator setup; both runs were the same job at https://github.com/AINYC/canonry — diagnosed by tracing the dist code.