Skip to content

Commit 61d53f9

Browse files
committed
fix(cron): clean up timed out agent runs
1 parent c1a42dc commit 61d53f9

16 files changed

Lines changed: 279 additions & 34 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Docs: https://docs.openclaw.ai
2323

2424
- Build/Gateway: route restart, shutdown, respawn, diagnostics, command-queue cleanup, and runtime cleanup through one stable gateway lifecycle runtime entry so rebuilt packages do not strand long-running gateways on stale hashed chunks. Carries forward #73964. Thanks @pashpashpash.
2525
- Memory/wiki: keep broad shared-source and generated related-link blocks from turning every page into a search hit, cap noisy backlinks, support all-term searches such as people-routing queries, and prefer readable page body snippets over generated metadata. Thanks @vincentkoc.
26+
- Cron/Gateway: abort and bounded-clean up timed-out isolated agent turns before recording the timeout, so stale cron sessions cannot leave Discord or other chat lanes stuck in `processing` after a timeout. Thanks @vincentkoc.
2627
- Agents/errors: suppress malformed streaming tool-call JSON fragments before they reach chat surfaces while preserving provider request-validation diagnostics. Fixes #59076; keeps #59080 as duplicate coverage. (#59118) Thanks @singleGanghood.
2728
- CLI/models: restore provider-filtered `models list --all --provider <id>` rows for providers without manifest/static catalog coverage, including Anthropic and Amazon Bedrock, while keeping the compatibility fallback off expensive availability and resolver paths. Thanks @shakkernerd.
2829
- CLI/models: move the OpenAI listable catalog into the plugin manifest so `models list --all --provider openai` uses the manifest fast path instead of loading provider runtime normalization hooks. Thanks @shakkernerd.

docs/automation/cron-jobs.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ Cron is the Gateway's built-in scheduler. It persists jobs, wakes the agent at t
5151
- Isolated cron runs also guard against stale acknowledgement replies. If the first result is just an interim status update (`on it`, `pulling everything together`, and similar hints) and no descendant subagent run is still responsible for the final answer, OpenClaw re-prompts once for the actual result before delivery.
5252
- Isolated cron runs prefer structured execution-denial metadata from the embedded run, then fall back to known final summary/output markers such as `SYSTEM_RUN_DENIED` and `INVALID_REQUEST`, so a blocked command is not reported as a green run.
5353
- Isolated cron runs also treat run-level agent failures as job errors even when no reply payload is produced, so model/provider failures increment error counters and trigger failure notifications instead of clearing the job as successful.
54+
- When an isolated agent-turn job reaches `timeoutSeconds`, cron aborts the underlying agent run and gives it a short cleanup window. If the run does not drain, Gateway-owned cleanup force-clears that run's session ownership before cron records the timeout, so queued chat work is not left behind a stale processing session.
5455

5556
<a id="maintenance"></a>
5657

docs/concepts/agent-loop.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ surfaces, while Codex native hooks remain a separate lower-level Codex mechanism
162162

163163
- `agent.wait` default: 30s (just the wait). `timeoutMs` param overrides.
164164
- Agent runtime: `agents.defaults.timeoutSeconds` default 172800s (48 hours); enforced in `runEmbeddedPiAgent` abort timer.
165+
- Cron runtime: isolated agent-turn `timeoutSeconds` is owned by cron. The scheduler starts that timer when execution begins, aborts the underlying run at the configured deadline, then runs bounded cleanup before recording the timeout so a stale child session cannot keep the lane stuck.
165166
- Stuck-session recovery: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` detects long `processing` sessions. Active embedded runs, active reply operations, and active session-lane tasks remain warning-only by default; if diagnostics show no active work for the session, the watchdog releases the affected session lane so queued startup work can drain.
166167
- Model idle timeout: OpenClaw aborts a model request when no response chunks arrive before the idle window. `models.providers.<id>.timeoutSeconds` extends this idle watchdog for slow local/self-hosted providers; otherwise OpenClaw uses `agents.defaults.timeoutSeconds` when configured, capped at 120s by default. Cron-triggered runs with no explicit model or agent timeout disable the idle watchdog and rely on the cron outer timeout.
167168
- Provider HTTP request timeout: `models.providers.<id>.timeoutSeconds` applies to that provider's model HTTP fetches, including connect, headers, body, SDK request timeout, total guarded-fetch abort handling, and model stream idle watchdog. Use this for slow local/self-hosted providers such as Ollama before raising the whole agent runtime timeout.

src/agents/pi-embedded-runner.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ export {
2121
runEmbeddedPiAgent as runEmbeddedAgent,
2222
} from "./pi-embedded-runner/run.js";
2323
export {
24+
abortAndDrainEmbeddedPiRun,
2425
abortEmbeddedPiRun,
2526
abortEmbeddedPiRun as abortEmbeddedAgentRun,
2627
isEmbeddedPiRunActive,

src/agents/pi-embedded-runner/runs.test.ts

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ import { importFreshModule } from "openclaw/plugin-sdk/test-fixtures";
22
import { afterEach, describe, expect, it, vi } from "vitest";
33
import {
44
__testing,
5+
abortAndDrainEmbeddedPiRun,
56
abortEmbeddedPiRun,
67
clearActiveEmbeddedRun,
78
consumeEmbeddedRunModelSwitch,
@@ -65,6 +66,32 @@ describe("pi-embedded runner run registry", () => {
6566
expect(abortB).toHaveBeenCalledTimes(1);
6667
});
6768

69+
it("force-clears an aborted run that does not drain", async () => {
70+
vi.useFakeTimers();
71+
try {
72+
const abortRun = vi.fn();
73+
setActiveEmbeddedRun("session-stuck", createRunHandle({ abort: abortRun }), "agent:main");
74+
75+
const resultPromise = abortAndDrainEmbeddedPiRun({
76+
sessionId: "session-stuck",
77+
sessionKey: "agent:main",
78+
settleMs: 100,
79+
forceClear: true,
80+
reason: "test_timeout",
81+
});
82+
await vi.advanceTimersByTimeAsync(100);
83+
const result = await resultPromise;
84+
85+
expect(result).toEqual({ aborted: true, drained: false, forceCleared: true });
86+
expect(abortRun).toHaveBeenCalledTimes(1);
87+
expect(isEmbeddedPiRunHandleActive("session-stuck")).toBe(false);
88+
expect(resolveActiveEmbeddedRunHandleSessionId("agent:main")).toBeUndefined();
89+
} finally {
90+
await vi.runOnlyPendingTimersAsync();
91+
vi.useRealTimers();
92+
}
93+
});
94+
6895
it("waits for active runs to drain", async () => {
6996
vi.useFakeTimers();
7097
try {

src/agents/pi-embedded-runner/runs.ts

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,29 @@ export function waitForEmbeddedPiRunEnd(sessionId: string, timeoutMs = 15_000):
310310
});
311311
}
312312

313+
export type AbortAndDrainEmbeddedPiRunResult = {
314+
aborted: boolean;
315+
drained: boolean;
316+
forceCleared: boolean;
317+
};
318+
319+
export async function abortAndDrainEmbeddedPiRun(params: {
320+
sessionId: string;
321+
sessionKey?: string;
322+
settleMs?: number;
323+
forceClear?: boolean;
324+
reason?: string;
325+
}): Promise<AbortAndDrainEmbeddedPiRunResult> {
326+
const settleMs = params.settleMs ?? 15_000;
327+
const aborted = abortEmbeddedPiRun(params.sessionId);
328+
const drained = aborted ? await waitForEmbeddedPiRunEnd(params.sessionId, settleMs) : false;
329+
const forceCleared =
330+
params.forceClear === true && (!aborted || !drained)
331+
? forceClearEmbeddedPiRun(params.sessionId, params.sessionKey, params.reason)
332+
: false;
333+
return { aborted, drained, forceCleared };
334+
}
335+
313336
function notifyEmbeddedRunEnded(sessionId: string) {
314337
const waiters = EMBEDDED_RUN_WAITERS.get(sessionId);
315338
if (!waiters || waiters.size === 0) {

src/agents/pi-embedded.runtime.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
export {
2+
abortAndDrainEmbeddedPiRun,
23
abortEmbeddedPiRun,
34
isEmbeddedPiRunActive,
45
isEmbeddedPiRunStreaming,

src/agents/pi-embedded.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ export type {
99
EmbeddedPiRunResult,
1010
} from "./pi-embedded-runner.js";
1111
export {
12+
abortAndDrainEmbeddedPiRun,
1213
abortEmbeddedAgentRun,
1314
abortEmbeddedPiRun,
1415
compactEmbeddedAgentSession,

src/cron/isolated-agent/run.ts

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import { stringifyRouteThreadId } from "../../plugin-sdk/channel-route.js";
1010
import { normalizeOptionalString } from "../../shared/string-coerce.js";
1111
import { resolveCronDeliveryPlan, type CronDeliveryPlan } from "../delivery-plan.js";
1212
import type {
13+
CronAgentExecutionStarted,
1314
CronDeliveryTrace,
1415
CronDeliveryTraceMessageTarget,
1516
CronDeliveryTraceTarget,
@@ -424,7 +425,7 @@ type RunCronAgentTurnParams = {
424425
message: string;
425426
abortSignal?: AbortSignal;
426427
signal?: AbortSignal;
427-
onExecutionStarted?: () => void;
428+
onExecutionStarted?: (info?: CronAgentExecutionStarted) => void;
428429
sessionKey: string;
429430
agentId?: string;
430431
lane?: string;
@@ -1008,7 +1009,7 @@ export async function runCronIsolatedAgentTurn(params: {
10081009
message: string;
10091010
abortSignal?: AbortSignal;
10101011
signal?: AbortSignal;
1011-
onExecutionStarted?: () => void;
1012+
onExecutionStarted?: (info?: CronAgentExecutionStarted) => void;
10121013
sessionKey: string;
10131014
agentId?: string;
10141015
lane?: string;
@@ -1026,6 +1027,13 @@ export async function runCronIsolatedAgentTurn(params: {
10261027
if (!prepared.ok) {
10271028
return prepared.result;
10281029
}
1030+
const notifyExecutionStarted = () =>
1031+
params.onExecutionStarted?.({
1032+
jobId: params.job.id,
1033+
agentId: prepared.context.agentId,
1034+
sessionId: prepared.context.runSessionId,
1035+
sessionKey: prepared.context.runSessionKey,
1036+
});
10291037

10301038
try {
10311039
const { executeCronRun } = await loadCronExecutorRuntime();
@@ -1054,7 +1062,7 @@ export async function runCronIsolatedAgentTurn(params: {
10541062
commandBody: prepared.context.commandBody,
10551063
persistSessionEntry: prepared.context.persistSessionEntry,
10561064
abortSignal,
1057-
onExecutionStarted: params.onExecutionStarted,
1065+
onExecutionStarted: notifyExecutionStarted,
10581066
abortReason,
10591067
isAborted,
10601068
thinkLevel: prepared.context.thinkLevel,

src/cron/service/state.ts

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import type {
77
CronJobCreate,
88
CronJobPatch,
99
CronMessageChannel,
10+
CronAgentExecutionStarted,
1011
CronRunOutcome,
1112
CronRunStatus,
1213
CronRunTelemetry,
@@ -93,7 +94,7 @@ export type CronServiceDeps = {
9394
job: CronJob;
9495
message: string;
9596
abortSignal?: AbortSignal;
96-
onExecutionStarted?: () => void;
97+
onExecutionStarted?: (info?: CronAgentExecutionStarted) => void;
9798
}) => Promise<
9899
{
99100
summary?: string;
@@ -114,6 +115,11 @@ export type CronServiceDeps = {
114115
} & CronRunOutcome &
115116
CronRunTelemetry
116117
>;
118+
cleanupTimedOutAgentRun?: (params: {
119+
job: CronJob;
120+
timeoutMs: number;
121+
execution?: CronAgentExecutionStarted;
122+
}) => Promise<void>;
117123
sendCronFailureAlert?: (params: {
118124
job: CronJob;
119125
text: string;

0 commit comments

Comments
 (0)