Skip to content

Commit 286964c

Browse files
fix(diagnostics): recover orphaned session activity
Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths. Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress. Refs #84903. Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com>
1 parent a67ee0f commit 286964c

9 files changed

Lines changed: 347 additions & 12 deletions

docs/concepts/agent-loop.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ surfaces, while Codex native hooks remain a separate lower-level Codex mechanism
165165
- `agent.wait` default: 30s (just the wait). `timeoutMs` param overrides.
166166
- Agent runtime: `agents.defaults.timeoutSeconds` default 172800s (48 hours); enforced in `runEmbeddedPiAgent` abort timer.
167167
- Cron runtime: isolated agent-turn `timeoutSeconds` is owned by cron. The scheduler starts that timer when execution begins, aborts the underlying run at the configured deadline, then runs bounded cleanup before recording the timeout so a stale child session cannot keep the lane stuck.
168-
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions that have no observed reply, tool, status, block, or ACP progress. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work. Stale session bookkeeping releases the affected session lane immediately; stalled embedded runs are abort-drained only after `diagnostics.stuckSessionAbortMs` (default: at least 5 minutes and 3x the warning threshold) so queued work can resume without cutting off merely slow runs. Recovery emits structured requested/completed outcomes, and diagnostic state is marked idle only if the same processing generation is still current. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
168+
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions that have no observed reply, tool, status, block, or ACP progress. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for recoverable stale session bookkeeping, including idle queued sessions with stale ownerless model/tool activity. Stale session bookkeeping releases the affected session lane immediately after recovery gates pass; stalled embedded runs are abort-drained only after `diagnostics.stuckSessionAbortMs` (default: at least 5 minutes and 3x the warning threshold) so queued work can resume without cutting off merely slow runs. Recovery emits structured requested/completed outcomes, and diagnostic state is marked idle only if the same processing generation is still current. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
169169
- Model idle timeout: OpenClaw aborts a model request when no response chunks arrive before the idle window. `models.providers.<id>.timeoutSeconds` extends this idle watchdog for slow local/self-hosted providers, but it is still bounded by any lower `agents.defaults.timeoutSeconds` or run-specific timeout because those control the whole agent run. Otherwise OpenClaw uses `agents.defaults.timeoutSeconds` when configured, capped at 120s by default. Cron-triggered runs with no explicit model or agent timeout disable the idle watchdog and rely on the cron outer timeout.
170170
- Provider HTTP request timeout: `models.providers.<id>.timeoutSeconds` applies to that provider's model HTTP fetches, including connect, headers, body, SDK request timeout, total guarded-fetch abort handling, and model stream idle watchdog. Use this for slow local/self-hosted providers such as Ollama before raising the whole agent runtime timeout, and keep the agent/runtime timeout at least as high when the model request needs to run longer.
171171

docs/concepts/queue.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ keys.
126126
- If commands seem stuck, enable verbose logs and look for "queued for ...ms" lines to confirm the queue is draining.
127127
- If you need queue depth, enable verbose logs and watch for queue timing lines.
128128
- Codex app-server runs that accept a turn and then stop emitting progress are interrupted by the Codex adapter so the active session lane can release instead of waiting for the outer run timeout.
129-
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` with no observed reply, tool, status, block, or ACP progress are classified by current activity. Active work logs as `session.long_running`; active work with no recent progress logs as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path can release the affected session lane so queued work drains. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
129+
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` with no observed reply, tool, status, block, or ACP progress are classified by current activity. Active work logs as `session.long_running`; active work with no recent progress logs as `session.stalled`; `session.stuck` is reserved for recoverable stale session bookkeeping, including idle queued sessions with stale ownerless model/tool activity, and only that path can release the affected session lane so queued work drains. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
130130

131131
## Related
132132

docs/gateway/opentelemetry.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -223,8 +223,8 @@ message bodies are also approved for export.
223223
- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or `openclaw.channel=heartbeat`)
224224
- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`)
225225
- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`)
226-
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`; emitted only for stale session bookkeeping with no active work)
227-
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`; emitted only for stale session bookkeeping with no active work)
226+
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`; emitted for recoverable stale session bookkeeping)
227+
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`; emitted for recoverable stale session bookkeeping)
228228
- `openclaw.session.turn.created` (counter, attrs: `openclaw.agent`, `openclaw.channel`, `openclaw.trigger`)
229229
- `openclaw.session.recovery.requested` (counter, attrs: `openclaw.state`, `openclaw.action`, `openclaw.active_work_kind`, `openclaw.reason`)
230230
- `openclaw.session.recovery.completed` (counter, attrs: `openclaw.state`, `openclaw.action`, `openclaw.status`, `openclaw.active_work_kind`, `openclaw.reason`)
@@ -249,8 +249,9 @@ OpenClaw classifies sessions by the work it can still observe:
249249
turns behind the lane can resume. When unset, the abort threshold defaults to
250250
the safer extended window of at least 5 minutes and 3x
251251
`diagnostics.stuckSessionWarnMs`.
252-
- `session.stuck`: stale session bookkeeping with no active work. This releases
253-
the affected session lane immediately.
252+
- `session.stuck`: stale session bookkeeping with no active work, or an idle
253+
queued session with stale ownerless model/tool activity. This releases the
254+
affected session lane immediately after recovery gates pass.
254255

255256
Recovery emits structured `session.recovery.requested` and
256257
`session.recovery.completed` events. Diagnostic session state is marked idle

src/agents/session-write-lock.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,14 @@ export const DEFAULT_SESSION_WRITE_LOCK_MAX_HOLD_MS = 5 * 60 * 1000;
4242
export const DEFAULT_SESSION_WRITE_LOCK_ACQUIRE_TIMEOUT_MS = 60_000;
4343
const DEFAULT_WATCHDOG_INTERVAL_MS = 60_000;
4444
const DEFAULT_TIMEOUT_GRACE_MS = 2 * 60 * 1000;
45+
46+
/**
47+
* Yield control to the event loop so other sessions can make progress
48+
* while lock contention callbacks run synchronous I/O.
49+
*/
50+
function yieldEventLoop(): Promise<void> {
51+
return new Promise<void>((resolve) => setImmediate(resolve));
52+
}
4553
// A payload-less lock can be left behind if shutdown lands between open("wx")
4654
// and the owner metadata write. Keep the grace short so 10s callers recover.
4755
const ORPHAN_LOCK_PAYLOAD_GRACE_MS = 5_000;
@@ -768,6 +776,9 @@ export async function acquireSessionWriteLock(params: {
768776
return lockPayload as Record<string, unknown>;
769777
},
770778
shouldReclaim: async ({ payload, nowMs, heldByThisProcess }) => {
779+
// Yield to the event loop before synchronous process inspection
780+
// to prevent lock contention retries from starving other sessions.
781+
await yieldEventLoop();
771782
const inspected = inspectLockPayloadForSession({
772783
payload: payload as LockFilePayload | null,
773784
staleMs,
@@ -780,6 +791,7 @@ export async function acquireSessionWriteLock(params: {
780791
return await shouldReclaimContendedLockFile(lockPath, inspected, staleMs, nowMs);
781792
},
782793
shouldRemoveStaleLock: async ({ lockPath, normalizedTargetPath, payload }) => {
794+
await yieldEventLoop();
783795
const nowMs = Date.now();
784796
const heldByThisProcess = sessionLockHeldByThisProcess(normalizedTargetPath);
785797
const inspected = inspectLockPayloadForSession({

src/logging/diagnostic-session-attention.test.ts

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,9 +102,62 @@ describe("classifySessionAttention", () => {
102102
recoveryEligible: false,
103103
},
104104
},
105-
])("$name", ({ activity, expected, queueDepth }) => {
105+
{
106+
name: "idle queued stale model activity without active embedded run",
107+
state: "idle" as const,
108+
queueDepth: 1,
109+
activity: {
110+
activeWorkKind: "model_call" as const,
111+
hasActiveEmbeddedRun: false,
112+
lastProgressAgeMs: 31_000,
113+
lastProgressReason: "model_call:started",
114+
},
115+
expected: {
116+
eventType: "session.stuck",
117+
reason: "queued_work_without_active_run",
118+
classification: "stale_session_state",
119+
recoveryEligible: true,
120+
},
121+
},
122+
{
123+
name: "idle queued stale tool_call activity without active embedded run",
124+
state: "idle" as const,
125+
queueDepth: 1,
126+
activity: {
127+
activeWorkKind: "tool_call" as const,
128+
hasActiveEmbeddedRun: false,
129+
activeToolAgeMs: 31_000,
130+
lastProgressAgeMs: 31_000,
131+
lastProgressReason: "tool:shell:started",
132+
},
133+
expected: {
134+
eventType: "session.stuck",
135+
reason: "queued_work_without_active_run",
136+
classification: "stale_session_state",
137+
recoveryEligible: true,
138+
},
139+
},
140+
{
141+
name: "processing session with orphaned activity is not recoverable",
142+
state: "processing" as const,
143+
queueDepth: 1,
144+
activity: {
145+
activeWorkKind: "model_call" as const,
146+
hasActiveEmbeddedRun: false,
147+
lastProgressAgeMs: 31_000,
148+
},
149+
expected: {
150+
eventType: "session.stalled",
151+
reason: "active_work_without_progress",
152+
classification: "stalled_agent_run",
153+
activeWorkKind: "model_call",
154+
recoveryEligible: false,
155+
},
156+
},
157+
])("$name", ({ activity, expected, queueDepth, state }) => {
106158
expect(
107159
classifySessionAttention({
160+
state,
108161
queueDepth,
109162
activity,
110163
staleMs: 30_000,

src/logging/diagnostic-session-attention.ts

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,29 @@ export type SessionAttentionClassification =
2525
};
2626

2727
export function classifySessionAttention(params: {
28+
state?: "idle" | "processing" | "waiting";
2829
queueDepth: number;
2930
activity: DiagnosticSessionActivitySnapshot;
3031
staleMs: number;
3132
}): SessionAttentionClassification {
3233
if (params.activity.activeWorkKind) {
34+
// Idle session with queued work and stale orphaned activity (no active
35+
// embedded owner) should be classified as recoverable stuck state, not as
36+
// stalled active work. This prevents orphaned model_call or tool_call
37+
// activity from blocking the queue indefinitely.
38+
if (
39+
params.state === "idle" &&
40+
params.queueDepth > 0 &&
41+
params.activity.hasActiveEmbeddedRun !== true &&
42+
(params.activity.lastProgressAgeMs ?? 0) > params.staleMs
43+
) {
44+
return {
45+
eventType: "session.stuck",
46+
reason: "queued_work_without_active_run",
47+
classification: "stale_session_state",
48+
recoveryEligible: true,
49+
};
50+
}
3351
if (
3452
params.activity.activeWorkKind === "tool_call" &&
3553
(params.activity.activeToolAgeMs ?? 0) > params.staleMs &&

src/logging/diagnostic-stuck-session-recovery.runtime.test.ts

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -552,6 +552,58 @@ describe("stuck session recovery", () => {
552552
]);
553553
});
554554

555+
it("releases idle queued work without aborting when stale activity has no active owner", async () => {
556+
mocks.resolveActiveEmbeddedRunHandleSessionId.mockReturnValue(undefined);
557+
mocks.resolveActiveEmbeddedRunSessionId.mockReturnValue(undefined);
558+
mocks.isEmbeddedPiRunActive.mockReturnValue(false);
559+
mocks.resetCommandLane.mockReturnValue(0);
560+
561+
const outcome = await recoverStuckDiagnosticSession({
562+
sessionId: "idle-stale-model-session",
563+
sessionKey: "agent:main:main",
564+
ageMs: 180_000,
565+
queueDepth: 1,
566+
expectedState: "idle",
567+
});
568+
569+
expect(outcome).toMatchObject({
570+
status: "released",
571+
action: "release_lane",
572+
sessionId: "idle-stale-model-session",
573+
sessionKey: "agent:main:main",
574+
released: 0,
575+
});
576+
expect(mocks.abortEmbeddedPiRun).not.toHaveBeenCalled();
577+
expect(mocks.forceClearEmbeddedPiRun).not.toHaveBeenCalled();
578+
expect(mocks.resetCommandLane).toHaveBeenCalledWith("session:agent:main:main");
579+
});
580+
581+
it("releases idle queued work with orphaned tool_call without aborting active work", async () => {
582+
mocks.resolveActiveEmbeddedRunHandleSessionId.mockReturnValue(undefined);
583+
mocks.resolveActiveEmbeddedRunSessionId.mockReturnValue(undefined);
584+
mocks.isEmbeddedPiRunActive.mockReturnValue(false);
585+
mocks.resetCommandLane.mockReturnValue(1);
586+
587+
const outcome = await recoverStuckDiagnosticSession({
588+
sessionId: "idle-stale-tool-session",
589+
sessionKey: "agent:sub:tool-runner",
590+
ageMs: 180_000,
591+
queueDepth: 2,
592+
expectedState: "idle",
593+
});
594+
595+
expect(outcome).toMatchObject({
596+
status: "released",
597+
action: "release_lane",
598+
sessionId: "idle-stale-tool-session",
599+
sessionKey: "agent:sub:tool-runner",
600+
released: 1,
601+
});
602+
expect(mocks.abortEmbeddedPiRun).not.toHaveBeenCalled();
603+
expect(mocks.forceClearEmbeddedPiRun).not.toHaveBeenCalled();
604+
expect(mocks.resetCommandLane).toHaveBeenCalledWith("session:agent:sub:tool-runner");
605+
});
606+
555607
it("releases a stale session-id lane when no session key is available", async () => {
556608
mocks.isEmbeddedPiRunHandleActive.mockReturnValue(false);
557609
mocks.resetCommandLane.mockReturnValue(1);

0 commit comments

Comments
 (0)