Skip to content

Commit 1666849

Browse files
committed
docs(diagnostics): clarify recoverable stuck sessions
1 parent 4176028 commit 1666849

4 files changed

Lines changed: 11 additions & 10 deletions

File tree

docs/concepts/agent-loop.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ surfaces, while Codex native hooks remain a separate lower-level Codex mechanism
165165
- `agent.wait` default: 30s (just the wait). `timeoutMs` param overrides.
166166
- Agent runtime: `agents.defaults.timeoutSeconds` default 172800s (48 hours); enforced in `runEmbeddedPiAgent` abort timer.
167167
- Cron runtime: isolated agent-turn `timeoutSeconds` is owned by cron. The scheduler starts that timer when execution begins, aborts the underlying run at the configured deadline, then runs bounded cleanup before recording the timeout so a stale child session cannot keep the lane stuck.
168-
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions that have no observed reply, tool, status, block, or ACP progress. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work. Stale session bookkeeping releases the affected session lane immediately; stalled embedded runs are abort-drained only after `diagnostics.stuckSessionAbortMs` (default: at least 5 minutes and 3x the warning threshold) so queued work can resume without cutting off merely slow runs. Recovery emits structured requested/completed outcomes, and diagnostic state is marked idle only if the same processing generation is still current. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
168+
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions that have no observed reply, tool, status, block, or ACP progress. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for recoverable stale session bookkeeping, including idle queued sessions with stale ownerless model/tool activity. Stale session bookkeeping releases the affected session lane immediately after recovery gates pass; stalled embedded runs are abort-drained only after `diagnostics.stuckSessionAbortMs` (default: at least 5 minutes and 3x the warning threshold) so queued work can resume without cutting off merely slow runs. Recovery emits structured requested/completed outcomes, and diagnostic state is marked idle only if the same processing generation is still current. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
169169
- Model idle timeout: OpenClaw aborts a model request when no response chunks arrive before the idle window. `models.providers.<id>.timeoutSeconds` extends this idle watchdog for slow local/self-hosted providers, but it is still bounded by any lower `agents.defaults.timeoutSeconds` or run-specific timeout because those control the whole agent run. Otherwise OpenClaw uses `agents.defaults.timeoutSeconds` when configured, capped at 120s by default. Cron-triggered runs with no explicit model or agent timeout disable the idle watchdog and rely on the cron outer timeout.
170170
- Provider HTTP request timeout: `models.providers.<id>.timeoutSeconds` applies to that provider's model HTTP fetches, including connect, headers, body, SDK request timeout, total guarded-fetch abort handling, and model stream idle watchdog. Use this for slow local/self-hosted providers such as Ollama before raising the whole agent runtime timeout, and keep the agent/runtime timeout at least as high when the model request needs to run longer.
171171

docs/concepts/queue.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ keys.
126126
- If commands seem stuck, enable verbose logs and look for "queued for ...ms" lines to confirm the queue is draining.
127127
- If you need queue depth, enable verbose logs and watch for queue timing lines.
128128
- Codex app-server runs that accept a turn and then stop emitting progress are interrupted by the Codex adapter so the active session lane can release instead of waiting for the outer run timeout.
129-
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` with no observed reply, tool, status, block, or ACP progress are classified by current activity. Active work logs as `session.long_running`; active work with no recent progress logs as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path can release the affected session lane so queued work drains. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
129+
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` with no observed reply, tool, status, block, or ACP progress are classified by current activity. Active work logs as `session.long_running`; active work with no recent progress logs as `session.stalled`; `session.stuck` is reserved for recoverable stale session bookkeeping, including idle queued sessions with stale ownerless model/tool activity, and only that path can release the affected session lane so queued work drains. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
130130

131131
## Related
132132

docs/gateway/opentelemetry.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -223,8 +223,8 @@ message bodies are also approved for export.
223223
- `openclaw.queue.depth` (histogram, attrs: `openclaw.lane` or `openclaw.channel=heartbeat`)
224224
- `openclaw.queue.wait_ms` (histogram, attrs: `openclaw.lane`)
225225
- `openclaw.session.state` (counter, attrs: `openclaw.state`, `openclaw.reason`)
226-
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`; emitted only for stale session bookkeeping with no active work)
227-
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`; emitted only for stale session bookkeeping with no active work)
226+
- `openclaw.session.stuck` (counter, attrs: `openclaw.state`; emitted for recoverable stale session bookkeeping)
227+
- `openclaw.session.stuck_age_ms` (histogram, attrs: `openclaw.state`; emitted for recoverable stale session bookkeeping)
228228
- `openclaw.session.turn.created` (counter, attrs: `openclaw.agent`, `openclaw.channel`, `openclaw.trigger`)
229229
- `openclaw.session.recovery.requested` (counter, attrs: `openclaw.state`, `openclaw.action`, `openclaw.active_work_kind`, `openclaw.reason`)
230230
- `openclaw.session.recovery.completed` (counter, attrs: `openclaw.state`, `openclaw.action`, `openclaw.status`, `openclaw.active_work_kind`, `openclaw.reason`)
@@ -249,8 +249,9 @@ OpenClaw classifies sessions by the work it can still observe:
249249
turns behind the lane can resume. When unset, the abort threshold defaults to
250250
the safer extended window of at least 5 minutes and 3x
251251
`diagnostics.stuckSessionWarnMs`.
252-
- `session.stuck`: stale session bookkeeping with no active work. This releases
253-
the affected session lane immediately.
252+
- `session.stuck`: stale session bookkeeping with no active work, or an idle
253+
queued session with stale ownerless model/tool activity. This releases the
254+
affected session lane immediately after recovery gates pass.
254255

255256
Recovery emits structured `session.recovery.requested` and
256257
`session.recovery.completed` events. Diagnostic session state is marked idle

src/logging/diagnostic.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -539,7 +539,7 @@ function isActiveAbortRecoveryEligible(params: {
539539
);
540540
}
541541

542-
function isIdleQueuedEmbeddedRunStall(params: {
542+
function isIdleQueuedRecoverableSessionStall(params: {
543543
state: {
544544
state: SessionStateValue;
545545
queueDepth: number;
@@ -1235,16 +1235,16 @@ export function startDiagnosticHeartbeat(
12351235
{ sessionId: state.sessionId, sessionKey: state.sessionKey },
12361236
now,
12371237
);
1238-
const idleQueuedEmbeddedRunStall = isIdleQueuedEmbeddedRunStall({
1238+
const idleQueuedRecoverableStall = isIdleQueuedRecoverableSessionStall({
12391239
state,
12401240
activity,
12411241
staleMs: stuckSessionWarnMs,
12421242
});
12431243
if (
12441244
(state.state === "processing" && ageMs > stuckSessionWarnMs) ||
1245-
idleQueuedEmbeddedRunStall
1245+
idleQueuedRecoverableStall
12461246
) {
1247-
const attentionAgeMs = idleQueuedEmbeddedRunStall
1247+
const attentionAgeMs = idleQueuedRecoverableStall
12481248
? (activity.lastProgressAgeMs ?? ageMs)
12491249
: ageMs;
12501250
const classification = logSessionAttention({

0 commit comments

Comments
 (0)