Skip to content

Commit 2d8d50d

Browse files
committed
fix: track diagnostic progress before stuck warnings
1 parent 42b7b2b commit 2d8d50d

20 files changed

Lines changed: 289 additions & 36 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ Docs: https://docs.openclaw.ai
2525
- Channels/status reactions: remove stale non-terminal lifecycle reactions when a run reaches done or error, so Discord does not leave a permanent thinking emoji after completion. Fixes #75458. Thanks @davelutztx.
2626
- Discord/doctor: migrate unsupported per-channel `agentId` entries under guild channel config into top-level `bindings[]` routes, so `openclaw doctor --fix` preserves the intended agent route instead of stripping it as an unknown key. Fixes #62455. Thanks @lobster-biscuit.
2727
- Gateway/config: log config health-state write failures instead of silently hiding config observe-recovery write errors. Thanks @sallyom.
28+
- Diagnostics: reset stuck-session timers on reply, tool, status, block, and ACP progress events, and back off repeated `session.stuck` diagnostics while a session remains unchanged. Supersedes #72010. Thanks @rubencu.
2829

2930
## 2026.4.30
3031

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
74530fefef9ed55cab302802bc0be413ec56929e73c12d4bf4f1e4d290813adc config-baseline.json
2-
21db87c2ebec8844e20bf66ea474c08f3adab842234ff334870fe3e8d87995b4 config-baseline.core.json
1+
8bbb620e445cba64aa8a451cfc1a7142ac24e8c80088d74a2fc813ee9e221680 config-baseline.json
2+
d145a87759d16d5f58873db337a25cb134ab25e776cd454812dca99bb9cb12a7 config-baseline.core.json
33
c401cd3450f1737bc92418cfea301d20b54b7fbef9e6049834acc01af338e538 config-baseline.channel.json
44
7731a0b93cb335b56fac4c807447ba659fea51ea7a6cd844dc0ef5616669ee75 config-baseline.plugin.json

docs/concepts/agent-loop.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@ surfaces, while Codex native hooks remain a separate lower-level Codex mechanism
164164
- `agent.wait` default: 30s (just the wait). `timeoutMs` param overrides.
165165
- Agent runtime: `agents.defaults.timeoutSeconds` default 172800s (48 hours); enforced in `runEmbeddedPiAgent` abort timer.
166166
- Cron runtime: isolated agent-turn `timeoutSeconds` is owned by cron. The scheduler starts that timer when execution begins, aborts the underlying run at the configured deadline, then runs bounded cleanup before recording the timeout so a stale child session cannot keep the lane stuck.
167-
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path releases the affected session lane so queued startup work can drain.
167+
- Session liveness diagnostics: with diagnostics enabled, `diagnostics.stuckSessionWarnMs` classifies long `processing` sessions that have no observed reply, tool, status, block, or ACP progress. Active embedded runs, model calls, and tool calls report as `session.long_running`; active work with no recent progress reports as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path releases the affected session lane so queued startup work can drain. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
168168
- Model idle timeout: OpenClaw aborts a model request when no response chunks arrive before the idle window. `models.providers.<id>.timeoutSeconds` extends this idle watchdog for slow local/self-hosted providers; otherwise OpenClaw uses `agents.defaults.timeoutSeconds` when configured, capped at 120s by default. Cron-triggered runs with no explicit model or agent timeout disable the idle watchdog and rely on the cron outer timeout.
169169
- Provider HTTP request timeout: `models.providers.<id>.timeoutSeconds` applies to that provider's model HTTP fetches, including connect, headers, body, SDK request timeout, total guarded-fetch abort handling, and model stream idle watchdog. Use this for slow local/self-hosted providers such as Ollama before raising the whole agent runtime timeout.
170170

docs/concepts/queue.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ keys.
115115
- If commands seem stuck, enable verbose logs and look for “queued for …ms” lines to confirm the queue is draining.
116116
- If you need queue depth, enable verbose logs and watch for queue timing lines.
117117
- Codex app-server runs that accept a turn and then stop emitting progress are interrupted by the Codex adapter so the active session lane can release instead of waiting for the outer run timeout.
118-
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` are classified by current activity. Active work logs as `session.long_running`; active work with no recent progress logs as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path can release the affected session lane so queued work drains.
118+
- When diagnostics are enabled, sessions that remain in `processing` past `diagnostics.stuckSessionWarnMs` with no observed reply, tool, status, block, or ACP progress are classified by current activity. Active work logs as `session.long_running`; active work with no recent progress logs as `session.stalled`; `session.stuck` is reserved for stale session bookkeeping with no active work, and only that path can release the affected session lane so queued work drains. Repeated `session.stuck` diagnostics back off while the session remains unchanged.
119119

120120
## Related
121121

docs/gateway/configuration-reference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -937,7 +937,7 @@ Notes:
937937

938938
- `enabled`: master toggle for instrumentation output (default: `true`).
939939
- `flags`: array of flag strings enabling targeted log output (supports wildcards like `"telegram.*"` or `"*"`).
940-
- `stuckSessionWarnMs`: age threshold in ms for classifying long-running processing sessions as `session.long_running`, `session.stalled`, or `session.stuck`.
940+
- `stuckSessionWarnMs`: no-progress age threshold in ms for classifying long-running processing sessions as `session.long_running`, `session.stalled`, or `session.stuck`. Reply, tool, status, block, and ACP progress reset the timer; repeated `session.stuck` diagnostics back off while unchanged.
941941
- `otel.enabled`: enables the OpenTelemetry export pipeline (default: `false`). For the full configuration, signal catalog, and privacy model, see [OpenTelemetry export](/gateway/opentelemetry).
942942
- `otel.endpoint`: collector URL for OTel export.
943943
- `otel.tracesEndpoint` / `otel.metricsEndpoint` / `otel.logsEndpoint`: optional signal-specific OTLP endpoints. When set, they override `otel.endpoint` for that signal only.

src/auto-reply/reply/acp-projector.test.ts

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,10 @@ import { createAcpTestConfig as createCfg } from "./test-fixtures/acp-runtime.js
55

66
type Delivery = { kind: string; text?: string };
77

8-
function createProjectorHarness(cfgOverrides?: Parameters<typeof createCfg>[0]) {
8+
function createProjectorHarness(
9+
cfgOverrides?: Parameters<typeof createCfg>[0],
10+
opts?: { onProgress?: () => void },
11+
) {
912
const deliveries: Delivery[] = [];
1013
const projector = createAcpReplyProjector({
1114
cfg: createCfg(cfgOverrides),
@@ -14,6 +17,7 @@ function createProjectorHarness(cfgOverrides?: Parameters<typeof createCfg>[0])
1417
deliveries.push({ kind, text: payload.text });
1518
return true;
1619
},
20+
onProgress: opts?.onProgress,
1721
});
1822
return { deliveries, projector };
1923
}
@@ -175,6 +179,28 @@ async function runHiddenBoundaryCase(params: {
175179
}
176180

177181
describe("createAcpReplyProjector", () => {
182+
it("reports progress for ACP runtime events before delivery filtering", async () => {
183+
const onProgress = vi.fn();
184+
const { projector } = createProjectorHarness(undefined, { onProgress });
185+
186+
await projector.onEvent({
187+
type: "text_delta",
188+
stream: "thought",
189+
text: "hidden reasoning",
190+
tag: "agent_message_chunk",
191+
});
192+
await projector.onEvent({
193+
type: "tool_call",
194+
tag: "tool_call",
195+
toolCallId: "tool-1",
196+
status: "in_progress",
197+
title: "Run command",
198+
text: "Run command",
199+
});
200+
201+
expect(onProgress).toHaveBeenCalledTimes(2);
202+
});
203+
178204
it("coalesces text deltas into bounded block chunks", async () => {
179205
const { deliveries, projector } = createProjectorHarness();
180206

src/auto-reply/reply/acp-projector.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,7 @@ export function createAcpReplyProjector(params: {
173173
payload: ReplyPayload,
174174
meta?: AcpProjectedDeliveryMeta,
175175
) => Promise<boolean>;
176+
onProgress?: () => void;
176177
provider?: string;
177178
accountId?: string;
178179
}): AcpReplyProjector {
@@ -403,6 +404,7 @@ export function createAcpReplyProjector(params: {
403404
};
404405

405406
const onEvent = async (event: AcpRuntimeEvent): Promise<void> => {
407+
params.onProgress?.();
406408
if (event.type === "text_delta") {
407409
if (event.stream && event.stream !== "output") {
408410
return;

src/auto-reply/reply/dispatch-acp.test.ts

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,10 @@ const mediaUnderstandingMocks = vi.hoisted(() => ({
7474
applyMediaUnderstanding: vi.fn(async (_params: unknown) => undefined),
7575
}));
7676

77+
const diagnosticMocks = vi.hoisted(() => ({
78+
markDiagnosticSessionProgress: vi.fn(),
79+
}));
80+
7781
const sessionMetaMocks = vi.hoisted(() => ({
7882
readAcpSessionEntry: vi.fn<
7983
(params: { sessionKey: string; cfg?: OpenClawConfig }) => AcpSessionStoreEntry | null
@@ -168,6 +172,10 @@ vi.mock("./dispatch-acp-session.runtime.js", () => ({
168172
sessionMetaMocks.readAcpSessionEntry(params),
169173
}));
170174

175+
vi.mock("../../logging/diagnostic.js", () => ({
176+
markDiagnosticSessionProgress: diagnosticMocks.markDiagnosticSessionProgress,
177+
}));
178+
171179
vi.mock("./dispatch-acp-transcript.runtime.js", () => ({
172180
persistAcpDispatchTranscript: (params: unknown) =>
173181
transcriptMocks.persistAcpDispatchTranscript(params),
@@ -374,6 +382,7 @@ describe("tryDispatchAcpReply", () => {
374382
ttsMocks.resolveTtsConfig.mockReturnValue({ mode: "final" });
375383
mediaUnderstandingMocks.applyMediaUnderstanding.mockReset();
376384
mediaUnderstandingMocks.applyMediaUnderstanding.mockResolvedValue(undefined);
385+
diagnosticMocks.markDiagnosticSessionProgress.mockReset();
377386
sessionMetaMocks.readAcpSessionEntry.mockReset();
378387
sessionMetaMocks.readAcpSessionEntry.mockReturnValue(null);
379388
transcriptMocks.persistAcpDispatchTranscript.mockClear();
@@ -545,6 +554,18 @@ describe("tryDispatchAcpReply", () => {
545554
expect(onReplyStart).toHaveBeenCalledTimes(1);
546555
});
547556

557+
it("does not mark ACP diagnostic progress when diagnostics are disabled", async () => {
558+
setReadyAcpResolution();
559+
mockVisibleTextTurn();
560+
561+
await runDispatch({
562+
bodyForAgent: "visible",
563+
cfg: createAcpTestConfig({ diagnostics: { enabled: false } }),
564+
});
565+
566+
expect(diagnosticMocks.markDiagnosticSessionProgress).not.toHaveBeenCalled();
567+
});
568+
548569
it("does not start reply lifecycle for empty ACP prompt", async () => {
549570
setReadyAcpResolution();
550571
const onReplyStart = vi.fn();

src/auto-reply/reply/dispatch-acp.ts

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,11 @@ import type { OpenClawConfig } from "../../config/types.openclaw.js";
1111
import type { TtsAutoMode } from "../../config/types.tts.js";
1212
import { logVerbose } from "../../globals.js";
1313
import { emitAgentEvent } from "../../infra/agent-events.js";
14+
import { isDiagnosticsEnabled } from "../../infra/diagnostic-events.js";
1415
import { formatErrorMessage } from "../../infra/errors.js";
1516
import { generateSecureUuid } from "../../infra/secure-random.js";
1617
import { prefixSystemMessage } from "../../infra/system-message.js";
18+
import { markDiagnosticSessionProgress } from "../../logging/diagnostic.js";
1719
import { resolveAgentIdFromSessionKey } from "../../routing/session-key.js";
1820
import {
1921
normalizeLowercaseStringOrEmpty,
@@ -342,6 +344,23 @@ export async function tryDispatchAcpReply(params: {
342344
}
343345
const canonicalSessionKey = acpResolution.sessionKey;
344346
const acpAgentId = resolveAgentIdFromSessionKey(canonicalSessionKey);
347+
const progressSessionKeys = isDiagnosticsEnabled(params.cfg)
348+
? Array.from(
349+
new Set(
350+
[params.ctx.SessionKey, sessionKey, canonicalSessionKey]
351+
.map((key) => normalizeOptionalString(key))
352+
.filter((key): key is string => Boolean(key)),
353+
),
354+
)
355+
: [];
356+
const markAcpProgress =
357+
progressSessionKeys.length > 0
358+
? () => {
359+
for (const key of progressSessionKeys) {
360+
markDiagnosticSessionProgress({ sessionKey: key });
361+
}
362+
}
363+
: undefined;
345364

346365
let queuedFinal = false;
347366
const delivery = createAcpDispatchDeliveryCoordinator({
@@ -401,6 +420,7 @@ export async function tryDispatchAcpReply(params: {
401420
cfg: params.cfg,
402421
shouldSendToolSummaries: params.shouldSendToolSummaries,
403422
deliver: delivery.deliver,
423+
onProgress: markAcpProgress,
404424
provider: params.ctx.Surface ?? params.ctx.Provider,
405425
accountId: effectiveDispatchAccountId,
406426
});

src/auto-reply/reply/dispatch-from-config.acp-abort.test.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,7 @@ describe("dispatchReplyFromConfig ACP abort", () => {
178178
diagnosticMocks.logMessageQueued.mockReset();
179179
diagnosticMocks.logMessageProcessed.mockReset();
180180
diagnosticMocks.logSessionStateChange.mockReset();
181+
diagnosticMocks.markDiagnosticSessionProgress.mockReset();
181182
agentEventMocks.emitAgentEvent.mockReset();
182183
agentEventMocks.onAgentEvent.mockReset().mockImplementation(() => () => {});
183184
setNoAbort();

0 commit comments

Comments
 (0)