fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout by YonganZhang · Pull Request #83049 · openclaw/openclaw

YonganZhang · 2026-05-17T10:22:38Z

Summary

Addresses #82961.

Current runAgentCleanupStep timeout warning is just the envelope:

agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000

It tells operators nothing about what the flush is waiting on (queued writer work, event-loop yield, or file append IO). The issue asks for "bounded, non-secret state" alongside the timeout.

Fix

Add an optional getCleanupDiagnostic?: () => string | undefined parameter to runAgentCleanupStep. If provided, the callback is evaluated on timeout (inside a try/catch so a misbehaving probe cannot suppress the warning itself) and its return value is appended:

let diagnostic: string | undefined;
try {
  diagnostic = params.getCleanupDiagnostic?.();
} catch (error) {
  diagnostic = `diagnosticError=${formatErrorMessage(error)}`;
}
const diagnosticSuffix = diagnostic ? ` ${diagnostic}` : "";
params.log.warn(`agent cleanup timed out: ... timeoutMs=${timeoutMs}${diagnosticSuffix}`);

Wire the trajectory caller (src/agents/pi-embedded-runner/run/attempt.ts:4485) to pass a probe that reports whether the trajectory recorder was constructed for the attempt — minimal at this stage; full queued-writer counters (queuedWrites / queuedBytes / activeOp / appendSize) require TrajectoryRuntimeRecorder to expose those fields, which is a follow-up refactor that plugs into the new hook trivially.

Real behavior proof

--- OLD warning (no diagnostic):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000

--- NEW warning (no recorder constructed):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 writer=none

--- NEW warning (recorder present):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 writer=present recorderType=runtime

--- NEW warning (probe itself threw):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 diagnosticError=boom

Behaviour

No callback → byte-identical warning to before this PR (full backward compat for every other runAgentCleanupStep caller).
Callback returns string → appended after timeoutMs=....
Callback throws → message is captured as diagnosticError=...; the timeout warning itself never fails.
Success path unchanged.

Scope

Two-file diff. The hook is general enough for any future cleanup step (bundle-mcp-retire, etc.) to plug into; only pi-trajectory-flush is wired now per the issue's scope.

Follow-up (out of scope)

TrajectoryRuntimeRecorder.diagnosticString?(): string to expose queuedBytes / queuedWrites / activeOp / appendSize / acceptedRuntimeBytes / droppedEvents. Trivial to plug in once that interface lands — just call the recorder method inside the existing getCleanupDiagnostic closure in attempt.ts. Happy to do that as a separate PR.

…leanup timeout `pi-trajectory-flush` (and any other step routed through `runAgentCleanupStep`) currently warns with only the envelope on timeout: agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000 This tells operators nothing about what the flush is waiting on (queued writer work, event-loop yield, or file append IO). The issue asks for "bounded, non-secret state" alongside the timeout. Add an optional `getCleanupDiagnostic?: () => string | undefined` parameter to `runAgentCleanupStep`. If provided, the callback is evaluated on timeout (inside a try/catch so a misbehaving probe cannot break the warning itself) and its return value is appended to the existing warning line: agent cleanup timed out: ... timeoutMs=10000 writer=present recorderType=runtime agent cleanup timed out: ... timeoutMs=10000 writer=none agent cleanup timed out: ... timeoutMs=10000 diagnosticError=boom Wire the trajectory caller in `src/agents/pi-embedded-runner/run/attempt.ts` to pass a probe that reports whether the trajectory recorder was constructed for the attempt. The probe is deliberately minimal at this stage — full queued writer stats (queuedWrites / queuedBytes / activeOp / appendSize) require the `TrajectoryRuntimeRecorder` interface to expose those counters, which is a separate refactor. The scaffold here lets that follow-up plug in trivially. Behaviour: * No diagnostic callback → identical warning to before this PR. * Callback returns string → appended after `timeoutMs=...`. * Callback throws → the thrown message is appended (no diagnostic capture failure is allowed to suppress the timeout warning). No behaviour change on the success path; only the timeout-warning content is enriched.

YonganZhang · 2026-05-17T10:23:06Z

Closing — this PR accidentally bundled untracked audit-tool output files. Re-opening with a clean diff.

openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels May 17, 2026

YonganZhang closed this May 17, 2026

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 17, 2026

YonganZhang mentioned this pull request May 17, 2026

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout #83050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83049

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83049
YonganZhang wants to merge 1 commit into
openclaw:mainfrom
YonganZhang:yongan/fix-82961-trajectory-flush-timeout-diagnostic

YonganZhang commented May 17, 2026

Uh oh!

YonganZhang commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YonganZhang commented May 17, 2026

Summary

Fix

Real behavior proof

Behaviour

Scope

Follow-up (out of scope)

Uh oh!

YonganZhang commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant