Skip to content

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83049

Closed
YonganZhang wants to merge 1 commit into
openclaw:mainfrom
YonganZhang:yongan/fix-82961-trajectory-flush-timeout-diagnostic
Closed

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83049
YonganZhang wants to merge 1 commit into
openclaw:mainfrom
YonganZhang:yongan/fix-82961-trajectory-flush-timeout-diagnostic

Conversation

@YonganZhang

Copy link
Copy Markdown
Contributor

Summary

Addresses #82961.

Current runAgentCleanupStep timeout warning is just the envelope:

agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000

It tells operators nothing about what the flush is waiting on (queued writer work, event-loop yield, or file append IO). The issue asks for "bounded, non-secret state" alongside the timeout.

Fix

Add an optional getCleanupDiagnostic?: () => string | undefined parameter to runAgentCleanupStep. If provided, the callback is evaluated on timeout (inside a try/catch so a misbehaving probe cannot suppress the warning itself) and its return value is appended:

let diagnostic: string | undefined;
try {
  diagnostic = params.getCleanupDiagnostic?.();
} catch (error) {
  diagnostic = `diagnosticError=${formatErrorMessage(error)}`;
}
const diagnosticSuffix = diagnostic ? ` ${diagnostic}` : "";
params.log.warn(`agent cleanup timed out: ... timeoutMs=${timeoutMs}${diagnosticSuffix}`);

Wire the trajectory caller (src/agents/pi-embedded-runner/run/attempt.ts:4485) to pass a probe that reports whether the trajectory recorder was constructed for the attempt — minimal at this stage; full queued-writer counters (queuedWrites / queuedBytes / activeOp / appendSize) require TrajectoryRuntimeRecorder to expose those fields, which is a follow-up refactor that plugs into the new hook trivially.

Real behavior proof

--- OLD warning (no diagnostic):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000

--- NEW warning (no recorder constructed):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 writer=none

--- NEW warning (recorder present):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 writer=present recorderType=runtime

--- NEW warning (probe itself threw):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 diagnosticError=boom

Behaviour

  • No callback → byte-identical warning to before this PR (full backward compat for every other runAgentCleanupStep caller).
  • Callback returns string → appended after timeoutMs=....
  • Callback throws → message is captured as diagnosticError=...; the timeout warning itself never fails.
  • Success path unchanged.

Scope

Two-file diff. The hook is general enough for any future cleanup step (bundle-mcp-retire, etc.) to plug into; only pi-trajectory-flush is wired now per the issue's scope.

Follow-up (out of scope)

TrajectoryRuntimeRecorder.diagnosticString?(): string to expose queuedBytes / queuedWrites / activeOp / appendSize / acceptedRuntimeBytes / droppedEvents. Trivial to plug in once that interface lands — just call the recorder method inside the existing getCleanupDiagnostic closure in attempt.ts. Happy to do that as a separate PR.

…leanup timeout

`pi-trajectory-flush` (and any other step routed through
`runAgentCleanupStep`) currently warns with only the envelope on timeout:

    agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000

This tells operators nothing about what the flush is waiting on
(queued writer work, event-loop yield, or file append IO). The issue
asks for "bounded, non-secret state" alongside the timeout.

Add an optional `getCleanupDiagnostic?: () => string | undefined`
parameter to `runAgentCleanupStep`. If provided, the callback is
evaluated on timeout (inside a try/catch so a misbehaving probe cannot
break the warning itself) and its return value is appended to the
existing warning line:

    agent cleanup timed out: ... timeoutMs=10000 writer=present recorderType=runtime
    agent cleanup timed out: ... timeoutMs=10000 writer=none
    agent cleanup timed out: ... timeoutMs=10000 diagnosticError=boom

Wire the trajectory caller in
`src/agents/pi-embedded-runner/run/attempt.ts` to pass a probe that
reports whether the trajectory recorder was constructed for the
attempt. The probe is deliberately minimal at this stage — full queued
writer stats (queuedWrites / queuedBytes / activeOp / appendSize)
require the `TrajectoryRuntimeRecorder` interface to expose those
counters, which is a separate refactor. The scaffold here lets that
follow-up plug in trivially.

Behaviour:
  * No diagnostic callback → identical warning to before this PR.
  * Callback returns string → appended after `timeoutMs=...`.
  * Callback throws → the thrown message is appended (no diagnostic
    capture failure is allowed to suppress the timeout warning).

No behaviour change on the success path; only the timeout-warning
content is enriched.
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels May 17, 2026
@YonganZhang

Copy link
Copy Markdown
Contributor Author

Closing — this PR accidentally bundled untracked audit-tool output files. Re-opening with a clean diff.

@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant