Skip to content

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83050

Closed
YonganZhang wants to merge 1 commit into
openclaw:mainfrom
YonganZhang:yongan/fix-82961-cleanup-diagnostic-clean
Closed

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83050
YonganZhang wants to merge 1 commit into
openclaw:mainfrom
YonganZhang:yongan/fix-82961-cleanup-diagnostic-clean

Conversation

@YonganZhang

Copy link
Copy Markdown
Contributor

Summary

Addresses #82961. (Re-opens the previously-closed #83049 with a clean two-file diff — that PR accidentally bundled untracked audit-tool output files.)

Current runAgentCleanupStep timeout warning is just the envelope:

agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000

It tells operators nothing about what the flush is waiting on (queued writer work, event-loop yield, or file append IO). The issue asks for "bounded, non-secret state" alongside the timeout.

Fix

Add an optional getCleanupDiagnostic?: () => string | undefined parameter to runAgentCleanupStep. If provided, the callback is evaluated on timeout (inside a try/catch so a misbehaving probe cannot suppress the warning itself) and its return value is appended:

let diagnostic: string | undefined;
try {
  diagnostic = params.getCleanupDiagnostic?.();
} catch (error) {
  diagnostic = `diagnosticError=${formatErrorMessage(error)}`;
}
const diagnosticSuffix = diagnostic ? ` ${diagnostic}` : "";
params.log.warn(`agent cleanup timed out: ... timeoutMs=${timeoutMs}${diagnosticSuffix}`);

Wire the trajectory caller (src/agents/pi-embedded-runner/run/attempt.ts:4485) to pass a probe that reports whether the trajectory recorder was constructed for the attempt — minimal at this stage; full queued-writer counters (queuedWrites / queuedBytes / activeOp / appendSize) require TrajectoryRuntimeRecorder to expose those fields, which is a follow-up refactor that plugs into the new hook trivially.

Real behavior proof

--- OLD warning (no diagnostic):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000

--- NEW warning (no recorder constructed):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 writer=none

--- NEW warning (recorder present):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 writer=present recorderType=runtime

--- NEW warning (probe itself threw):
  warn> agent cleanup timed out: runId=run-123 sessionId=sess-456 step=pi-trajectory-flush timeoutMs=10000 diagnosticError=boom

Behaviour

  • No callback → byte-identical warning to before this PR (full backward compat for every other runAgentCleanupStep caller).
  • Callback returns string → appended after timeoutMs=....
  • Callback throws → message is captured as diagnosticError=...; the timeout warning itself never fails.
  • Success path unchanged.

Scope

Two-file diff. The hook is general enough for any future cleanup step (bundle-mcp-retire, etc.) to plug into; only pi-trajectory-flush is wired now per the issue's scope.

Follow-up (out of scope)

TrajectoryRuntimeRecorder.diagnosticString?(): string to expose queuedBytes / queuedWrites / activeOp / appendSize / acceptedRuntimeBytes / droppedEvents. Trivial to plug in once that interface lands — just call the recorder method inside the existing getCleanupDiagnostic closure in attempt.ts. Happy to do that as a separate PR.

…leanup timeout

`pi-trajectory-flush` (and any other step routed through
`runAgentCleanupStep`) currently warns with only the envelope on timeout:

    agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000

This tells operators nothing about what the flush is waiting on
(queued writer work, event-loop yield, or file append IO). The issue
asks for "bounded, non-secret state" alongside the timeout.

Add an optional `getCleanupDiagnostic?: () => string | undefined`
parameter to `runAgentCleanupStep`. If provided, the callback is
evaluated on timeout (inside a try/catch so a misbehaving probe cannot
break the warning itself) and its return value is appended to the
existing warning line.

Wire the trajectory caller in
`src/agents/pi-embedded-runner/run/attempt.ts` to pass a probe that
reports whether the trajectory recorder was constructed for the
attempt. The probe is deliberately minimal at this stage — full queued
writer stats (queuedWrites / queuedBytes / activeOp / appendSize)
require the `TrajectoryRuntimeRecorder` interface to expose those
counters, which is a separate refactor. The scaffold here lets that
follow-up plug in trivially.

Behaviour:
  * No diagnostic callback → byte-identical warning to before this PR.
  * Callback returns string → appended after `timeoutMs=...`.
  * Callback throws → message is captured as `diagnosticError=...`; the
    timeout warning itself never fails.
  * Success path unchanged.
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 17, 2026
@clawsweeper

clawsweeper Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close this PR as superseded: the open canonical PR at #82962 covers the same linked bug with actual queued writer diagnostics, while this PR only reports recorder presence and leaves the requested state unavailable.

Canonical path: Use #82962 as the canonical fix path, then close the linked issue after that PR lands.

So I’m closing this here and keeping the remaining discussion on #82962.

Review details

Best possible solution:

Use #82962 as the canonical fix path, then close the linked issue after that PR lands.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main’s timeout logger and tests show the warning lacks writer state, and the trajectory flush caller passes only the cleanup function. I did not reproduce a live stalled flush.

Is this the best way to solve the issue?

No. This PR adds a usable hook, but the wired diagnostic only reports recorder presence; the narrower maintainable fix is the canonical PR that exposes bounded queue and append state from the writer/recorder and tests it.

Security review:

Security review cleared: The diff adds logging-only internal diagnostics and does not change dependencies, permissions, secrets handling, network calls, or package execution paths.

What I checked:

Likely related people:

  • Peter Steinberger: Current-main blame in this checkout attributes the central cleanup timeout helper, trajectory flush call site, trajectory runtime, and queued writer lines to commit 83e19ca; the queued writer also has earlier refactor history in 817b581. (role: current cleanup and trajectory area contributor; confidence: high; commits: 83e19ca469e2, 817b5812e10a; files: src/agents/run-cleanup-timeout.ts, src/agents/pi-embedded-runner/run/attempt.ts, src/trajectory/runtime.ts)
  • galiniliev: The open canonical PR at fix(agents): add trajectory flush timeout diagnostics #82962 implements the same requested diagnostics across the queued writer, trajectory runtime, cleanup helper, runner call site, and tests. (role: canonical fix proposer; confidence: high; commits: ec8371545d9b; files: src/agents/run-cleanup-timeout.ts, src/agents/queued-file-writer.ts, src/trajectory/runtime.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 9feca3e11efc.

@clawsweeper clawsweeper Bot added P2 Normal backlog priority with limited blast radius. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. labels May 17, 2026
@clawsweeper clawsweeper Bot closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. P2 Normal backlog priority with limited blast radius. size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant