fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83050
fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83050YonganZhang wants to merge 1 commit into
Conversation
…leanup timeout
`pi-trajectory-flush` (and any other step routed through
`runAgentCleanupStep`) currently warns with only the envelope on timeout:
agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000
This tells operators nothing about what the flush is waiting on
(queued writer work, event-loop yield, or file append IO). The issue
asks for "bounded, non-secret state" alongside the timeout.
Add an optional `getCleanupDiagnostic?: () => string | undefined`
parameter to `runAgentCleanupStep`. If provided, the callback is
evaluated on timeout (inside a try/catch so a misbehaving probe cannot
break the warning itself) and its return value is appended to the
existing warning line.
Wire the trajectory caller in
`src/agents/pi-embedded-runner/run/attempt.ts` to pass a probe that
reports whether the trajectory recorder was constructed for the
attempt. The probe is deliberately minimal at this stage — full queued
writer stats (queuedWrites / queuedBytes / activeOp / appendSize)
require the `TrajectoryRuntimeRecorder` interface to expose those
counters, which is a separate refactor. The scaffold here lets that
follow-up plug in trivially.
Behaviour:
* No diagnostic callback → byte-identical warning to before this PR.
* Callback returns string → appended after `timeoutMs=...`.
* Callback throws → message is captured as `diagnosticError=...`; the
timeout warning itself never fails.
* Success path unchanged.
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Close this PR as superseded: the open canonical PR at #82962 covers the same linked bug with actual queued writer diagnostics, while this PR only reports recorder presence and leaves the requested state unavailable. Canonical path: Use #82962 as the canonical fix path, then close the linked issue after that PR lands. So I’m closing this here and keeping the remaining discussion on #82962. Review detailsBest possible solution: Use #82962 as the canonical fix path, then close the linked issue after that PR lands. Do we have a high-confidence way to reproduce the issue? Yes, at source level: current main’s timeout logger and tests show the warning lacks writer state, and the trajectory flush caller passes only the cleanup function. I did not reproduce a live stalled flush. Is this the best way to solve the issue? No. This PR adds a usable hook, but the wired diagnostic only reports recorder presence; the narrower maintainable fix is the canonical PR that exposes bounded queue and append state from the writer/recorder and tests it. Security review: Security review cleared: The diff adds logging-only internal diagnostics and does not change dependencies, permissions, secrets handling, network calls, or package execution paths. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 9feca3e11efc. |
Summary
Addresses #82961. (Re-opens the previously-closed #83049 with a clean two-file diff — that PR accidentally bundled untracked audit-tool output files.)
Current
runAgentCleanupSteptimeout warning is just the envelope:It tells operators nothing about what the flush is waiting on (queued writer work, event-loop yield, or file append IO). The issue asks for "bounded, non-secret state" alongside the timeout.
Fix
Add an optional
getCleanupDiagnostic?: () => string | undefinedparameter torunAgentCleanupStep. If provided, the callback is evaluated on timeout (inside a try/catch so a misbehaving probe cannot suppress the warning itself) and its return value is appended:Wire the trajectory caller (
src/agents/pi-embedded-runner/run/attempt.ts:4485) to pass a probe that reports whether the trajectory recorder was constructed for the attempt — minimal at this stage; full queued-writer counters (queuedWrites / queuedBytes / activeOp / appendSize) requireTrajectoryRuntimeRecorderto expose those fields, which is a follow-up refactor that plugs into the new hook trivially.Real behavior proof
Behaviour
runAgentCleanupStepcaller).timeoutMs=....diagnosticError=...; the timeout warning itself never fails.Scope
Two-file diff. The hook is general enough for any future cleanup step (
bundle-mcp-retire, etc.) to plug into; onlypi-trajectory-flushis wired now per the issue's scope.Follow-up (out of scope)
TrajectoryRuntimeRecorder.diagnosticString?(): stringto exposequeuedBytes / queuedWrites / activeOp / appendSize / acceptedRuntimeBytes / droppedEvents. Trivial to plug in once that interface lands — just call the recorder method inside the existinggetCleanupDiagnosticclosure inattempt.ts. Happy to do that as a separate PR.