fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout#83049
Closed
YonganZhang wants to merge 1 commit into
Closed
Conversation
…leanup timeout
`pi-trajectory-flush` (and any other step routed through
`runAgentCleanupStep`) currently warns with only the envelope on timeout:
agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000
This tells operators nothing about what the flush is waiting on
(queued writer work, event-loop yield, or file append IO). The issue
asks for "bounded, non-secret state" alongside the timeout.
Add an optional `getCleanupDiagnostic?: () => string | undefined`
parameter to `runAgentCleanupStep`. If provided, the callback is
evaluated on timeout (inside a try/catch so a misbehaving probe cannot
break the warning itself) and its return value is appended to the
existing warning line:
agent cleanup timed out: ... timeoutMs=10000 writer=present recorderType=runtime
agent cleanup timed out: ... timeoutMs=10000 writer=none
agent cleanup timed out: ... timeoutMs=10000 diagnosticError=boom
Wire the trajectory caller in
`src/agents/pi-embedded-runner/run/attempt.ts` to pass a probe that
reports whether the trajectory recorder was constructed for the
attempt. The probe is deliberately minimal at this stage — full queued
writer stats (queuedWrites / queuedBytes / activeOp / appendSize)
require the `TrajectoryRuntimeRecorder` interface to expose those
counters, which is a separate refactor. The scaffold here lets that
follow-up plug in trivially.
Behaviour:
* No diagnostic callback → identical warning to before this PR.
* Callback returns string → appended after `timeoutMs=...`.
* Callback throws → the thrown message is appended (no diagnostic
capture failure is allowed to suppress the timeout warning).
No behaviour change on the success path; only the timeout-warning
content is enriched.
Contributor
Author
|
Closing — this PR accidentally bundled untracked audit-tool output files. Re-opening with a clean diff. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #82961.
Current
runAgentCleanupSteptimeout warning is just the envelope:It tells operators nothing about what the flush is waiting on (queued writer work, event-loop yield, or file append IO). The issue asks for "bounded, non-secret state" alongside the timeout.
Fix
Add an optional
getCleanupDiagnostic?: () => string | undefinedparameter torunAgentCleanupStep. If provided, the callback is evaluated on timeout (inside a try/catch so a misbehaving probe cannot suppress the warning itself) and its return value is appended:Wire the trajectory caller (
src/agents/pi-embedded-runner/run/attempt.ts:4485) to pass a probe that reports whether the trajectory recorder was constructed for the attempt — minimal at this stage; full queued-writer counters (queuedWrites / queuedBytes / activeOp / appendSize) requireTrajectoryRuntimeRecorderto expose those fields, which is a follow-up refactor that plugs into the new hook trivially.Real behavior proof
Behaviour
runAgentCleanupStepcaller).timeoutMs=....diagnosticError=...; the timeout warning itself never fails.Scope
Two-file diff. The hook is general enough for any future cleanup step (
bundle-mcp-retire, etc.) to plug into; onlypi-trajectory-flushis wired now per the issue's scope.Follow-up (out of scope)
TrajectoryRuntimeRecorder.diagnosticString?(): stringto exposequeuedBytes / queuedWrites / activeOp / appendSize / acceptedRuntimeBytes / droppedEvents. Trivial to plug in once that interface lands — just call the recorder method inside the existinggetCleanupDiagnosticclosure inattempt.ts. Happy to do that as a separate PR.