Skip to content

Commit 10d73a0

Browse files
committed
fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout
`pi-trajectory-flush` (and any other step routed through `runAgentCleanupStep`) currently warns with only the envelope on timeout: agent cleanup timed out: runId=... sessionId=... step=pi-trajectory-flush timeoutMs=10000 This tells operators nothing about what the flush is waiting on (queued writer work, event-loop yield, or file append IO). The issue asks for "bounded, non-secret state" alongside the timeout. Add an optional `getCleanupDiagnostic?: () => string | undefined` parameter to `runAgentCleanupStep`. If provided, the callback is evaluated on timeout (inside a try/catch so a misbehaving probe cannot break the warning itself) and its return value is appended to the existing warning line. Wire the trajectory caller in `src/agents/pi-embedded-runner/run/attempt.ts` to pass a probe that reports whether the trajectory recorder was constructed for the attempt. The probe is deliberately minimal at this stage — full queued writer stats (queuedWrites / queuedBytes / activeOp / appendSize) require the `TrajectoryRuntimeRecorder` interface to expose those counters, which is a separate refactor. The scaffold here lets that follow-up plug in trivially. Behaviour: * No diagnostic callback → byte-identical warning to before this PR. * Callback returns string → appended after `timeoutMs=...`. * Callback throws → message is captured as `diagnosticError=...`; the timeout warning itself never fails. * Success path unchanged.
1 parent 37806af commit 10d73a0

2 files changed

Lines changed: 30 additions & 1 deletion

File tree

src/agents/pi-embedded-runner/run/attempt.ts

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4490,6 +4490,20 @@ export async function runEmbeddedAttempt(
44904490
cleanup: async () => {
44914491
await trajectoryRecorder?.flush();
44924492
},
4493+
getCleanupDiagnostic: () => {
4494+
// Bounded, non-secret snapshot of what the flush is waiting on.
4495+
// Trajectory recorder is constructed lazily; if it was never
4496+
// initialized for this attempt, the flush hung on the
4497+
// `trajectoryRecorder?.flush()` Promise chain itself, not on writer IO.
4498+
if (!trajectoryRecorder) {
4499+
return "writer=none";
4500+
}
4501+
// The TrajectoryRuntimeRecorder interface does not yet surface
4502+
// queued-write / queued-byte counters; once it does (follow-up to
4503+
// this PR), expose them here. For now, signal at least that the
4504+
// writer existed and the flush was actually entered.
4505+
return "writer=present recorderType=runtime";
4506+
},
44934507
});
44944508
// Always tear down the session (and release the lock) before we leave this attempt.
44954509
//

src/agents/run-cleanup-timeout.ts

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,13 @@ export async function runAgentCleanupStep(params: {
5757
log: AgentCleanupLogger;
5858
env?: NodeJS.ProcessEnv;
5959
timeoutMs?: number;
60+
/**
61+
* Optional callback evaluated on timeout. Should return a short, bounded,
62+
* non-secret summary of what the cleanup step is waiting on (queued bytes,
63+
* pending operation, etc.) so the timeout warning is actionable instead of
64+
* just reporting the envelope. See issue #82961.
65+
*/
66+
getCleanupDiagnostic?: () => string | undefined;
6067
}): Promise<void> {
6168
const timeoutMs = resolveAgentCleanupStepTimeoutMs({
6269
step: params.step,
@@ -88,8 +95,16 @@ export async function runAgentCleanupStep(params: {
8895
clearTimeout(timeoutHandle);
8996
}
9097
if (result === "timeout") {
98+
let diagnostic: string | undefined;
99+
try {
100+
diagnostic = params.getCleanupDiagnostic?.();
101+
} catch (error) {
102+
// Diagnostic capture must never break the timeout warning itself.
103+
diagnostic = `diagnosticError=${formatErrorMessage(error)}`;
104+
}
105+
const diagnosticSuffix = diagnostic ? ` ${diagnostic}` : "";
91106
params.log.warn(
92-
`agent cleanup timed out: runId=${params.runId} sessionId=${params.sessionId} step=${params.step} timeoutMs=${timeoutMs}`,
107+
`agent cleanup timed out: runId=${params.runId} sessionId=${params.sessionId} step=${params.step} timeoutMs=${timeoutMs}${diagnosticSuffix}`,
93108
);
94109
void cleanupPromise.catch((error) => {
95110
params.log.warn(

0 commit comments

Comments
 (0)