fix: relay ACP parent completions reliably#53125
Conversation
…terSubagentRun errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR fixes two related reliability gaps in What changed
Two minor points to consider
Confidence Score: 5/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 336-355
Comment:
**Potential double-completion via late lifecycle events**
`finalizeCompletedRelay` calls `completeSubagentRun` with `sendFarewell: true` and then disposes the relay (removing the `onAgentEvent` listener). If a late lifecycle `phase === "end"` event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call `completeSubagentRun` a second time with a fresh `Date.now()` timestamp. Because `endedAt` will differ from the first call, `mutated` will be `true` and a second farewell message may be emitted.
In practice this only triggers if the child completes very close to the stall threshold *and* the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in `completeSubagentRun` when `entry.endedAt` is already set (or checking `entry.endedReason`) would fully close this window.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 304-324
Comment:
**Silent catch discards diagnostic context**
The bare `catch {}` on the `callGateway("agent.wait")` call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since `probeForCompletedRelay` is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.
Consider adding a debug-level log:
```ts
} catch (err) {
log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err);
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: export completeSubagentRun, use spi..." | Re-trigger Greptile |
| void probeForCompletedRelay().then(async (completed) => { | ||
| if (disposed || completed || stallNotified) { | ||
| return; | ||
| } | ||
| try { | ||
| const childOutput = await readSubagentOutput(params.childSessionKey); | ||
| if (childOutput?.trim()) { | ||
| await finalizeCompletedRelay({ status: "ok" }); | ||
| return; | ||
| } | ||
| } catch {} | ||
| if (disposed) { | ||
| return; | ||
| } | ||
| stallNotified = true; | ||
| emit( | ||
| `${relayLabel} has produced no output for ${Math.round(noOutputNoticeMs / 1000)}s. It may be waiting for interactive input.`, | ||
| `${contextPrefix}:stall`, | ||
| ); | ||
| }); |
There was a problem hiding this comment.
Potential double-completion via late lifecycle events
finalizeCompletedRelay calls completeSubagentRun with sendFarewell: true and then disposes the relay (removing the onAgentEvent listener). If a late lifecycle phase === "end" event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call completeSubagentRun a second time with a fresh Date.now() timestamp. Because endedAt will differ from the first call, mutated will be true and a second farewell message may be emitted.
In practice this only triggers if the child completes very close to the stall threshold and the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in completeSubagentRun when entry.endedAt is already set (or checking entry.endedReason) would fully close this window.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 336-355
Comment:
**Potential double-completion via late lifecycle events**
`finalizeCompletedRelay` calls `completeSubagentRun` with `sendFarewell: true` and then disposes the relay (removing the `onAgentEvent` listener). If a late lifecycle `phase === "end"` event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call `completeSubagentRun` a second time with a fresh `Date.now()` timestamp. Because `endedAt` will differ from the first call, `mutated` will be `true` and a second farewell message may be emitted.
In practice this only triggers if the child completes very close to the stall threshold *and* the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in `completeSubagentRun` when `entry.endedAt` is already set (or checking `entry.endedReason`) would fully close this window.
How can I resolve this? If you propose a fix, please make it concise.| const probeForCompletedRelay = async (timeoutMs = 1): Promise<boolean> => { | ||
| try { | ||
| const wait = await callGateway<{ | ||
| status?: string; | ||
| error?: string; | ||
| startedAt?: number; | ||
| endedAt?: number; | ||
| }>({ | ||
| method: "agent.wait", | ||
| params: { | ||
| runId, | ||
| timeoutMs: Math.max(1, Math.floor(timeoutMs)), | ||
| }, | ||
| timeoutMs: Math.max(1_500, Math.floor(timeoutMs) + 1_000), | ||
| }); | ||
| if (wait?.status === "ok" || wait?.status === "error") { | ||
| return await finalizeCompletedRelay(wait); | ||
| } | ||
| } catch {} | ||
| return false; | ||
| }; |
There was a problem hiding this comment.
Silent catch discards diagnostic context
The bare catch {} on the callGateway("agent.wait") call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since probeForCompletedRelay is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.
Consider adding a debug-level log:
} catch (err) {
log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err);
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 304-324
Comment:
**Silent catch discards diagnostic context**
The bare `catch {}` on the `callGateway("agent.wait")` call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since `probeForCompletedRelay` is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.
Consider adding a debug-level log:
```ts
} catch (err) {
log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err);
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 86544f36e4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const childOutput = await readSubagentOutput(params.childSessionKey); | ||
| if (childOutput?.trim()) { | ||
| await finalizeCompletedRelay({ status: "ok" }); |
There was a problem hiding this comment.
Stop marking stalled ACP runs complete from partial output
If the child is still running when this probe fires, agent.wait returns status: "timeout" rather than a terminal result (checked in src/gateway/server-methods/agent.ts:856-860). This fallback then treats any readSubagentOutput() text as proof of completion, but readSubagentOutput() just returns the latest assistant/tool transcript fragment from the active session (src/agents/subagent-announce.ts:370-426). So any ACP run that emits some progress and then pauses past noOutputNoticeMs/maxRelayLifetimeMs will be announced as completed and cleaned up while it is still executing, dropping later output and sending a premature final answer to the parent.
Useful? React with 👍 / 👎.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c638055e2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Idempotency: if the run already ended, bail out to avoid mutating endedAt | ||
| // a second time or re-triggering farewell/cleanup side-effects. | ||
| if (typeof entry.endedAt === "number" && !isRecoveryFromKill) { | ||
| return; |
There was a problem hiding this comment.
Don't bail out when
agent.wait populated endedAt first
registerSubagentRun() always launches waitForSubagentCompletion(), and that helper writes entry.endedAt / entry.outcome before it calls completeSubagentRun() (src/agents/subagent-registry.ts:1458-1474). With this new guard, the first normal terminal result from agent.wait now returns immediately for every registered run, so the cleanup/announce path never executes: no ended hook, no farewell/summary delivery, and no session cleanup/archive follow-up. This affects ordinary spawned subagent and ACP runs that rely on the default agent.wait completion path.
Useful? React with 👍 / 👎.
…zed flag The endedAt-based guard suppressed the normal first completion path because waitForSubagentCompletion sets entry.endedAt before calling completeSubagentRun. Replace with a dedicated completionFinalized flag so the first call always runs full finalization (cleanup, announce, hooks) while duplicate calls after finalization are still no-ops. Also fixes pre-existing TS errors in acp-spawn.ts and acp-spawn-parent-stream.test.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1aa7e5f96
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| registerSubagentRun({ | ||
| runId: childRunId, | ||
| childSessionKey: sessionKey, | ||
| controllerSessionKey: requesterInternalKey, | ||
| requesterSessionKey: requesterInternalKey, | ||
| requesterOrigin: requesterState.origin, |
There was a problem hiding this comment.
Skip subagent-run registration for streamed ACP child sessions
Registering streamTo:"parent" ACP sessions here makes them participate in the generic subagent reactivation path. Both src/gateway/server-methods/agent.ts:658-662 and src/gateway/server-methods/sessions.ts:460-465 call reactivateCompletedSubagentSession() on later sends to the same sessionKey, and replaceSubagentRunAfterSteer() in src/agents/subagent-registry.ts:1312-1335 clones the original requesterSessionKey/requesterOrigin from this record. In mode:"session" (or any reused ACP child session), a follow-up sent directly in the ACP conversation will therefore keep routing its completion announce back to the original parent relay target instead of the conversation the user is now using.
Useful? React with 👍 / 👎.
|
A bit more user-facing context on why this fix matters: This was needed to make Claude Code usable reliably on my remote OpenClaw machine through the Telegram chat flow. Before this change, In practice that meant I could kick off Claude Code remotely from Telegram, see partial progress, and then either get a misleading no-output warning or never receive the final result in the parent conversation. This PR fixes that relay/completion path so remote Claude Code runs report back to the Telegram parent session properly. |
|
Re-checked against current The remaining live seams appear to be split across: ACP lifecycle semantics (#40885), persisted/cross-process parent relay fallback (#54701), and possibly a smaller ACP run-registration gap. I would not merge this as one broad package against current main without first carving it into current-state-focused pieces. |
|
Re-checked against current The remaining live seams appear to be split across ACP lifecycle semantics (#40885), persisted/cross-process parent relay fallback (#54701), and possibly a smaller ACP run-registration gap. I would not merge this as one broad package against current main without first carving it into current-state-focused pieces. |
|
Closing as superseded. This branch was useful diagnosis, but it is now too bundled against current main. The remaining live pieces have been split into narrower current-state fixes: ACP lifecycle/registration in #40885 and persisted/cross-process parent relay fallback in #54701. Keeping this open now adds review noise and overlap rather than a clean merge path. Thanks for the investigation and for narrowing the failure mode. |
Summary
streamTo:"parent"ACP runs so the normal subagent completion announce path can deliver final resultsRoot cause
streamTo:"parent"ACP runs started the parent relay and returned without registering a subagent run. That meant the completion/announce cleanup path had no run record to resolve against, so successful child transcripts could finish without ever relaying the final user-facing completion. Separately, the parent relay could emit a false stall warning when projected assistant/lifecycle events were silent even though the child had already completed.Verification
git diff --checkpasses in this source cloneNotes
I could not run the repo test suite in this clone because the local environment here does not have the project toolchain installed (
pnpm/repo dependencies unavailable in-path for test execution).