Fix ACP parent completion relay#54701
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Proof for this fix:
I also generated local PNG/MP4 proof artifacts from the same run; GitHub CLI/gist can publish the text artifacts directly, but binary attachment upload is not available from this CLI path. |
Greptile SummaryThis PR fixes a production reliability issue where parent relays for ACP children that completed in a different gateway process would stall indefinitely — never receiving the terminal lifecycle event from the local in-memory event bus — and only time out after the 6-hour Key changes:
Confidence Score: 5/5
Reviews (1): Last reviewed commit: "Fix ACP parent completion relay" | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d04d56e19
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa34a27dc4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| persistedRelayState?.state === "idle" && | ||
| (sawPersistedRunningState || persistedStateAdvanced) | ||
| ) { |
There was a problem hiding this comment.
Scope persisted terminal fallback to the spawned run
This completion path trusts session-level acp.state changes without correlating them to this relay’s runId, so a later turn in the same child session can terminate the relay for the wrong run when local lifecycle events are missing. In session mode, another turn can update acp.state before the next poll (default 15s), and this branch will emit run completed/run failed for that later turn because the persisted state is shared across turns. Please gate persisted fallback with a run-specific signal (or persisted run-start timestamp) before disposing.
Useful? React with 👍 / 👎.
|
Maintainer-ready status update: the bot feedback on this PR has already been absorbed in the current branch state, the PR is |
|
Contributor-side this PR looks ready to merge: merge state is CLEAN, visible checks are green, the diff stays narrowly scoped to the ACP parent completion relay implementation/tests, and my attempt to enable auto-merge was blocked only by missing MergePullRequest permission. I do not see a remaining contributor-side blocker here. |
|
Re-checked against current This is currently my leading candidate for the cross-process fallback side of the ACP relay family. |
1 similar comment
|
Re-checked against current This is currently my leading candidate for the cross-process fallback side of the ACP relay family. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b450d41b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a208b0364
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| persistedRelayState?.state === "running" && | ||
| persistedStateAdvanced && | ||
| persistedRelayState.lastActivityAt != null && | ||
| persistedRelayState.lastActivityAt > relayStartedAt | ||
| ) { |
There was a problem hiding this comment.
Trust post-start terminal snapshots when running state is missed
The persisted fallback only marks a run as trustworthy after observing a polled running state, but with the default 15s poll interval a child run can transition running -> idle/error entirely between polls. In the exact failure mode this patch targets (no local lifecycle events on this process), that leaves sawPersistedRunningTransition false and the relay never emits completion/error, so it still times out hours later. This is reproducible for short ACP runs that finish before the first poll.
Useful? React with 👍 / 👎.
|
Closing as not merge-ready. Re-reviewing the current branch against the latest comments shows a real data-model gap, not a small condition bug: the persisted fallback is session-scoped, but correctness here needs a run-scoped persisted marker (for example |
Summary
startAcpSpawnParentStreamRelay()so parent relays still emit terminal completion/error notices when the local in-memory agent-event bus never receives the child lifecycle eventidlestate and only trusting persisted state that became current for this relay or was seen transitioning throughrunningidle, persistederror, and stale pre-runidlestateRoot cause
streamTo:"parent"relays were driven entirely by the localonAgentEvent()bus. When the ACP child finished in a different gateway process, the parent relay could still emit its ownstart/stalltimers but never saw the terminalend/errorlifecycle event, so it timed out after hours instead of notifying the parent that the child had finished.Validation
pnpm exec oxfmt --check src/agents/acp-spawn-parent-stream.ts src/agents/acp-spawn-parent-stream.test.tspnpm exec oxlint --type-aware src/agents/acp-spawn-parent-stream.ts src/agents/acp-spawn-parent-stream.test.tspnpm exec vitest --run src/agents/acp-spawn-parent-stream.test.ts src/agents/acp-spawn.test.tspnpm exec vitest --run src/agents/acp-spawn-parent-stream.test.ts src/agents/acp-spawn.test.tscodex run completed./codex run failed: provider disconnectedfrom persisted ACP session stateCloses #54690.