Skip to content

Fix ACP parent completion relay#54701

Closed
giulio-leone wants to merge 4 commits into
openclaw:mainfrom
giulio-leone:fix/54690-acp-parent-completion-relay
Closed

Fix ACP parent completion relay#54701
giulio-leone wants to merge 4 commits into
openclaw:mainfrom
giulio-leone:fix/54690-acp-parent-completion-relay

Conversation

@giulio-leone

Copy link
Copy Markdown
Contributor

Summary

  • add a persisted ACP-session-state fallback in startAcpSpawnParentStreamRelay() so parent relays still emit terminal completion/error notices when the local in-memory agent-event bus never receives the child lifecycle event
  • guard against false positives by ignoring stale pre-run idle state and only trusting persisted state that became current for this relay or was seen transitioning through running
  • add relay regressions for persisted idle, persisted error, and stale pre-run idle state

Root cause

streamTo:"parent" relays were driven entirely by the local onAgentEvent() bus. When the ACP child finished in a different gateway process, the parent relay could still emit its own start/stall timers but never saw the terminal end/error lifecycle event, so it timed out after hours instead of notifying the parent that the child had finished.

Validation

  • pnpm exec oxfmt --check src/agents/acp-spawn-parent-stream.ts src/agents/acp-spawn-parent-stream.test.ts
  • pnpm exec oxlint --type-aware src/agents/acp-spawn-parent-stream.ts src/agents/acp-spawn-parent-stream.test.ts
  • pnpm exec vitest --run src/agents/acp-spawn-parent-stream.test.ts src/agents/acp-spawn.test.ts
  • pnpm exec vitest --run src/agents/acp-spawn-parent-stream.test.ts src/agents/acp-spawn.test.ts
  • filesystem-backed runtime proof with no child lifecycle events emitted on the local bus; the parent still receives codex run completed. / codex run failed: provider disconnected from persisted ACP session state

Closes #54690.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Mar 25, 2026
@giulio-leone

Copy link
Copy Markdown
Contributor Author

Proof for this fix:

  • repo-native validation passed twice:
    • pnpm exec oxfmt --check src/agents/acp-spawn-parent-stream.ts src/agents/acp-spawn-parent-stream.test.ts
    • pnpm exec oxlint --type-aware src/agents/acp-spawn-parent-stream.ts src/agents/acp-spawn-parent-stream.test.ts
    • pnpm exec vitest --run src/agents/acp-spawn-parent-stream.test.ts src/agents/acp-spawn.test.ts (run twice)
  • filesystem-backed runtime proof exercised the failure mode from this issue: the parent relay received no child lifecycle events on the local in-memory bus, and still emitted terminal notices from persisted ACP session state
  • observed terminal events:
    • codex run completed.
    • codex run failed: provider disconnected
  • text artifacts (summary/json/log): https://gist.github.com/giulio-leone/7cf86f5bc3ad72acf7763440ebfa9fde

I also generated local PNG/MP4 proof artifacts from the same run; GitHub CLI/gist can publish the text artifacts directly, but binary attachment upload is not available from this CLI path.

@greptile-apps

greptile-apps Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a production reliability issue where parent relays for ACP children that completed in a different gateway process would stall indefinitely — never receiving the terminal lifecycle event from the local in-memory event bus — and only time out after the 6-hour maxRelayLifetimeMs deadline. The fix adds a filesystem-backed persisted-state polling fallback inside the existing noOutputWatcherTimer interval, with carefully guarded anti-false-positive conditions.

Key changes:

  • Adds readPersistedAcpRelayState() helper that reads readAcpSessionEntry and normalises the ACP state, with full error containment via try/catch.
  • In noOutputWatcherTimer, polls persisted ACP state on every tick and emits a terminal done:persisted or error:persisted event when the child session key transitions through running → idle or running → error in the persisted store.
  • Guards stale pre-run idle state via persistedActivityAt >= relayStartedAt (timestamp fence) and a sawPersistedRunningState flag, preventing false-positive completions from a previous run's residual state.
  • Adds three targeted regression tests covering persisted idle, persisted error, and stale pre-run idle guard.

Confidence Score: 5/5

  • PR is safe to merge — the fix is well-scoped, logic is sound, and new tests cover all three new code paths.
  • No correctness bugs found. The timestamp fence (persistedActivityAt >= relayStartedAt) and the sawPersistedRunningState flag together correctly prevent false positives from stale pre-run idle/error states. The try/catch in readPersistedAcpRelayState ensures file I/O errors never break relay flow. The dispose() idempotency guard prevents double-emit races between the event bus path and the new persisted-state path. All three new test scenarios validate exactly the cases described in the PR description.
  • No files require special attention.

Reviews (1): Last reviewed commit: "Fix ACP parent completion relay" | Re-trigger Greptile

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d04d56e19

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/acp-spawn-parent-stream.ts Outdated
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa34a27dc4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +308 to +310
persistedRelayState?.state === "idle" &&
(sawPersistedRunningState || persistedStateAdvanced)
) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope persisted terminal fallback to the spawned run

This completion path trusts session-level acp.state changes without correlating them to this relay’s runId, so a later turn in the same child session can terminate the relay for the wrong run when local lifecycle events are missing. In session mode, another turn can update acp.state before the next poll (default 15s), and this branch will emit run completed/run failed for that later turn because the persisted state is shared across turns. Please gate persisted fallback with a run-specific signal (or persisted run-start timestamp) before disposing.

Useful? React with 👍 / 👎.

@giulio-leone

Copy link
Copy Markdown
Contributor Author

Maintainer-ready status update: the bot feedback on this PR has already been absorbed in the current branch state, the PR is CLEAN/mergeable, and the current check suite is green across the matrix (including Windows shards). No further contributor-side changes are pending from my side.

@giulio-leone

Copy link
Copy Markdown
Contributor Author

Contributor-side this PR looks ready to merge: merge state is CLEAN, visible checks are green, the diff stays narrowly scoped to the ACP parent completion relay implementation/tests, and my attempt to enable auto-merge was blocked only by missing MergePullRequest permission. I do not see a remaining contributor-side blocker here.

@vincentkoc

Copy link
Copy Markdown
Member

Re-checked against current origin/main on 2026-03-29. This one still looks live. acp-spawn-parent-stream.ts still appears to rely on the local in-memory onAgentEvent() bus without a persisted-state fallback, so the cross-process parent completion gap is still plausible on current main.

This is currently my leading candidate for the cross-process fallback side of the ACP relay family.

1 similar comment
@vincentkoc

Copy link
Copy Markdown
Member

Re-checked against current origin/main on 2026-03-29. This one still looks live. acp-spawn-parent-stream.ts still appears to rely on the local in-memory onAgentEvent() bus without a persisted-state fallback, so the cross-process parent completion gap is still plausible on current main.

This is currently my leading candidate for the cross-process fallback side of the ACP relay family.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b450d41b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/acp-spawn-parent-stream.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a208b0364

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +306 to +310
persistedRelayState?.state === "running" &&
persistedStateAdvanced &&
persistedRelayState.lastActivityAt != null &&
persistedRelayState.lastActivityAt > relayStartedAt
) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Trust post-start terminal snapshots when running state is missed

The persisted fallback only marks a run as trustworthy after observing a polled running state, but with the default 15s poll interval a child run can transition running -> idle/error entirely between polls. In the exact failure mode this patch targets (no local lifecycle events on this process), that leaves sawPersistedRunningTransition false and the relay never emits completion/error, so it still times out hours later. This is reproducible for short ACP runs that finish before the first poll.

Useful? React with 👍 / 👎.

@vincentkoc

Copy link
Copy Markdown
Member

Closing as not merge-ready. Re-reviewing the current branch against the latest comments shows a real data-model gap, not a small condition bug: the persisted fallback is session-scoped, but correctness here needs a run-scoped persisted marker (for example currentRunId, lastCompletedRunId, or a persisted run-start timestamp tied to this relay). Without that, we are stuck between two bad choices: trust session-level terminal state too easily and complete the wrong run, or require a polled running transition and miss short runs entirely. The safer next step is a fresh narrower change that first persists a run-scoped ACP marker, then a follow-up relay patch that consumes it. Thanks for the investigation and for narrowing the cross-process completion failure mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ACP sessions_spawn with streamTo:"parent" never delivers completion event to parent session

2 participants