fix: relay ACP parent completions reliably by fedgaltc-code · Pull Request #53125 · openclaw/openclaw

fedgaltc-code · 2026-03-23T20:20:13Z

Summary

probe ACP parent relays for completed runs before emitting stall/timeout warnings
fall back to child transcript reads when completion output is missing from projected history
register streamTo:"parent" ACP runs so the normal subagent completion announce path can deliver final results

Root cause

streamTo:"parent" ACP runs started the parent relay and returned without registering a subagent run. That meant the completion/announce cleanup path had no run record to resolve against, so successful child transcripts could finish without ever relaying the final user-facing completion. Separately, the parent relay could emit a false stall warning when projected assistant/lifecycle events were silent even though the child had already completed.

Verification

verified locally on the installed OpenClaw package with live Telegram -> OpenClaw -> Claude ACP runs
successful runs no longer show the false 60s stall warning
parent streams now emit terminal completion events
final completion is delivered back to chat
git diff --check passes in this source clone

Notes

I could not run the repo test suite in this clone because the local environment here does not have the project toolchain installed (pnpm/repo dependencies unavailable in-path for test execution).

…terSubagentRun errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-23T20:25:38Z

Greptile Summary

This PR fixes two related reliability gaps in streamTo:"parent" ACP runs: (1) the run was never registered in subagentRuns, so the normal announce/cleanup path had no record to resolve final completions against, and (2) the parent relay could emit a false 60-second stall warning even when the child had already finished (because projected lifecycle events were silent).

What changed

acp-spawn.ts: calls registerSubagentRun for streamTo:"parent" runs, inside the same effectiveStreamToParent guard that starts the relay. The relay is correctly re-created when the gateway returns a runId that differs from the idempotency key, ensuring both the relay and the registry use the same runId.
acp-spawn-parent-stream.ts: adds probeForCompletedRelay (a short agent.wait call to the gateway) and a readSubagentOutput transcript fallback that are checked before the stall/timeout message is emitted. If either confirms the child finished, finalizeCompletedRelay is called to emit the completion message and invoke completeSubagentRun instead.
subagent-registry.ts: completeSubagentRun is exported so the relay can call it from the probing path.
subagent-announce.ts: readSubagentOutput is exported and gains a synchronous JSONL transcript fallback (readSubagentOutputFromTranscript) as a last resort after the gateway history and latest-assistant-reply lookups.
Tests across all three files migrate from vi.mock factory patterns to vi.spyOn + per-test mockRestore(), and two new tests cover the probe and transcript-fallback paths.

Two minor points to consider

A narrow double-completion window exists if a late lifecycle phase:"end" event arrives after finalizeCompletedRelay has already called completeSubagentRun via the probe path (see inline comment at the stall-watcher block).
The bare catch {} inside probeForCompletedRelay discards diagnostic context; a debug log would make transient probe failures easier to trace.

Confidence Score: 5/5

Safe to merge — the fix correctly addresses the root cause and the two P2 points are edge-case concerns, not blockers.
The core bug (missing registerSubagentRun for streamTo:"parent" runs) is fixed with a minimal, well-scoped change. The probing additions are well-guarded (disposed, stallNotified, and completed checks prevent redundant work). Test coverage is improved via the vi.spyOn migration and two new scenario tests. The double-completion concern only manifests in a narrow race where lifecycle events are in-flight during the probe window, and the existing mutated guard in completeSubagentRun provides partial protection.
No files require special attention, though the double-completion edge case in acp-spawn-parent-stream.ts around finalizeCompletedRelay is worth a follow-up hardening pass.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 336-355

Comment:
**Potential double-completion via late lifecycle events**

`finalizeCompletedRelay` calls `completeSubagentRun` with `sendFarewell: true` and then disposes the relay (removing the `onAgentEvent` listener). If a late lifecycle `phase === "end"` event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call `completeSubagentRun` a second time with a fresh `Date.now()` timestamp. Because `endedAt` will differ from the first call, `mutated` will be `true` and a second farewell message may be emitted.

In practice this only triggers if the child completes very close to the stall threshold *and* the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in `completeSubagentRun` when `entry.endedAt` is already set (or checking `entry.endedReason`) would fully close this window.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 304-324

Comment:
**Silent catch discards diagnostic context**

The bare `catch {}` on the `callGateway("agent.wait")` call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since `probeForCompletedRelay` is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.

Consider adding a debug-level log:
```ts
} catch (err) {
  log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err);
}
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix: export completeSubagentRun, use spi..." | Re-trigger Greptile}

greptile-apps · 2026-03-23T20:25:42Z

+    void probeForCompletedRelay().then(async (completed) => {
+      if (disposed || completed || stallNotified) {
+        return;
+      }
+      try {
+        const childOutput = await readSubagentOutput(params.childSessionKey);
+        if (childOutput?.trim()) {
+          await finalizeCompletedRelay({ status: "ok" });
+          return;
+        }
+      } catch {}
+      if (disposed) {
+        return;
+      }
+      stallNotified = true;
+      emit(
+        `${relayLabel} has produced no output for ${Math.round(noOutputNoticeMs / 1000)}s. It may be waiting for interactive input.`,
+        `${contextPrefix}:stall`,
+      );
+    });


Potential double-completion via late lifecycle events

finalizeCompletedRelay calls completeSubagentRun with sendFarewell: true and then disposes the relay (removing the onAgentEvent listener). If a late lifecycle phase === "end" event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call completeSubagentRun a second time with a fresh Date.now() timestamp. Because endedAt will differ from the first call, mutated will be true and a second farewell message may be emitted.

In practice this only triggers if the child completes very close to the stall threshold and the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in completeSubagentRun when entry.endedAt is already set (or checking entry.endedReason) would fully close this window.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/acp-spawn-parent-stream.ts Line: 336-355 Comment: **Potential double-completion via late lifecycle events** `finalizeCompletedRelay` calls `completeSubagentRun` with `sendFarewell: true` and then disposes the relay (removing the `onAgentEvent` listener). If a late lifecycle `phase === "end"` event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call `completeSubagentRun` a second time with a fresh `Date.now()` timestamp. Because `endedAt` will differ from the first call, `mutated` will be `true` and a second farewell message may be emitted. In practice this only triggers if the child completes very close to the stall threshold *and* the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in `completeSubagentRun` when `entry.endedAt` is already set (or checking `entry.endedReason`) would fully close this window. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-23T20:25:43Z

+  const probeForCompletedRelay = async (timeoutMs = 1): Promise<boolean> => {
+    try {
+      const wait = await callGateway<{
+        status?: string;
+        error?: string;
+        startedAt?: number;
+        endedAt?: number;
+      }>({
+        method: "agent.wait",
+        params: {
+          runId,
+          timeoutMs: Math.max(1, Math.floor(timeoutMs)),
+        },
+        timeoutMs: Math.max(1_500, Math.floor(timeoutMs) + 1_000),
+      });
+      if (wait?.status === "ok" || wait?.status === "error") {
+        return await finalizeCompletedRelay(wait);
+      }
+    } catch {}
+    return false;
+  };


Silent catch discards diagnostic context

The bare catch {} on the callGateway("agent.wait") call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since probeForCompletedRelay is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.

Consider adding a debug-level log:

} catch (err) { log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err); }

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/acp-spawn-parent-stream.ts Line: 304-324 Comment: **Silent catch discards diagnostic context** The bare `catch {}` on the `callGateway("agent.wait")` call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since `probeForCompletedRelay` is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose. Consider adding a debug-level log: ```ts } catch (err) { log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err); } ``` How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86544f36e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T20:27:21Z

+        const childOutput = await readSubagentOutput(params.childSessionKey);
+        if (childOutput?.trim()) {
+          await finalizeCompletedRelay({ status: "ok" });


Stop marking stalled ACP runs complete from partial output

If the child is still running when this probe fires, agent.wait returns status: "timeout" rather than a terminal result (checked in src/gateway/server-methods/agent.ts:856-860). This fallback then treats any readSubagentOutput() text as proof of completion, but readSubagentOutput() just returns the latest assistant/tool transcript fragment from the active session (src/agents/subagent-announce.ts:370-426). So any ACP run that emits some progress and then pauses past noOutputNoticeMs/maxRelayLifetimeMs will be announced as completed and cleaned up while it is still executing, dropping later output and sending a premature final answer to the parent.

Useful? React with 👍 / 👎.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c638055e2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T21:25:26Z

+  // Idempotency: if the run already ended, bail out to avoid mutating endedAt
+  // a second time or re-triggering farewell/cleanup side-effects.
+  if (typeof entry.endedAt === "number" && !isRecoveryFromKill) {
+    return;


Don't bail out when agent.wait populated endedAt first

registerSubagentRun() always launches waitForSubagentCompletion(), and that helper writes entry.endedAt / entry.outcome before it calls completeSubagentRun() (src/agents/subagent-registry.ts:1458-1474). With this new guard, the first normal terminal result from agent.wait now returns immediately for every registered run, so the cleanup/announce path never executes: no ended hook, no farewell/summary delivery, and no session cleanup/archive follow-up. This affects ordinary spawned subagent and ACP runs that rely on the default agent.wait completion path.

Useful? React with 👍 / 👎.

…zed flag The endedAt-based guard suppressed the normal first completion path because waitForSubagentCompletion sets entry.endedAt before calling completeSubagentRun. Replace with a dedicated completionFinalized flag so the first call always runs full finalization (cleanup, announce, hooks) while duplicate calls after finalization are still no-ops. Also fixes pre-existing TS errors in acp-spawn.ts and acp-spawn-parent-stream.test.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1aa7e5f96

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T23:16:27Z

+      registerSubagentRun({
+        runId: childRunId,
+        childSessionKey: sessionKey,
+        controllerSessionKey: requesterInternalKey,
+        requesterSessionKey: requesterInternalKey,
+        requesterOrigin: requesterState.origin,


Skip subagent-run registration for streamed ACP child sessions

Registering streamTo:"parent" ACP sessions here makes them participate in the generic subagent reactivation path. Both src/gateway/server-methods/agent.ts:658-662 and src/gateway/server-methods/sessions.ts:460-465 call reactivateCompletedSubagentSession() on later sends to the same sessionKey, and replaceSubagentRunAfterSteer() in src/agents/subagent-registry.ts:1312-1335 clones the original requesterSessionKey/requesterOrigin from this record. In mode:"session" (or any reused ACP child session), a follow-up sent directly in the ACP conversation will therefore keep routing its completion announce back to the original parent relay target instead of the conversation the user is now using.

Useful? React with 👍 / 👎.

fedgaltc-code · 2026-03-24T11:11:23Z

A bit more user-facing context on why this fix matters:

This was needed to make Claude Code usable reliably on my remote OpenClaw machine through the Telegram chat flow.

Before this change, streamTo:"parent" ACP runs could appear to stall in the parent session and, in the successful path, the final completion often never made it back to the chat even though the child run had actually finished.

In practice that meant I could kick off Claude Code remotely from Telegram, see partial progress, and then either get a misleading no-output warning or never receive the final result in the parent conversation.

This PR fixes that relay/completion path so remote Claude Code runs report back to the Telegram parent session properly.

vincentkoc · 2026-03-29T07:59:41Z

Re-checked against current origin/main on 2026-03-29. This branch still contains useful diagnosis, but as a merge candidate it is now too bundled.

The remaining live seams appear to be split across: ACP lifecycle semantics (#40885), persisted/cross-process parent relay fallback (#54701), and possibly a smaller ACP run-registration gap. I would not merge this as one broad package against current main without first carving it into current-state-focused pieces.

vincentkoc · 2026-03-29T08:00:40Z

Re-checked against current origin/main on 2026-03-29. This branch still contains useful diagnosis, but as a merge candidate it is now too bundled.

The remaining live seams appear to be split across ACP lifecycle semantics (#40885), persisted/cross-process parent relay fallback (#54701), and possibly a smaller ACP run-registration gap. I would not merge this as one broad package against current main without first carving it into current-state-focused pieces.

vincentkoc · 2026-03-29T09:03:11Z

Closing as superseded. This branch was useful diagnosis, but it is now too bundled against current main. The remaining live pieces have been split into narrower current-state fixes: ACP lifecycle/registration in #40885 and persisted/cross-process parent relay fallback in #54701. Keeping this open now adds review noise and overlap rather than a clean merge path. Thanks for the investigation and for narrowing the failure mode.

Fed and others added 3 commits March 23, 2026 18:04

fix: relay ACP parent completions reliably

cdc2276

fix: relay ACP parent completions reliably

e60e02b

fix: export completeSubagentRun, use spies in stream tests, log regis…

86544f3

…terSubagentRun errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openclaw-barnacle Bot added agents Agent runtime and tooling size: M labels Mar 23, 2026

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

Fed and others added 2 commits March 23, 2026 21:16

fix: make completeSubagentRun idempotent for already-ended runs

283f0c2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: log probe errors in probeForCompletedRelay for diagnosability

0c63805

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openclaw-barnacle Bot added size: L and removed size: M labels Mar 23, 2026

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

vincentkoc self-assigned this Mar 27, 2026

vincentkoc mentioned this pull request Mar 29, 2026

fix: register ACP runs in subagent registry and enable delivery to external channels #46412

Closed

5 tasks

vincentkoc closed this Mar 29, 2026

Uh oh!

Conversation

fedgaltc-code commented Mar 23, 2026

Summary

Root cause

Verification

Notes

Uh oh!

greptile-apps Bot commented Mar 23, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

fedgaltc-code commented Mar 24, 2026

Uh oh!

vincentkoc commented Mar 29, 2026

Uh oh!

vincentkoc commented Mar 29, 2026

Uh oh!

vincentkoc commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants