Skip to content

fix: relay ACP parent completions reliably#53125

Closed
fedgaltc-code wants to merge 6 commits into
openclaw:mainfrom
fedgaltc-code:fix/acp-parent-relay-completion
Closed

fix: relay ACP parent completions reliably#53125
fedgaltc-code wants to merge 6 commits into
openclaw:mainfrom
fedgaltc-code:fix/acp-parent-relay-completion

Conversation

@fedgaltc-code

Copy link
Copy Markdown

Summary

  • probe ACP parent relays for completed runs before emitting stall/timeout warnings
  • fall back to child transcript reads when completion output is missing from projected history
  • register streamTo:"parent" ACP runs so the normal subagent completion announce path can deliver final results

Root cause

streamTo:"parent" ACP runs started the parent relay and returned without registering a subagent run. That meant the completion/announce cleanup path had no run record to resolve against, so successful child transcripts could finish without ever relaying the final user-facing completion. Separately, the parent relay could emit a false stall warning when projected assistant/lifecycle events were silent even though the child had already completed.

Verification

  • verified locally on the installed OpenClaw package with live Telegram -> OpenClaw -> Claude ACP runs
  • successful runs no longer show the false 60s stall warning
  • parent streams now emit terminal completion events
  • final completion is delivered back to chat
  • git diff --check passes in this source clone

Notes

I could not run the repo test suite in this clone because the local environment here does not have the project toolchain installed (pnpm/repo dependencies unavailable in-path for test execution).

Fed and others added 3 commits March 23, 2026 18:04
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M labels Mar 23, 2026
@greptile-apps

greptile-apps Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes two related reliability gaps in streamTo:"parent" ACP runs: (1) the run was never registered in subagentRuns, so the normal announce/cleanup path had no record to resolve final completions against, and (2) the parent relay could emit a false 60-second stall warning even when the child had already finished (because projected lifecycle events were silent).

What changed

  • acp-spawn.ts: calls registerSubagentRun for streamTo:"parent" runs, inside the same effectiveStreamToParent guard that starts the relay. The relay is correctly re-created when the gateway returns a runId that differs from the idempotency key, ensuring both the relay and the registry use the same runId.
  • acp-spawn-parent-stream.ts: adds probeForCompletedRelay (a short agent.wait call to the gateway) and a readSubagentOutput transcript fallback that are checked before the stall/timeout message is emitted. If either confirms the child finished, finalizeCompletedRelay is called to emit the completion message and invoke completeSubagentRun instead.
  • subagent-registry.ts: completeSubagentRun is exported so the relay can call it from the probing path.
  • subagent-announce.ts: readSubagentOutput is exported and gains a synchronous JSONL transcript fallback (readSubagentOutputFromTranscript) as a last resort after the gateway history and latest-assistant-reply lookups.
  • Tests across all three files migrate from vi.mock factory patterns to vi.spyOn + per-test mockRestore(), and two new tests cover the probe and transcript-fallback paths.

Two minor points to consider

  • A narrow double-completion window exists if a late lifecycle phase:"end" event arrives after finalizeCompletedRelay has already called completeSubagentRun via the probe path (see inline comment at the stall-watcher block).
  • The bare catch {} inside probeForCompletedRelay discards diagnostic context; a debug log would make transient probe failures easier to trace.

Confidence Score: 5/5

  • Safe to merge — the fix correctly addresses the root cause and the two P2 points are edge-case concerns, not blockers.
  • The core bug (missing registerSubagentRun for streamTo:"parent" runs) is fixed with a minimal, well-scoped change. The probing additions are well-guarded (disposed, stallNotified, and completed checks prevent redundant work). Test coverage is improved via the vi.spyOn migration and two new scenario tests. The double-completion concern only manifests in a narrow race where lifecycle events are in-flight during the probe window, and the existing mutated guard in completeSubagentRun provides partial protection.
  • No files require special attention, though the double-completion edge case in acp-spawn-parent-stream.ts around finalizeCompletedRelay is worth a follow-up hardening pass.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 336-355

Comment:
**Potential double-completion via late lifecycle events**

`finalizeCompletedRelay` calls `completeSubagentRun` with `sendFarewell: true` and then disposes the relay (removing the `onAgentEvent` listener). If a late lifecycle `phase === "end"` event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call `completeSubagentRun` a second time with a fresh `Date.now()` timestamp. Because `endedAt` will differ from the first call, `mutated` will be `true` and a second farewell message may be emitted.

In practice this only triggers if the child completes very close to the stall threshold *and* the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in `completeSubagentRun` when `entry.endedAt` is already set (or checking `entry.endedReason`) would fully close this window.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 304-324

Comment:
**Silent catch discards diagnostic context**

The bare `catch {}` on the `callGateway("agent.wait")` call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since `probeForCompletedRelay` is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.

Consider adding a debug-level log:
```ts
} catch (err) {
  log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err);
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: export completeSubagentRun, use spi..." | Re-trigger Greptile

Comment on lines +336 to +355
void probeForCompletedRelay().then(async (completed) => {
if (disposed || completed || stallNotified) {
return;
}
try {
const childOutput = await readSubagentOutput(params.childSessionKey);
if (childOutput?.trim()) {
await finalizeCompletedRelay({ status: "ok" });
return;
}
} catch {}
if (disposed) {
return;
}
stallNotified = true;
emit(
`${relayLabel} has produced no output for ${Math.round(noOutputNoticeMs / 1000)}s. It may be waiting for interactive input.`,
`${contextPrefix}:stall`,
);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Potential double-completion via late lifecycle events

finalizeCompletedRelay calls completeSubagentRun with sendFarewell: true and then disposes the relay (removing the onAgentEvent listener). If a late lifecycle phase === "end" event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call completeSubagentRun a second time with a fresh Date.now() timestamp. Because endedAt will differ from the first call, mutated will be true and a second farewell message may be emitted.

In practice this only triggers if the child completes very close to the stall threshold and the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in completeSubagentRun when entry.endedAt is already set (or checking entry.endedReason) would fully close this window.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 336-355

Comment:
**Potential double-completion via late lifecycle events**

`finalizeCompletedRelay` calls `completeSubagentRun` with `sendFarewell: true` and then disposes the relay (removing the `onAgentEvent` listener). If a late lifecycle `phase === "end"` event subsequently arrives through the gateway's own event system (not through this relay), the normal registry path will call `completeSubagentRun` a second time with a fresh `Date.now()` timestamp. Because `endedAt` will differ from the first call, `mutated` will be `true` and a second farewell message may be emitted.

In practice this only triggers if the child completes very close to the stall threshold *and* the lifecycle event was in-flight during the probe window — narrow but possible. Adding an early-exit guard in `completeSubagentRun` when `entry.endedAt` is already set (or checking `entry.endedReason`) would fully close this window.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +304 to +324
const probeForCompletedRelay = async (timeoutMs = 1): Promise<boolean> => {
try {
const wait = await callGateway<{
status?: string;
error?: string;
startedAt?: number;
endedAt?: number;
}>({
method: "agent.wait",
params: {
runId,
timeoutMs: Math.max(1, Math.floor(timeoutMs)),
},
timeoutMs: Math.max(1_500, Math.floor(timeoutMs) + 1_000),
});
if (wait?.status === "ok" || wait?.status === "error") {
return await finalizeCompletedRelay(wait);
}
} catch {}
return false;
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent catch discards diagnostic context

The bare catch {} on the callGateway("agent.wait") call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since probeForCompletedRelay is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.

Consider adding a debug-level log:

} catch (err) {
  log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/acp-spawn-parent-stream.ts
Line: 304-324

Comment:
**Silent catch discards diagnostic context**

The bare `catch {}` on the `callGateway("agent.wait")` call means any transient error (timeout, deserialization failure, network blip) is silently swallowed. Since `probeForCompletedRelay` is the gate that determines whether the stall/timeout warning fires, a hidden error here could cause false stall messages to re-appear in ways that are difficult to diagnose.

Consider adding a debug-level log:
```ts
} catch (err) {
  log.debug("probeForCompletedRelay: agent.wait failed for run %s: %s", runId, err);
}
```

How can I resolve this? If you propose a fix, please make it concise.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86544f36e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +341 to +343
const childOutput = await readSubagentOutput(params.childSessionKey);
if (childOutput?.trim()) {
await finalizeCompletedRelay({ status: "ok" });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop marking stalled ACP runs complete from partial output

If the child is still running when this probe fires, agent.wait returns status: "timeout" rather than a terminal result (checked in src/gateway/server-methods/agent.ts:856-860). This fallback then treats any readSubagentOutput() text as proof of completion, but readSubagentOutput() just returns the latest assistant/tool transcript fragment from the active session (src/agents/subagent-announce.ts:370-426). So any ACP run that emits some progress and then pauses past noOutputNoticeMs/maxRelayLifetimeMs will be announced as completed and cleaned up while it is still executing, dropping later output and sending a premature final answer to the parent.

Useful? React with 👍 / 👎.

Fed and others added 2 commits March 23, 2026 21:16
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c638055e2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/subagent-registry.ts Outdated
Comment on lines +548 to +551
// Idempotency: if the run already ended, bail out to avoid mutating endedAt
// a second time or re-triggering farewell/cleanup side-effects.
if (typeof entry.endedAt === "number" && !isRecoveryFromKill) {
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't bail out when agent.wait populated endedAt first

registerSubagentRun() always launches waitForSubagentCompletion(), and that helper writes entry.endedAt / entry.outcome before it calls completeSubagentRun() (src/agents/subagent-registry.ts:1458-1474). With this new guard, the first normal terminal result from agent.wait now returns immediately for every registered run, so the cleanup/announce path never executes: no ended hook, no farewell/summary delivery, and no session cleanup/archive follow-up. This affects ordinary spawned subagent and ACP runs that rely on the default agent.wait completion path.

Useful? React with 👍 / 👎.

…zed flag

The endedAt-based guard suppressed the normal first completion path
because waitForSubagentCompletion sets entry.endedAt before calling
completeSubagentRun. Replace with a dedicated completionFinalized
flag so the first call always runs full finalization (cleanup, announce,
hooks) while duplicate calls after finalization are still no-ops.

Also fixes pre-existing TS errors in acp-spawn.ts and
acp-spawn-parent-stream.test.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1aa7e5f96

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/acp-spawn.ts
Comment on lines +919 to +924
registerSubagentRun({
runId: childRunId,
childSessionKey: sessionKey,
controllerSessionKey: requesterInternalKey,
requesterSessionKey: requesterInternalKey,
requesterOrigin: requesterState.origin,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip subagent-run registration for streamed ACP child sessions

Registering streamTo:"parent" ACP sessions here makes them participate in the generic subagent reactivation path. Both src/gateway/server-methods/agent.ts:658-662 and src/gateway/server-methods/sessions.ts:460-465 call reactivateCompletedSubagentSession() on later sends to the same sessionKey, and replaceSubagentRunAfterSteer() in src/agents/subagent-registry.ts:1312-1335 clones the original requesterSessionKey/requesterOrigin from this record. In mode:"session" (or any reused ACP child session), a follow-up sent directly in the ACP conversation will therefore keep routing its completion announce back to the original parent relay target instead of the conversation the user is now using.

Useful? React with 👍 / 👎.

@fedgaltc-code

Copy link
Copy Markdown
Author

A bit more user-facing context on why this fix matters:

This was needed to make Claude Code usable reliably on my remote OpenClaw machine through the Telegram chat flow.

Before this change, streamTo:"parent" ACP runs could appear to stall in the parent session and, in the successful path, the final completion often never made it back to the chat even though the child run had actually finished.

In practice that meant I could kick off Claude Code remotely from Telegram, see partial progress, and then either get a misleading no-output warning or never receive the final result in the parent conversation.

This PR fixes that relay/completion path so remote Claude Code runs report back to the Telegram parent session properly.

@vincentkoc

Copy link
Copy Markdown
Member

Re-checked against current origin/main on 2026-03-29. This branch still contains useful diagnosis, but as a merge candidate it is now too bundled.

The remaining live seams appear to be split across: ACP lifecycle semantics (#40885), persisted/cross-process parent relay fallback (#54701), and possibly a smaller ACP run-registration gap. I would not merge this as one broad package against current main without first carving it into current-state-focused pieces.

@vincentkoc

Copy link
Copy Markdown
Member

Re-checked against current origin/main on 2026-03-29. This branch still contains useful diagnosis, but as a merge candidate it is now too bundled.

The remaining live seams appear to be split across ACP lifecycle semantics (#40885), persisted/cross-process parent relay fallback (#54701), and possibly a smaller ACP run-registration gap. I would not merge this as one broad package against current main without first carving it into current-state-focused pieces.

@vincentkoc

Copy link
Copy Markdown
Member

Closing as superseded. This branch was useful diagnosis, but it is now too bundled against current main. The remaining live pieces have been split into narrower current-state fixes: ACP lifecycle/registration in #40885 and persisted/cross-process parent relay fallback in #54701. Keeping this open now adds review noise and overlap rather than a clean merge path. Thanks for the investigation and for narrowing the failure mode.

@vincentkoc vincentkoc closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants