Skip to content

ACP: recover parent relay output from gateway state#664

Open
BingqingLyu wants to merge 1 commit into
mainfrom
fork-pr-45739-fix-acp-parent-relay-fallback-45205
Open

ACP: recover parent relay output from gateway state#664
BingqingLyu wants to merge 1 commit into
mainfrom
fork-pr-45739-fix-acp-parent-relay-fallback-45205

Conversation

@BingqingLyu

@BingqingLyu BingqingLyu commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • Problem: ACP child runs can finish successfully while the parent streamTo: "parent" relay only sees synthetic start/stall notices because it depends on local onAgentEvent(...) delivery.
  • Why it matters: orchestration from the parent session loses child progress and terminal completion, so sessions_spawn looks hung even when the ACP child actually replied or finished work.
  • What changed: the parent relay now backfills child assistant output from gateway chat.history and terminal state from agent.wait, while still preferring local assistant/lifecycle events when they arrive.
  • What did NOT change (scope boundary): ACP spawn semantics, child execution, and the normal local event path are unchanged; this only adds a best-effort fallback when relay events do not cross the process boundary.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • ACP parent relays now recover missing child assistant output and completion from gateway state when local relay events do not arrive, so parent sessions stop getting stuck on start/stall-only updates.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): Yes
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:
    • The relay now makes best-effort gateway calls to chat.history and agent.wait for the spawned child session/run.
    • Mitigation: calls are scoped to the known child session key / run ID, use short timeouts, and only supplement the existing local event path.

Repro + Verification

Environment

  • OS: macOS host for source tests; live repro previously validated against Debian 13 install from issue pattern
  • Runtime/container: Node v25.6.0, pnpm 10.23.0
  • Model/provider: ACP child via Codex during manual smoke; issue originally reported with Claude/OpenRouter
  • Integration/channel (if any): ACP sessions_spawn with streamTo: "parent"
  • Relevant config (redacted): ACP enabled with backend acpx, parent streaming enabled

Steps

  1. Start an ACP-capable parent session and spawn a child with sessions_spawn, runtime: "acp", and streamTo: "parent".
  2. Use a task that produces visible assistant output quickly.
  3. Observe the parent relay output and child transcript / ACP stream log.

Expected

  • Parent relay receives child assistant progress and terminal completion.

Actual

  • Before the fix, affected runs could emit only synthetic start/stall notices even though the child transcript already contained assistant output and the child had completed.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • pnpm exec vitest run src/agents/acp-spawn-parent-stream.test.ts src/commands/agent.acp.test.ts
    • New relay regression cases cover:
      • missing local assistant + terminal events recovered via chat.history + agent.wait
      • missing local terminal event with assistant deltas still present
      • missing local assistant output recovered before a local lifecycle end
    • Manual live smoke on the installed 2026.3.12 runtime: 5 repeated real ACP spawns with unique markers all produced assistant output plus terminal completion in the generated .acp-stream.jsonl relay logs.
  • Edge cases checked:
    • start notice behavior unchanged
    • no-output stall / resume behavior unchanged
    • max relay lifetime timeout unchanged
    • local whitespace-preserving delta aggregation unchanged
  • What you did not verify:
    • I did not rely on src/agents/acp-spawn.test.ts for this fix because that suite fully mocks acp-spawn-parent-stream.js and, in this tarball snapshot, currently fails during unrelated mock-hoisting/module-load setup before exercising the relay path.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert this commit or temporarily remove the fallback poll / gateway backfill from src/agents/acp-spawn-parent-stream.ts.
  • Files/config to restore:
    • src/agents/acp-spawn-parent-stream.ts
    • src/agents/acp-spawn-parent-stream.test.ts
    • CHANGELOG.md
  • Known bad symptoms reviewers should watch for:
    • duplicated child progress relays
    • parent relay completion firing before child output is flushed
    • relay logs missing assistant_history / terminal backfill in the missing-event path

Risks and Mitigations

  • Risk: fallback history polling could duplicate output when local assistant events already arrived.
    • Mitigation: the fallback path bails once local assistant events have been seen and tracks whether history output was already relayed.
  • Risk: extra gateway lookups could introduce relay noise or hangs.
    • Mitigation: calls are short-timeout, best-effort, and capped to the relay lifetime / poll interval already used by the stall watcher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ACP child run completes, but parent relay receives no progress or completion events

2 participants