Skip to content

[Bug]: Make subagent completion delivery durable across retry and reconnect gaps #65000

@MertBasar0

Description

@MertBasar0

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Deferred subagent completion delivery is unreliable across retry/reconnect boundaries, and parent-facing final delivery
can be delayed or missed even when the child run completes successfully.

Steps to reproduce

  1. Run OpenClaw 2026.4.9.
  2. Start a parent session and launch a longer-running subagent / ACP child run.
  3. While the child is still running, force a reconnect-sensitive path such as gateway restart or delivery
    interruption/disconnect before final parent delivery.
  4. Wait for the child run to finish successfully.
  5. Observe that completion is recorded, but parent-facing final delivery is not reliably emitted automatically/timely in
    the affected reconnect-gap path.

Expected behavior

When a child/subagent run finishes successfully, deferred final delivery state should remain durable across
retry/restart/reconnect paths and the parent session should receive the final completion delivery once the system is
able to retry it.

Actual behavior

In reconnect-gap / deferred-delivery paths, child completion can be observed successfully while parent-facing final
delivery remains unreliable. In local debugging, a failed retry path was also able to overwrite durable pending-delivery
context with live run fields, weakening later retry/cleanup behavior.

OpenClaw version

2026.4.9

Operating system

Linux 6.6.114.1-microsoft-standard-WSL2 (x64)

Install method

npm global (live gateway), with source checkout used for patch/test verification

Model

Multiple ACP subagent runs observed; not isolated to a single model

Provider / routing chain

OpenClaw gateway -> ACP subagent session -> parent completion delivery

Additional provider/model setup details

The observed failure appears in subagent completion delivery / follow-up lifecycle behavior rather than a
provider-specific model output path. Reproductions involved ACP child runs and parent completion delivery timing across
reconnect-sensitive conditions.

Logs, screenshots, and evidence

Observed on latest local install after cutover to 2026.4.9.

  Grounded observations:
  - short smoke passed
  - long no-restart smoke passed
  - reconnect-gap path remained unreliable
  - child/subagent work completed successfully
  - parent-facing final delivery did not return cleanly/timely in the affected path

  Local debugging also found a concrete durability bug:
  - retry-state writes could overwrite existing durable pending-delivery payload with transient live fields

  Local hardening patch added:
  - payload preservation during retry-state writes
  - persistence/restart coverage
  - targeted tests for cleanup/persistence behavior

Impact and severity

Affected: users relying on parent-visible subagent/ACP completion delivery across reconnect/retry boundaries
Severity: High for affected flows, because final results can be delayed or effectively missed from the parent/user
perspective
Frequency: Intermittent, specifically observed in reconnect-gap / deferred-delivery paths, not in every completion path
Consequence: parent sessions may not receive reliable final completion delivery even though child work completed
successfully

Additional information

This report is about observed reliability behavior, not a claim of a complete root-cause fix.

A local hardening direction appears promising:

  • preserve deferred completion payload durably across retry-state writes
  • keep final delivery as a parent-owned retry obligation rather than relying on a transient handoff moment

However, current evidence does not yet justify claiming a full end-to-end fix for all reconnect-gap scenarios.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions