Skip to content

[Bug]: native subagent spawn can return accepted without durable registry entry #83132

@albert-zen

Description

@albert-zen

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Native sessions_spawn(runtime="subagent") can return accepted and create child session records, but the corresponding run IDs are absent from ~/.openclaw/subagents/runs.json, so subagents list and completion delivery cannot find the kept subagents.

Steps to reproduce

  1. Start OpenClaw 2026.5.16 from a local checkout.
  2. From an embedded chat channel, ask the parent agent to spawn multiple native subagents with runtime:"subagent", mode:"run", cleanup:"keep", then call sessions_yield.
  3. Observe sessions_spawn returning accepted for each child, including childSessionKey and runId.
  4. After the children finish/fail/timeout, run subagents list or inspect ~/.openclaw/subagents/runs.json.
  5. Compare with ~/.openclaw/agents/main/sessions/sessions.json and per-session .jsonl transcript files.

Expected behavior

Every accepted native subagent run should be durably registered before sessions_spawn returns accepted. subagents list should be able to show the run, or at least report an orphaned child session if the registry entry is missing. Completion/result capture should leave a visible diagnostic or partial result instead of silently producing no parent-visible result.

Actual behavior

A real run produced five accepted native subagent children whose sessions were present in the session store, but none of the five run IDs appeared in ~/.openclaw/subagents/runs.json. subagents list returned only older registry entries and no recent/active entries for the accepted children.

Observed child sessions present in agents/main/sessions/sessions.json:

agent:main:subagent:2b168bd8-65c1-45c4-9568-adaab7c89acb  runId=8a10e2b6-4b36-4ef1-8510-4de85f48d0ef  status=failed
agent:main:subagent:2c89a506-4519-4c4f-93ac-436b0e5529f4  runId=2ad17fe6-06d3-4463-bdc0-d6fc97fbd3a5  status=failed
agent:main:subagent:0358693c-8b3b-49c3-8f1a-0a42e99f83bc  runId=1e7e3e7c-fe6e-45e3-b157-42584cc4e196  status=timeout
agent:main:subagent:61eb58bf-2623-43bc-8433-c33fb1f73994  runId=b4710394-da7f-49be-8fb7-3e71f106f669  status=done
agent:main:subagent:a4ec2574-2f77-4087-b781-cd5193dd0b90  runId=7c3b4282-af6f-45f2-869a-4341ceeff586  status=done

rg over the OpenClaw state directory found those run IDs in the session/trajectory files, but not in subagents/runs.json.

There were also four older registry entries from a previous wave. Those older entries confirm that cleanup:"keep" can be processed, but several had frozenResultText:null; one had pendingFinalDelivery:true with lastAnnounceDeliveryError="gateway request timeout for agent; direct-primary: gateway request timeout for agent". That looks related to #44925, but the missing-registry symptom above is narrower and independently actionable.

OpenClaw version

2026.5.16, local checkout commit 2bcb0abbb8f40fec6ee103226389c66d58878a8a

Operating system

Windows 11 10.0.26200

Install method

Local source checkout, launched with pnpm openclaw gateway.

Model

DashScope K2.6 was the effective model in the QQ embedded-agent workflow where this was observed.

Provider / routing chain

OpenClaw gateway -> DashScope K2.6 provider route.

Additional provider/model setup details

The same install had also been testing model/reasoning configuration for embedded QQ agent flows. No API keys or provider credentials are included here.

Logs, screenshots, and evidence

Relevant source paths from the local checkout:

src/agents/tools/sessions-spawn-tool.ts
src/agents/subagent-spawn.ts
src/agents/subagent-control.ts
src/agents/subagent-list.ts
src/agents/subagent-registry-lifecycle.ts
src/agents/subagent-announce-output.ts
src/agents/subagent-announce-delivery.ts
src/agents/subagent-run-liveness.ts
src/agents/subagent-registry-helpers.ts

Source-level observations:

- sessions_spawn-tool registers ACP runs after spawn; native runtime delegates to spawnSubagentDirect.
- spawnSubagentDirect appears to call registerSubagentRun before returning accepted.
- In the observed state, accepted native child sessions existed, but the accepted run IDs were absent from runs.json.
- buildSubagentList only lists registry runs passed to it; it does not reconstruct from sessions.json.
- completion capture can store frozenResultText=null when no final assistant text/history is available.

Impact and severity

Affected: embedded chat workflows using native subagents, observed from a QQ channel.
Severity: high for long-running research/orchestration workflows because the parent can yield waiting for children that later become invisible to subagents list and completion delivery.
Frequency: observed in one production-style multi-subagent run with five accepted children missing from the registry.
Consequence: user-visible completion can appear lost even though child session transcripts exist; operators cannot reliably recover from subagents list alone.

Additional information

This likely overlaps the broader delivery durability family tracked in #44925 and the yielded-parent wake/resumption family in #52249, but the specific acceptance/registry invariant seems worth tracking separately:

  • If registration failed or was not persisted, sessions_spawn should not return accepted.
  • If a child session exists without a registry run, subagents list or diagnostics should surface it as an orphan.
  • Completion capture should fall back to persisted session transcripts when gateway chat.history is empty/unavailable, and should freeze a diagnostic/partial result instead of null for terminal children.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions