Parent subagent wait can time out before delayed child starts, leaving requester unaware of success

## Summary

A subagent wrapper can time out while the actual child CLI run has not started yet. In the observed case, the child started a few seconds after the parent wait timed out, completed successfully, and emitted a useful final answer, but the parent/requester remained in a stale waiting/timed-out state until manually poked.

This matches user-visible reports where Jarvis appears to be waiting on a subagent that is already done, or only notices completion after a follow-up prompt.

## Evidence

Example label: `pwa-backtest-v2-lumbergh-fixes`

Shared run id: `f69b3958-5826-4d11-ba2f-1c9dd3d7a811`

`tasks/runs.sqlite` shows two rows for the same run id:

| task_id | runtime | agent_id | status | created | started | ended | start lag | run time |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| `eaad5c31-9f18-4df2-9323-53df32ba49dc` | `subagent` | `main` | `timed_out` | 2026-05-16 17:34:42 MDT | 2026-05-16 17:34:42 MDT | 2026-05-16 17:39:42 MDT | 0.8s | 299.4s |
| `48ae44e4-d071-4a18-9c82-e562baa33cda` | `cli` | `billy` | `succeeded` | 2026-05-16 17:34:42 MDT | 2026-05-16 17:39:45 MDT | 2026-05-16 17:40:32 MDT | 303.1s | 47.2s |

The parent wrapper timed out at 17:39:42, while the child CLI run did not start until 17:39:45 and then completed successfully at 17:40:32.

Relevant log/session lines:

- `logs/gateway.log:295605`: `2026-05-16T17:39:42.225-06:00 [ws] ⇄ res ✓ agent.wait 300012ms ...`
- `agents/billy/sessions/3ba0944a-194f-4237-b0f6-042c02a71fd4.jsonl:54`: prompt timeout recorded at `2026-05-16T23:39:42.796Z`
- `agents/billy/sessions/3ba0944a-194f-4237-b0f6-042c02a71fd4.jsonl:71`: final assistant message at `2026-05-16T23:40:31.369Z`
- `logs/gateway.log:295647`: `2026-05-16T17:40:32.483-06:00 Both fixes are complete and verified...`

The final child output included `## Task Complete` and listed both completed fixes, so this was not a failed child run.

## Actual Behavior

The parent subagent task is marked `timed_out` after ~300s wall-clock time from wrapper start.

The child runtime can remain queued or blocked for nearly the full parent timeout window, then start after the wrapper has already timed out.

When the child completes successfully after the parent timeout, the requester/Jarvis does not reliably reconcile that late success into the parent state or notify the requester.

## Expected Behavior

The requester should not lose successful child results because of child start delay.

The parent subagent lifecycle should distinguish between:

- queued/not-started time
- active child runtime
- child completed after parent wait timeout

If a child completes after the parent wrapper timed out, OpenClaw should reconcile the parent task state and deliver a late completion/update to the requester, or at minimum surface a clear `late_success_after_parent_timeout` state.

## Suggested Fixes

- Start the subagent execution timeout when the child runtime actually starts, or maintain separate queue/start timeout and active execution timeout budgets.
- If parent wait times out, keep a watcher/reconciler subscribed to the child run id so late `succeeded`/`failed` states are propagated.
- Reconcile parent rows where `runtime='subagent' status='timed_out'` and a child row with the same `run_id` later reaches `succeeded`.
- Emit a requester-visible notification when late child success arrives after the initial wait timed out.
- Update UI state so `Subagent: <label>` does not remain as stale waiting when the child row is already terminal.

## Environment

Observed locally on OpenClaw `2026.5.12` on macOS, with `agents.defaults.subagents.runTimeoutSeconds=300`.

## Post-upgrade local patch status (2026-05-19)

Upgraded local install from `2026.5.12` to `2026.5.18 (50a2481)`.

The previous local `subagent-timeout-reconciliation` patch is no longer being carried as a local delta:

- Reapply script status: `unchanged` on `/opt/homebrew/lib/node_modules/openclaw/dist/subagent-registry-Bu5qGLSl.js`.
- The reconciliation marker/behavior was already present in the upgraded bundle.

Post-upgrade smoke checks passed:

- Gateway and CLI version: `2026.5.18`.
- `openclaw status --deep`: Gateway reachable, Discord OK, Telegram OK, event loop healthy.
- `openclaw channels status --json`: all enabled Discord accounts connected (`main`, `farber`, `lumbergh`, `maverick`, `scout`) and Telegram connected.
- `openclaw tasks list --status running --json`: `0` running tasks.

Interpretation: this issue appears covered upstream in `2026.5.18`; no local patch is being carried for this anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parent subagent wait can time out before delayed child starts, leaving requester unaware of success #82787

Summary

Evidence

Actual Behavior

Expected Behavior

Suggested Fixes

Environment

Post-upgrade local patch status (2026-05-19)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

task_id	runtime	agent_id	status	created	started	ended	start lag	run time
`eaad5c31-9f18-4df2-9323-53df32ba49dc`	`subagent`	`main`	`timed_out`	2026-05-16 17:34:42 MDT	2026-05-16 17:34:42 MDT	2026-05-16 17:39:42 MDT	0.8s	299.4s
`48ae44e4-d071-4a18-9c82-e562baa33cda`	`cli`	`billy`	`succeeded`	2026-05-16 17:34:42 MDT	2026-05-16 17:39:45 MDT	2026-05-16 17:40:32 MDT	303.1s	47.2s

Uh oh!

Parent subagent wait can time out before delayed child starts, leaving requester unaware of success #82787

Description

Summary

Evidence

Actual Behavior

Expected Behavior

Suggested Fixes

Environment

Post-upgrade local patch status (2026-05-19)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions