Skip to content

Parent subagent wait can time out before delayed child starts, leaving requester unaware of success #82787

@ramitrkar-hash

Description

@ramitrkar-hash

Summary

A subagent wrapper can time out while the actual child CLI run has not started yet. In the observed case, the child started a few seconds after the parent wait timed out, completed successfully, and emitted a useful final answer, but the parent/requester remained in a stale waiting/timed-out state until manually poked.

This matches user-visible reports where Jarvis appears to be waiting on a subagent that is already done, or only notices completion after a follow-up prompt.

Evidence

Example label: pwa-backtest-v2-lumbergh-fixes

Shared run id: f69b3958-5826-4d11-ba2f-1c9dd3d7a811

tasks/runs.sqlite shows two rows for the same run id:

task_id runtime agent_id status created started ended start lag run time
eaad5c31-9f18-4df2-9323-53df32ba49dc subagent main timed_out 2026-05-16 17:34:42 MDT 2026-05-16 17:34:42 MDT 2026-05-16 17:39:42 MDT 0.8s 299.4s
48ae44e4-d071-4a18-9c82-e562baa33cda cli billy succeeded 2026-05-16 17:34:42 MDT 2026-05-16 17:39:45 MDT 2026-05-16 17:40:32 MDT 303.1s 47.2s

The parent wrapper timed out at 17:39:42, while the child CLI run did not start until 17:39:45 and then completed successfully at 17:40:32.

Relevant log/session lines:

  • logs/gateway.log:295605: 2026-05-16T17:39:42.225-06:00 [ws] ⇄ res ✓ agent.wait 300012ms ...
  • agents/billy/sessions/3ba0944a-194f-4237-b0f6-042c02a71fd4.jsonl:54: prompt timeout recorded at 2026-05-16T23:39:42.796Z
  • agents/billy/sessions/3ba0944a-194f-4237-b0f6-042c02a71fd4.jsonl:71: final assistant message at 2026-05-16T23:40:31.369Z
  • logs/gateway.log:295647: 2026-05-16T17:40:32.483-06:00 Both fixes are complete and verified...

The final child output included ## Task Complete and listed both completed fixes, so this was not a failed child run.

Actual Behavior

The parent subagent task is marked timed_out after ~300s wall-clock time from wrapper start.

The child runtime can remain queued or blocked for nearly the full parent timeout window, then start after the wrapper has already timed out.

When the child completes successfully after the parent timeout, the requester/Jarvis does not reliably reconcile that late success into the parent state or notify the requester.

Expected Behavior

The requester should not lose successful child results because of child start delay.

The parent subagent lifecycle should distinguish between:

  • queued/not-started time
  • active child runtime
  • child completed after parent wait timeout

If a child completes after the parent wrapper timed out, OpenClaw should reconcile the parent task state and deliver a late completion/update to the requester, or at minimum surface a clear late_success_after_parent_timeout state.

Suggested Fixes

  • Start the subagent execution timeout when the child runtime actually starts, or maintain separate queue/start timeout and active execution timeout budgets.
  • If parent wait times out, keep a watcher/reconciler subscribed to the child run id so late succeeded/failed states are propagated.
  • Reconcile parent rows where runtime='subagent' status='timed_out' and a child row with the same run_id later reaches succeeded.
  • Emit a requester-visible notification when late child success arrives after the initial wait timed out.
  • Update UI state so Subagent: <label> does not remain as stale waiting when the child row is already terminal.

Environment

Observed locally on OpenClaw 2026.5.12 on macOS, with agents.defaults.subagents.runTimeoutSeconds=300.

Post-upgrade local patch status (2026-05-19)

Upgraded local install from 2026.5.12 to 2026.5.18 (50a2481).

The previous local subagent-timeout-reconciliation patch is no longer being carried as a local delta:

  • Reapply script status: unchanged on /opt/homebrew/lib/node_modules/openclaw/dist/subagent-registry-Bu5qGLSl.js.
  • The reconciliation marker/behavior was already present in the upgraded bundle.

Post-upgrade smoke checks passed:

  • Gateway and CLI version: 2026.5.18.
  • openclaw status --deep: Gateway reachable, Discord OK, Telegram OK, event loop healthy.
  • openclaw channels status --json: all enabled Discord accounts connected (main, farber, lumbergh, maverick, scout) and Telegram connected.
  • openclaw tasks list --status running --json: 0 running tasks.

Interpretation: this issue appears covered upstream in 2026.5.18; no local patch is being carried for this anymore.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions