Skip to content

Long subagent runs can time out silently without checkpointing or deliverable verification #82784

@ramitrkar-hash

Description

@ramitrkar-hash

Summary

Long specialist subagent runs can appear to be "assigned" and active, but then time out/interruption without producing the required deliverable. A manual status poke can make the agent report useful partial state, which makes the system look like it was idle until nudged. Evidence suggests this is not a queue/dispatch failure: the subagent starts quickly, then gets stuck in an overlong turn with weak checkpointing and weak parent-side deliverable verification.

Recent examples

Observed on 2026-05-16 in Ramit's OpenClaw state.

City expansion analysis

Task label: pwa-city-expansion-analysis / pwa-city-expansion-analysis-v2

Evidence from tasks/runs.sqlite:

  • Maverick run 30954cff-502e-496e-a76a-d2fb23320627
    • created: 2026-05-16 16:34:40 MDT
    • started: 2026-05-16 16:34:41 MDT
    • ended: 2026-05-16 16:50:02 MDT
    • status: timed_out
    • runtime: ~921s
  • Manual/status-poke run 9a00a70a-1dd2-4808-b18d-3c4200cf53dd
    • task: "What's your current status? What have you found so far?"
    • started within ~1s and returned useful partial status in ~56s
  • Clean v2 parent task 2e2f0571-78e8-479f-81cc-6b294b58f3f1
    • status: timed_out, delivery delivered
    • progress summary: [Partial progress: 50 tool call(s) executed before timeout]
    • the required artifact was eventually written, but the task still timed out instead of cleanly completing

The user-visible symptom was: Maverick seemed to sit until Jarvis asked for status, then suddenly reported progress.

Thin-bin assessment

Task label: pwa-thin-bin-assessment

Evidence from tasks/runs.sqlite:

  • Parent task 2fac1f7a-c916-451e-8c1d-eec81cb6a5d6
    • created: 2026-05-16 17:28:32 MDT
    • started: 2026-05-16 17:28:33 MDT
    • ended: 2026-05-16 17:39:03 MDT
    • status: timed_out, delivery delivered
    • progress summary: [Partial progress: 9 tool call(s) executed before timeout]
  • Maverick child run 34f8cc76-9754-470a-bd87-0ea08814118c
    • started within ~1s
    • status: timed_out
    • terminal summary: aborted

The required deliverable did not exist afterward:

/Users/jarviskar/.openclaw/workspace-main/memory/stage5b_thin_bin_systemic_assessment.md

Aggregate signal

Recent local stats over the prior ~3 hours:

  • maverick: 4 runs, 3 timed out, 75.0% timeout rate
  • billy: 13 runs, 0 timed out
  • lumbergh: 13 runs, 2 timed out
  • main: 32 runs, 6 timed out

This points to a reliability issue around long specialist analysis tasks rather than general dispatch.

Expected behavior

For long subagent tasks with a required deliverable:

  1. The subagent should either complete, checkpoint, or proactively report blocked/partial state before timeout.
  2. If the task has a named artifact/deliverable, parent Jarvis should verify the artifact exists and satisfies acceptance criteria before reporting completion.
  3. If a subagent times out after partial tool activity, OpenClaw should surface an explicit "partial progress, resume needed" state and ideally suggest or create a follow-up task that resumes from existing artifacts/logs instead of restarting.
  4. A manual "what is your status?" poke should not be required to extract the partial state.

Actual behavior

  • Subagent starts promptly, so dispatch is working.
  • The run can spend 10-15 minutes inside a turn, time out/interruption, and leave no deliverable or only partial state.
  • Parent-visible state can be ambiguous: status/progress may be delivered even when the required artifact is missing.
  • Manual status steering can produce useful progress info, giving the appearance that the agent was idle until poked.

Suggested fixes

  • Add automatic checkpoint/progress emission for long subagent runs before the timeout boundary.
  • On timeout, include last tool count, last tool/action, known partial files, and whether the named deliverable exists.
  • Add parent-side named-deliverable verification for subagent tasks that specify an output path.
  • Add a "resume from partial work" affordance or automatic follow-up when timeout occurs after nonzero tool activity.
  • Consider warning at spawn time when task size is likely to exceed agents.defaults.subagents.runTimeoutSeconds.

Environment

  • OpenClaw version observed in trajectory metadata: 2026.5.5 and local config last wizard version 2026.5.12
  • OS: macOS / Darwin arm64
  • Config defaults observed:
    • agents.defaults.timeoutSeconds: 300
    • agents.defaults.subagents.runTimeoutSeconds: 300

Workaround applied locally

Updated local workspace-main/skills/agent-handoff/SKILL.md to require splitting/checkpointing/longer budget for Maverick/Farber research tasks and to verify deliverables before reporting completion. This is only a workflow mitigation, not a runtime fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions