Long subagent runs can time out silently without checkpointing or deliverable verification

## Summary

Long specialist subagent runs can appear to be "assigned" and active, but then time out/interruption without producing the required deliverable. A manual status poke can make the agent report useful partial state, which makes the system look like it was idle until nudged. Evidence suggests this is not a queue/dispatch failure: the subagent starts quickly, then gets stuck in an overlong turn with weak checkpointing and weak parent-side deliverable verification.

## Recent examples

Observed on 2026-05-16 in Ramit's OpenClaw state.

### City expansion analysis

Task label: `pwa-city-expansion-analysis` / `pwa-city-expansion-analysis-v2`

Evidence from `tasks/runs.sqlite`:

- Maverick run `30954cff-502e-496e-a76a-d2fb23320627`
  - created: 2026-05-16 16:34:40 MDT
  - started: 2026-05-16 16:34:41 MDT
  - ended: 2026-05-16 16:50:02 MDT
  - status: `timed_out`
  - runtime: ~921s
- Manual/status-poke run `9a00a70a-1dd2-4808-b18d-3c4200cf53dd`
  - task: "What's your current status? What have you found so far?"
  - started within ~1s and returned useful partial status in ~56s
- Clean v2 parent task `2e2f0571-78e8-479f-81cc-6b294b58f3f1`
  - status: `timed_out`, delivery `delivered`
  - progress summary: `[Partial progress: 50 tool call(s) executed before timeout]`
  - the required artifact was eventually written, but the task still timed out instead of cleanly completing

The user-visible symptom was: Maverick seemed to sit until Jarvis asked for status, then suddenly reported progress.

### Thin-bin assessment

Task label: `pwa-thin-bin-assessment`

Evidence from `tasks/runs.sqlite`:

- Parent task `2fac1f7a-c916-451e-8c1d-eec81cb6a5d6`
  - created: 2026-05-16 17:28:32 MDT
  - started: 2026-05-16 17:28:33 MDT
  - ended: 2026-05-16 17:39:03 MDT
  - status: `timed_out`, delivery `delivered`
  - progress summary: `[Partial progress: 9 tool call(s) executed before timeout]`
- Maverick child run `34f8cc76-9754-470a-bd87-0ea08814118c`
  - started within ~1s
  - status: `timed_out`
  - terminal summary: `aborted`

The required deliverable did not exist afterward:

`/Users/jarviskar/.openclaw/workspace-main/memory/stage5b_thin_bin_systemic_assessment.md`

## Aggregate signal

Recent local stats over the prior ~3 hours:

- `maverick`: 4 runs, 3 timed out, 75.0% timeout rate
- `billy`: 13 runs, 0 timed out
- `lumbergh`: 13 runs, 2 timed out
- `main`: 32 runs, 6 timed out

This points to a reliability issue around long specialist analysis tasks rather than general dispatch.

## Expected behavior

For long subagent tasks with a required deliverable:

1. The subagent should either complete, checkpoint, or proactively report blocked/partial state before timeout.
2. If the task has a named artifact/deliverable, parent Jarvis should verify the artifact exists and satisfies acceptance criteria before reporting completion.
3. If a subagent times out after partial tool activity, OpenClaw should surface an explicit "partial progress, resume needed" state and ideally suggest or create a follow-up task that resumes from existing artifacts/logs instead of restarting.
4. A manual "what is your status?" poke should not be required to extract the partial state.

## Actual behavior

- Subagent starts promptly, so dispatch is working.
- The run can spend 10-15 minutes inside a turn, time out/interruption, and leave no deliverable or only partial state.
- Parent-visible state can be ambiguous: status/progress may be delivered even when the required artifact is missing.
- Manual status steering can produce useful progress info, giving the appearance that the agent was idle until poked.

## Suggested fixes

- Add automatic checkpoint/progress emission for long subagent runs before the timeout boundary.
- On timeout, include last tool count, last tool/action, known partial files, and whether the named deliverable exists.
- Add parent-side named-deliverable verification for subagent tasks that specify an output path.
- Add a "resume from partial work" affordance or automatic follow-up when timeout occurs after nonzero tool activity.
- Consider warning at spawn time when task size is likely to exceed `agents.defaults.subagents.runTimeoutSeconds`.

## Environment

- OpenClaw version observed in trajectory metadata: `2026.5.5` and local config last wizard version `2026.5.12`
- OS: macOS / Darwin arm64
- Config defaults observed:
  - `agents.defaults.timeoutSeconds: 300`
  - `agents.defaults.subagents.runTimeoutSeconds: 300`

## Workaround applied locally

Updated local `workspace-main/skills/agent-handoff/SKILL.md` to require splitting/checkpointing/longer budget for Maverick/Farber research tasks and to verify deliverables before reporting completion. This is only a workflow mitigation, not a runtime fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long subagent runs can time out silently without checkpointing or deliverable verification #82784

Summary

Recent examples

City expansion analysis

Thin-bin assessment

Aggregate signal

Expected behavior

Actual behavior

Suggested fixes

Environment

Workaround applied locally

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Long subagent runs can time out silently without checkpointing or deliverable verification #82784

Description

Summary

Recent examples

City expansion analysis

Thin-bin assessment

Aggregate signal

Expected behavior

Actual behavior

Suggested fixes

Environment

Workaround applied locally

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions