Skip to content

subagent timeout leaves zombie claude -p → late output emitted directly to user transport (bypasses parent agent) #76962

@anibalTNS

Description

@anibalTNS

Environment

  • OpenClaw 2026.5.2
  • Gateway: local mode, systemd
  • Backend: anthropic/claude-sonnet-4-6 via claude-cli OAuth, fallback openai-codex/gpt-5.4

Bug chain (3 linked issues)

Bug A — Subagent timeout does not kill the physical claude -p process

When a subagent session expires (timeoutSeconds), the parent receives
status: timed out. However, the underlying claude -p process spawned
by the subagent remains alive as a zombie. It can continue executing tool
calls minutes after the parent declared it timed out.

Observed: claude -p child of gateway (via ps --ppid <gateway_pid>)
alive for 297–306s after timeoutSeconds: 120/180 expired.

Bug C — Zombie subagent output emitted raw to user transport

When the zombie process completes its pending tool calls after the parent
has already emitted final_answer, the runtime attempts an
Automatic session resume. When that resume fails (e.g., claude-cli
errors), it emits the raw subagent output directly to the user transport
with the literal prefix:

"Automatic session resume failed, so sending the status directly"

This bypasses the parent agent entirely. The user receives raw internal
output that should never reach them.

Bug D — Cross-model fallback amplifies the leak

After the claude-cli resume fails, the runtime spawns a new session using
the fallback model (gpt-5.4/Codex) to "explain" the previous error. This
session also emits its final_answer to the transport, resulting in
additional unsolicited messages to the user.

Reproduction steps

  1. Configure a subagent with timeoutSeconds: N (tested: 120, 180).
  2. Have the subagent invoke exec with a command that hits
    exec.approval.waitDecision (queue wait > N seconds).
  3. Parent receives status: timed out and emits its own final_answer.
  4. Approval queue eventually processes the exec command.
  5. Zombie claude -p completes, runtime attempts session resume.
  6. Resume fails → raw output emitted to transport.

Expected behavior

  • When timeoutSeconds expires: kill the physical claude -p process (SIGTERM → SIGKILL).
  • If the parent has already emitted final_answer: discard any late output
    from zombie subagents, do not attempt session resume.
  • Cancel pending exec.approval queue items when the owning subagent has timed out.

Workaround (in place)

Restricting the subagent's exec allowlist to a small set of safe binaries
prevents the approval queue from blocking, which eliminates the reproduction
path. This does not fix the underlying zombie/resume issue.

Severity

High — zombie output reaches the user transport, bypassing the parent agent's
output control. In multi-agent architectures with strict output routing, this
breaks the invariant that only the parent speaks to the user.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:needs-security-reviewClawSweeper marked this issue as needing security-sensitive review.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:securitySecurity boundary, credential, authz, sandbox, or sensitive-data risk.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions