Skip to content

Process supervisor: graceful signal escalation and drain timeout for exec tool #66399

@kagura-agent

Description

@kagura-agent

Problem

When the exec tool times out (either overall-timeout or no-output-timeout), the process supervisor sends an immediate SIGKILL (supervisor.ts#L163):

cancelAdapter = (_reason: TerminationReason) => {
    if (settled) return;
    adapter.kill("SIGKILL");
};

This means:

  1. Processes cannot clean up — temp files, partial writes, network connections are abandoned
  2. No escalation path — SIGTERM (graceful shutdown request) is never attempted
  3. On POSIX, if the process tree does not exit promptly after SIGKILL (zombie processes, kernel-level hangs), adapter.wait() relies solely on the close event with no independent drain timeout. The 4-second FORCE_KILL_WAIT_FALLBACK_MS fallback only activates on Windows.

Observed impact

Subagent processes (Claude Code, coding agents) that time out lose any in-progress state. Processes that spawn their own children and set up signal handlers for graceful shutdown never get the chance to use them.

Proposed fix: two-phase signal escalation + drain timeout

Phase 1 — Graceful shutdown (SIGTERM + grace period)

cancelAdapter = (reason: TerminationReason) => {
    if (settled) return;
    adapter.kill("SIGTERM");  // Ask nicely first
    forceKillTimer = setTimeout(() => {
        if (!settled) {
            adapter.kill("SIGKILL");  // Force after grace period
        }
    }, GRACEFUL_SHUTDOWN_MS); // e.g. 5000ms
};

Phase 2 — Independent drain timeout

After SIGKILL, add a POSIX drain timeout (not just Windows) to prevent wait() from hanging indefinitely:

const DRAIN_TIMEOUT_MS = 10_000; // after SIGKILL, max wait for close event

// In the kill() path, after SIGKILL:
scheduleForceKillWaitFallback("SIGKILL"); // Already exists for Windows
// Extend to all platforms, not just Windows

The existing FORCE_KILL_WAIT_FALLBACK_MS (4000ms, Windows-only) could be unified into a cross-platform drain timeout.

Optional: pipe close watchdog

When cancellation is requested, explicitly close stdout/stderr pipes to unblock any blocked readers before sending the kill signal:

cancelAdapter = (reason: TerminationReason) => {
    if (settled) return;
    // Close pipes first — unblocks any blocked stdout/stderr handlers
    adapter.stdin?.destroy();
    adapter.kill("SIGTERM");
    // ... escalation timer ...
};

Prior art

  • multica docs(bird): update skill for v0.7.0 commands #947: Implemented a three-layer defense (pipe close → drain timeout → context-aware select) across all 4 agent backends after observing daemon stall from a hung Claude Code process. PR
  • Go exec.Cmd.WaitDelay: Go 1.20+ added WaitDelay as a built-in mechanism for exactly this pattern — close pipes after process exit, force-kill after delay

Scope

  • src/process/supervisor/supervisor.ts — signal escalation in cancelAdapter
  • src/process/supervisor/adapters/child.ts — cross-platform drain timeout (extend FORCE_KILL_WAIT_FALLBACK_MS to POSIX)
  • Tests for: SIGTERM → SIGKILL escalation, drain timeout on POSIX, pipe close unblocking

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:data-lossCan lose, corrupt, or silently drop user/session/config data.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions