Skip to content

refactor(arch): unified timeout abstraction for child-process execution #2562

@jyaunches

Description

@jyaunches

Problem

There is no shared timeout architecture for child-process execution. The codebase uses 5 different ad-hoc approaches, and the most critical execution layer — openshell.ts — has zero timeout support, making every openshell CLI call unbounded.

This was the root cause of the E2E hang introduced by PR #2398 (reverted in #2471): the dashboard recovery chain wired runOpenshell/captureOpenshell calls as deps, but those calls have no timeout. When the sandbox was in a bad state, openshell sandbox download and openshell forward list hung forever.

Current State: 5 Timeout Approaches, No Shared Pattern

Approach Where Example
Raw spawnSync with inline timeout: nemoclaw.ts, sandbox-state.ts, debug.ts, http-probe.ts, shields.ts spawnSync("ssh", [...], { timeout: 15000 })
curl --connect-timeout / --max-time http-probe.ts, inline scripts curl --max-time 3
Deadline loop with remainingMs() nemoclaw.ts connect command while (Date.now() < deadline)
Injected timeoutMs on deps cluster-image-patch.ts (#2487) pullTimeoutMs, buildTimeoutMs, inspectTimeoutMs
No timeout at all openshell.tsrunOpenshellCommand and captureOpenshellCommand Every openshell CLI call

Silent Bug Already Shipping

nemoclaw connect passes timeout: remainingMs() to captureOpenshell, but the wrapper in nemoclaw.ts does not forward it, and captureOpenshellCommand in openshell.ts does not accept it:

// nemoclaw.ts — timeout silently dropped
captureOpenshell(["sandbox", "list"], {
  ignoreError: true,
  timeout: remainingMs(),  // ← dead code
}).output;

// nemoclaw.ts wrapper — does not forward timeout
function captureOpenshell(args, opts = {}) {
  return captureOpenshellCommand(getOpenshellBinary(), args, {
    cwd: ROOT,
    env: opts.env,
    ignoreError: opts.ignoreError,  // timeout not here
    errorLine: console.error,
    exit: (code) => process.exit(code),
  });
}

TypeScript does not catch this because nemoclaw.ts uses untyped opts = {}.

Proposed Solution

  1. Add timeout to openshell.ts interfaces — both RunOpenshellOptions and CaptureOpenshellOptions should accept timeout?: number and forward it to spawnSync.

  2. Define named timeout constants — follow the pattern Aaron established in cluster-image-patch.ts:

    • OPENSHELL_SANDBOX_EXEC_TIMEOUT_MS (e.g. 15s — matches existing SSH timeout)
    • OPENSHELL_FORWARD_LIST_TIMEOUT_MS (e.g. 10s)
    • OPENSHELL_SANDBOX_DOWNLOAD_TIMEOUT_MS (e.g. 30s)
    • OPENSHELL_DEFAULT_TIMEOUT_MS (e.g. 30s fallback)
  3. Forward timeout in nemoclaw.ts wrappersrunOpenshell and captureOpenshell must pass opts.timeout through.

  4. Fix the silent connect bug — the timeout: remainingMs() in the connect readiness loop should actually work once the plumbing is in place.

  5. Audit all runOpenshell/captureOpenshell call sites — decide appropriate timeout for each, default to OPENSHELL_DEFAULT_TIMEOUT_MS for any that do not specify one.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: architectureArchitecture, design debt, major refactors, or maintainability

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions