Problem
There is no shared timeout architecture for child-process execution. The codebase uses 5 different ad-hoc approaches, and the most critical execution layer — openshell.ts — has zero timeout support, making every openshell CLI call unbounded.
This was the root cause of the E2E hang introduced by PR #2398 (reverted in #2471): the dashboard recovery chain wired runOpenshell/captureOpenshell calls as deps, but those calls have no timeout. When the sandbox was in a bad state, openshell sandbox download and openshell forward list hung forever.
Current State: 5 Timeout Approaches, No Shared Pattern
| Approach |
Where |
Example |
Raw spawnSync with inline timeout: |
nemoclaw.ts, sandbox-state.ts, debug.ts, http-probe.ts, shields.ts |
spawnSync("ssh", [...], { timeout: 15000 }) |
curl --connect-timeout / --max-time |
http-probe.ts, inline scripts |
curl --max-time 3 |
Deadline loop with remainingMs() |
nemoclaw.ts connect command |
while (Date.now() < deadline) |
Injected timeoutMs on deps |
cluster-image-patch.ts (#2487) |
pullTimeoutMs, buildTimeoutMs, inspectTimeoutMs |
| No timeout at all |
openshell.ts — runOpenshellCommand and captureOpenshellCommand |
Every openshell CLI call |
Silent Bug Already Shipping
nemoclaw connect passes timeout: remainingMs() to captureOpenshell, but the wrapper in nemoclaw.ts does not forward it, and captureOpenshellCommand in openshell.ts does not accept it:
// nemoclaw.ts — timeout silently dropped
captureOpenshell(["sandbox", "list"], {
ignoreError: true,
timeout: remainingMs(), // ← dead code
}).output;
// nemoclaw.ts wrapper — does not forward timeout
function captureOpenshell(args, opts = {}) {
return captureOpenshellCommand(getOpenshellBinary(), args, {
cwd: ROOT,
env: opts.env,
ignoreError: opts.ignoreError, // timeout not here
errorLine: console.error,
exit: (code) => process.exit(code),
});
}
TypeScript does not catch this because nemoclaw.ts uses untyped opts = {}.
Proposed Solution
-
Add timeout to openshell.ts interfaces — both RunOpenshellOptions and CaptureOpenshellOptions should accept timeout?: number and forward it to spawnSync.
-
Define named timeout constants — follow the pattern Aaron established in cluster-image-patch.ts:
OPENSHELL_SANDBOX_EXEC_TIMEOUT_MS (e.g. 15s — matches existing SSH timeout)
OPENSHELL_FORWARD_LIST_TIMEOUT_MS (e.g. 10s)
OPENSHELL_SANDBOX_DOWNLOAD_TIMEOUT_MS (e.g. 30s)
OPENSHELL_DEFAULT_TIMEOUT_MS (e.g. 30s fallback)
-
Forward timeout in nemoclaw.ts wrappers — runOpenshell and captureOpenshell must pass opts.timeout through.
-
Fix the silent connect bug — the timeout: remainingMs() in the connect readiness loop should actually work once the plumbing is in place.
-
Audit all runOpenshell/captureOpenshell call sites — decide appropriate timeout for each, default to OPENSHELL_DEFAULT_TIMEOUT_MS for any that do not specify one.
Context
Problem
There is no shared timeout architecture for child-process execution. The codebase uses 5 different ad-hoc approaches, and the most critical execution layer —
openshell.ts— has zero timeout support, making everyopenshellCLI call unbounded.This was the root cause of the E2E hang introduced by PR #2398 (reverted in #2471): the dashboard recovery chain wired
runOpenshell/captureOpenshellcalls as deps, but those calls have no timeout. When the sandbox was in a bad state,openshell sandbox downloadandopenshell forward listhung forever.Current State: 5 Timeout Approaches, No Shared Pattern
spawnSyncwith inlinetimeout:nemoclaw.ts,sandbox-state.ts,debug.ts,http-probe.ts,shields.tsspawnSync("ssh", [...], { timeout: 15000 })curl --connect-timeout/--max-timehttp-probe.ts, inline scriptscurl --max-time 3remainingMs()nemoclaw.tsconnect commandwhile (Date.now() < deadline)timeoutMson depscluster-image-patch.ts(#2487)pullTimeoutMs,buildTimeoutMs,inspectTimeoutMsopenshell.ts—runOpenshellCommandandcaptureOpenshellCommandopenshellCLI callSilent Bug Already Shipping
nemoclaw connectpassestimeout: remainingMs()tocaptureOpenshell, but the wrapper innemoclaw.tsdoes not forward it, andcaptureOpenshellCommandinopenshell.tsdoes not accept it:TypeScript does not catch this because
nemoclaw.tsuses untypedopts = {}.Proposed Solution
Add
timeouttoopenshell.tsinterfaces — bothRunOpenshellOptionsandCaptureOpenshellOptionsshould accepttimeout?: numberand forward it tospawnSync.Define named timeout constants — follow the pattern Aaron established in
cluster-image-patch.ts:OPENSHELL_SANDBOX_EXEC_TIMEOUT_MS(e.g. 15s — matches existing SSH timeout)OPENSHELL_FORWARD_LIST_TIMEOUT_MS(e.g. 10s)OPENSHELL_SANDBOX_DOWNLOAD_TIMEOUT_MS(e.g. 30s)OPENSHELL_DEFAULT_TIMEOUT_MS(e.g. 30s fallback)Forward timeout in
nemoclaw.tswrappers —runOpenshellandcaptureOpenshellmust passopts.timeoutthrough.Fix the silent
connectbug — thetimeout: remainingMs()in the connect readiness loop should actually work once the plumbing is in place.Audit all
runOpenshell/captureOpenshellcall sites — decide appropriate timeout for each, default toOPENSHELL_DEFAULT_TIMEOUT_MSfor any that do not specify one.Context
cluster-image-patch.ts(fix(onboard): auto-patch cluster image for Docker 26+ overlayfs nested mount break #2487) — best current example of disciplined timeout injection