Skip to content

fix(windows): delegate npm global update to detached helper to avoid EBUSY#44614

Closed
Eiman24 wants to merge 1 commit intoopenclaw:mainfrom
Eiman24:fix/windows-ebusy-update
Closed

fix(windows): delegate npm global update to detached helper to avoid EBUSY#44614
Eiman24 wants to merge 1 commit intoopenclaw:mainfrom
Eiman24:fix/windows-ebusy-update

Conversation

@Eiman24
Copy link
Copy Markdown

@Eiman24 Eiman24 commented Mar 13, 2026

Problem

On Windows, the Dashboard "Update now" button (update.run) always fails with EBUSY because:

  1. The Gateway process is running and Node.js keeps loaded .js files locked
  2. npm i -g openclaw@latest needs to rename/delete those locked files
  3. npm fails with EPERM/EBUSY → update never succeeds

This breaks the user expectation that clicking "Update now" should work. The current workaround requires manually stopping the Gateway, running npm, and restarting — which defeats the purpose of having an update button.

Solution

Introduce a detached update flow for Windows that mirrors the existing windows-task-restart.ts pattern:

  1. update-runner.ts detects Windows + global npm install scenario
  2. Spawns a detached .cmd helper script (update-detached-win32.ts)
  3. Returns status: "ok" with detachedResultPath to signal delegation
  4. The update.ts handler sees detachedResultPath → sends response → exits the Gateway process (process.exit(0))
  5. The detached helper:
    • Waits for the Gateway PID to disappear (file locks released)
    • Runs npm i -g openclaw@latest
    • Relaunches the Gateway via Scheduled Task (schtasks /Run)
    • Writes a JSON result file for diagnostics

This is the same approach used by many desktop apps (VS Code, Chrome, etc.): exit the main process, let an updater replace files, then restart.

Changes

File Change
src/infra/update-detached-win32.ts New — Detached update helper (spawn script, read results)
src/infra/update-runner.ts Windows branch in global update path; new detachedResultPath field on UpdateRunResult
src/gateway/server-methods/update.ts Handle detachedResultPath: respond, then process.exit(0) to release file locks
src/infra/update-detached-win32.test.ts New — Tests for spawn and result reading

Non-Windows platforms

No changes. Linux/macOS continue to use the existing in-process npm i -g + SIGUSR1 restart flow, which works fine because those platforms don't lock loaded JS files.

Testing

  • Unit tests for the detached helper module
  • Tested manually on Windows 10 with npm global install

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M labels Mar 13, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 13, 2026

Greptile Summary

This PR introduces a detached-process update flow on Windows to work around the EBUSY/EPERM errors that prevent npm i -g from replacing files locked by the running Gateway process. A .cmd batch helper is spawned in a detached, unref'd child process; it waits for the Gateway PID to disappear, runs the install, relaunches the Gateway via Scheduled Task, and writes a JSON result for diagnostics. The Gateway then calls process.exit(0) after sending its response so the file locks are released. The approach mirrors the existing windows-task-restart pattern and is well-structured.

Issues found:

  • Logic bug (high severity)Math.max(restartDelayMs, 2000) in update.ts:119 where restartDelayMs is number | undefined. When the parameter is absent, Math.max(undefined, 2000) returns NaN, setTimeout(fn, NaN) fires at 0 ms, and process.exit(0) runs before the response can flush. The fix is Math.max(restartDelayMs ?? 0, 2000).
  • Incomplete featurereadDetachedUpdateResult is exported but nothing in this PR reads the result file on Gateway startup. Update failures (e.g. install-failed) are silently dropped with no signal to the user or in logs. Wiring up a startup check to consume this file (and then delete it) would complete the diagnostics loop described in the PR description.

Confidence Score: 3/5

  • Not safe to merge as-is — the NaN setTimeout bug can cause process.exit before the response flushes, breaking the update flow on Windows when restartDelayMs is not supplied.
  • The overall architecture is sound and the batch-script generation is correct, but the Math.max(restartDelayMs, 2000) bug is a reliable, reproducible failure: the common case (no restartDelayMs in the request) yields a 0 ms exit delay, which races against response flushing and likely causes the client to see a dropped connection rather than the intended "detached" response.
  • src/gateway/server-methods/update.ts — fix the Math.max(restartDelayMs ?? 0, 2000) issue on line 119 before merging.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/gateway/server-methods/update.ts
Line: 119

Comment:
**`Math.max(undefined, 2000)` evaluates to `NaN`, causing immediate exit**

`restartDelayMs` has type `number | undefined` (see `parseRestartRequestParams`). When the caller omits this parameter, `restartDelayMs` is `undefined`, and `Math.max(undefined, 2000)` returns `NaN` in JavaScript. `setTimeout(fn, NaN)` treats `NaN` as `0`, so `process.exit(0)` fires on the very next event-loop tick — before the WebSocket / HTTP response written by `respond()` above has had a chance to flush. The client never receives the "detached" response and the update appears to have silently failed.

```suggestion
      }, Math.max(restartDelayMs ?? 0, 2000));
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/update-detached-win32.ts
Line: 178-188

Comment:
**`readDetachedUpdateResult` is exported but never consumed**

The detached helper writes a JSON result file (e.g. `{"ok":false,"reason":"install-failed","exitCode":1}`) so the next Gateway boot can surface the outcome to the user. However, nothing in this PR reads that file on startup. The diagnostic information is silently discarded: a failed update leaves no visible signal in the UI, logs, or restart sentinel.

Consider wiring up a startup check (e.g. in the Gateway boot sequence) that calls `readDetachedUpdateResult` for any pending result path — perhaps stored in the restart sentinel written in `update.ts` at line 90 — and then emits a log warning or an observable update-status event before deleting the result file.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 9f59d37

Comment thread src/gateway/server-methods/update.ts Outdated
setTimeout(() => {
context?.logGateway?.info("Exiting for detached Windows update...");
process.exit(0);
}, Math.max(restartDelayMs, 2000));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math.max(undefined, 2000) evaluates to NaN, causing immediate exit

restartDelayMs has type number | undefined (see parseRestartRequestParams). When the caller omits this parameter, restartDelayMs is undefined, and Math.max(undefined, 2000) returns NaN in JavaScript. setTimeout(fn, NaN) treats NaN as 0, so process.exit(0) fires on the very next event-loop tick — before the WebSocket / HTTP response written by respond() above has had a chance to flush. The client never receives the "detached" response and the update appears to have silently failed.

Suggested change
}, Math.max(restartDelayMs, 2000));
}, Math.max(restartDelayMs ?? 0, 2000));
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-methods/update.ts
Line: 119

Comment:
**`Math.max(undefined, 2000)` evaluates to `NaN`, causing immediate exit**

`restartDelayMs` has type `number | undefined` (see `parseRestartRequestParams`). When the caller omits this parameter, `restartDelayMs` is `undefined`, and `Math.max(undefined, 2000)` returns `NaN` in JavaScript. `setTimeout(fn, NaN)` treats `NaN` as `0`, so `process.exit(0)` fires on the very next event-loop tick — before the WebSocket / HTTP response written by `respond()` above has had a chance to flush. The client never receives the "detached" response and the update appears to have silently failed.

```suggestion
      }, Math.max(restartDelayMs ?? 0, 2000));
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +178 to +188
export function readDetachedUpdateResult(
resultPath: string,
): { ok: boolean; reason?: string; exitCode?: number } | null {
try {
const raw = fs.readFileSync(resultPath, "utf8").trim();
const parsed = JSON.parse(raw);
return parsed;
} catch {
return null;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readDetachedUpdateResult is exported but never consumed

The detached helper writes a JSON result file (e.g. {"ok":false,"reason":"install-failed","exitCode":1}) so the next Gateway boot can surface the outcome to the user. However, nothing in this PR reads that file on startup. The diagnostic information is silently discarded: a failed update leaves no visible signal in the UI, logs, or restart sentinel.

Consider wiring up a startup check (e.g. in the Gateway boot sequence) that calls readDetachedUpdateResult for any pending result path — perhaps stored in the restart sentinel written in update.ts at line 90 — and then emits a log warning or an observable update-status event before deleting the result file.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/update-detached-win32.ts
Line: 178-188

Comment:
**`readDetachedUpdateResult` is exported but never consumed**

The detached helper writes a JSON result file (e.g. `{"ok":false,"reason":"install-failed","exitCode":1}`) so the next Gateway boot can surface the outcome to the user. However, nothing in this PR reads that file on startup. The diagnostic information is silently discarded: a failed update leaves no visible signal in the UI, logs, or restart sentinel.

Consider wiring up a startup check (e.g. in the Gateway boot sequence) that calls `readDetachedUpdateResult` for any pending result path — perhaps stored in the restart sentinel written in `update.ts` at line 90 — and then emits a log warning or an observable update-status event before deleting the result file.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f59d379f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +147 to +151
const child = spawn("cmd.exe", ["/d", "/s", "/c", quoteCmdScriptArg(scriptPath)], {
detached: true,
stdio: "ignore",
windowsHide: true,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate prepared env to detached updater process

This detached path drops the install environment assembled in runGatewayUpdate: spawnDetachedUpdate accepts env but never passes it to spawn(...), so the helper runs with the parent process environment instead of globalInstallEnv. On Windows, that loses values from createGlobalInstallEnv (notably portable-Git PATH prepends and npm config vars), which can make detached npm i -g fail on machines that rely on those settings even though the previous in-process update path succeeded.

Useful? React with 👍 / 👎.

Comment thread src/gateway/server-methods/update.ts Outdated
setTimeout(() => {
context?.logGateway?.info("Exiting for detached Windows update...");
process.exit(0);
}, Math.max(restartDelayMs, 2000));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp detached shutdown delay to helper wait window

The gateway exit delay is taken from restartDelayMs without an upper bound, but the detached helper only waits 60 seconds for this PID before aborting with pid-wait-timeout. If a caller sends restartDelayMs > 60000, the helper exits before the gateway stops, so the install/restart never runs and the later process.exit(0) can leave the gateway down.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff87e1440e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/gateway/server-methods/update.ts Outdated
setTimeout(() => {
context?.logGateway?.info("Exiting for detached Windows update...");
process.exit(0);
}, Math.max(restartDelayMs, 2000));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Default detached exit delay to a finite value

restartDelayMs is optional, but this call computes Math.max(restartDelayMs, 2000) directly; when the caller omits restartDelayMs, that expression becomes NaN, and Node timers treat NaN like an immediate timeout. In the detached update path this can call process.exit(0) almost instantly, which can cut off the update.run response before it flushes to the client in the common default-parameter case.

Useful? React with 👍 / 👎.

Comment on lines +897 to +899
const installArgv = globalInstallArgs(globalManager, spec);
const detached = spawnDetachedUpdate({ installArgv, env: globalInstallEnv as NodeJS.ProcessEnv | undefined });
if (!detached.ok) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve npm omit-optional retry in win32 detached updates

The Windows branch delegates a single globalInstallArgs(...) command to the helper and returns early, so it never reaches the existing npm fallback retry (globalInstallFallbackArgs) used elsewhere in this function. That regresses environments where the first global npm install fails but the --omit=optional retry succeeds, causing Windows updates to fail in cases that were previously handled.

Useful? React with 👍 / 👎.

Comment on lines +134 to +136
const installCommand = params.installArgv
.map((arg) => (arg.includes(" ") ? `"${arg}"` : arg))
.join(" ");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Escape detached install argv for cmd metacharacters

This constructs installCommand by only wrapping args that contain spaces, but cmd.exe still interprets metacharacters like &, |, %, !, and ^ in unescaped arguments. In the detached path, package specs such as signed URLs or local paths containing these characters can be split/expanded and make the update command fail (or execute unintended fragments), whereas the previous in-process argv execution avoided shell parsing.

Useful? React with 👍 / 👎.

…EBUSY

On Windows, 'npm i -g openclaw@latest' fails with EBUSY when the Gateway
process is running because Node.js locks loaded .js files. The Dashboard
'Update now' button always fails on Windows for this reason.

This fix introduces a detached update flow for Windows:
1. update-runner detects Windows + global npm install scenario
2. Spawns a detached .cmd helper script (update-detached-win32.ts)
3. Returns status 'ok' with detachedResultPath so the handler knows
   to exit the Gateway process (not just SIGUSR1 in-process reload)
4. The detached helper waits for the Gateway PID to exit, runs npm
   install, then relaunches the Gateway via Scheduled Task

This mirrors the existing windows-task-restart.ts pattern but adds
the npm install step between stop and restart.

Files changed:
- src/infra/update-detached-win32.ts (new): Detached update helper
- src/infra/update-runner.ts: Windows branch in global update path
- src/gateway/server-methods/update.ts: Handle detachedResultPath
- src/infra/update-detached-win32.test.ts (new): Tests

Fixes: Dashboard 'Update now' EBUSY failure on Windows
@Eiman24 Eiman24 force-pushed the fix/windows-ebusy-update branch from ff87e14 to c83ca14 Compare March 16, 2026 07:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c83ca14db7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Return status "ok" so the caller schedules a graceful shutdown.
// The detached script handles the actual install + restart.
return {
status: "ok",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report delegated update as pending until install actually finishes

The win32 global-update path returns status: "ok" as soon as the detached helper is spawned, before any install step runs. update.run then uses that status for the API response and restart sentinel, so if the helper later records install-failed or pid-wait-timeout, users still get a success-style restart message on next boot. This masks real update failures and regresses observability versus the previous synchronous path; the delegated flow should not be marked ok until detachedResultPath has been reconciled.

Useful? React with 👍 / 👎.

setTimeout(
() => {
context?.logGateway?.info("Exiting for detached Windows update...");
process.exit(0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require supervised restart capability before forced detached exit

This branch unconditionally hard-exits the gateway after delegation, but the helper restart mechanism is schtasks-only and no check is performed that the current Windows runtime is actually task-supervised. In unsupervised runs (which the existing restart pipeline explicitly supports via in-process fallback), the helper cannot relaunch the process, so update.run can leave the gateway down immediately after returning success. Gate detached exit behind detected scheduled-task supervision, or keep the prior restart fallback path.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded: the Windows EBUSY update bug is real, but this PR is unsafe as-is and open PR #75649 explicitly carries this work forward with the staged-update, restart-supervision, env, quoting, result-reconciliation, docs, and changelog follow-up.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Close this source PR and continue review on #75649 as the canonical replacement for #40540, preserving the original contributor credit while landing the safer staged-update design.

Do we have a high-confidence way to reproduce the issue?

Yes. The reproduction path is source-backed by #40540's Windows EBUSY logs plus current main's live Gateway/global npm package update path; I did not run a live Windows update in this read-only review.

Is this the best way to solve the issue?

No. The detached-helper direction is plausible, but this PR is not the best implementation because it bypasses current staged-update safeguards, drops env handling, builds unsafe cmd commands, and reports success before the detached update is reconciled; #75649 is the safer replacement path.

Security review:

Security review needs attention: The PR introduces a Windows command-execution risk in the detached update helper.

  • [medium] Generated cmd script does not preserve argv semantics — src/infra/update-detached-win32.ts:134
    installArgv is converted to one cmd command with only space-based quoting, so shell metacharacters in package specs or paths can alter the command that runs in a supply-chain-sensitive update path.
    Confidence: 0.9

What I checked:

  • Superseding replacement PR exists: Open PR fix(windows): preserve staged update handoff #75649 says it carries forward this PR, preserves the original cherry-picked commit, and adds maintainer follow-up for staged verification, env handling, cmd quoting, restart supervision, fallback restart behavior, detached-result reconciliation, docs, and changelog attribution. (4224f5bc48ca)
  • Original PR remains open and unmerged: The GitHub API reports this PR open at head c83ca14 with no protected labels. (c83ca14db719)
  • Current main still uses the shared package update contract: Current main routes update.run through runGatewayUpdate(...), with restart handling after the update result; it has no detached Windows result path from this PR. (src/gateway/server-methods/update.ts:85, 775c27433fa3)
  • Current main preserves staged npm install and fallback behavior: runGlobalPackageUpdateSteps(...) stages npm installs, retries with global update (omit optional), verifies the staged package, then swaps the staged package root. (src/infra/package-update-steps.ts:332, 775c27433fa3)
  • PR patch has a shutdown-delay bug: The PR branch computes Math.max(restartDelayMs, 2000) even though restartDelayMs is optional, so the default request can schedule an immediate exit. (src/gateway/server-methods/update.ts:121, c83ca14db719)
  • PR patch has a command-execution risk: The detached helper builds a cmd command by quoting only args with spaces, so cmd metacharacters in the package spec or path can alter the executed command. (src/infra/update-detached-win32.ts:134, c83ca14db719)

Likely related people:

  • steipete: Recent commits on the update runner, Gateway update handler, restart helper, and package-update validation make this the clearest current maintainer route for the updater/restart surface. (role: recent maintainer; confidence: high; commits: 781c9b7ab08a, 88da533714e5, e131eaecb502; files: src/infra/package-update-steps.ts, src/infra/update-runner.ts, src/gateway/server-methods/update.ts)
  • shakkernerd: April commits introduced and hardened atomic/staged npm global update behavior that this PR must preserve rather than bypass. (role: introduced staged npm update behavior; confidence: high; commits: 6985c6751c3d, b0127b9f1fb0, 2186080963f4; files: src/infra/package-update-steps.ts, src/infra/update-runner.ts)
  • IAMSamuelRodda: Prior merged work added update restart-sentinel/restart verification plumbing, and open PR fix(windows): preserve staged update handoff #75649 carries this PR forward around that current contract. (role: adjacent owner and replacement PR author; confidence: high; commits: 6c252cc54c3c, 4224f5bc48ca; files: src/gateway/server-methods/update.ts, src/gateway/server-restart-sentinel.ts, src/infra/update-runner.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 775c27433fa3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: M triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants