Skip to content

fix(windows): detect stale gateway PIDs via netstat#52485

Closed
andyk-ms wants to merge 3 commits intoopenclaw:mainfrom
andyk-ms:fix/windows-stale-pid-detection
Closed

fix(windows): detect stale gateway PIDs via netstat#52485
andyk-ms wants to merge 3 commits intoopenclaw:mainfrom
andyk-ms:fix/windows-stale-pid-detection

Conversation

@andyk-ms
Copy link
Copy Markdown
Contributor

Problem

\ indGatewayPidsOnPortSync()\ returns []\ on Windows (line 89: early return). This means \openclaw gateway stop\ cannot detect or kill stale
ode.exe\ processes holding the gateway port, causing EADDRINUSE on restart.

Fix

  • Add \ indGatewayPidsOnPortWindows()\ — uses
    etstat -ano\ to find PIDs listening on the gateway port, verified via \ asklist\ to confirm they are node.exe processes
  • Add \pollPortOnceWindows()\ — polls port availability after termination using netstat
  • Route both through Windows implementations when \process.platform === 'win32'\

No changes to \ erminateStaleProcessesSync()\ — Node.js \process.kill(pid, 'SIGTERM')\ already works on Windows.

Testing

Tested on Windows 10 (build 26200) with 3 scenarios:

  1. Normal stop+start — stale node.exe detected and killed, clean restart
  2. Port conflict — netstat correctly identifies PID holding port 18789
  3. Clean state — returns empty array when no stale processes exist

Fixes #52049

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 22, 2026

Greptile Summary

This PR fixes the long-standing Windows no-op in findGatewayPidsOnPortSync (which previously returned [] unconditionally on win32) by adding proper netstat-based PID detection and port polling. The approach is architecturally consistent with the existing Unix path and correctly routes through the new implementations at both call sites.

Issues found:

  • P1 logic bug: The try/catch block around spawnSync("tasklist", …) in findGatewayPidsOnPortWindows is dead code. spawnSync never throws; errors are reported via res.error. If tasklist times out or fails, tl.stdout is empty and the PID is silently dropped — the opposite of the stated intent ("include the PID anyway — better to kill a stale process"). This should use an explicit tl.error || tl.status !== 0 check.
  • P2 style: The original JSDoc comment block at lines 81–84 ("Find PIDs of gateway processes listening on the given port using synchronous lsof") was left stranded between the two new functions and is no longer attached to findGatewayPidsOnPortSync. It should be moved.

Confidence Score: 4/5

  • Safe to merge after fixing the dead try/catch around tasklist — the happy path works correctly and the bug only affects the error-handling fallback.
  • The PR correctly solves the Windows stub problem and the netstat-based approach is sound. One P1 logic bug exists: the try/catch around spawnSync("tasklist") is dead code and the intended fallback (include PID on failure) is silently broken. In practice tasklist rarely fails, so this won't block normal usage, but fixing it before merge is strongly recommended. The dangling JSDoc is cosmetic only. No data-loss or security risk.
  • src/infra/restart-stale-pids.ts — specifically the tasklist error-handling block inside findGatewayPidsOnPortWindows

Comments Outside Diff (1)

  1. src/infra/restart-stale-pids.ts, line 81-84 (link)

    P2 Orphaned JSDoc block

    The original JSDoc comment (lines 81–84) was written for findGatewayPidsOnPortSync, but the new Windows function was inserted directly after it. The comment now has no function beneath it and is attached to nothing — TypeScript tooling will associate it with findGatewayPidsOnPortWindows, shadowing that function's own JSDoc. The original comment should either be moved to findGatewayPidsOnPortSync (where it belongs) or removed.

    Move this block down to sit immediately above export function findGatewayPidsOnPortSync and remove it from its current position.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/infra/restart-stale-pids.ts
    Line: 81-84
    
    Comment:
    **Orphaned JSDoc block**
    
    The original JSDoc comment (lines 81–84) was written for `findGatewayPidsOnPortSync`, but the new Windows function was inserted directly after it. The comment now has no function beneath it and is attached to nothing — TypeScript tooling will associate it with `findGatewayPidsOnPortWindows`, shadowing that function's own JSDoc. The original comment should either be moved to `findGatewayPidsOnPortSync` (where it belongs) or removed.
    
    
    
    Move this block down to sit immediately above `export function findGatewayPidsOnPortSync` and remove it from its current position.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 128-141

Comment:
**Dead `catch` block — fallback silently dropped**

`spawnSync` never throws. When `tasklist` times out or exits non-zero, `tl.error` is populated and `tl.stdout` is empty/null, but the `catch` block is never reached. The stated intent — _"If tasklist fails, include the PID anyway"_ — is silently broken: `tasklist` failures cause the PID to be **dropped** instead of conservatively included.

The fix is to check `tl.error` (or a non-zero/null status) explicitly:

```suggestion
    const tl = spawnSync("tasklist", ["/FI", `PID eq ${pid}`, "/FO", "CSV", "/NH"], {
        encoding: "utf8",
        timeout: spawnTimeoutMs,
        windowsHide: true,
      });
      if (tl.error || tl.status == null || tl.status !== 0) {
        // tasklist failed — include the PID anyway — better to kill a stale
        // process than leave a port occupied.
        confirmed.push(pid);
      } else if (tl.stdout && tl.stdout.toLowerCase().includes("node")) {
        confirmed.push(pid);
      }
```

Replace the whole `try/catch` block with the above (no `try` or `catch` wrappers needed).

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 81-84

Comment:
**Orphaned JSDoc block**

The original JSDoc comment (lines 81–84) was written for `findGatewayPidsOnPortSync`, but the new Windows function was inserted directly after it. The comment now has no function beneath it and is attached to nothing — TypeScript tooling will associate it with `findGatewayPidsOnPortWindows`, shadowing that function's own JSDoc. The original comment should either be moved to `findGatewayPidsOnPortSync` (where it belongs) or removed.

```suggestion
/**
 * Find PIDs of gateway processes listening on the given port using synchronous lsof.
 * Returns only PIDs that belong to openclaw gateway processes (not the current process).
 */
```

Move this block down to sit immediately above `export function findGatewayPidsOnPortSync` and remove it from its current position.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(windows): detect stale gateway PIDs ..." | Re-trigger Greptile

Comment thread src/infra/restart-stale-pids.ts Outdated
Comment on lines +128 to +141
try {
const tl = spawnSync("tasklist", ["/FI", `PID eq ${pid}`, "/FO", "CSV", "/NH"], {
encoding: "utf8",
timeout: spawnTimeoutMs,
windowsHide: true,
});
if (tl.stdout && tl.stdout.toLowerCase().includes("node")) {
confirmed.push(pid);
}
} catch {
// If tasklist fails, include the PID anyway — better to kill a stale
// process than leave a port occupied.
confirmed.push(pid);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Dead catch block — fallback silently dropped

spawnSync never throws. When tasklist times out or exits non-zero, tl.error is populated and tl.stdout is empty/null, but the catch block is never reached. The stated intent — "If tasklist fails, include the PID anyway" — is silently broken: tasklist failures cause the PID to be dropped instead of conservatively included.

The fix is to check tl.error (or a non-zero/null status) explicitly:

Suggested change
try {
const tl = spawnSync("tasklist", ["/FI", `PID eq ${pid}`, "/FO", "CSV", "/NH"], {
encoding: "utf8",
timeout: spawnTimeoutMs,
windowsHide: true,
});
if (tl.stdout && tl.stdout.toLowerCase().includes("node")) {
confirmed.push(pid);
}
} catch {
// If tasklist fails, include the PID anyway — better to kill a stale
// process than leave a port occupied.
confirmed.push(pid);
}
const tl = spawnSync("tasklist", ["/FI", `PID eq ${pid}`, "/FO", "CSV", "/NH"], {
encoding: "utf8",
timeout: spawnTimeoutMs,
windowsHide: true,
});
if (tl.error || tl.status == null || tl.status !== 0) {
// tasklist failed — include the PID anyway — better to kill a stale
// process than leave a port occupied.
confirmed.push(pid);
} else if (tl.stdout && tl.stdout.toLowerCase().includes("node")) {
confirmed.push(pid);
}

Replace the whole try/catch block with the above (no try or catch wrappers needed).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 128-141

Comment:
**Dead `catch` block — fallback silently dropped**

`spawnSync` never throws. When `tasklist` times out or exits non-zero, `tl.error` is populated and `tl.stdout` is empty/null, but the `catch` block is never reached. The stated intent — _"If tasklist fails, include the PID anyway"_ — is silently broken: `tasklist` failures cause the PID to be **dropped** instead of conservatively included.

The fix is to check `tl.error` (or a non-zero/null status) explicitly:

```suggestion
    const tl = spawnSync("tasklist", ["/FI", `PID eq ${pid}`, "/FO", "CSV", "/NH"], {
        encoding: "utf8",
        timeout: spawnTimeoutMs,
        windowsHide: true,
      });
      if (tl.error || tl.status == null || tl.status !== 0) {
        // tasklist failed — include the PID anyway — better to kill a stale
        // process than leave a port occupied.
        confirmed.push(pid);
      } else if (tl.stdout && tl.stdout.toLowerCase().includes("node")) {
        confirmed.push(pid);
      }
```

Replace the whole `try/catch` block with the above (no `try` or `catch` wrappers needed).

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e069075a4b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/restart-stale-pids.ts Outdated
Comment on lines +134 to +135
if (tl.stdout && tl.stdout.toLowerCase().includes("node")) {
confirmed.push(pid);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict Windows stale-PID match to OpenClaw process

findGatewayPidsOnPortWindows claims to return only OpenClaw gateway PIDs, but this check accepts any tasklist entry containing node, so any unrelated Node app listening on the configured gateway port can be terminated during cleanStaleGatewayProcessesSync restart cleanup. This is a behavioral regression versus the Unix path (which filters to OpenClaw command names) and can kill non-OpenClaw workloads if they bind the same port.

Useful? React with 👍 / 👎.

@andyk-ms
Copy link
Copy Markdown
Contributor Author

Addressed review feedback with two fixes:

Fix 1 (P1 logic bug): Replaced dead try/catch around spawnSync(tasklist) in findGatewayPidsOnPortWindows with an explicit tl.error || tl.status !== 0 check. spawnSync never throws, errors are reported via the .error property, so the catch block was unreachable dead code. Now errors are properly detected and the PID is included as intended.

Fix 2 (P2 orphaned JSDoc): Moved the stranded JSDoc comment from between the Windows function and findGatewayPidsOnPortSync to sit directly above findGatewayPidsOnPortSync where it belongs.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 315e2f7968

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +129 to +133
if (tl.error || tl.status !== 0) {
// If tasklist fails, include the PID anyway — better to kill a stale
// process than leave a port occupied.
confirmed.push(pid);
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip PIDs when process identity cannot be verified

When tasklist fails, this branch still adds the PID to confirmed, so restart cleanup can terminate whatever process is listening on the gateway port even if it is not OpenClaw. This is a real risk on timeout/permission/transient failures of tasklist: the code explicitly bypasses identity verification and proceeds to kill by PID, which can disrupt unrelated services bound to that port.

Useful? React with 👍 / 👎.

@andyk-ms
Copy link
Copy Markdown
Contributor Author

andyk-ms commented Apr 4, 2026

Pushed test fix: the win32 test was still expecting the old no-op behavior (return [], don't call spawnSync). Updated to mock netstat output and verify the Windows code path is actually exercised. The TypeError on line 99 (res undefined) should be resolved — CI will confirm.

andyk-ms added 3 commits April 4, 2026 14:37
findGatewayPidsOnPortSync() returned [] on Windows, so gateway stop
could not kill stale node.exe holding the port (EADDRINUSE on restart).

Add netstat-based PID detection and port polling for Windows.

Fixes openclaw#52049
@andyk-ms andyk-ms force-pushed the fix/windows-stale-pid-detection branch from 15474e0 to e3e28a8 Compare April 4, 2026 21:37
@andyk-ms
Copy link
Copy Markdown
Contributor Author

andyk-ms commented Apr 5, 2026

Closing — this was superseded by #60480 which landed the same Windows stale-PID cleanup (netstat + command-line verification via isGatewayArgv) with a cleaner shared-helper architecture. All three review concerns (P1: dead catch, P1: broad node match, P2: skip unverified) are addressed in that implementation. Thanks @arifahmedjoy!

@andyk-ms andyk-ms closed this Apr 5, 2026
@arifahmedjoy
Copy link
Copy Markdown
Contributor

You are welcome, @andyk-ms!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: gateway stop doesn't terminate node.exe process on Windows

2 participants