Skip to content

fix(windows): stabilize gateway restart and avoid false stale cleanup [AI-assisted]#50136

Open
CharlesHappy wants to merge 1 commit intoopenclaw:mainfrom
CharlesHappy:fix/windows-gateway-restart
Open

fix(windows): stabilize gateway restart and avoid false stale cleanup [AI-assisted]#50136
CharlesHappy wants to merge 1 commit intoopenclaw:mainfrom
CharlesHappy:fix/windows-gateway-restart

Conversation

@CharlesHappy
Copy link
Copy Markdown

Problem

On native Windows environments, openclaw gateway restart could be unreliable and sometimes incorrect:

  • restart could be reported as failed even when the gateway was actually running
  • unrelated processes could be misclassified as gateway processes and terminated
  • restart recovery could aggressively clean up listeners based on weak attribution
  • restart success could be accepted without verifying whether the process was actually replaced

Root cause

This was caused by several Windows-specific issues:

  1. Unknown listeners treated as stale

    • Due to weak process attribution on Windows, restart recovery could treat unknown listeners as stale and terminate them.
  2. Loose gateway process classification

    • Any process containing "openclaw" in its command line could be misclassified as a gateway process.
  3. Lagging or unreliable runtime metadata

    • schtasks runtime state may be delayed or incomplete, and is not a reliable signal immediately after restart.
  4. Insufficient restart identity validation

    • Restart success relied on probe results without checking whether the process identity actually changed.

What changed

1. Safer restart recovery (Windows)

  • Removed includeUnknownListenersAsStale from the Windows service-managed restart path
  • Only positively identified gateway processes are eligible for stale cleanup
  • Unknown listeners are no longer terminated

2. Stronger restart health criteria

  • On Windows:
    • port busy + probe success is treated as sufficient restart health
    • even when scheduled-task runtime is still unknown or lagging

3. Strict gateway process classification

  • Gateway identification now requires:

    • valid gateway argv (isGatewayArgv(...)), OR
    • executable path clearly matching openclaw(.exe) or openclaw-gateway(.exe)
  • Ambiguous processes remain unknown and are never selected for termination


4. Lightweight restart identity check

  • Capture pre-restart PID
  • After restart:
    • emit a warning if PID appears unchanged
    • helps surface cases where restart did not clearly replace the process

5. Improved diagnostics

Restart logs now include:

  • runtime state
  • probe result
  • listener classification (gateway / unknown / other)
  • stale termination candidates
  • failure reason

6. Tests updated and added

  • src/cli/daemon-cli/restart-health.test.ts
  • src/cli/daemon-cli/lifecycle.test.ts
  • src/infra/ports-format.test.ts
  • src/cli/daemon-cli/status.gather.test.ts

Updates include:

  • adding required GatewayRestartSnapshot fields:
    • probeResult
    • listenerKinds
  • aligning stale mocks/fixtures with the current snapshot shape

Why this fixes Windows restart

Before:

  • Weak runtime signal + weak process attribution + aggressive cleanup
  • → could kill valid processes or misreport restart failure

After:

  • Unknown listeners are never treated as stale
  • Gateway process classification is strict
  • Probe success is trusted over lagging runtime
  • Restart identity is partially verified via PID comparison

This removes false-positive cleanup paths and stabilizes restart behavior on Windows while preserving existing Linux/macOS behavior.


Validation

Automated

  • pnpm check passed
  • Focused tests related to this change all passed:
pnpm test -- src/cli/daemon-cli/restart-health.test.ts \
           src/cli/daemon-cli/lifecycle.test.ts \
           src/infra/ports-format.test.ts \
           src/cli/daemon-cli/status.gather.test.ts

@openclaw-barnacle openclaw-barnacle Bot added cli CLI command changes size: M labels Mar 19, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 19, 2026

Greptile Summary

This PR stabilizes openclaw gateway restart on Windows by tightening three interlocking behaviors: (1) gateway process classification now requires a verifiable argv or exact executable basename match rather than a loose "contains openclaw" substring check, (2) the includeUnknownListenersAsStale flag is removed from both managed-restart call sites so unknown listeners can never be incorrectly terminated, and (3) a lightweight PID identity check warns when the post-restart health snapshot looks identical to the pre-restart state, surfacing lagging schtasks metadata.

Key observations:

  • The new probeResult, listenerKinds, and failureReason fields on GatewayRestartSnapshot cleanly model all diagnostic information that was previously implicit, and the renderRestartDiagnostics additions make those fields visible in logs.
  • Identity warning not in warnings for non-JSON mode (lifecycle.ts lines 249–254): the if (!json) … else … pattern means the warning is either logged or pushed to warnings, never both — inconsistent with every other warning in the same function, which always pushes unconditionally.
  • process.platform leak in restart-health.test.ts (lines 213 and 268): both new tests override process.platform = "win32" without a try/finally restore, letting the override bleed into the next test in execution order. lifecycle.test.ts already demonstrates the correct cleanup pattern.

Confidence Score: 3/5

  • Safe to merge after resolving the identity-warning inconsistency and adding platform-cleanup guards to the two new tests.
  • The core logic changes — removing includeUnknownListenersAsStale, tightening process classification, and adding PID-identity diagnostics — are correct and well-tested on the happy paths. Two issues hold the score back: the identity warning is silently dropped from the warnings array in non-JSON mode (a behavioral inconsistency that could hide the caveat from callers), and two new tests fail to restore process.platform, which can cause silent test pollution of the surrounding suite.
  • src/cli/daemon-cli/lifecycle.ts (identity warning consistency) and src/cli/daemon-cli/restart-health.test.ts (platform cleanup in both new tests).
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 249-254

Comment:
**Identity warning not added to `warnings` in non-JSON mode**

The identity warning uses a mutually exclusive `if (!json) ... else ...` pattern — it's either logged *or* pushed to `warnings`, never both. Every other warning in this function (e.g. the stale-pid warning just above) is always pushed to `warnings` unconditionally AND then also logged in non-JSON mode. As a result, any code that inspects the `warnings` array after a non-JSON restart will silently miss this caveat.

```suggestion
          const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`;
          warnings.push(identityWarning);
          if (!json) {
            defaultRuntime.log(theme.warn(identityWarning));
          }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.test.ts
Line: 213-235

Comment:
**`process.platform` override leaks into subsequent tests**

Both new tests in this file (`"treats a busy unknown Windows listener…"` at line 213 and `"reports probe result, listener classification, and failure reason in diagnostics"` at line 268) call `Object.defineProperty(process, "platform", { value: "win32", … })` but never restore the original value. This lets the `"win32"` override bleed into later tests — most notably `"treats auth-closed probe as healthy gateway reachability"` (line 237), which runs on the leaked `"win32"` platform and may silently change its behavior.

The newly added test in `lifecycle.test.ts` already demonstrates the correct pattern using `try / finally`:

```ts
const originalPlatform = process.platform;
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
try {
  // ... test body ...
} finally {
  Object.defineProperty(process, "platform", { value: originalPlatform, configurable: true });
}
```

Apply the same `try / finally` guard to both new tests here (line 213 and line 268).

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "fix(windows): stabil..."

Comment on lines +249 to +254
const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`;
if (!json) {
defaultRuntime.log(theme.warn(identityWarning));
} else {
warnings.push(identityWarning);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Identity warning not added to warnings in non-JSON mode

The identity warning uses a mutually exclusive if (!json) ... else ... pattern — it's either logged or pushed to warnings, never both. Every other warning in this function (e.g. the stale-pid warning just above) is always pushed to warnings unconditionally AND then also logged in non-JSON mode. As a result, any code that inspects the warnings array after a non-JSON restart will silently miss this caveat.

Suggested change
const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`;
if (!json) {
defaultRuntime.log(theme.warn(identityWarning));
} else {
warnings.push(identityWarning);
}
const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`;
warnings.push(identityWarning);
if (!json) {
defaultRuntime.log(theme.warn(identityWarning));
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 249-254

Comment:
**Identity warning not added to `warnings` in non-JSON mode**

The identity warning uses a mutually exclusive `if (!json) ... else ...` pattern — it's either logged *or* pushed to `warnings`, never both. Every other warning in this function (e.g. the stale-pid warning just above) is always pushed to `warnings` unconditionally AND then also logged in non-JSON mode. As a result, any code that inspects the `warnings` array after a non-JSON restart will silently miss this caveat.

```suggestion
          const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`;
          warnings.push(identityWarning);
          if (!json) {
            defaultRuntime.log(theme.warn(identityWarning));
          }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +213 to +235
it("treats a busy unknown Windows listener as healthy when the probe succeeds", async () => {
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
classifyPortListener.mockReturnValue("unknown");
probeGateway.mockResolvedValue({
ok: true,
close: null,
});

const snapshot = await inspectGatewayRestartWithSnapshot({
runtime: { status: "stopped" },
portUsage: {
port: 18789,
status: "busy",
listeners: [{ pid: 9100, commandLine: "" }],
hints: [],
},
includeUnknownListenersAsStale: true,
});

expect(snapshot.healthy).toBe(true);
expect(snapshot.staleGatewayPids).toEqual([]);
expect(snapshot.probeResult).toBe("ok");
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 process.platform override leaks into subsequent tests

Both new tests in this file ("treats a busy unknown Windows listener…" at line 213 and "reports probe result, listener classification, and failure reason in diagnostics" at line 268) call Object.defineProperty(process, "platform", { value: "win32", … }) but never restore the original value. This lets the "win32" override bleed into later tests — most notably "treats auth-closed probe as healthy gateway reachability" (line 237), which runs on the leaked "win32" platform and may silently change its behavior.

The newly added test in lifecycle.test.ts already demonstrates the correct pattern using try / finally:

const originalPlatform = process.platform;
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
try {
  // ... test body ...
} finally {
  Object.defineProperty(process, "platform", { value: originalPlatform, configurable: true });
}

Apply the same try / finally guard to both new tests here (line 213 and line 268).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.test.ts
Line: 213-235

Comment:
**`process.platform` override leaks into subsequent tests**

Both new tests in this file (`"treats a busy unknown Windows listener…"` at line 213 and `"reports probe result, listener classification, and failure reason in diagnostics"` at line 268) call `Object.defineProperty(process, "platform", { value: "win32", … })` but never restore the original value. This lets the `"win32"` override bleed into later tests — most notably `"treats auth-closed probe as healthy gateway reachability"` (line 237), which runs on the leaked `"win32"` platform and may silently change its behavior.

The newly added test in `lifecycle.test.ts` already demonstrates the correct pattern using `try / finally`:

```ts
const originalPlatform = process.platform;
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
try {
  // ... test body ...
} finally {
  Object.defineProperty(process, "platform", { value: originalPlatform, configurable: true });
}
```

Apply the same `try / finally` guard to both new tests here (line 213 and line 268).

How can I resolve this? If you propose a fix, please make it concise.

@vincentkoc vincentkoc added the triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context. label Apr 26, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Codex review: found issues before merge.

Summary
The PR changes Windows gateway restart health checks, stale listener cleanup, listener classification, diagnostics, and related daemon/status/port tests.

Reproducibility: yes. by source inspection. On current main, win32 restart health can classify unknown listeners as stale PIDs and the restart path then terminates those returned PIDs; the loose classifier also treats any command text containing "openclaw" as a gateway listener.

Next step before merge
The PR is a useful implementation candidate but needs a maintainer-chosen rebase or replacement because it is dirty against main and touches process termination plus restart-probe trust without automation opt-in.

Security
Needs attention: The diff has no dependency, workflow, secret, install, or supply-chain changes, but the stale branch would weaken current restart-probe trust if rebased naively.

Review findings

  • [P2] Rebase over the current restart-health contract — src/cli/daemon-cli/restart-health.ts:19-30
  • [P2] Preserve the exact restart-probe close allowlist — src/cli/daemon-cli/restart-health.ts:55-66
  • [P3] Push the identity warning before logging it — src/cli/daemon-cli/lifecycle.ts:249-253
Review details

Best possible solution:

Land a rebased or replacement patch that removes unknown-listener termination from Windows restart recovery, tightens listener classification with current argv helpers, preserves current restart-health/version/plugin/channel gates, keeps the exact auth-close allowlist, and adds focused regression coverage plus a changelog entry.

Do we have a high-confidence way to reproduce the issue?

Yes, by source inspection. On current main, win32 restart health can classify unknown listeners as stale PIDs and the restart path then terminates those returned PIDs; the loose classifier also treats any command text containing "openclaw" as a gateway listener.

Is this the best way to solve the issue?

No, not as-is. The PR direction is useful, but the branch must be rebased or replaced so its listener-safety changes are layered onto the current restart-health contract and exact restart-probe close allowlist.

Full review comments:

  • [P2] Rebase over the current restart-health contract — src/cli/daemon-cli/restart-health.ts:19-30
    The branch's GatewayRestartSnapshot and reachability flow predate current main's expected-version, gateway-version, plugin/channel error, wait-outcome, and elapsed-time fields. Resolving this by taking the branch shape would drop restart/update health gates that now protect package restarts, so the listener diagnostics need to be layered onto the current contract instead.
    Confidence: 0.9
  • [P2] Preserve the exact restart-probe close allowlist — src/cli/daemon-cli/restart-health.ts:55-66
    This branch still treats broad 1008 close-reason substrings like auth, token, password, scope, and role as gateway reachability. Current main intentionally uses an exact allowlist so a local non-gateway listener cannot satisfy restart health with auth-looking text; the rebase needs to keep that policy.
    Confidence: 0.87
  • [P3] Push the identity warning before logging it — src/cli/daemon-cli/lifecycle.ts:249-253
    The PID identity warning is only appended to warnings in JSON mode. Keep it consistent with the other restart warnings by pushing it unconditionally, then logging it for human output when !json.
    Confidence: 0.78
  • [P3] Restore process.platform after Windows-specific tests — src/cli/daemon-cli/restart-health.test.ts:213-214
    The new restart-health tests override process.platform to win32 without restoring it, which can leak platform state into later tests and change their behavior. Use the try/finally restore pattern already used in the lifecycle test for both new overrides.
    Confidence: 0.84

Overall correctness: patch is incorrect
Overall confidence: 0.88

Security concerns:

  • [medium] Preserve restart-probe auth-close hardening — src/cli/daemon-cli/restart-health.ts:55
    Current main uses an exact policy-close allowlist so broad auth-looking close reasons from a local non-gateway listener do not make restart health pass. The PR branch predates that hardening and accepts broad substrings, so rebase or replacement work must keep the current allowlist.
    Confidence: 0.86

Acceptance criteria:

  • pnpm test -- src/cli/daemon-cli/restart-health.test.ts src/cli/daemon-cli/lifecycle.test.ts src/infra/ports-format.test.ts src/cli/daemon-cli/status.gather.test.ts
  • pnpm check:changed

What I checked:

  • Current main still passes the Windows unknown-listener fallback into restart health: runDaemonRestart() calls waitForGatewayHealthyRestart() with includeUnknownListenersAsStale: process.platform === "win32" before terminating returned staleGatewayPids. (src/cli/daemon-cli/lifecycle.ts:334, 103cdd9d96f8)
  • Current main can turn unknown Windows listeners into stale PIDs: inspectGatewayRestart() maps listeners classified as unknown into fallbackListenerPids when includeUnknownListenersAsStale is enabled on win32 and the runtime is not running. (src/cli/daemon-cli/restart-health.ts:359, 103cdd9d96f8)
  • Current main still uses loose OpenClaw substring listener classification: classifyPortListener() returns gateway whenever the normalized command text contains "openclaw". (src/infra/ports-format.ts:5, 103cdd9d96f8)
  • The PR branch predates the current restart-health contract: The branch GatewayRestartSnapshot only has probeResult, listenerKinds, and failureReason additions, while current main also carries expectedVersion, gatewayVersion, plugin/channel errors, waitOutcome, and elapsedMs. (src/cli/daemon-cli/restart-health.ts:19, d8520412d5b0)
  • The PR branch predates current auth-close hardening: The branch treats any 1008 close reason containing auth/token/password/scope/role as gateway reachability; current main has an exact allowlist with a spoofing comment. (src/cli/daemon-cli/restart-health.ts:55, d8520412d5b0)
  • GitHub reports the branch is not directly mergeable: The pull API reports mergeable=false, rebaseable=false, and mergeable_state="dirty" for head d852041. (d8520412d5b0)

Likely related people:

  • steipete: Recent commits repeatedly changed restart-health, Windows gateway lifecycle, package restart checks, and port-format helpers in the affected files. (role: recent maintainer and original Windows restart hardening contributor; confidence: high; commits: 5ea03efe92d6, be8a3617d92e, e648f38efcd0; files: src/cli/daemon-cli/restart-health.ts, src/cli/daemon-cli/lifecycle.ts, src/infra/ports-format.ts)
  • vincentkoc: Auth-close restart hardening and adjacent gateway restart/control work landed under this handle in the same source area. (role: recent restart-policy and daemon lifecycle maintainer; confidence: high; commits: c85065eb7f39, f6f8d74419a1, 60d4d5e1fa05; files: src/cli/daemon-cli/restart-health.ts, src/cli/daemon-cli/lifecycle.ts)
  • ProspectOre: The expected-version restart contract that this PR must preserve is tied to a recent restart-health commit by this author. (role: current restart-health contract contributor; confidence: medium; commits: b46ff081f74b; files: src/cli/daemon-cli/restart-health.ts)
  • tmimmanuel: A recent Windows scheduled task restart/install preservation commit touched the same restart-health behavior family. (role: Windows scheduled task restart behavior contributor; confidence: medium; commits: 0fef95b17d72; files: src/cli/daemon-cli/restart-health.ts)
  • NikolaFC: The current HEAD commit adds the safe restart coordinator and touches lifecycle restart handling near the affected path. (role: recent adjacent lifecycle owner; confidence: medium; commits: 103cdd9d96f8; files: src/cli/daemon-cli/lifecycle.ts)

Remaining risk / open question:

  • Native Windows restart behavior was not live-tested in this read-only review; the reproduction assessment is source-based.
  • The branch is dirty against current main, so correctness depends on the conflict resolution preserving newer restart-health and auth-close behavior.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 103cdd9d96f8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes size: M triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants