[Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready

# [Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready; mixes known probe false negatives with cron/session stale state and post-ready HTTP loss

## Bug type

Regression (worked before, now fails)

## Beta release blocker

No

## Summary

On Windows with the gateway installed as a Scheduled Task, `openclaw gateway restart` can repeatedly time out with:

- `Timed out after 60s waiting for gateway port 18789 to become healthy`
- `Service runtime: status=unknown`
- `Port 18789 is already in use`

This environment appears to hit more than one problem at once:

1. A known local loopback probe false negative on restart (`ws ... code=1008 reason=connect failed` / `device-required`)
2. Cron/job/session state corruption after restart (`runningAtMs` / stale cron session state)
3. An additional post-`ready` instability where the gateway can log `ready (...)` and even bind `18789`, but `/health` and `/` later stop responding or the port becomes free again

I am filing this because the first two have close neighbors in existing issues/PRs, but I have not found a single Windows issue that covers the full combined behavior end-to-end.

## OpenClaw version

OpenClaw 2026.4.8 (`9ece252`)

## Operating system

Windows (PowerShell 5.1.22621.4249)

## Install method

npm global install + `openclaw gateway install` / Scheduled Task

## Model

`bailian/qwen3.5-plus`

## Provider / routing chain

Ali / Bailian

## Additional provider/model setup details

- Node.js upgraded to `v22.22.2`
- Repro observed both before and after upgrade from `2026.4.5` to `2026.4.8`
- Repro observed with normal config and also with external channels/providers largely disabled during bisecting

## Steps to reproduce

1. Install/run gateway on Windows via Scheduled Task
2. Have existing cron jobs in `~/.openclaw/cron/jobs.json`
3. Run `openclaw gateway restart`
4. Observe one or more of the following sequences:

Sequence A:
- CLI waits 60s and prints timeout
- log shows local WS probe closed with `1008` / `connect failed`
- gateway may actually already be alive

Sequence B:
- gateway reaches:
  - `starting HTTP server...`
  - `ready (... plugins, ...s)`
  - `cron: started`
- but `http://127.0.0.1:18789/health` and `/` later time out or the port becomes free again

Sequence C:
- cron jobs recover from UI edits/restart into stale state
- previously seen local failures included `TypeError: Cannot read properties of undefined (reading 'runningAtMs')`
- stale `runningAtMs` / stale cron session state prevented clean recovery without manual intervention

## Expected behavior

- `openclaw gateway restart` should succeed when the restarted local gateway is already healthy enough to reject unauthenticated loopback probes
- Scheduled Task runtime and port ownership should stay consistent
- Cron startup should not preserve impossible stale running state
- Once the gateway logs `ready (...)`, `/health` and `/` should remain responsive instead of later hanging or disappearing

## Actual behavior

Observed across repeated runs on 2026-04-08 and 2026-04-09:

- `openclaw gateway restart` times out after 60s
- logs show loopback WS probe closure:
  - `code=1008 reason=connect failed`
  - `cause":"device-required"`
- sometimes port `18789` is reported busy while runtime status is `unknown`
- sometimes gateway logs `ready (...)` and later port `18789` becomes free again
- sometimes `/health` is briefly reachable, then later times out
- cron previously failed with missing or stale `runningAtMs`-related state

Representative log lines:

```text
2026-04-09T09:55:37.924+08:00 [gateway/ws] closed before connect ... code=1008 reason=connect failed
2026-04-09T10:02:48.014+08:00 Timed out after 60s waiting for gateway port 18789 to become healthy.
2026-04-09T10:02:48.045+08:00 Service runtime: status=unknown
2026-04-09T10:02:48.049+08:00 Gateway port 18789 status: free.
2026-04-09T10:05:23.293+08:00 [gateway] ready (0 plugins, 27.5s)
2026-04-09T10:05:28.021+08:00 [cron] cron: started
```

## Related issues / likely overlap

- `#48771` and PR `#48801`: Windows/local restart false negative when loopback WS probe is closed with `1008` / `connect failed` / `device required`
- `#44920`: stale cron `runningAtMs` after restart
- `#59511`: local `http://127.0.0.1:18789/health` not usable after gateway run
- `#60295`: different OS, but similar “restart times out while service state/port ownership is inconsistent”

## What I found during local debugging

I did substantial local debugging because the machine was stuck in production use:

- upgraded OpenClaw from `2026.4.5` to `2026.4.8`
- upgraded Node.js to `22.22.2`
- isolated/remediated several local issues:
  - old incompatible channel config fields after upgrade
  - untracked local plugin auto-loading
  - stale cron job/session state
- after that cleanup, the remaining issue was still reproducible:
  - gateway reaches `ready (...)`
  - HTTP health/UI later become unreachable or unstable

I also locally patched the CLI to treat loopback HTTP `/health` and local `1008` policy closes as healthy enough for restart probing, which reduced one class of false negatives, but did not eliminate the post-`ready` instability.

That suggests there may still be a deeper Windows gateway/runtime bug after startup, beyond the already-known restart probe issue.

## Impact and severity

High for Windows users relying on Scheduled Task mode:

- restart automation becomes unreliable
- control UI availability becomes inconsistent
- cron jobs can be left in broken/stale state after restart cycles
- users may see a mixture of “service is up”, “service is unknown”, and “port is free” across the same debugging session

## Logs, screenshots, and evidence

I can provide:

- full `openclaw-2026-04-08.log` / `openclaw-2026-04-09.log`
- `openclaw gateway restart` terminal output
- `openclaw gateway status --json` output from both healthy and unhealthy moments
- details of the stale cron/session state observed in `~/.openclaw/cron/jobs.json` and session index cleanup

## Additional information

If helpful, I can also open a follow-up issue with a narrower repro focused only on:

1. Windows Scheduled Task + restart probe false negative
2. Cron stale `runningAtMs` / session state after restart
3. Post-`ready` HTTP hang / port disappearance

because on this machine they appeared stacked together.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready #63491

[Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready; mixes known probe false negatives with cron/session stale state and post-ready HTTP loss

Bug type

Beta release blocker

Summary

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Steps to reproduce

Expected behavior

Actual behavior

Related issues / likely overlap

What I found during local debugging

Impact and severity

Logs, screenshots, and evidence

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready #63491

Description

[Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready; mixes known probe false negatives with cron/session stale state and post-ready HTTP loss

Bug type

Beta release blocker

Summary

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Steps to reproduce

Expected behavior

Actual behavior

Related issues / likely overlap

What I found during local debugging

Impact and severity

Logs, screenshots, and evidence

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions