Skip to content

[Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready #63491

@ufomaker

Description

@ufomaker

[Bug]: Windows Scheduled Task gateway restart/health becomes inconsistent after ready; mixes known probe false negatives with cron/session stale state and post-ready HTTP loss

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

On Windows with the gateway installed as a Scheduled Task, openclaw gateway restart can repeatedly time out with:

  • Timed out after 60s waiting for gateway port 18789 to become healthy
  • Service runtime: status=unknown
  • Port 18789 is already in use

This environment appears to hit more than one problem at once:

  1. A known local loopback probe false negative on restart (ws ... code=1008 reason=connect failed / device-required)
  2. Cron/job/session state corruption after restart (runningAtMs / stale cron session state)
  3. An additional post-ready instability where the gateway can log ready (...) and even bind 18789, but /health and / later stop responding or the port becomes free again

I am filing this because the first two have close neighbors in existing issues/PRs, but I have not found a single Windows issue that covers the full combined behavior end-to-end.

OpenClaw version

OpenClaw 2026.4.8 (9ece252)

Operating system

Windows (PowerShell 5.1.22621.4249)

Install method

npm global install + openclaw gateway install / Scheduled Task

Model

bailian/qwen3.5-plus

Provider / routing chain

Ali / Bailian

Additional provider/model setup details

  • Node.js upgraded to v22.22.2
  • Repro observed both before and after upgrade from 2026.4.5 to 2026.4.8
  • Repro observed with normal config and also with external channels/providers largely disabled during bisecting

Steps to reproduce

  1. Install/run gateway on Windows via Scheduled Task
  2. Have existing cron jobs in ~/.openclaw/cron/jobs.json
  3. Run openclaw gateway restart
  4. Observe one or more of the following sequences:

Sequence A:

  • CLI waits 60s and prints timeout
  • log shows local WS probe closed with 1008 / connect failed
  • gateway may actually already be alive

Sequence B:

  • gateway reaches:
    • starting HTTP server...
    • ready (... plugins, ...s)
    • cron: started
  • but http://127.0.0.1:18789/health and / later time out or the port becomes free again

Sequence C:

  • cron jobs recover from UI edits/restart into stale state
  • previously seen local failures included TypeError: Cannot read properties of undefined (reading 'runningAtMs')
  • stale runningAtMs / stale cron session state prevented clean recovery without manual intervention

Expected behavior

  • openclaw gateway restart should succeed when the restarted local gateway is already healthy enough to reject unauthenticated loopback probes
  • Scheduled Task runtime and port ownership should stay consistent
  • Cron startup should not preserve impossible stale running state
  • Once the gateway logs ready (...), /health and / should remain responsive instead of later hanging or disappearing

Actual behavior

Observed across repeated runs on 2026-04-08 and 2026-04-09:

  • openclaw gateway restart times out after 60s
  • logs show loopback WS probe closure:
    • code=1008 reason=connect failed
    • cause":"device-required"
  • sometimes port 18789 is reported busy while runtime status is unknown
  • sometimes gateway logs ready (...) and later port 18789 becomes free again
  • sometimes /health is briefly reachable, then later times out
  • cron previously failed with missing or stale runningAtMs-related state

Representative log lines:

2026-04-09T09:55:37.924+08:00 [gateway/ws] closed before connect ... code=1008 reason=connect failed
2026-04-09T10:02:48.014+08:00 Timed out after 60s waiting for gateway port 18789 to become healthy.
2026-04-09T10:02:48.045+08:00 Service runtime: status=unknown
2026-04-09T10:02:48.049+08:00 Gateway port 18789 status: free.
2026-04-09T10:05:23.293+08:00 [gateway] ready (0 plugins, 27.5s)
2026-04-09T10:05:28.021+08:00 [cron] cron: started

Related issues / likely overlap

  • #48771 and PR #48801: Windows/local restart false negative when loopback WS probe is closed with 1008 / connect failed / device required
  • #44920: stale cron runningAtMs after restart
  • #59511: local http://127.0.0.1:18789/health not usable after gateway run
  • #60295: different OS, but similar “restart times out while service state/port ownership is inconsistent”

What I found during local debugging

I did substantial local debugging because the machine was stuck in production use:

  • upgraded OpenClaw from 2026.4.5 to 2026.4.8
  • upgraded Node.js to 22.22.2
  • isolated/remediated several local issues:
    • old incompatible channel config fields after upgrade
    • untracked local plugin auto-loading
    • stale cron job/session state
  • after that cleanup, the remaining issue was still reproducible:
    • gateway reaches ready (...)
    • HTTP health/UI later become unreachable or unstable

I also locally patched the CLI to treat loopback HTTP /health and local 1008 policy closes as healthy enough for restart probing, which reduced one class of false negatives, but did not eliminate the post-ready instability.

That suggests there may still be a deeper Windows gateway/runtime bug after startup, beyond the already-known restart probe issue.

Impact and severity

High for Windows users relying on Scheduled Task mode:

  • restart automation becomes unreliable
  • control UI availability becomes inconsistent
  • cron jobs can be left in broken/stale state after restart cycles
  • users may see a mixture of “service is up”, “service is unknown”, and “port is free” across the same debugging session

Logs, screenshots, and evidence

I can provide:

  • full openclaw-2026-04-08.log / openclaw-2026-04-09.log
  • openclaw gateway restart terminal output
  • openclaw gateway status --json output from both healthy and unhealthy moments
  • details of the stale cron/session state observed in ~/.openclaw/cron/jobs.json and session index cleanup

Additional information

If helpful, I can also open a follow-up issue with a narrower repro focused only on:

  1. Windows Scheduled Task + restart probe false negative
  2. Cron stale runningAtMs / session state after restart
  3. Post-ready HTTP hang / port disappearance

because on this machine they appeared stacked together.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions