Skip to content

Regression: Windows gateway probe/connect timeout returns in 2026.4.26 #74279

@joost-heijden

Description

@joost-heijden

This looks like a regression of the Windows gateway probe / connect-timeout class of failures that was previously marked fixed around 2026.3.22.

Related locked/closed issues:

Environment:

  • Windows: Microsoft Windows NT 10.0.26200.0
  • Install: npm global
  • OpenClaw latest tested: 2026.4.26
  • Stable workaround: 2026.4.23 (a979721)
  • Node runtime used by gateway: v22.22.2
  • Codex CLI: 0.125.0
  • Gateway: local, loopback, token auth, port 18789

Symptoms on 2026.4.26:

  • Gateway process starts and listens on 127.0.0.1:18789.
  • Logs show http server listening / gateway ready.
  • openclaw gateway probe --timeout 60000 reports loopback connect timeout / unreachable.
  • Gateway logs show handshake-timeout with lastFrameMethod:"connect".
  • This persisted with Telegram disabled and with minimal/near-zero plugins.
  • In contrast, Control UI/webchat and runtime RPC paths could still work, so gateway probe became an unreliable health source.
  • A watchdog relying on gateway probe repeatedly restarted an otherwise usable gateway.

Additional Windows-specific finding:

  • Bundled provider extensions anthropic and lmstudio attempted to load even though they were disabled in config.
  • They failed with ERR_PACKAGE_PATH_NOT_EXPORTED from their bundled @mariozechner/pi-ai dependency.
  • Those load failures appeared to block startup/handshake long enough to trigger more WS handshake timeouts.
  • Quarantining those two broken bundled extension directories plus using longer OPENCLAW_HANDSHAKE_TIMEOUT_MS / OPENCLAW_CONNECT_CHALLENGE_TIMEOUT_MS helped stabilize the local install.

Working workaround:

  • Downgrade/pin OpenClaw to 2026.4.23.
  • Use channels status --deep / channels status --probe as the watchdog health check instead of gateway probe, because gateway probe was a false negative.
  • Keep gateway handshake timeout env vars high on Windows.
  • Keep the broken bundled provider extensions out of the load path until their packaging/export issue is fixed.

Why I think this should be treated as a regression:

  • The previously fixed handshake/probe failure is reproducible again on a newer Windows build.
  • The strongest signal is the same contradiction described in prior issues: gateway/channel runtime healthy and Telegram working, while gateway probe reports timeout/unreachable.
  • There may be two interacting issues: the CLI/probe WS handshake race, plus disabled/broken provider extension loading causing event-loop stalls during gateway/connect handling.

Sanitized key signatures:

  • gateway ready
  • gateway probe: Local loopback ws://127.0.0.1:18789 -> Connect: failed - timeout
  • gateway log: handshake timeout ... lastFrameMethod=connect
  • provider load errors: anthropic/lmstudio failed to load ... ERR_PACKAGE_PATH_NOT_EXPORTED ... @mariozechner/pi-ai

I can provide more sanitized logs if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions