Skip to content

[Bug]: CLI commands hang at WebSocket gateway handshake #68944

@WaMaSeDu

Description

@WaMaSeDu

CLI Commands Hang at WebSocket Gateway Handshake

Bug Description

Every CLI subcommand that requires a live gateway connection hangs indefinitely. The CLI successfully connects to the gateway WebSocket endpoint and receives the connect.challenge nonce, but then never sends the connect.reply — the command hangs until killed.

Commands that bypass the gateway transport (--version, --help) work correctly.

Environment

  • OpenClaw: 2026.4.15 (041266a)
  • Node: v24.13.1
  • NPM: 11.10.1
  • OS: Windows Server 2022 x64 (Windows_NT 10.0.20348)
  • Platform: VPS-760407
  • Gateway: Running as scheduled task under SYSTEM account
  • Auth: token + device identity (~/.openclaw/identity/)

Affected Commands

Command Result Latency
openclaw --version ✅ OK 375ms
openclaw --help ✅ OK 114ms
openclaw sessions ❌ HANG 8000ms+
openclaw cron list ❌ HANG 8000ms+
openclaw status ❌ HANG 8300ms+
openclaw doctor ❌ HANG 8000ms+
openclaw channels list ❌ HANG 8200ms+
openclaw models list ❌ HANG 8200ms+
openclaw agents list ❌ HANG 8200ms+
openclaw gateway status ❌ HANG 8200ms+
openclaw backup list ❌ HANG 6200ms+
openclaw tui ❌ HANG 6200ms+

Root Cause Tracing

Using a WebSocket client with the auth token to trace the connection flow:

  1. CLI spawns node process → connects to ws://localhost:18789/gateway
  2. Gateway sends connect.challenge event with nonce ✅
  3. CLI receives the challenge ✅
  4. CLI hangs — never sends connect.reply with signed nonce ❌
  5. Gateway waits indefinitely, command times out

The CLI receives the nonce but appears to fail silently when generating/sending the signed reply. This blocks the entire command.

What Was Tried

  1. CLI reinstallnpm uninstall -g openclaw + npm install -g openclaw@2026.4.15
    • Result: Same behavior, issue persists
  2. Gateway restart via scheduled task (Stop-ScheduledTask + Start-ScheduledTask)
    • Result: Same behavior after restart
  3. WebSocket trace using ws library with HMAC-SHA256 signed nonce
    • Result: Gateway rejects HMAC-signed connections, confirming the CLI should use device keypair (EC P-256) signing

Device Identity Status

  • identity/device.json — intact, valid EC keypair
  • identity/device-auth.json — intact, operator token with correct scopes
  • Device ID: 20512fa1c7948b21355756331b40ee88c8e3d27b033ca619b2620d4995cb48f

Possible Causes

  1. Node.js v24.x crypto change — v24.13.1 may have changed behavior in crypto.sign() / crypto.verify() for EC P-256 keys
  2. CLI device signing silently fails — the signing code path returns an error that is swallowed, causing the CLI to never send the reply
  3. CLI respawn transport issue — the CLI respawns itself with modified NODE_OPTIONS; the respawned process may have a different crypto state
  4. Stale nonce rejection — gateway may be rejecting nonces from the current CLI process as replayed or malformed

Logs

# CLI trace output
Connecting to ws://localhost:18789/gateway with token...
OPEN
Received: {"type":"event","event":"connect.challenge","payload":{"nonce":"b5e9bb7c-acc1-4d0f-957e-f29b3964e112","ts":1776586770161}}
Got nonce: b5e9bb7c-acc1-4d0f-957e-f29b3964e112
Signing with HMAC-SHA256...
Sent reply
[5s timeout] — never receives connect.ack
# Gateway log (last entry before CLI test)
2026-04-09T12:34:09.320+02:00 [gateway] log file: \tmp\openclaw\openclaw-2026-04-09.log
2026-04-09T12:34:09.327+02:00 [gateway] starting channels and sidecars...
2026-04-09T12:34:11.255+02:00 [hooks] loaded 4 internal hook handlers
2026-04-09T12:34:11.264+02:00 [plugins] embedded acpx runtime backend registered
2026-04-09T12:34:11.346+02:00 [browser] control listening on http://127.0.0.1:18791/
2026-04-09T12:34:16.681+02:00 [plugins] embedded acpx runtime backend ready

Gateway log does not show the CLI connection attempt — the challenge is sent but no corresponding connection is logged as established.

Impact

  • All CLI subcommands that interact with the gateway are unusable
  • Cannot inspect sessions, cron jobs, channel status, model configs
  • Automated scripts that rely on CLI fail
  • Gateway health (HTTP) is fine — only the WebSocket transport is affected
  • Rubble Board and direct file reads continue to work as a workaround

Workaround

  • Direct file reads of JSON state files (sessions, cron configs) work in milliseconds
  • Gateway HTTP endpoints respond correctly (/health in 6ms)
  • Rubble Board provides a web UI that bypasses the broken CLI transport

Suggestion

  1. Add logging to the CLI's challenge reply generation — the signing failure is currently silent
  2. Add a timeout on the gateway side for pending challenges — this prevents a stuck CLI from holding resources indefinitely
  3. Consider HMAC fallback if device key signing fails — the gateway currently rejects HMAC nonces, making debugging harder
  4. Add CLI-level tracing for the WS handshake: openclaw --log-level trace sessions to expose where exactly it hangs

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions