Skip to content

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) #73874

@purpleant

Description

@purpleant

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Summary

On Windows host + Docker Desktop + bind-mounted ~/.openclaw/ setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs ready and binds the listener socket, but the request-handling dispatch is deadlocked. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck agent:main:main session and never delivered. Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories. Fully working on 2026.4.23 with the same compose/config.

Reproduction environment

  • Host: Windows 11 Pro 26200, Docker Desktop (WSL2 backend)
  • Container image: ghcr.io/openclaw/openclaw:2026.4.26 (and .25, .24) extended via Dockerfile (FROM) with Playwright/Chromium/gh/gog CLI installs
  • Bind mount: ./config:/home/node/.openclaw from a Windows NTFS path
  • Container user: node (uid 1000); compose overrides user: "0:0" for an entrypoint wrapper that runs runuser -u node -- "$@" after fix-up perms
  • Two bots tested:
    • "Bragi" — extensively used since 2026.3.x, accumulated state, 2.4 MB sessions.json with one large 88%-context-utilized session
    • "Kvasir" — clean state, came directly from 2026.4.23 to .26 with no intermediate migrations

Symptoms (identical on both bots)

After gateway ready:

Probe Behavior
curl http://127.0.0.1:18789/healthz TCP connection accepted, never any HTTP response — times out at 8s with HTTP 000
curl http://127.0.0.1:18789/ (gateway dashboard) Same — TCP accept, no response
curl http://127.0.0.1:18789/__openclaw__/canvas/ Same
curl http://127.0.0.1:18789/api/status Same
openclaw gateway status --deep (WebSocket probe to ws://127.0.0.1:18789) Same — timeout
openclaw plugins inspect <id> (CLI → gateway RPC) Hangs
openclaw plugins doctor (CLI → gateway RPC) Hangs / silent
codex exec directly (CLI, bypasses gateway) Works — returns gpt-5.4 reply

Process state: openclaw-gateway PID is Sl (sleeping, multi-threaded), 4–9% CPU. Not idle, not CPU-spinning. All threads S (sleeping). Event-loop deadlock signature.

Kernel TCP state for port 18789:

00000000:4965 -> 00000000:0000 [LISTEN]
0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT]
... (one CLOSE_WAIT per probe attempt)

Each probe leaves a CLOSE_WAIT — server accepted the TCP connection, peer eventually closed, server never close()'d its side. Classic Node await-never-resolves signature.

Cascading downstream symptoms

The dispatch deadlock causes:

  1. Slack provider stalls after channels resolved with no socket mode connected line. The slack plugin needs an internal handshake to the gateway HTTP/WS path during init, and that handshake hangs.

  2. session-write-lock held for 200,000+ ms (max 15,000 ms expected) — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.

  3. stuck session: sessionId=main sessionKey=agent:main:main state=processing age=595s queueDepth=0 — the same agent session sits "processing" for ~10 minutes, gets watchdog-released, and the next request immediately stucks the same way. Self-perpetuating.

  4. [ws] ⇄ res ✗ nativeHook.invoke errorCode=INVALID_REQUEST errorMessage=native hook relay not found — slack plugin tries to invoke a hook it registered on the gateway. Registration succeeded silently but the gateway's registry doesn't have it on lookup. Strongly suggests plugin registry mismatch / multiple registry instances.

  5. 5 plugin(s) failed to initialize (validation: anthropic, codex, memory-core, openai, slack) — sometimes appears after restart, sometimes doesn't. When it does, the codex agent harness isn't registered, so embedded agent requests fail with Requested agent harness "codex" is not registered and PI fallback is disabled. Even the fallback to anthropic fails because that plugin also failed validation. Inconsistent run-to-run.

  6. [skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw' (and stat of various subpaths) — the skills watcher subsystem can't traverse ~/.openclaw/ because it's drwx------ root:root (Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.

  7. [heartbeat] failed: EACCES: permission denied, mkdir '/home/node/.openclaw/workspace' — heartbeat subsystem tries to mkdir an already-existing bind-mount sub-mount. Same root cause.

  8. Plugin runtime-deps mirror-lock contention: on first 2026.4.24/.26 startup, plugin runtime deps install into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/. If the previous startup died holding the mirror-lock, subsequent startups wait 5 minutes per-plugin (300050ms timeout) for the lock and give up loading that plugin. Lock dir at ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/owner.json persists across container restarts. Have to manually rm -rf the lock dir to recover.

  9. Cannot find module '.../slack/pipeline.runtime-<hash>.js' — slack plugin's runtime-deps install reports success, but at least one bundle file is silently missing on first install. Eventually self-resolves on a later startup.

  10. openclaw doctor --fix writes openclaw.json with 0600 root:root perms when invoked via docker exec -t (which inherits compose user: "0:0"). The gateway then can't read its own config and fails restart loop with "Missing config. Run openclaw setup or set gateway.mode=local". Have to chown the file manually to recover.

  11. pending.json/paired.json parse-handle race: [gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.json fires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename .tmp swap.

  12. 2026.4.24+ silently rewrites openclaw.json on first start: agents.defaults.model.primary from codex/gpt-5.4 to openai/gpt-5.4, adds openai plugin entry. Persists across rollback to .23 — manual revert needed. (Note: the openai provider in 2026.4.x does work with codex/ChatGPT OAuth via the agentRuntime: {id: "codex"} runtime, but the rewrite caught us off guard initially.)

What works on 2026.4.23 with the same setup

  • /healthz returns HTTP 200 in ~20ms
  • All plugins load without validation failures
  • Slack socket-mode connects within ~30s of ready
  • Session-write-lock acquired/released in milliseconds
  • No nativeHook registry mismatches
  • No plugin-runtime-deps install needed (.23 doesn't use that mechanism)

Diagnostic data we collected

  • gateway process state (Sl/Rl, CPU %, thread count, all wchan=0 sleeping)
  • TCP socket state (LISTEN + N CLOSE_WAIT accumulating)
  • Stability bundles in ~/.openclaw/logs/stability/ (only one from a MODULE_NOT_FOUND during very first .24 attempt; nothing for the dispatch deadlocks themselves)
  • openclaw plugins list output (6 plugins enabled — most plugins are still 2026.4.25 in the .26 release, only cerebras/migrate-claude/qqbot bumped to .26)
  • Full container logs from multiple startup attempts

Happy to attach files / run additional diagnostics on request.

What I tried and what did/didn't help

Step Effect
chown -R node:node ~/.openclaw/{tasks,memory,flows,extensions,plugin-runtime-deps,node_modules} Fixes lots of unrelated EACCES errors but does not fix dispatch deadlock
chown node:node ~/.openclaw (the bind-mount root, mode 0700) Fixes [skills] watcher EACCES storm and [heartbeat] mkdir EACCES
rm -rf ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/ Unblocks plugin loading after a stuck-lock startup
chown node:node ~/.openclaw/openclaw.json Fixes "Missing config" restart-loop after openclaw doctor --fix
openclaw doctor --fix (interactive) Migrates legacy embeddedHarnessagentRuntime, but writes config with bad perms (see above)
Renaming sessions.json aside Caused a different startup hang; restoring fixed that
compose down && up --force-recreate Doesn't help — same regression
Rolling back to 2026.4.23 (retag local image, restore openclaw.json from pre-update commit, wipe stale node_modules + plugin-runtime-deps) Fully restores working state

Compose context

services:
  gateway:
    image: openclaw-bot:latest    # FROM ghcr.io/openclaw/openclaw:latest
    user: "0:0"
    entrypoint: ["/bin/sh", "/usr/local/bin/fix-codex-perms.sh", "docker-entrypoint.sh"]
    command: ["node", "dist/index.js", "gateway", "--bind", "lan", "--port", "18789"]
    volumes:
      - ./config:/home/node/.openclaw
      - ./workspace:/home/node/.openclaw/workspace
      # plus ./config/codex-config:/home/node/.codex etc.

fix-codex-perms.sh chowns ~/.codex/, ~/.openclaw/{tasks,memory,flows,cron,delivery-queue,node_modules,plugin-runtime-deps}, strips world-writable bits on ~/.openclaw/extensions/*/, then runuser -u node -- "$@".

Asks

  1. Is this dispatch deadlock a known issue you're tracking? It's been present in three consecutive releases (.24, .25, .26).
  2. Would adding ~/.openclaw/ itself (the bind-mount root) and ~/.openclaw/openclaw.json to whatever owns initial perm setup help, or is the breakage independent of that?
  3. Any way to enable verbose dispatcher logging that would capture why the dispatcher is stuck post-ready?
  4. Can you confirm whether the openai/gpt-5.4 provider is intended to work with ChatGPT OAuth (via codex runtime) or whether the auto-migration in .24+ should be conditional on actual API-key auth being present?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions