Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)
Summary
On Windows host + Docker Desktop + bind-mounted ~/.openclaw/ setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs ready and binds the listener socket, but the request-handling dispatch is deadlocked. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck agent:main:main session and never delivered. Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories. Fully working on 2026.4.23 with the same compose/config.
Reproduction environment
- Host: Windows 11 Pro 26200, Docker Desktop (WSL2 backend)
- Container image:
ghcr.io/openclaw/openclaw:2026.4.26 (and .25, .24) extended via Dockerfile (FROM) with Playwright/Chromium/gh/gog CLI installs
- Bind mount:
./config:/home/node/.openclaw from a Windows NTFS path
- Container user:
node (uid 1000); compose overrides user: "0:0" for an entrypoint wrapper that runs runuser -u node -- "$@" after fix-up perms
- Two bots tested:
- "Bragi" — extensively used since 2026.3.x, accumulated state, 2.4 MB sessions.json with one large 88%-context-utilized session
- "Kvasir" — clean state, came directly from 2026.4.23 to .26 with no intermediate migrations
Symptoms (identical on both bots)
After gateway ready:
| Probe |
Behavior |
curl http://127.0.0.1:18789/healthz |
TCP connection accepted, never any HTTP response — times out at 8s with HTTP 000 |
curl http://127.0.0.1:18789/ (gateway dashboard) |
Same — TCP accept, no response |
curl http://127.0.0.1:18789/__openclaw__/canvas/ |
Same |
curl http://127.0.0.1:18789/api/status |
Same |
openclaw gateway status --deep (WebSocket probe to ws://127.0.0.1:18789) |
Same — timeout |
openclaw plugins inspect <id> (CLI → gateway RPC) |
Hangs |
openclaw plugins doctor (CLI → gateway RPC) |
Hangs / silent |
codex exec directly (CLI, bypasses gateway) |
Works — returns gpt-5.4 reply |
Process state: openclaw-gateway PID is Sl (sleeping, multi-threaded), 4–9% CPU. Not idle, not CPU-spinning. All threads S (sleeping). Event-loop deadlock signature.
Kernel TCP state for port 18789:
00000000:4965 -> 00000000:0000 [LISTEN]
0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT]
... (one CLOSE_WAIT per probe attempt)
Each probe leaves a CLOSE_WAIT — server accepted the TCP connection, peer eventually closed, server never close()'d its side. Classic Node await-never-resolves signature.
Cascading downstream symptoms
The dispatch deadlock causes:
-
Slack provider stalls after channels resolved with no socket mode connected line. The slack plugin needs an internal handshake to the gateway HTTP/WS path during init, and that handshake hangs.
-
session-write-lock held for 200,000+ ms (max 15,000 ms expected) — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.
-
stuck session: sessionId=main sessionKey=agent:main:main state=processing age=595s queueDepth=0 — the same agent session sits "processing" for ~10 minutes, gets watchdog-released, and the next request immediately stucks the same way. Self-perpetuating.
-
[ws] ⇄ res ✗ nativeHook.invoke errorCode=INVALID_REQUEST errorMessage=native hook relay not found — slack plugin tries to invoke a hook it registered on the gateway. Registration succeeded silently but the gateway's registry doesn't have it on lookup. Strongly suggests plugin registry mismatch / multiple registry instances.
-
5 plugin(s) failed to initialize (validation: anthropic, codex, memory-core, openai, slack) — sometimes appears after restart, sometimes doesn't. When it does, the codex agent harness isn't registered, so embedded agent requests fail with Requested agent harness "codex" is not registered and PI fallback is disabled. Even the fallback to anthropic fails because that plugin also failed validation. Inconsistent run-to-run.
-
[skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw' (and stat of various subpaths) — the skills watcher subsystem can't traverse ~/.openclaw/ because it's drwx------ root:root (Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.
-
[heartbeat] failed: EACCES: permission denied, mkdir '/home/node/.openclaw/workspace' — heartbeat subsystem tries to mkdir an already-existing bind-mount sub-mount. Same root cause.
-
Plugin runtime-deps mirror-lock contention: on first 2026.4.24/.26 startup, plugin runtime deps install into ~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/. If the previous startup died holding the mirror-lock, subsequent startups wait 5 minutes per-plugin (300050ms timeout) for the lock and give up loading that plugin. Lock dir at ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/owner.json persists across container restarts. Have to manually rm -rf the lock dir to recover.
-
Cannot find module '.../slack/pipeline.runtime-<hash>.js' — slack plugin's runtime-deps install reports success, but at least one bundle file is silently missing on first install. Eventually self-resolves on a later startup.
-
openclaw doctor --fix writes openclaw.json with 0600 root:root perms when invoked via docker exec -t (which inherits compose user: "0:0"). The gateway then can't read its own config and fails restart loop with "Missing config. Run openclaw setup or set gateway.mode=local". Have to chown the file manually to recover.
-
pending.json/paired.json parse-handle race: [gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.json fires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename .tmp swap.
-
2026.4.24+ silently rewrites openclaw.json on first start: agents.defaults.model.primary from codex/gpt-5.4 to openai/gpt-5.4, adds openai plugin entry. Persists across rollback to .23 — manual revert needed. (Note: the openai provider in 2026.4.x does work with codex/ChatGPT OAuth via the agentRuntime: {id: "codex"} runtime, but the rewrite caught us off guard initially.)
What works on 2026.4.23 with the same setup
/healthz returns HTTP 200 in ~20ms
- All plugins load without validation failures
- Slack socket-mode connects within ~30s of
ready
- Session-write-lock acquired/released in milliseconds
- No
nativeHook registry mismatches
- No plugin-runtime-deps install needed (.23 doesn't use that mechanism)
Diagnostic data we collected
- gateway process state (
Sl/Rl, CPU %, thread count, all wchan=0 sleeping)
- TCP socket state (LISTEN + N CLOSE_WAIT accumulating)
- Stability bundles in
~/.openclaw/logs/stability/ (only one from a MODULE_NOT_FOUND during very first .24 attempt; nothing for the dispatch deadlocks themselves)
openclaw plugins list output (6 plugins enabled — most plugins are still 2026.4.25 in the .26 release, only cerebras/migrate-claude/qqbot bumped to .26)
- Full container logs from multiple startup attempts
Happy to attach files / run additional diagnostics on request.
What I tried and what did/didn't help
| Step |
Effect |
chown -R node:node ~/.openclaw/{tasks,memory,flows,extensions,plugin-runtime-deps,node_modules} |
Fixes lots of unrelated EACCES errors but does not fix dispatch deadlock |
chown node:node ~/.openclaw (the bind-mount root, mode 0700) |
Fixes [skills] watcher EACCES storm and [heartbeat] mkdir EACCES |
rm -rf ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/ |
Unblocks plugin loading after a stuck-lock startup |
chown node:node ~/.openclaw/openclaw.json |
Fixes "Missing config" restart-loop after openclaw doctor --fix |
openclaw doctor --fix (interactive) |
Migrates legacy embeddedHarness → agentRuntime, but writes config with bad perms (see above) |
Renaming sessions.json aside |
Caused a different startup hang; restoring fixed that |
compose down && up --force-recreate |
Doesn't help — same regression |
Rolling back to 2026.4.23 (retag local image, restore openclaw.json from pre-update commit, wipe stale node_modules + plugin-runtime-deps) |
Fully restores working state |
Compose context
services:
gateway:
image: openclaw-bot:latest # FROM ghcr.io/openclaw/openclaw:latest
user: "0:0"
entrypoint: ["/bin/sh", "/usr/local/bin/fix-codex-perms.sh", "docker-entrypoint.sh"]
command: ["node", "dist/index.js", "gateway", "--bind", "lan", "--port", "18789"]
volumes:
- ./config:/home/node/.openclaw
- ./workspace:/home/node/.openclaw/workspace
# plus ./config/codex-config:/home/node/.codex etc.
fix-codex-perms.sh chowns ~/.codex/, ~/.openclaw/{tasks,memory,flows,cron,delivery-queue,node_modules,plugin-runtime-deps}, strips world-writable bits on ~/.openclaw/extensions/*/, then runuser -u node -- "$@".
Asks
- Is this dispatch deadlock a known issue you're tracking? It's been present in three consecutive releases (.24, .25, .26).
- Would adding
~/.openclaw/ itself (the bind-mount root) and ~/.openclaw/openclaw.json to whatever owns initial perm setup help, or is the breakage independent of that?
- Any way to enable verbose dispatcher logging that would capture why the dispatcher is stuck post-
ready?
- Can you confirm whether the
openai/gpt-5.4 provider is intended to work with ChatGPT OAuth (via codex runtime) or whether the auto-migration in .24+ should be conditional on actual API-key auth being present?
Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)
Summary
On Windows host + Docker Desktop + bind-mounted
~/.openclaw/setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logsreadyand binds the listener socket, but the request-handling dispatch is deadlocked. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuckagent:main:mainsession and never delivered. Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories. Fully working on 2026.4.23 with the same compose/config.Reproduction environment
ghcr.io/openclaw/openclaw:2026.4.26(and .25, .24) extended via Dockerfile (FROM) with Playwright/Chromium/gh/gog CLI installs./config:/home/node/.openclawfrom a Windows NTFS pathnode(uid 1000); compose overridesuser: "0:0"for an entrypoint wrapper that runsrunuser -u node -- "$@"after fix-up permsSymptoms (identical on both bots)
After
gateway ready:curl http://127.0.0.1:18789/healthzHTTP 000curl http://127.0.0.1:18789/(gateway dashboard)curl http://127.0.0.1:18789/__openclaw__/canvas/curl http://127.0.0.1:18789/api/statusopenclaw gateway status --deep(WebSocket probe tows://127.0.0.1:18789)openclaw plugins inspect <id>(CLI → gateway RPC)openclaw plugins doctor(CLI → gateway RPC)codex execdirectly (CLI, bypasses gateway)Process state:
openclaw-gatewayPID isSl(sleeping, multi-threaded), 4–9% CPU. Not idle, not CPU-spinning. All threadsS(sleeping). Event-loop deadlock signature.Kernel TCP state for port 18789:
Each probe leaves a CLOSE_WAIT — server accepted the TCP connection, peer eventually closed, server never
close()'d its side. Classic Nodeawait-never-resolves signature.Cascading downstream symptoms
The dispatch deadlock causes:
Slack provider stalls after
channels resolvedwith nosocket mode connectedline. The slack plugin needs an internal handshake to the gateway HTTP/WS path during init, and that handshake hangs.session-write-lockheld for 200,000+ ms (max 15,000 ms expected) — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.stuck session: sessionId=main sessionKey=agent:main:main state=processing age=595s queueDepth=0— the same agent session sits "processing" for ~10 minutes, gets watchdog-released, and the next request immediately stucks the same way. Self-perpetuating.[ws] ⇄ res ✗ nativeHook.invoke errorCode=INVALID_REQUEST errorMessage=native hook relay not found— slack plugin tries to invoke a hook it registered on the gateway. Registration succeeded silently but the gateway's registry doesn't have it on lookup. Strongly suggests plugin registry mismatch / multiple registry instances.5 plugin(s) failed to initialize (validation: anthropic, codex, memory-core, openai, slack)— sometimes appears after restart, sometimes doesn't. When it does, the codex agent harness isn't registered, so embedded agent requests fail withRequested agent harness "codex" is not registered and PI fallback is disabled. Even the fallback to anthropic fails because that plugin also failed validation. Inconsistent run-to-run.[skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw'(andstatof various subpaths) — the skills watcher subsystem can't traverse~/.openclaw/because it'sdrwx------ root:root(Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.[heartbeat] failed: EACCES: permission denied, mkdir '/home/node/.openclaw/workspace'— heartbeat subsystem tries to mkdir an already-existing bind-mount sub-mount. Same root cause.Plugin runtime-deps mirror-lock contention: on first 2026.4.24/.26 startup, plugin runtime deps install into
~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/. If the previous startup died holding the mirror-lock, subsequent startups wait 5 minutes per-plugin (300050ms timeout) for the lock and give up loading that plugin. Lock dir at~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/owner.jsonpersists across container restarts. Have to manuallyrm -rfthe lock dir to recover.Cannot find module '.../slack/pipeline.runtime-<hash>.js'— slack plugin's runtime-deps install reports success, but at least one bundle file is silently missing on first install. Eventually self-resolves on a later startup.openclaw doctor --fixwritesopenclaw.jsonwith0600 root:rootperms when invoked viadocker exec -t(which inherits composeuser: "0:0"). The gateway then can't read its own config and fails restart loop with "Missing config. Runopenclaw setupor set gateway.mode=local". Have to chown the file manually to recover.pending.json/paired.jsonparse-handle race:[gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.jsonfires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename.tmpswap.2026.4.24+ silently rewrites
openclaw.jsonon first start:agents.defaults.model.primaryfromcodex/gpt-5.4toopenai/gpt-5.4, addsopenaiplugin entry. Persists across rollback to .23 — manual revert needed. (Note: theopenaiprovider in 2026.4.x does work with codex/ChatGPT OAuth via theagentRuntime: {id: "codex"}runtime, but the rewrite caught us off guard initially.)What works on 2026.4.23 with the same setup
/healthzreturns HTTP 200 in ~20msreadynativeHookregistry mismatchesDiagnostic data we collected
Sl/Rl, CPU %, thread count, allwchan=0sleeping)~/.openclaw/logs/stability/(only one from aMODULE_NOT_FOUNDduring very first .24 attempt; nothing for the dispatch deadlocks themselves)openclaw plugins listoutput (6 plugins enabled — most plugins are still 2026.4.25 in the .26 release, only cerebras/migrate-claude/qqbot bumped to .26)Happy to attach files / run additional diagnostics on request.
What I tried and what did/didn't help
chown -R node:node ~/.openclaw/{tasks,memory,flows,extensions,plugin-runtime-deps,node_modules}chown node:node ~/.openclaw(the bind-mount root, mode 0700)[skills]watcher EACCES storm and[heartbeat]mkdir EACCESrm -rf ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/chown node:node ~/.openclaw/openclaw.jsonopenclaw doctor --fixopenclaw doctor --fix(interactive)embeddedHarness→agentRuntime, but writes config with bad perms (see above)sessions.jsonasidecompose down && up --force-recreateopenclaw.jsonfrom pre-update commit, wipe stalenode_modules+plugin-runtime-deps)Compose context
fix-codex-perms.shchowns~/.codex/,~/.openclaw/{tasks,memory,flows,cron,delivery-queue,node_modules,plugin-runtime-deps}, strips world-writable bits on~/.openclaw/extensions/*/, thenrunuser -u node -- "$@".Asks
~/.openclaw/itself (the bind-mount root) and~/.openclaw/openclaw.jsonto whatever owns initial perm setup help, or is the breakage independent of that?ready?openai/gpt-5.4provider is intended to work with ChatGPT OAuth (via codex runtime) or whether the auto-migration in .24+ should be conditional on actual API-key auth being present?