Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

# Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

## Summary

On Windows host + Docker Desktop + bind-mounted `~/.openclaw/` setups, the gateway in 2026.4.24, 2026.4.25, and 2026.4.26 logs `ready` and binds the listener socket, but the **request-handling dispatch is deadlocked**. Both HTTP and internal-WebSocket transports accept TCP connections and never respond. Slack DMs are queued behind a stuck `agent:main:main` session and never delivered. **Reproduces deterministically on two distinct bots (Bragi and Kvasir) with different histories.** Fully working on 2026.4.23 with the same compose/config.

## Reproduction environment

- **Host**: Windows 11 Pro 26200, Docker Desktop (WSL2 backend)
- **Container image**: `ghcr.io/openclaw/openclaw:2026.4.26` (and .25, .24) extended via Dockerfile (`FROM`) with Playwright/Chromium/gh/gog CLI installs
- **Bind mount**: `./config:/home/node/.openclaw` from a Windows NTFS path
- **Container user**: `node` (uid 1000); compose overrides `user: "0:0"` for an entrypoint wrapper that runs `runuser -u node -- "$@"` after fix-up perms
- **Two bots tested**:
  - "Bragi" — extensively used since 2026.3.x, accumulated state, 2.4 MB sessions.json with one large 88%-context-utilized session
  - "Kvasir" — clean state, came directly from 2026.4.23 to .26 with no intermediate migrations

## Symptoms (identical on both bots)

After `gateway ready`:

| Probe | Behavior |
|---|---|
| `curl http://127.0.0.1:18789/healthz` | TCP connection accepted, **never any HTTP response** — times out at 8s with `HTTP 000` |
| `curl http://127.0.0.1:18789/` (gateway dashboard) | Same — TCP accept, no response |
| `curl http://127.0.0.1:18789/__openclaw__/canvas/` | Same |
| `curl http://127.0.0.1:18789/api/status` | Same |
| `openclaw gateway status --deep` (WebSocket probe to `ws://127.0.0.1:18789`) | **Same — timeout** |
| `openclaw plugins inspect <id>` (CLI → gateway RPC) | Hangs |
| `openclaw plugins doctor` (CLI → gateway RPC) | Hangs / silent |
| `codex exec` directly (CLI, bypasses gateway) | **Works** — returns gpt-5.4 reply |

Process state: `openclaw-gateway` PID is `Sl` (sleeping, multi-threaded), 4–9% CPU. Not idle, not CPU-spinning. All threads `S` (sleeping). Event-loop deadlock signature.

Kernel TCP state for port 18789:
```
00000000:4965 -> 00000000:0000 [LISTEN]
0100007F:4965 -> 0100007F:C998 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E732 [CLOSE_WAIT]
0100007F:4965 -> 0100007F:E29C [CLOSE_WAIT]
... (one CLOSE_WAIT per probe attempt)
```
Each probe leaves a CLOSE_WAIT — server accepted the TCP connection, peer eventually closed, server never `close()`'d its side. Classic Node `await`-never-resolves signature.

## Cascading downstream symptoms

The dispatch deadlock causes:

1. **Slack provider stalls after `channels resolved`** with no `socket mode connected` line. The slack plugin needs an internal handshake to the gateway HTTP/WS path during init, and that handshake hangs.

2. **`session-write-lock` held for 200,000+ ms (max 15,000 ms expected)** — the agent grabs the lock to process a request, model call hangs because of the upstream dispatch issue, lock held for minutes. New Slack DMs queue behind it indefinitely.

3. **`stuck session: sessionId=main sessionKey=agent:main:main state=processing age=595s queueDepth=0`** — the same agent session sits "processing" for ~10 minutes, gets watchdog-released, and the next request immediately stucks the same way. Self-perpetuating.

4. **`[ws] ⇄ res ✗ nativeHook.invoke errorCode=INVALID_REQUEST errorMessage=native hook relay not found`** — slack plugin tries to invoke a hook it registered on the gateway. Registration succeeded silently but the gateway's registry doesn't have it on lookup. Strongly suggests **plugin registry mismatch / multiple registry instances**.

5. **`5 plugin(s) failed to initialize (validation: anthropic, codex, memory-core, openai, slack)`** — sometimes appears after restart, sometimes doesn't. When it does, the codex agent harness isn't registered, so embedded agent requests fail with `Requested agent harness "codex" is not registered and PI fallback is disabled`. Even the fallback to anthropic fails because that plugin also failed validation. Inconsistent run-to-run.

6. **`[skills] watcher error: EACCES: permission denied, watch '/home/node/.openclaw'`** (and `stat` of various subpaths) — the skills watcher subsystem can't traverse `~/.openclaw/` because it's `drwx------ root:root` (Windows-NTFS bind-mount default mode 0700 owned by uid 0). New behavior in 2026.4.x. Workaround: chown the dir to node before startup.

7. **`[heartbeat] failed: EACCES: permission denied, mkdir '/home/node/.openclaw/workspace'`** — heartbeat subsystem tries to mkdir an already-existing bind-mount sub-mount. Same root cause.

8. **Plugin runtime-deps mirror-lock contention**: on first 2026.4.24/.26 startup, plugin runtime deps install into `~/.openclaw/plugin-runtime-deps/openclaw-<version>-<hash>/`. If the previous startup died holding the mirror-lock, subsequent startups wait 5 minutes per-plugin (300050ms timeout) for the lock and give up loading that plugin. Lock dir at `~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/owner.json` persists across container restarts. Have to manually `rm -rf` the lock dir to recover.

9. **`Cannot find module '.../slack/pipeline.runtime-<hash>.js'`** — slack plugin's runtime-deps install reports success, but at least one bundle file is silently missing on first install. Eventually self-resolves on a later startup.

10. **`openclaw doctor --fix` writes `openclaw.json` with `0600 root:root` perms** when invoked via `docker exec -t` (which inherits compose `user: "0:0"`). The gateway then can't read its own config and fails restart loop with "Missing config. Run `openclaw setup` or set gateway.mode=local". Have to chown the file manually to recover.

11. **`pending.json`/`paired.json` parse-handle race**: `[gateway] parse/handle error: JsonFileReadError: Failed to read JSON file: ~/.openclaw/devices/pending.json` fires every 30s. Files exist and are valid JSON. Probably racing with atomic-rename `.tmp` swap.

12. **2026.4.24+ silently rewrites `openclaw.json`** on first start: `agents.defaults.model.primary` from `codex/gpt-5.4` to `openai/gpt-5.4`, adds `openai` plugin entry. Persists across rollback to .23 — manual revert needed. (Note: the `openai` provider in 2026.4.x does work with codex/ChatGPT OAuth via the `agentRuntime: {id: "codex"}` runtime, but the rewrite caught us off guard initially.)

## What works on 2026.4.23 with the same setup

- `/healthz` returns HTTP 200 in ~20ms
- All plugins load without validation failures
- Slack socket-mode connects within ~30s of `ready`
- Session-write-lock acquired/released in milliseconds
- No `nativeHook` registry mismatches
- No plugin-runtime-deps install needed (.23 doesn't use that mechanism)

## Diagnostic data we collected

- gateway process state (`Sl`/`Rl`, CPU %, thread count, all `wchan=0` sleeping)
- TCP socket state (LISTEN + N CLOSE_WAIT accumulating)
- Stability bundles in `~/.openclaw/logs/stability/` (only one from a `MODULE_NOT_FOUND` during very first .24 attempt; nothing for the dispatch deadlocks themselves)
- `openclaw plugins list` output (6 plugins enabled — most plugins are still 2026.4.25 in the .26 release, only cerebras/migrate-claude/qqbot bumped to .26)
- Full container logs from multiple startup attempts

Happy to attach files / run additional diagnostics on request.

## What I tried and what did/didn't help

| Step | Effect |
|---|---|
| `chown -R node:node ~/.openclaw/{tasks,memory,flows,extensions,plugin-runtime-deps,node_modules}` | Fixes lots of unrelated EACCES errors but **does not fix dispatch deadlock** |
| `chown node:node ~/.openclaw` (the bind-mount root, mode 0700) | Fixes `[skills]` watcher EACCES storm and `[heartbeat]` mkdir EACCES |
| `rm -rf ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/` | Unblocks plugin loading after a stuck-lock startup |
| `chown node:node ~/.openclaw/openclaw.json` | Fixes "Missing config" restart-loop after `openclaw doctor --fix` |
| `openclaw doctor --fix` (interactive) | Migrates legacy `embeddedHarness` → `agentRuntime`, but writes config with bad perms (see above) |
| Renaming `sessions.json` aside | Caused a different startup hang; restoring fixed that |
| `compose down && up --force-recreate` | Doesn't help — same regression |
| Rolling back to 2026.4.23 (retag local image, restore `openclaw.json` from pre-update commit, wipe stale `node_modules` + `plugin-runtime-deps`) | **Fully restores working state** |

## Compose context

```yaml
services:
  gateway:
    image: openclaw-bot:latest    # FROM ghcr.io/openclaw/openclaw:latest
    user: "0:0"
    entrypoint: ["/bin/sh", "/usr/local/bin/fix-codex-perms.sh", "docker-entrypoint.sh"]
    command: ["node", "dist/index.js", "gateway", "--bind", "lan", "--port", "18789"]
    volumes:
      - ./config:/home/node/.openclaw
      - ./workspace:/home/node/.openclaw/workspace
      # plus ./config/codex-config:/home/node/.codex etc.
```

`fix-codex-perms.sh` chowns `~/.codex/`, `~/.openclaw/{tasks,memory,flows,cron,delivery-queue,node_modules,plugin-runtime-deps}`, strips world-writable bits on `~/.openclaw/extensions/*/`, then `runuser -u node -- "$@"`.

## Asks

1. Is this dispatch deadlock a known issue you're tracking? It's been present in three consecutive releases (.24, .25, .26).
2. Would adding `~/.openclaw/` itself (the bind-mount root) and `~/.openclaw/openclaw.json` to whatever owns initial perm setup help, or is the breakage independent of that?
3. Any way to enable verbose dispatcher logging that would capture *why* the dispatcher is stuck post-`ready`?
4. Can you confirm whether the `openai/gpt-5.4` provider is intended to work with ChatGPT OAuth (via codex runtime) or whether the auto-migration in .24+ should be conditional on actual API-key auth being present?


Probe	Behavior
`curl http://127.0.0.1:18789/healthz`	TCP connection accepted, never any HTTP response — times out at 8s with `HTTP 000`
`curl http://127.0.0.1:18789/` (gateway dashboard)	Same — TCP accept, no response
`curl http://127.0.0.1:18789/__openclaw__/canvas/`	Same
`curl http://127.0.0.1:18789/api/status`	Same
`openclaw gateway status --deep` (WebSocket probe to `ws://127.0.0.1:18789`)	Same — timeout
`openclaw plugins inspect <id>` (CLI → gateway RPC)	Hangs
`openclaw plugins doctor` (CLI → gateway RPC)	Hangs / silent
`codex exec` directly (CLI, bypasses gateway)	Works — returns gpt-5.4 reply

Step	Effect
`chown -R node:node ~/.openclaw/{tasks,memory,flows,extensions,plugin-runtime-deps,node_modules}`	Fixes lots of unrelated EACCES errors but does not fix dispatch deadlock
`chown node:node ~/.openclaw` (the bind-mount root, mode 0700)	Fixes `[skills]` watcher EACCES storm and `[heartbeat]` mkdir EACCES
`rm -rf ~/.openclaw/plugin-runtime-deps/<id>/.openclaw-runtime-mirror.lock/`	Unblocks plugin loading after a stuck-lock startup
`chown node:node ~/.openclaw/openclaw.json`	Fixes "Missing config" restart-loop after `openclaw doctor --fix`
`openclaw doctor --fix` (interactive)	Migrates legacy `embeddedHarness` → `agentRuntime`, but writes config with bad perms (see above)
Renaming `sessions.json` aside	Caused a different startup hang; restoring fixed that
`compose down && up --force-recreate`	Doesn't help — same regression
Rolling back to 2026.4.23 (retag local image, restore `openclaw.json` from pre-update commit, wipe stale `node_modules` + `plugin-runtime-deps`)	Fully restores working state

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) #73874

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Summary

Reproduction environment

Symptoms (identical on both bots)

Cascading downstream symptoms

What works on 2026.4.23 with the same setup

Diagnostic data we collected

What I tried and what did/didn't help

Compose context

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26) #73874

Description

Gateway HTTP/WS dispatch deadlock on Windows + Docker Desktop bind-mount setups (regression in 2026.4.24, persists in .25 and .26)

Summary

Reproduction environment

Symptoms (identical on both bots)

Cascading downstream symptoms

What works on 2026.4.23 with the same setup

Diagnostic data we collected

What I tried and what did/didn't help

Compose context

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions