Skip to content

Discord inbound worker repeatedly times out after 1800s while gateway is still running #71948

@McoreD

Description

@McoreD

Summary

I'm seeing Discord replies fail with:

Discord inbound worker timed out.

In the service logs this appears as:

2026-04-26T10:04:19+08:00 [discord] inbound worker timed out after 1800 seconds (channelId=1497687704866000906, messageId=1497772780744347771)
2026-04-26T12:06:09+08:00 [discord] inbound worker timed out after 1800 seconds (channelId=1489624037758861415, messageId=1497803443287625800)

The gateway process itself remains active, but Discord inbound work appears to get stuck long enough to hit the 30 minute default timeout. Around the same periods I also see gateway/local worker health symptoms like websocket handshake timeouts, subagent announce timeouts, session locks, qmd timeouts, and context overflow recovery.

What I expected

A Discord inbound message should either complete, fail with the underlying agent/model error, or surface enough diagnostic context to tell which internal worker/session is stuck.

If this timeout is expected behavior, it would help if the user-facing Discord reply included the agent/session/run id or a clearer reason than just Discord inbound worker timed out.

What actually happens

The Discord channel gets a generic timeout reply after 1800 seconds.

Nearby logs show related pressure/errors:

[ws] handshake timeout ... peer=127.0.0.1:...->127.0.0.1:18789
Subagent announce failed: Error: gateway timeout after 10000ms
[session-write-lock] releasing lock held for 66843ms / 71211ms / 97029ms
[memory] qmd embed failed ... timed out after 600000ms
[agent/embedded] [context-overflow-diag] ... Context overflow: estimated context size exceeds safe threshold during tool loop.

There were also Discord gateway reconnect/session churn events in the same general window:

[discord] gateway error: Error: socket hang up
[discord] gateway: Gateway websocket closed: 1006
[discord] gateway: Gateway reconnect scheduled ... (invalid-session, resume=false)

System/config context

  • OpenClaw: 2026.4.24 (46d2415)
  • OS: LMDE 6 / Debian kernel 6.1.0-44-amd64
  • Node: v24.15.0
  • npm: 11.12.1
  • Codex CLI: 0.120.0
  • Running as user systemd service: openclaw-gateway.service
  • Gateway mode: local loopback, port 18789
  • Discord enabled with multiple accounts/bots
  • Discord healthMonitor.enabled=false
  • Discord threadBindings.enabled=true
  • Discord threadBindings.spawnSubagentSessions=true
  • Discord threadBindings.spawnAcpSessions=true
  • Agent defaults:
    • contextTokens=120000
    • primary model openai-codex/gpt-5.5
    • timeoutSeconds=3600
    • subagents.maxConcurrent=5
    • subagents.maxChildrenPerAgent=5
    • subagents.announceTimeoutMs=300000
    • compaction reserve floor 24000
  • Discord inbound worker timeout appears to be using the default 1800000ms / 1800s; I did not find an explicit per-account channels.discord.accounts.<id>.inboundWorker.runTimeoutMs override in my config.

At the time of inspection the service was still running but using substantial resources:

Tasks: 80
Memory: 6.5G
CPU: 3d+ accumulated

Why I think this might be an OpenClaw issue

The timeout itself is documented/configured, but in practice it seems to be acting as the only visible failure mode for several possible internal stalls:

  • queued Discord inbound run stuck behind session locks
  • qmd embed/search/update timeouts
  • subagent announce timeouts
  • local gateway websocket handshake timeouts
  • context overflow recovery taking a long time or looping

It would be useful if the inbound worker timeout carried the underlying run/session state into the Discord error reply and logs, or if the queue could cancel/unblock the stuck worker more cleanly before the full 1800s elapses.

Possible improvement

When the inbound worker times out, include something like:

  • account id / agent id
  • session key
  • run id
  • queue depth
  • whether the agent was waiting on model, tool, qmd, session lock, or gateway connect
  • whether the timeout came from default inboundWorker.runTimeoutMs or an explicit account override

That would make this much easier to diagnose from Discord without digging through journal logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions