Skip to content

Discord Message Queue Gets Stuck - DiscordMessageListener Blocking #9238

@openclaw-gh-app

Description

@openclaw-gh-app

Summary

Discord messages stop being processed after DiscordMessageListener starts taking >30 seconds per event. The queue eventually blocks completely, requiring a gateway restart to recover. This has occurred twice in 24 hours.

Environment

  • OpenClaw Version: 2026.2.1
  • OS: Linux 6.14.0-1017-azure (x64)
  • Node: v22.22.0
  • Channel: Discord
  • Model Provider: github-copilot/claude-opus-4.5

Symptoms

  1. [discord] Slow listener detected: DiscordMessageListener took Xs seconds for event MESSAGE_CREATE warnings appear
  2. Listener times escalate (34s → 46s → 90s → 133s in severe cases)
  3. Eventually, new messages are not processed at all
  4. Session shows totalTokens: 0 — the message never reaches the LLM
  5. No "task done" log appears — task is stuck indefinitely
  6. typing TTL reached (2m); stopping typing indicator appears as typing starts but never completes
  7. Gateway restart required to recover

Incident Timeline

Incident 1: Feb 4, 2026 (~18:14 - 19:04 UTC)

18:14:53 — Slow listener: 38.1s
18:20:34 — Slow listener: 92.9s (escalating)
18:30:36 — Slow listener: 133.3s (peak)
18:31:54 — [agent/embedded] embedded run timeout: runId=xxx timeoutMs=600000
18:33-18:56 — Multiple slow listeners (31-97s range)
18:59:35 — SIGUSR1 restart attempted
18:59:42 — Slow listener: 97.5s (still broken after SIGUSR1)
19:03:00 — Full systemctl restart required to fix

Duration stuck: ~40 minutes
Fix: Full service restart (SIGUSR1 did not work)

Incident 2: Feb 5, 2026 (~01:05 - 01:23 UTC)

01:04:57 — Slow listener: 34.6s
01:12:47 — Slow listener: 34s
01:14:58 — Slow listener: 45.9s
01:15:01 — typing TTL reached; no further processing
01:22:54 — SIGTERM sent (pkill)
01:23:04 — Gateway restarted by systemd

Duration stuck: ~13 minutes
Fix: SIGTERM/restart (SIGUSR1 was attempted but via different restart flow)

Key Observations

  1. Blocking occurs before LLM call — Session has 0 tokens when stuck, meaning the hang is in pre-processing (context building, memory search, file reads?)

  2. No timeout on the blocking operation — The "no task done log" suggests whatever is blocking doesn't have a timeout configured

  3. SIGUSR1 may not fully recover — In incident 1, SIGUSR1 restart didn't clear the stuck state; full systemctl restart was needed

  4. Possible correlation with concurrent operations — Incident 1 occurred while SSH operations to a remote VM were happening from another session. Possible resource contention?

  5. Left-over processes on shutdown — Logs show:

    openclaw-gateway.service: Unit process XXXX (openclaw) remains running after unit stopped.
    openclaw-gateway.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
    

Relevant Log Excerpts

Slow Listener Detection Pattern

[discord] Slow listener detected: DiscordMessageListener took 38.1 seconds for event MESSAGE_CREATE
[discord] Slow listener detected: DiscordMessageListener took 92.9 seconds for event MESSAGE_CREATE
[discord] Slow listener detected: DiscordMessageListener took 133.3 seconds for event MESSAGE_CREATE

Lane Wait Diagnostic

[diagnostic] lane wait exceeded: lane=session:agent:main:discord:channel:XXXXXXXXX waitedMs=16703 queueAhead=0

Embedded Run Timeout (Incident 1)

[agent/embedded] embedded run timeout: runId=XXXXX sessionId=XXXXX timeoutMs=600000

Shutdown Issues

openclaw-gateway.service: Unit process XXXXX (openclaw) remains running after unit stopped.
openclaw-gateway.service: Unit process XXXXX (openclaw-gatewa) remains running after unit stopped.
openclaw-gateway.service: This usually indicates unclean termination of a previous run

Questions

  1. What operations occur between message receipt and LLM call that could block without timeout?
  2. Is there synchronous file I/O or blocking operations in the message processing path?
  3. Could memory search / QMD indexing operations be blocking the event loop?
  4. Why does SIGUSR1 sometimes fail to recover the stuck state?
  5. Are the left-over processes on shutdown related to the blocking issue?

Suggested Investigation Areas

  • Add timeouts to all pre-processing operations
  • Add circuit breaker for slow message processing
  • Log more granular timing in the message processing pipeline
  • Investigate if QMD/memory operations can block
  • Check for any sync file operations in the Discord message handler

Workaround

Gateway restart via systemctl restart openclaw-gateway or SIGTERM clears the stuck state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions