Summary
Discord messages stop being processed after DiscordMessageListener starts taking >30 seconds per event. The queue eventually blocks completely, requiring a gateway restart to recover. This has occurred twice in 24 hours.
Environment
- OpenClaw Version: 2026.2.1
- OS: Linux 6.14.0-1017-azure (x64)
- Node: v22.22.0
- Channel: Discord
- Model Provider: github-copilot/claude-opus-4.5
Symptoms
[discord] Slow listener detected: DiscordMessageListener took Xs seconds for event MESSAGE_CREATE warnings appear
- Listener times escalate (34s → 46s → 90s → 133s in severe cases)
- Eventually, new messages are not processed at all
- Session shows
totalTokens: 0 — the message never reaches the LLM
- No "task done" log appears — task is stuck indefinitely
typing TTL reached (2m); stopping typing indicator appears as typing starts but never completes
- Gateway restart required to recover
Incident Timeline
Incident 1: Feb 4, 2026 (~18:14 - 19:04 UTC)
18:14:53 — Slow listener: 38.1s
18:20:34 — Slow listener: 92.9s (escalating)
18:30:36 — Slow listener: 133.3s (peak)
18:31:54 — [agent/embedded] embedded run timeout: runId=xxx timeoutMs=600000
18:33-18:56 — Multiple slow listeners (31-97s range)
18:59:35 — SIGUSR1 restart attempted
18:59:42 — Slow listener: 97.5s (still broken after SIGUSR1)
19:03:00 — Full systemctl restart required to fix
Duration stuck: ~40 minutes
Fix: Full service restart (SIGUSR1 did not work)
Incident 2: Feb 5, 2026 (~01:05 - 01:23 UTC)
01:04:57 — Slow listener: 34.6s
01:12:47 — Slow listener: 34s
01:14:58 — Slow listener: 45.9s
01:15:01 — typing TTL reached; no further processing
01:22:54 — SIGTERM sent (pkill)
01:23:04 — Gateway restarted by systemd
Duration stuck: ~13 minutes
Fix: SIGTERM/restart (SIGUSR1 was attempted but via different restart flow)
Key Observations
-
Blocking occurs before LLM call — Session has 0 tokens when stuck, meaning the hang is in pre-processing (context building, memory search, file reads?)
-
No timeout on the blocking operation — The "no task done log" suggests whatever is blocking doesn't have a timeout configured
-
SIGUSR1 may not fully recover — In incident 1, SIGUSR1 restart didn't clear the stuck state; full systemctl restart was needed
-
Possible correlation with concurrent operations — Incident 1 occurred while SSH operations to a remote VM were happening from another session. Possible resource contention?
-
Left-over processes on shutdown — Logs show:
openclaw-gateway.service: Unit process XXXX (openclaw) remains running after unit stopped.
openclaw-gateway.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Relevant Log Excerpts
Slow Listener Detection Pattern
[discord] Slow listener detected: DiscordMessageListener took 38.1 seconds for event MESSAGE_CREATE
[discord] Slow listener detected: DiscordMessageListener took 92.9 seconds for event MESSAGE_CREATE
[discord] Slow listener detected: DiscordMessageListener took 133.3 seconds for event MESSAGE_CREATE
Lane Wait Diagnostic
[diagnostic] lane wait exceeded: lane=session:agent:main:discord:channel:XXXXXXXXX waitedMs=16703 queueAhead=0
Embedded Run Timeout (Incident 1)
[agent/embedded] embedded run timeout: runId=XXXXX sessionId=XXXXX timeoutMs=600000
Shutdown Issues
openclaw-gateway.service: Unit process XXXXX (openclaw) remains running after unit stopped.
openclaw-gateway.service: Unit process XXXXX (openclaw-gatewa) remains running after unit stopped.
openclaw-gateway.service: This usually indicates unclean termination of a previous run
Questions
- What operations occur between message receipt and LLM call that could block without timeout?
- Is there synchronous file I/O or blocking operations in the message processing path?
- Could memory search / QMD indexing operations be blocking the event loop?
- Why does SIGUSR1 sometimes fail to recover the stuck state?
- Are the left-over processes on shutdown related to the blocking issue?
Suggested Investigation Areas
- Add timeouts to all pre-processing operations
- Add circuit breaker for slow message processing
- Log more granular timing in the message processing pipeline
- Investigate if QMD/memory operations can block
- Check for any sync file operations in the Discord message handler
Workaround
Gateway restart via systemctl restart openclaw-gateway or SIGTERM clears the stuck state.
Summary
Discord messages stop being processed after
DiscordMessageListenerstarts taking >30 seconds per event. The queue eventually blocks completely, requiring a gateway restart to recover. This has occurred twice in 24 hours.Environment
Symptoms
[discord] Slow listener detected: DiscordMessageListener took Xs seconds for event MESSAGE_CREATEwarnings appeartotalTokens: 0— the message never reaches the LLMtyping TTL reached (2m); stopping typing indicatorappears as typing starts but never completesIncident Timeline
Incident 1: Feb 4, 2026 (~18:14 - 19:04 UTC)
Duration stuck: ~40 minutes
Fix: Full service restart (SIGUSR1 did not work)
Incident 2: Feb 5, 2026 (~01:05 - 01:23 UTC)
Duration stuck: ~13 minutes
Fix: SIGTERM/restart (SIGUSR1 was attempted but via different restart flow)
Key Observations
Blocking occurs before LLM call — Session has 0 tokens when stuck, meaning the hang is in pre-processing (context building, memory search, file reads?)
No timeout on the blocking operation — The "no task done log" suggests whatever is blocking doesn't have a timeout configured
SIGUSR1 may not fully recover — In incident 1, SIGUSR1 restart didn't clear the stuck state; full systemctl restart was needed
Possible correlation with concurrent operations — Incident 1 occurred while SSH operations to a remote VM were happening from another session. Possible resource contention?
Left-over processes on shutdown — Logs show:
Relevant Log Excerpts
Slow Listener Detection Pattern
Lane Wait Diagnostic
Embedded Run Timeout (Incident 1)
Shutdown Issues
Questions
Suggested Investigation Areas
Workaround
Gateway restart via
systemctl restart openclaw-gatewayor SIGTERM clears the stuck state.