Discord Message Queue Gets Stuck - DiscordMessageListener Blocking

## Summary

Discord messages stop being processed after `DiscordMessageListener` starts taking >30 seconds per event. The queue eventually blocks completely, requiring a gateway restart to recover. This has occurred twice in 24 hours.

## Environment

- **OpenClaw Version:** 2026.2.1
- **OS:** Linux 6.14.0-1017-azure (x64)
- **Node:** v22.22.0
- **Channel:** Discord
- **Model Provider:** github-copilot/claude-opus-4.5

## Symptoms

1. `[discord] Slow listener detected: DiscordMessageListener took Xs seconds for event MESSAGE_CREATE` warnings appear
2. Listener times escalate (34s → 46s → 90s → 133s in severe cases)
3. Eventually, new messages are not processed at all
4. Session shows `totalTokens: 0` — the message never reaches the LLM
5. No "task done" log appears — task is stuck indefinitely
6. `typing TTL reached (2m); stopping typing indicator` appears as typing starts but never completes
7. Gateway restart required to recover

## Incident Timeline

### Incident 1: Feb 4, 2026 (~18:14 - 19:04 UTC)

```
18:14:53 — Slow listener: 38.1s
18:20:34 — Slow listener: 92.9s (escalating)
18:30:36 — Slow listener: 133.3s (peak)
18:31:54 — [agent/embedded] embedded run timeout: runId=xxx timeoutMs=600000
18:33-18:56 — Multiple slow listeners (31-97s range)
18:59:35 — SIGUSR1 restart attempted
18:59:42 — Slow listener: 97.5s (still broken after SIGUSR1)
19:03:00 — Full systemctl restart required to fix
```

**Duration stuck:** ~40 minutes
**Fix:** Full service restart (SIGUSR1 did not work)

### Incident 2: Feb 5, 2026 (~01:05 - 01:23 UTC)

```
01:04:57 — Slow listener: 34.6s
01:12:47 — Slow listener: 34s
01:14:58 — Slow listener: 45.9s
01:15:01 — typing TTL reached; no further processing
01:22:54 — SIGTERM sent (pkill)
01:23:04 — Gateway restarted by systemd
```

**Duration stuck:** ~13 minutes
**Fix:** SIGTERM/restart (SIGUSR1 was attempted but via different restart flow)

## Key Observations

1. **Blocking occurs before LLM call** — Session has 0 tokens when stuck, meaning the hang is in pre-processing (context building, memory search, file reads?)

2. **No timeout on the blocking operation** — The "no task done log" suggests whatever is blocking doesn't have a timeout configured

3. **SIGUSR1 may not fully recover** — In incident 1, SIGUSR1 restart didn't clear the stuck state; full systemctl restart was needed

4. **Possible correlation with concurrent operations** — Incident 1 occurred while SSH operations to a remote VM were happening from another session. Possible resource contention?

5. **Left-over processes on shutdown** — Logs show:
   ```
   openclaw-gateway.service: Unit process XXXX (openclaw) remains running after unit stopped.
   openclaw-gateway.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
   ```

## Relevant Log Excerpts

### Slow Listener Detection Pattern
```
[discord] Slow listener detected: DiscordMessageListener took 38.1 seconds for event MESSAGE_CREATE
[discord] Slow listener detected: DiscordMessageListener took 92.9 seconds for event MESSAGE_CREATE
[discord] Slow listener detected: DiscordMessageListener took 133.3 seconds for event MESSAGE_CREATE
```

### Lane Wait Diagnostic
```
[diagnostic] lane wait exceeded: lane=session:agent:main:discord:channel:XXXXXXXXX waitedMs=16703 queueAhead=0
```

### Embedded Run Timeout (Incident 1)
```
[agent/embedded] embedded run timeout: runId=XXXXX sessionId=XXXXX timeoutMs=600000
```

### Shutdown Issues
```
openclaw-gateway.service: Unit process XXXXX (openclaw) remains running after unit stopped.
openclaw-gateway.service: Unit process XXXXX (openclaw-gatewa) remains running after unit stopped.
openclaw-gateway.service: This usually indicates unclean termination of a previous run
```

## Questions

1. What operations occur between message receipt and LLM call that could block without timeout?
2. Is there synchronous file I/O or blocking operations in the message processing path?
3. Could memory search / QMD indexing operations be blocking the event loop?
4. Why does SIGUSR1 sometimes fail to recover the stuck state?
5. Are the left-over processes on shutdown related to the blocking issue?

## Suggested Investigation Areas

- Add timeouts to all pre-processing operations
- Add circuit breaker for slow message processing
- Log more granular timing in the message processing pipeline
- Investigate if QMD/memory operations can block
- Check for any sync file operations in the Discord message handler

## Workaround

Gateway restart via `systemctl restart openclaw-gateway` or SIGTERM clears the stuck state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discord Message Queue Gets Stuck - DiscordMessageListener Blocking #9238

Summary

Environment

Symptoms

Incident Timeline

Incident 1: Feb 4, 2026 (~18:14 - 19:04 UTC)

Incident 2: Feb 5, 2026 (~01:05 - 01:23 UTC)

Key Observations

Relevant Log Excerpts

Slow Listener Detection Pattern

Lane Wait Diagnostic

Embedded Run Timeout (Incident 1)

Shutdown Issues

Questions

Suggested Investigation Areas

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Discord Message Queue Gets Stuck - DiscordMessageListener Blocking #9238

Description

Summary

Environment

Symptoms

Incident Timeline

Incident 1: Feb 4, 2026 (~18:14 - 19:04 UTC)

Incident 2: Feb 5, 2026 (~01:05 - 01:23 UTC)

Key Observations

Relevant Log Excerpts

Slow Listener Detection Pattern

Lane Wait Diagnostic

Embedded Run Timeout (Incident 1)

Shutdown Issues

Questions

Suggested Investigation Areas

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions