Discord ingest lag of 100–400 s on stable connection persists after PR #68159 / 2026.4.1 reconnect-ownership change


## Summary

On `openclaw 2026.4.23`, single-account, stable wired network, the **time between Discord delivering a DM and the agent runtime starting to process it** is regularly **100–400 seconds**. Agent processing + LLM call after that point is healthy at 5–7 s, so the delay is entirely in the message-ingest path. This persists after the recent Discord-lifecycle hardening that closed #56492 (PR #68159, commit `5adf9d2…081f45`), #53132, and #51116.

Filing as a separate, narrow issue per `@steipete`'s instruction in the #53132 close: *"If a similar startup hang is still reproducible on a current build, please open a fresh issue with current logs."* This is a different failure surface than #38596 (which is about the health-monitor restart loop) — here, the bot does NOT visibly restart; the lag happens inside Carbon's reconnect/RESUME flow on a connection that the gateway still considers up.

## Environment

| | |
|---|---|
| OpenClaw | **2026.4.23** (`a979721` per `--help` and `status --json runtimeVersion`) |
| `@buape/carbon` | **0.16.0** (latest on npm; pinned exactly by `@openclaw/discord` plugin) |
| `discord-api-types` | `^0.38.47` |
| `ws` | `^8.20.0` |
| OS | macOS 15.6.1 (arm64) |
| Node (gateway runtime) | 22.22.2 |
| Discord setup | 1 bot account, 1 guild, 1 user, MESSAGE_CONTENT intent enabled |
| Network | wired Ethernet, 9–31 ms ping to `gateway.discord.gg`, 0% packet loss |
| Other channels | telegram disabled (`channels.telegram.enabled: false`), mDNS off (`discovery.mdns.mode: "off"`), TTS preflight off (`messages.tts.enabled: false`) |
| Service | launchd `ai.openclaw.gateway`, `gateway.controlUi.allowInsecureAuth: true` (separate concern) |

## Reproduction

1. Configure single Discord bot, restart gateway, wait for `[gateway] ready (6 plugins …)` then `[discord] logged in to discord as <id> (OpenClaw)`.
2. Leave bot idle.
3. Observe gateway.log: `[discord] gateway: Gateway websocket closed: 1000` followed by `Gateway reconnect scheduled in <~1000ms> (close, resume=true)` — happens every 3–5 minutes on idle.
4. Send a DM to the bot at any point.
5. Most of the time the bot replies in 5–10 s. Periodically (when the inbound message lands during a reconnect window or right around a `did not reach READY within 30000ms` event), reply takes **100–400 s**.

## Evidence — agent trajectory log

From `~/.openclaw/agents/main/sessions/<sid>.trajectory.jsonl`, with the user `message_id` decoded from the embedded Discord snowflake (Discord epoch `1420070400000`):

```
seq=  1 2026-04-25T07:25:22.028Z session.started
seq=  4 2026-04-25T07:25:23.309Z prompt.submitted
seq=  5 2026-04-25T07:25:26.783Z model.completed   reply="Hi."
seq=  7 2026-04-25T07:25:26.898Z session.ended

user message_id 1497497072411218070 → Discord-snowflake-time 2026-04-25T07:18:44.213Z
=> Discord→openclaw INGEST lag = 397.8 s
   Agent processing total      =   4.9 s
   Model latency only          =   3.5 s
   END-TO-END                  = 402.6 s
```

Three consecutive idle-bot tests on the same session, same wired network:

| User msg | Discord ts (UTC) | session.started | Δ ingest | Δ model | **Δ end-to-end** |
|---|---|---|---|---|---|
| `hi` | 07:18:44.213 | 07:25:22.028 | **397.8 s** | 3.5 s | **402.6 s** |
| `Pingping` | 07:42:45.606 | 07:44:25.532 | **99.9 s** | 5.7 s | **106.5 s** |
| `Double ping` | 07:51:03.549 | 07:53:13.547 | **130.0 s** | 6.7 s | **137.3 s** |

The agent + LLM segment is consistently 5–10 s. The 100–400 s headline number is entirely in the segment between Discord's gateway and openclaw's `DiscordMessageListener` enqueuing into the agent runtime.

Correlated `stuck session` warning while the inbound message sits in queue:

```
[diagnostic] stuck session: sessionId=unknown
  sessionKey=agent:main:discord:direct:<userid>
  state=processing age=501s queueDepth=1
```

queueDepth=1 with a pending message that does eventually get answered → this is buffering / replay during reconnect, not permanent loss (so distinct from #51116's user-facing claim that messages are "lost" — they're delayed, not dropped).

## WS-close cadence (4-hour window, single account, otherwise idle)

```
$ awk '/2026-04-25T15:27:21/{flag=1} flag' gateway.log \
  | grep -oE "Gateway websocket closed: [0-9]+|reconnect scheduled.*\((zombie|invalid-session|close)" \
  | sort | uniq -c | sort -rn

  23 Gateway websocket closed: 1000
   8 reconnect scheduled in <ms>ms (zombie
   8 reconnect scheduled in <ms>ms (invalid-session
   1 Gateway websocket closed: 1006
   ... (close-resume reconnect events)
```

Concrete close timestamps over a single ~26-min window post-restart:

```
15:33:12  closed 1000  (3:14 after login)
15:39:46  closed 1000  (3:28 after re-init)
15:44:21  zombie reconnect
15:49:57  closed 1000
15:53:10  closed 1000
```

Every WS close triggers Carbon's RESUME flow. Inbound messages received during the reconnect window get buffered. RESUME usually succeeds within ~1 s, but when it fails — `discord gateway opened but did not reach READY within 30000ms` (defined in `dist/extensions/discord/provider-Bc1Lm79N.js:5897` as `DISCORD_GATEWAY_RUNTIME_READY_TIMEOUT_MS = 3e4`) — the channel exits and the outer `auto-restart attempt 1/10 in 5s` cycle kicks in, adding 30–60 s of unavailability per failed RESUME.

## Network is not the cause

```
$ ping -c 3 gateway.discord.gg
9.319/10.380/11.440 ms (0% loss)
```

Discord-side reachability is fine. CPU on the gateway process is `0.0%` per `ps -o %cpu` during these episodes; RSS 57 MB; no event-loop stall.

## What I tried — A/B downgrade to Carbon 0.15.0 (does not work)

Hypothesis: Carbon 0.16.0 (released 2026-04-16) regressed RESUME vs 0.15.0 (2026-04-10). Tested by:

1. `npm pack @buape/carbon@0.15.0` → atomic swap into `dist/extensions/discord/node_modules/@buape/carbon`.
2. Patched `dist/extensions/discord/package.json` `@buape/carbon` pin to `"0.15.0"`.
3. SIGUSR1 restart.

Result: bot reaches `[discord] client initialized as <id> (OpenClaw); awaiting gateway readiness` and **never proceeds to `logged in to discord`**. `openclaw status --json` shows `channelSummary: []` for the entire test window. No errors in `gateway.err.log`. Reverted cleanly to 0.16.0 via backup; symptom from the new issue resumed exactly as before.

So 0.15.0 is incompatible with the current `@openclaw/discord` plugin code path; not a viable downgrade.

## Related issues

- #38596 OPEN — health-monitor restart-loop circuit-breaker bypass. Same family of symptoms but different surface (visible restart loop vs. silent in-Carbon reconnect-with-buffering). Re-commenting there with the same trajectory data to un-stale.
- #51116 CLOSED 2026-04-25 — "WS disconnects every ~10 min, messages lost." Closed as obsolete on main. The "messages lost" framing was too strong — this issue documents that messages are *delayed*, not lost, on current main, but the user-facing latency is still in the hundreds of seconds.
- #56492 CLOSED 2026-04-24 — Carbon `Client` constructor IDENTIFY race. Fixed by PR #68159. The login race no longer happens for me; this issue is about post-login churn, which #68159 doesn't touch.
- #53132 CLOSED 2026-04-24 — multi-account `awaiting gateway readiness` hang. Different class (multi-account); fixed in 2026.3.24+. Single-account doesn't hang on login on current main.
- #43468 CLOSED 2026-03-15 — the (correctly rejected) "remove Carbon" issue. **Not** asking for that here.

## Asks

Concrete and narrow:

1. **Is the 3–5 min close-1000 cadence on idle considered baseline behavior on current main, or a regression worth investigating?** A reproducer on a clean install would settle this — happy to provide more environment detail.
2. **What's the right place in the code path for a buffered-message catch-up after RESUME?** The Carbon `Client.events` after `RESUMED` should re-deliver missed events per Discord's gateway spec (`SESSION` op resume + `RESUMED` event with replayed dispatches). If openclaw's `DiscordMessageListener` is being recreated during the inner reconnect, those replayed events would never reach the agent runtime — that would explain the 100–400 s lag matching the reconnect window exactly. Worth checking whether the listener is preserved across Carbon's internal reconnects.
3. **Would a `chat.history`-style catch-up on every successful RESUME** (re-fetching the last N messages on each affected channel since `last_message_id`) be in scope as a defensive backstop, even if the underlying buffering issue is fixed? It's the same ask #51116 made, and it'd be a small, opt-in change behind a config flag.

## Logs / artifacts available on request

- `/tmp/openclaw/openclaw-2026-04-25.log` (full ndjson trace, ~250 KB at time of writing)
- `~/.openclaw/agents/main/sessions/<sid>.trajectory.jsonl` (the trajectory excerpts above came from here)
- `~/.openclaw/logs/gateway.log`, `gateway.err.log`
- The Carbon 0.15.0 A/B test result (post-restart `awaiting gateway readiness` hang) reproducible on demand.

Happy to upload as gist links if you'd prefer — let me know format you'd like.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discord ingest lag of 100–400 s on stable connection persists after PR #68159 / 2026.4.1 reconnect-ownership change #71546

Summary

Environment

Reproduction

Evidence — agent trajectory log

WS-close cadence (4-hour window, single account, otherwise idle)

Network is not the cause

What I tried — A/B downgrade to Carbon 0.15.0 (does not work)

Related issues

Asks

Logs / artifacts available on request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


OpenClaw	2026.4.23 (`a979721` per `--help` and `status --json runtimeVersion`)
`@buape/carbon`	0.16.0 (latest on npm; pinned exactly by `@openclaw/discord` plugin)
`discord-api-types`	`^0.38.47`
`ws`	`^8.20.0`
OS	macOS 15.6.1 (arm64)
Node (gateway runtime)	22.22.2
Discord setup	1 bot account, 1 guild, 1 user, MESSAGE_CONTENT intent enabled
Network	wired Ethernet, 9–31 ms ping to `gateway.discord.gg`, 0% packet loss
Other channels	telegram disabled (`channels.telegram.enabled: false`), mDNS off (`discovery.mdns.mode: "off"`), TTS preflight off (`messages.tts.enabled: false`)
Service	launchd `ai.openclaw.gateway`, `gateway.controlUi.allowInsecureAuth: true` (separate concern)

User msg	Discord ts (UTC)	session.started	Δ ingest	Δ model	Δ end-to-end
`hi`	07:18:44.213	07:25:22.028	397.8 s	3.5 s	402.6 s
`Pingping`	07:42:45.606	07:44:25.532	99.9 s	5.7 s	106.5 s
`Double ping`	07:51:03.549	07:53:13.547	130.0 s	6.7 s	137.3 s

Uh oh!

Discord ingest lag of 100–400 s on stable connection persists after PR #68159 / 2026.4.1 reconnect-ownership change #71546

Description

Summary

Environment

Reproduction

Evidence — agent trajectory log

WS-close cadence (4-hour window, single account, otherwise idle)

Network is not the cause

What I tried — A/B downgrade to Carbon 0.15.0 (does not work)

Related issues

Asks

Logs / artifacts available on request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions