Skip to content

Discord ingest lag of 100–400 s on stable connection persists after PR #68159 / 2026.4.1 reconnect-ownership change #71546

@apex-system

Description

@apex-system

Summary

On openclaw 2026.4.23, single-account, stable wired network, the time between Discord delivering a DM and the agent runtime starting to process it is regularly 100–400 seconds. Agent processing + LLM call after that point is healthy at 5–7 s, so the delay is entirely in the message-ingest path. This persists after the recent Discord-lifecycle hardening that closed #56492 (PR #68159, commit 5adf9d2…081f45), #53132, and #51116.

Filing as a separate, narrow issue per @steipete's instruction in the #53132 close: "If a similar startup hang is still reproducible on a current build, please open a fresh issue with current logs." This is a different failure surface than #38596 (which is about the health-monitor restart loop) — here, the bot does NOT visibly restart; the lag happens inside Carbon's reconnect/RESUME flow on a connection that the gateway still considers up.

Environment

OpenClaw 2026.4.23 (a979721 per --help and status --json runtimeVersion)
@buape/carbon 0.16.0 (latest on npm; pinned exactly by @openclaw/discord plugin)
discord-api-types ^0.38.47
ws ^8.20.0
OS macOS 15.6.1 (arm64)
Node (gateway runtime) 22.22.2
Discord setup 1 bot account, 1 guild, 1 user, MESSAGE_CONTENT intent enabled
Network wired Ethernet, 9–31 ms ping to gateway.discord.gg, 0% packet loss
Other channels telegram disabled (channels.telegram.enabled: false), mDNS off (discovery.mdns.mode: "off"), TTS preflight off (messages.tts.enabled: false)
Service launchd ai.openclaw.gateway, gateway.controlUi.allowInsecureAuth: true (separate concern)

Reproduction

  1. Configure single Discord bot, restart gateway, wait for [gateway] ready (6 plugins …) then [discord] logged in to discord as <id> (OpenClaw).
  2. Leave bot idle.
  3. Observe gateway.log: [discord] gateway: Gateway websocket closed: 1000 followed by Gateway reconnect scheduled in <~1000ms> (close, resume=true) — happens every 3–5 minutes on idle.
  4. Send a DM to the bot at any point.
  5. Most of the time the bot replies in 5–10 s. Periodically (when the inbound message lands during a reconnect window or right around a did not reach READY within 30000ms event), reply takes 100–400 s.

Evidence — agent trajectory log

From ~/.openclaw/agents/main/sessions/<sid>.trajectory.jsonl, with the user message_id decoded from the embedded Discord snowflake (Discord epoch 1420070400000):

seq=  1 2026-04-25T07:25:22.028Z session.started
seq=  4 2026-04-25T07:25:23.309Z prompt.submitted
seq=  5 2026-04-25T07:25:26.783Z model.completed   reply="Hi."
seq=  7 2026-04-25T07:25:26.898Z session.ended

user message_id 1497497072411218070 → Discord-snowflake-time 2026-04-25T07:18:44.213Z
=> Discord→openclaw INGEST lag = 397.8 s
   Agent processing total      =   4.9 s
   Model latency only          =   3.5 s
   END-TO-END                  = 402.6 s

Three consecutive idle-bot tests on the same session, same wired network:

User msg Discord ts (UTC) session.started Δ ingest Δ model Δ end-to-end
hi 07:18:44.213 07:25:22.028 397.8 s 3.5 s 402.6 s
Pingping 07:42:45.606 07:44:25.532 99.9 s 5.7 s 106.5 s
Double ping 07:51:03.549 07:53:13.547 130.0 s 6.7 s 137.3 s

The agent + LLM segment is consistently 5–10 s. The 100–400 s headline number is entirely in the segment between Discord's gateway and openclaw's DiscordMessageListener enqueuing into the agent runtime.

Correlated stuck session warning while the inbound message sits in queue:

[diagnostic] stuck session: sessionId=unknown
  sessionKey=agent:main:discord:direct:<userid>
  state=processing age=501s queueDepth=1

queueDepth=1 with a pending message that does eventually get answered → this is buffering / replay during reconnect, not permanent loss (so distinct from #51116's user-facing claim that messages are "lost" — they're delayed, not dropped).

WS-close cadence (4-hour window, single account, otherwise idle)

$ awk '/2026-04-25T15:27:21/{flag=1} flag' gateway.log \
  | grep -oE "Gateway websocket closed: [0-9]+|reconnect scheduled.*\((zombie|invalid-session|close)" \
  | sort | uniq -c | sort -rn

  23 Gateway websocket closed: 1000
   8 reconnect scheduled in <ms>ms (zombie
   8 reconnect scheduled in <ms>ms (invalid-session
   1 Gateway websocket closed: 1006
   ... (close-resume reconnect events)

Concrete close timestamps over a single ~26-min window post-restart:

15:33:12  closed 1000  (3:14 after login)
15:39:46  closed 1000  (3:28 after re-init)
15:44:21  zombie reconnect
15:49:57  closed 1000
15:53:10  closed 1000

Every WS close triggers Carbon's RESUME flow. Inbound messages received during the reconnect window get buffered. RESUME usually succeeds within ~1 s, but when it fails — discord gateway opened but did not reach READY within 30000ms (defined in dist/extensions/discord/provider-Bc1Lm79N.js:5897 as DISCORD_GATEWAY_RUNTIME_READY_TIMEOUT_MS = 3e4) — the channel exits and the outer auto-restart attempt 1/10 in 5s cycle kicks in, adding 30–60 s of unavailability per failed RESUME.

Network is not the cause

$ ping -c 3 gateway.discord.gg
9.319/10.380/11.440 ms (0% loss)

Discord-side reachability is fine. CPU on the gateway process is 0.0% per ps -o %cpu during these episodes; RSS 57 MB; no event-loop stall.

What I tried — A/B downgrade to Carbon 0.15.0 (does not work)

Hypothesis: Carbon 0.16.0 (released 2026-04-16) regressed RESUME vs 0.15.0 (2026-04-10). Tested by:

  1. npm pack @buape/carbon@0.15.0 → atomic swap into dist/extensions/discord/node_modules/@buape/carbon.
  2. Patched dist/extensions/discord/package.json @buape/carbon pin to "0.15.0".
  3. SIGUSR1 restart.

Result: bot reaches [discord] client initialized as <id> (OpenClaw); awaiting gateway readiness and never proceeds to logged in to discord. openclaw status --json shows channelSummary: [] for the entire test window. No errors in gateway.err.log. Reverted cleanly to 0.16.0 via backup; symptom from the new issue resumed exactly as before.

So 0.15.0 is incompatible with the current @openclaw/discord plugin code path; not a viable downgrade.

Related issues

Asks

Concrete and narrow:

  1. Is the 3–5 min close-1000 cadence on idle considered baseline behavior on current main, or a regression worth investigating? A reproducer on a clean install would settle this — happy to provide more environment detail.
  2. What's the right place in the code path for a buffered-message catch-up after RESUME? The Carbon Client.events after RESUMED should re-deliver missed events per Discord's gateway spec (SESSION op resume + RESUMED event with replayed dispatches). If openclaw's DiscordMessageListener is being recreated during the inner reconnect, those replayed events would never reach the agent runtime — that would explain the 100–400 s lag matching the reconnect window exactly. Worth checking whether the listener is preserved across Carbon's internal reconnects.
  3. Would a chat.history-style catch-up on every successful RESUME (re-fetching the last N messages on each affected channel since last_message_id) be in scope as a defensive backstop, even if the underlying buffering issue is fixed? It's the same ask Discord WebSocket disconnects every ~10 minutes, messages lost during reconnect window #51116 made, and it'd be a small, opt-in change behind a config flag.

Logs / artifacts available on request

  • /tmp/openclaw/openclaw-2026-04-25.log (full ndjson trace, ~250 KB at time of writing)
  • ~/.openclaw/agents/main/sessions/<sid>.trajectory.jsonl (the trajectory excerpts above came from here)
  • ~/.openclaw/logs/gateway.log, gateway.err.log
  • The Carbon 0.15.0 A/B test result (post-restart awaiting gateway readiness hang) reproducible on demand.

Happy to upload as gist links if you'd prefer — let me know format you'd like.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions